Bug 1791739

Summary: watch closed unexpectedly while running [Feature:DeploymentConfig] test on ppc64le platform
Product: OpenShift Container Platform Reporter: mkumatag
Component: openshift-controller-managerAssignee: Tomáš Nožička <tnozicka>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: low Docs Contact:
Priority: medium    
Version: 4.3.0CC: adam.kaplan, aos-bugs, clnperez, danili, lmcfadde, maszulik, mfojtik, scuppett, sjohn, tnozicka
Target Milestone: ---Flags: mkumatag: needinfo+
mkumatag: needinfo-
Target Release: 4.6.0   
Hardware: ppc64le   
OS: Linux   
Whiteboard: workloads, Multi-Arch
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 15:54:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1878851    

Description mkumatag 2020-01-16 12:14:47 UTC
Description of problem:

watch closed unexpectedly while running "[Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel/minimal]" test

Version-Release number of selected component (if applicable):

[root@localhost origin]# oc version
Client Version: openshift-clients-4.3.0-201910250623-88-g6a937dfe
Server Version: 4.3.0-0.nightly-ppc64le-2020-01-09-234845
Kubernetes Version: v1.16.2
[root@localhost origin]#

How reproducible:


Steps to Reproduce:
1. clone the code from https://github.com/mkumatag/origin/commits/release-4.3-ppc64le which contains the proper images to run tests on power
2. export KUBECONFIG=/root/.kube/config
3. _output/local/bin/linux/ppc64le/openshift-tests run-test "[Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel/minimal]"

Actual results:

Jan 16 07:05:59.761: INFO: minreadytest-1-dv5d8[e2e-test-cli-deployment-dzs42].container[myapp].log

Jan 16 07:05:59.765: INFO: skipping dumping cluster info - cluster too large
Jan 16 07:05:59.807: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-cli-deployment-dzs42-user}, err: <nil>
Jan 16 07:05:59.878: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-cli-deployment-dzs42}, err: <nil>
Jan 16 07:05:59.917: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  6kIq0Tk4ROKoJRCRfzzUGwAAAAAAAAAA}, err: <nil>
[AfterEach] [Feature:DeploymentConfig] deploymentconfigs
  /root/openshift_ws/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:152
Jan 16 07:05:59.917: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-cli-deployment-dzs42" for this suite.
Jan 16 07:05:59.954: INFO: Running AfterSuite actions on all nodes
Jan 16 07:05:59.954: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/deployments/util.go:747]: watch closed unexpectedly
Expected
    <bool>: false
to be equivalent to
    <bool>: true
[root@localhost origin]#

complete log: http://pastebin.test.redhat.com/827902

Expected results:

Test should pass without any errors

Additional info:

Community issue: https://github.com/openshift/origin/issues/24158

Comment 1 Maciej Szulik 2020-01-28 11:56:08 UTC
Please attach logs from the kube-apiserver, openshift-apiserver, kube-controller-manager and openshift-controller-manager.

Comment 2 mkumatag 2020-01-30 12:20:45 UTC
Ran the oc adm must-gather and sharing via http://10.0.1.233:4444/must-gather.local.6070290862417239547/ timebeing, not sure where to upload those logs!

The test ran and watch crashed between Jan 30 07:02:31 to Jan 30 07:12:26 GMT

Comment 3 Maciej Szulik 2020-01-30 12:44:04 UTC
Can you please tar this directory and upload it as such, it's easier than going through those files manually.

Comment 4 mkumatag 2020-01-30 19:18:19 UTC
Uploaded the tar file in the same server here: http://10.0.1.233:4444/must-gather.local.6070290862417239547.tar.gz

Comment 6 mkumatag 2020-04-24 08:33:19 UTC
We have seen this issue more often now and following test cases are failing most of the time:


[Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel/minimal]
[Feature:DeploymentConfig] deploymentconfigs won't deploy RC with unresolved images [Conformance] when patched with empty image [Suite:openshift/conformance/parallel/minimal]


with error:

Apr 23 16:09:36.303: INFO: minreadytest-1-deploy[e2e-test-cli-deployment-4rszc].container[deployment].log
--> Scaling minreadytest-1 to 2
--> Success

Apr 23 16:09:36.312: INFO: minreadytest-1-ft4lw[e2e-test-cli-deployment-4rszc].container[myapp].log

Apr 23 16:09:36.317: INFO: skipping dumping cluster info - cluster too large
Apr 23 16:09:36.330: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-cli-deployment-4rszc-user}, err: <nil>
Apr 23 16:09:36.337: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-cli-deployment-4rszc}, err: <nil>
Apr 23 16:09:36.354: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  AJJwsg6mS0qMHCcHi4yHcAAAAAAAAAAA}, err: <nil>
[AfterEach] [Feature:DeploymentConfig] deploymentconfigs
  /tmp/openshift_ws/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:152
Apr 23 16:09:36.354: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-cli-deployment-4rszc" for this suite.
Apr 23 16:09:36.381: INFO: Running AfterSuite actions on all nodes
Apr 23 16:09:36.381: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/deployments/util.go:747]: watch closed unexpectedly
Expected
    <bool>: false
to be equivalent to
    <bool>: true



latest logs pushed: http://file.rdu.redhat.com/~mkumatag/logs/BZ1791739/must-gather.local.5141860892755809294.tar.gz

Comment 7 Christy Norman 2020-05-14 14:17:58 UTC
This has started failing in our public CI (which is running on libvirt/kvm, instead of PowerVM). 


Can we get someone to take a look? 

Here are some runs where it failed:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.3/53
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.3/56

Comment 8 Dan Li 2020-05-20 13:32:20 UTC
Hi Manju, let us know if you are the right person to investigate this bug. Our team thinks this bug is related to CI and testing on Power

Comment 9 mkumatag 2020-05-20 14:08:35 UTC
(In reply to Dan Li from comment #8)
> Hi Manju, let us know if you are the right person to investigate this bug.
> Our team thinks this bug is related to CI and testing on Power

This is a new area for me to explore and tied up with other work items as of now. I don't think I will be able to spare any bandwidth on this bug for next few weeks.

Comment 10 Dan Li 2020-05-20 15:13:13 UTC
Hi Adam, I saw that Manju re-assigned this bug to you. Is this a bug that you can help us look into? This bug was escalated during our weekly multi-arch on IBM Power meeting last week.

Comment 12 Tomáš Nožička 2020-05-21 08:54:25 UTC
It seems like your platform LB is cutting off the connection, causing watch to be closed prematurely. If that's the case, can you fix the LB config? I have not seen it on other platforms.

There is a lot of these in apiserver logs:
2020-04-24T07:03:40.032281401Z W0424 07:03:40.032193       1 cacher.go:162] Terminating all watchers from cacher *unstructured.Unstructured
2020-04-24T07:03:40.034951064Z I0424 07:03:40.034919       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-04-24T07:03:40.035082244Z I0424 07:03:40.035061       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-04-24T07:03:40.035247851Z I0424 07:03:40.035228       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

I'll see if I can make the test more resilient for future, but I think the real issue is LB setup.

Comment 13 mkumatag 2020-05-26 04:50:02 UTC
(In reply to Tomáš Nožička from comment #12)
> It seems like your platform LB is cutting off the connection, causing watch
> to be closed prematurely. If that's the case, can you fix the LB config? I
> have not seen it on other platforms.
> 
> There is a lot of these in apiserver logs:
> 2020-04-24T07:03:40.032281401Z W0424 07:03:40.032193       1 cacher.go:162]
> Terminating all watchers from cacher *unstructured.Unstructured
> 2020-04-24T07:03:40.034951064Z I0424 07:03:40.034919       1
> controlbuf.go:508] transport: loopyWriter.run returning. connection error:
> desc = "transport is closing"
> 2020-04-24T07:03:40.035082244Z I0424 07:03:40.035061       1
> controlbuf.go:508] transport: loopyWriter.run returning. connection error:
> desc = "transport is closing"
> 2020-04-24T07:03:40.035247851Z I0424 07:03:40.035228       1
> controlbuf.go:508] transport: loopyWriter.run returning. connection error:
> desc = "transport is closing"
> 
> I'll see if I can make the test more resilient for future, but I think the
> real issue is LB setup.

I don't see any additional LB is running in the CI setup where @christy mentioned, we just have a libvirt IPI based cluster. Thanks for the patch in the test, hope it will get fixed in the master and backported to 4.4, 4.5 soon.

Comment 14 Christy Norman 2020-05-26 15:28:34 UTC
Right, Manju, there is no LB in the upstream CI. There is an ssh tunnel -- but this is the only test we ever see fail in this way, so, that would seem like a strange coincidence if it was causing any problems.

Comment 16 Tomáš Nožička 2020-05-27 16:47:43 UTC
> I don't see any additional LB is running in the CI setup where @christy mentioned, we just have a libvirt IPI based cluster. Thanks for the patch in the test, hope it will get fixed in the master and backported to 4.4, 4.5 soon.

Not an additional load balancer, but the 2 that should be part of OCP install. What do you get from

 oc get infrastructure/cluster --template='{{ printf "internalLB: %s\nexternalLB: %s\n" .status.apiServerInternalURI .status.apiServerURL}}'

and what's behind those addresses?

This test isn't failing in the supported setups like AWS, GCE where the LB is setup correctly. Something is cutting off the connection.

Comment 17 mkumatag 2020-05-28 10:37:06 UTC
(In reply to Tomáš Nožička from comment #16)
> > I don't see any additional LB is running in the CI setup where @christy mentioned, we just have a libvirt IPI based cluster. Thanks for the patch in the test, hope it will get fixed in the master and backported to 4.4, 4.5 soon.
> 
> Not an additional load balancer, but the 2 that should be part of OCP
> install. What do you get from
> 
>  oc get infrastructure/cluster --template='{{ printf "internalLB:
> %s\nexternalLB: %s\n" .status.apiServerInternalURI .status.apiServerURL}}'
> 
> and what's behind those addresses?
> 
> This test isn't failing in the supported setups like AWS, GCE where the LB
> is setup correctly. Something is cutting off the connection.

here are the LBs:

sh-4.2$ oc get infrastructure/cluster --template='{{ printf "internalLB:%s\nexternalLB: %s\n" .status.apiServerInternalURI .status.apiServerURL}}'
internalLB:https://api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04:6443
externalLB: https://api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04:6443

and here is the network configs in the libvirt for these endpoints, load-balanced between all the masters in the cluster:

[root@C155F2U33 ~]# virsh net-dumpxml ci-op-rimznz4n-8c2eb-5gj6c
<network>
  <name>ci-op-rimznz4n-8c2eb-5gj6c</name>
  <uuid>2f671596-3cf5-4a23-9682-7e9c2daf4121</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='br8c2eb1' stp='on' delay='0'/>
  <mac address='52:54:00:a3:b2:4f'/>
  <domain name='ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' localOnly='yes'/>
  <dns enable='yes'>
    <host ip='192.168.4.11'>
      <hostname>api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
    </host>
    <host ip='192.168.4.12'>
      <hostname>api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
    </host>
    <host ip='192.168.4.13'>
      <hostname>api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
    </host>
    <host ip='192.168.4.51'>
      <hostname>alertmanager-main-openshift-monitoring.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>console-openshift-console.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>downloads-openshift-console.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>grafana-openshift-monitoring.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>oauth-openshift.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>prometheus-k8s-openshift-monitoring.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
    </host>
  </dns>
  <ip family='ipv4' address='192.168.4.1' prefix='24'>
    <dhcp>
      <range start='192.168.4.2' end='192.168.4.254'/>
      <host mac='52:54:00:3a:6e:81' name='ci-op-rimznz4n-8c2eb-5gj6c-bootstrap.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.10'/>
      <host mac='52:54:00:7a:a3:5a' name='ci-op-rimznz4n-8c2eb-5gj6c-master-1.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.12'/>
      <host mac='52:54:00:9e:b5:9b' name='ci-op-rimznz4n-8c2eb-5gj6c-master-2.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.13'/>
      <host mac='52:54:00:c6:46:98' name='ci-op-rimznz4n-8c2eb-5gj6c-master-0.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.11'/>
      <host mac='52:3c:1e:6a:dd:b2' name='ci-op-rimznz4n-8c2eb-5gj6c-worker-0-xzvkf' ip='192.168.4.51'/>
      <host mac='de:91:4a:1c:9f:c3' name='ci-op-rimznz4n-8c2eb-5gj6c-worker-0-w2brz' ip='192.168.4.52'/>
    </dhcp>
  </ip>
</network>

[root@C155F2U33 ~]#

Comment 18 Christy Norman 2020-05-28 19:56:01 UTC
Is there some more information somewhere about those two LBs? This is obviously new to me. Any suggestions on debugging them on our end? I'm not sure if we have access to those LBs or not, as it seems they live inside the OCP CI cluster when they're alive.

Comment 19 lmcfadde 2020-06-17 20:27:12 UTC
Hi @tnozicka do you have any updates here and did you see the question (comment 18) from Christy?

Comment 20 Tomáš Nožička 2020-06-18 10:44:41 UTC
Documentation for LB requirements lives in

  https://github.com/openshift/installer/blob/master/docs/dev/kube-apiserver-health-check.md

Comment 21 wewang 2020-06-28 05:55:31 UTC
Hi @yin Changed QA Contact to you, thanks

Comment 23 Christy Norman 2020-09-11 18:14:27 UTC
Seeing a flurry of activity on this bz. Was an issue identified and fixed?

Comment 24 Tomáš Nožička 2020-09-14 10:52:28 UTC
As you can see the PR is merged and the BZ is on QA. Should be fixed in master.

Comment 25 Christy Norman 2020-09-14 13:36:13 UTC
Interesting. Are there comments that aren't visible to the public bz? I don't see any mention of a fix or a PR from my view.

Comment 26 Christy Norman 2020-09-14 13:38:48 UTC
Actually, there's a PR link added in an update a long time ago by a bot, but no mention of it by humans. Very strange. But if that's the PR that fixes this, then we should try and verify.

Comment 27 Tomáš Nožička 2020-09-14 14:00:47 UTC
> Actually, there's a PR link added in an update a long time ago by a bot, but no mention of it by humans.

That's it. If a PR mentions fixing this BZ in the title, bot will automatically link it and changes state to POST. When it merges it goes to MODIFIED and when there is a payload (nightly) available containing the fix it transitions to ON_QA.


Interesting I can see change like:

OpenShift BugZilla Robot 2020-09-09 20:41:32 UTC
Status: POST → MODIFIED

shown as public, but I don't see them when I am not logged in. But the `Status:	ON_QA` at the top of the BZ is publicly visible.

Comment 28 Christy Norman 2020-09-14 14:10:37 UTC
Yes all those things you pointed out are visible, but I missed the implication. Thanks for the explanation.

Comment 29 zhou ying 2020-09-15 02:00:49 UTC
HI :

Would you please help take this issue because I don't have the required cluster ? thanks a lot.

Comment 30 Dan Li 2020-09-15 12:25:59 UTC
Making Ying's QE request comment un-private as Manju cannot see private comments

Comment 31 mkumatag 2020-09-24 10:10:31 UTC
working fine with the latest build, tested on both and libvirt and powervm env

started: (10/694/2365) "[sig-apps][Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel]"
passed: (1m39s) 2020-09-24T06:35:36 "[sig-apps][Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel]"

Comment 32 Christy Norman 2020-09-25 22:23:42 UTC
I think we may have hit this again in a 4.6 CI run: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/12255/rehearse-12255-release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.6/1309570945945440256/build-log.txt


This was during cluster setup, not a specific test:

level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-ppc64le-2020-09-25-190030: 88% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-ppc64le-2020-09-25-190030: 89% complete"
level=debug msg="Still waiting for the cluster to initialize: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: kube-apiserver oauth endpoint https://192.168.126.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)"
W0925 19:56:00.901569      89 reflector.go:326] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ClusterVersion ended with: very short watch: k8s.io/client-go/tools/watch/informerwatcher.go:146: Unexpected watch close - watch lasted less than a second and no items received

Comment 33 Dan Li 2020-10-05 19:01:27 UTC
@Tomas, per Christy's Comment 32, it seems that we are still running into this error with the testing of the fix. Is it possible to take another look into the bug?

Also, after chatting with Christy, it is determined that this bug is not a release blocker for 4.6. It causes inconvenience to the CI team, but it is not blocking the 4.6 release. So, with that said, I am de-escalating this bug to "Low" severity.

Comment 35 Tomáš Nožička 2020-10-15 05:55:35 UTC
> I think we may have hit this again in a 4.6 CI run: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/12255/rehearse-12255-release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.6/1309570945945440256/build-log.txt

E0925 20:47:39.962347     380 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ClusterVersion: Get "https://api.ci-op-pn6byhsz-32479.libvirt-ppc64le-1-0:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": EOF
level=error msg="Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: Get \"https://api.ci-op-pn6byhsz-32479.libvirt-ppc64le-1-0:6443/apis/config.openshift.io/v1/clusteroperators\": EOF"
level=fatal msg="failed to initialize the cluster: timed out waiting for the condition"

That seems completely unrelated to the deployment config e2e test reported in this BZ.

Your LB is likely setup wrong way as I noted initially in https://bugzilla.redhat.com/show_bug.cgi?id=1791739#c12 . The DC test was also correct initially, we just made it more resilient to wrong infra.

Comment 36 zhou ying 2020-10-15 06:03:19 UTC
@Christy, @Dan:

Since the error is not related with the Deployment config test , could we move this to verified status ? thanks .

Comment 37 mkumatag 2020-10-15 15:31:49 UTC
Origin issue reported in this is not happening anymore with this fix, hence we close this issue and track that new error in a separate BZ if required. -Thanks

Comment 38 Dan Li 2020-10-15 15:42:21 UTC
Making Ying's request un-private as some folks on this bug cannot view private comments.

Comment 39 zhou ying 2020-10-19 01:11:24 UTC
Will  move to verified status according https://bugzilla.redhat.com/show_bug.cgi?id=1791739#c37.

Comment 40 Christy Norman 2020-10-19 14:19:15 UTC
removing the needinfo flag since manju answered and it's moved to verified

Comment 42 errata-xmlrpc 2020-10-27 15:54:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196