Description of problem: watch closed unexpectedly while running "[Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel/minimal]" test Version-Release number of selected component (if applicable): [root@localhost origin]# oc version Client Version: openshift-clients-4.3.0-201910250623-88-g6a937dfe Server Version: 4.3.0-0.nightly-ppc64le-2020-01-09-234845 Kubernetes Version: v1.16.2 [root@localhost origin]# How reproducible: Steps to Reproduce: 1. clone the code from https://github.com/mkumatag/origin/commits/release-4.3-ppc64le which contains the proper images to run tests on power 2. export KUBECONFIG=/root/.kube/config 3. _output/local/bin/linux/ppc64le/openshift-tests run-test "[Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel/minimal]" Actual results: Jan 16 07:05:59.761: INFO: minreadytest-1-dv5d8[e2e-test-cli-deployment-dzs42].container[myapp].log Jan 16 07:05:59.765: INFO: skipping dumping cluster info - cluster too large Jan 16 07:05:59.807: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-cli-deployment-dzs42-user}, err: <nil> Jan 16 07:05:59.878: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-cli-deployment-dzs42}, err: <nil> Jan 16 07:05:59.917: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens 6kIq0Tk4ROKoJRCRfzzUGwAAAAAAAAAA}, err: <nil> [AfterEach] [Feature:DeploymentConfig] deploymentconfigs /root/openshift_ws/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:152 Jan 16 07:05:59.917: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-test-cli-deployment-dzs42" for this suite. Jan 16 07:05:59.954: INFO: Running AfterSuite actions on all nodes Jan 16 07:05:59.954: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/deployments/util.go:747]: watch closed unexpectedly Expected <bool>: false to be equivalent to <bool>: true [root@localhost origin]# complete log: http://pastebin.test.redhat.com/827902 Expected results: Test should pass without any errors Additional info: Community issue: https://github.com/openshift/origin/issues/24158
Please attach logs from the kube-apiserver, openshift-apiserver, kube-controller-manager and openshift-controller-manager.
Ran the oc adm must-gather and sharing via http://10.0.1.233:4444/must-gather.local.6070290862417239547/ timebeing, not sure where to upload those logs! The test ran and watch crashed between Jan 30 07:02:31 to Jan 30 07:12:26 GMT
Can you please tar this directory and upload it as such, it's easier than going through those files manually.
Uploaded the tar file in the same server here: http://10.0.1.233:4444/must-gather.local.6070290862417239547.tar.gz
Pushed the logs to http://file.rdu.redhat.com/~mkumatag/logs/BZ1791739/must-gather.local.6070290862417239547.tar.gz as well.
We have seen this issue more often now and following test cases are failing most of the time: [Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel/minimal] [Feature:DeploymentConfig] deploymentconfigs won't deploy RC with unresolved images [Conformance] when patched with empty image [Suite:openshift/conformance/parallel/minimal] with error: Apr 23 16:09:36.303: INFO: minreadytest-1-deploy[e2e-test-cli-deployment-4rszc].container[deployment].log --> Scaling minreadytest-1 to 2 --> Success Apr 23 16:09:36.312: INFO: minreadytest-1-ft4lw[e2e-test-cli-deployment-4rszc].container[myapp].log Apr 23 16:09:36.317: INFO: skipping dumping cluster info - cluster too large Apr 23 16:09:36.330: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-cli-deployment-4rszc-user}, err: <nil> Apr 23 16:09:36.337: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-cli-deployment-4rszc}, err: <nil> Apr 23 16:09:36.354: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens AJJwsg6mS0qMHCcHi4yHcAAAAAAAAAAA}, err: <nil> [AfterEach] [Feature:DeploymentConfig] deploymentconfigs /tmp/openshift_ws/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:152 Apr 23 16:09:36.354: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-test-cli-deployment-4rszc" for this suite. Apr 23 16:09:36.381: INFO: Running AfterSuite actions on all nodes Apr 23 16:09:36.381: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/deployments/util.go:747]: watch closed unexpectedly Expected <bool>: false to be equivalent to <bool>: true latest logs pushed: http://file.rdu.redhat.com/~mkumatag/logs/BZ1791739/must-gather.local.5141860892755809294.tar.gz
This has started failing in our public CI (which is running on libvirt/kvm, instead of PowerVM). Can we get someone to take a look? Here are some runs where it failed: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.3/53 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.3/56
Hi Manju, let us know if you are the right person to investigate this bug. Our team thinks this bug is related to CI and testing on Power
(In reply to Dan Li from comment #8) > Hi Manju, let us know if you are the right person to investigate this bug. > Our team thinks this bug is related to CI and testing on Power This is a new area for me to explore and tied up with other work items as of now. I don't think I will be able to spare any bandwidth on this bug for next few weeks.
Hi Adam, I saw that Manju re-assigned this bug to you. Is this a bug that you can help us look into? This bug was escalated during our weekly multi-arch on IBM Power meeting last week.
It seems like your platform LB is cutting off the connection, causing watch to be closed prematurely. If that's the case, can you fix the LB config? I have not seen it on other platforms. There is a lot of these in apiserver logs: 2020-04-24T07:03:40.032281401Z W0424 07:03:40.032193 1 cacher.go:162] Terminating all watchers from cacher *unstructured.Unstructured 2020-04-24T07:03:40.034951064Z I0424 07:03:40.034919 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" 2020-04-24T07:03:40.035082244Z I0424 07:03:40.035061 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" 2020-04-24T07:03:40.035247851Z I0424 07:03:40.035228 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I'll see if I can make the test more resilient for future, but I think the real issue is LB setup.
(In reply to Tomáš Nožička from comment #12) > It seems like your platform LB is cutting off the connection, causing watch > to be closed prematurely. If that's the case, can you fix the LB config? I > have not seen it on other platforms. > > There is a lot of these in apiserver logs: > 2020-04-24T07:03:40.032281401Z W0424 07:03:40.032193 1 cacher.go:162] > Terminating all watchers from cacher *unstructured.Unstructured > 2020-04-24T07:03:40.034951064Z I0424 07:03:40.034919 1 > controlbuf.go:508] transport: loopyWriter.run returning. connection error: > desc = "transport is closing" > 2020-04-24T07:03:40.035082244Z I0424 07:03:40.035061 1 > controlbuf.go:508] transport: loopyWriter.run returning. connection error: > desc = "transport is closing" > 2020-04-24T07:03:40.035247851Z I0424 07:03:40.035228 1 > controlbuf.go:508] transport: loopyWriter.run returning. connection error: > desc = "transport is closing" > > I'll see if I can make the test more resilient for future, but I think the > real issue is LB setup. I don't see any additional LB is running in the CI setup where @christy mentioned, we just have a libvirt IPI based cluster. Thanks for the patch in the test, hope it will get fixed in the master and backported to 4.4, 4.5 soon.
Right, Manju, there is no LB in the upstream CI. There is an ssh tunnel -- but this is the only test we ever see fail in this way, so, that would seem like a strange coincidence if it was causing any problems.
> I don't see any additional LB is running in the CI setup where @christy mentioned, we just have a libvirt IPI based cluster. Thanks for the patch in the test, hope it will get fixed in the master and backported to 4.4, 4.5 soon. Not an additional load balancer, but the 2 that should be part of OCP install. What do you get from oc get infrastructure/cluster --template='{{ printf "internalLB: %s\nexternalLB: %s\n" .status.apiServerInternalURI .status.apiServerURL}}' and what's behind those addresses? This test isn't failing in the supported setups like AWS, GCE where the LB is setup correctly. Something is cutting off the connection.
(In reply to Tomáš Nožička from comment #16) > > I don't see any additional LB is running in the CI setup where @christy mentioned, we just have a libvirt IPI based cluster. Thanks for the patch in the test, hope it will get fixed in the master and backported to 4.4, 4.5 soon. > > Not an additional load balancer, but the 2 that should be part of OCP > install. What do you get from > > oc get infrastructure/cluster --template='{{ printf "internalLB: > %s\nexternalLB: %s\n" .status.apiServerInternalURI .status.apiServerURL}}' > > and what's behind those addresses? > > This test isn't failing in the supported setups like AWS, GCE where the LB > is setup correctly. Something is cutting off the connection. here are the LBs: sh-4.2$ oc get infrastructure/cluster --template='{{ printf "internalLB:%s\nexternalLB: %s\n" .status.apiServerInternalURI .status.apiServerURL}}' internalLB:https://api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04:6443 externalLB: https://api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04:6443 and here is the network configs in the libvirt for these endpoints, load-balanced between all the masters in the cluster: [root@C155F2U33 ~]# virsh net-dumpxml ci-op-rimznz4n-8c2eb-5gj6c <network> <name>ci-op-rimznz4n-8c2eb-5gj6c</name> <uuid>2f671596-3cf5-4a23-9682-7e9c2daf4121</uuid> <forward mode='nat'> <nat> <port start='1024' end='65535'/> </nat> </forward> <bridge name='br8c2eb1' stp='on' delay='0'/> <mac address='52:54:00:a3:b2:4f'/> <domain name='ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' localOnly='yes'/> <dns enable='yes'> <host ip='192.168.4.11'> <hostname>api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> <hostname>api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> </host> <host ip='192.168.4.12'> <hostname>api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> <hostname>api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> </host> <host ip='192.168.4.13'> <hostname>api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> <hostname>api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> </host> <host ip='192.168.4.51'> <hostname>alertmanager-main-openshift-monitoring.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> <hostname>console-openshift-console.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> <hostname>downloads-openshift-console.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> <hostname>grafana-openshift-monitoring.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> <hostname>oauth-openshift.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> <hostname>prometheus-k8s-openshift-monitoring.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname> </host> </dns> <ip family='ipv4' address='192.168.4.1' prefix='24'> <dhcp> <range start='192.168.4.2' end='192.168.4.254'/> <host mac='52:54:00:3a:6e:81' name='ci-op-rimznz4n-8c2eb-5gj6c-bootstrap.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.10'/> <host mac='52:54:00:7a:a3:5a' name='ci-op-rimznz4n-8c2eb-5gj6c-master-1.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.12'/> <host mac='52:54:00:9e:b5:9b' name='ci-op-rimznz4n-8c2eb-5gj6c-master-2.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.13'/> <host mac='52:54:00:c6:46:98' name='ci-op-rimznz4n-8c2eb-5gj6c-master-0.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.11'/> <host mac='52:3c:1e:6a:dd:b2' name='ci-op-rimznz4n-8c2eb-5gj6c-worker-0-xzvkf' ip='192.168.4.51'/> <host mac='de:91:4a:1c:9f:c3' name='ci-op-rimznz4n-8c2eb-5gj6c-worker-0-w2brz' ip='192.168.4.52'/> </dhcp> </ip> </network> [root@C155F2U33 ~]#
Is there some more information somewhere about those two LBs? This is obviously new to me. Any suggestions on debugging them on our end? I'm not sure if we have access to those LBs or not, as it seems they live inside the OCP CI cluster when they're alive.
Hi @tnozicka do you have any updates here and did you see the question (comment 18) from Christy?
Documentation for LB requirements lives in https://github.com/openshift/installer/blob/master/docs/dev/kube-apiserver-health-check.md
Hi @yin Changed QA Contact to you, thanks
Seeing a flurry of activity on this bz. Was an issue identified and fixed?
As you can see the PR is merged and the BZ is on QA. Should be fixed in master.
Interesting. Are there comments that aren't visible to the public bz? I don't see any mention of a fix or a PR from my view.
Actually, there's a PR link added in an update a long time ago by a bot, but no mention of it by humans. Very strange. But if that's the PR that fixes this, then we should try and verify.
> Actually, there's a PR link added in an update a long time ago by a bot, but no mention of it by humans. That's it. If a PR mentions fixing this BZ in the title, bot will automatically link it and changes state to POST. When it merges it goes to MODIFIED and when there is a payload (nightly) available containing the fix it transitions to ON_QA. Interesting I can see change like: OpenShift BugZilla Robot 2020-09-09 20:41:32 UTC Status: POST → MODIFIED shown as public, but I don't see them when I am not logged in. But the `Status: ON_QA` at the top of the BZ is publicly visible.
Yes all those things you pointed out are visible, but I missed the implication. Thanks for the explanation.
HI : Would you please help take this issue because I don't have the required cluster ? thanks a lot.
Making Ying's QE request comment un-private as Manju cannot see private comments
working fine with the latest build, tested on both and libvirt and powervm env started: (10/694/2365) "[sig-apps][Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel]" passed: (1m39s) 2020-09-24T06:35:36 "[sig-apps][Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel]"
I think we may have hit this again in a 4.6 CI run: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/12255/rehearse-12255-release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.6/1309570945945440256/build-log.txt This was during cluster setup, not a specific test: level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-ppc64le-2020-09-25-190030: 88% complete" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-ppc64le-2020-09-25-190030: 89% complete" level=debug msg="Still waiting for the cluster to initialize: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: kube-apiserver oauth endpoint https://192.168.126.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)" W0925 19:56:00.901569 89 reflector.go:326] k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *v1.ClusterVersion ended with: very short watch: k8s.io/client-go/tools/watch/informerwatcher.go:146: Unexpected watch close - watch lasted less than a second and no items received
@Tomas, per Christy's Comment 32, it seems that we are still running into this error with the testing of the fix. Is it possible to take another look into the bug? Also, after chatting with Christy, it is determined that this bug is not a release blocker for 4.6. It causes inconvenience to the CI team, but it is not blocking the 4.6 release. So, with that said, I am de-escalating this bug to "Low" severity.
> I think we may have hit this again in a 4.6 CI run: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/12255/rehearse-12255-release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.6/1309570945945440256/build-log.txt E0925 20:47:39.962347 380 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ClusterVersion: Get "https://api.ci-op-pn6byhsz-32479.libvirt-ppc64le-1-0:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": EOF level=error msg="Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: Get \"https://api.ci-op-pn6byhsz-32479.libvirt-ppc64le-1-0:6443/apis/config.openshift.io/v1/clusteroperators\": EOF" level=fatal msg="failed to initialize the cluster: timed out waiting for the condition" That seems completely unrelated to the deployment config e2e test reported in this BZ. Your LB is likely setup wrong way as I noted initially in https://bugzilla.redhat.com/show_bug.cgi?id=1791739#c12 . The DC test was also correct initially, we just made it more resilient to wrong infra.
@Christy, @Dan: Since the error is not related with the Deployment config test , could we move this to verified status ? thanks .
Origin issue reported in this is not happening anymore with this fix, hence we close this issue and track that new error in a separate BZ if required. -Thanks
Making Ying's request un-private as some folks on this bug cannot view private comments.
Will move to verified status according https://bugzilla.redhat.com/show_bug.cgi?id=1791739#c37.
removing the needinfo flag since manju answered and it's moved to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196