Bug 1791739 - watch closed unexpectedly while running [Feature:DeploymentConfig] test on ppc64le platform [NEEDINFO]
Summary: watch closed unexpectedly while running [Feature:DeploymentConfig] test on pp...
Keywords:
Status: ON_QA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-controller-manager
Version: 4.3.0
Hardware: ppc64le
OS: Linux
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Tomáš Nožička
QA Contact: zhou ying
URL:
Whiteboard: workloads, Multi-Arch
Depends On:
Blocks: 1878851
TreeView+ depends on / blocked
 
Reported: 2020-01-16 12:14 UTC by mkumatag
Modified: 2020-09-15 12:25 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
mkumatag: needinfo+
mkumatag: needinfo-
yinzhou: needinfo? (mkumatag)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift origin pull 25010 None closed Bug 1791739: Restart watches in DC e2e tests 2020-09-21 03:40:32 UTC

Description mkumatag 2020-01-16 12:14:47 UTC
Description of problem:

watch closed unexpectedly while running "[Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel/minimal]" test

Version-Release number of selected component (if applicable):

[root@localhost origin]# oc version
Client Version: openshift-clients-4.3.0-201910250623-88-g6a937dfe
Server Version: 4.3.0-0.nightly-ppc64le-2020-01-09-234845
Kubernetes Version: v1.16.2
[root@localhost origin]#

How reproducible:


Steps to Reproduce:
1. clone the code from https://github.com/mkumatag/origin/commits/release-4.3-ppc64le which contains the proper images to run tests on power
2. export KUBECONFIG=/root/.kube/config
3. _output/local/bin/linux/ppc64le/openshift-tests run-test "[Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel/minimal]"

Actual results:

Jan 16 07:05:59.761: INFO: minreadytest-1-dv5d8[e2e-test-cli-deployment-dzs42].container[myapp].log

Jan 16 07:05:59.765: INFO: skipping dumping cluster info - cluster too large
Jan 16 07:05:59.807: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-cli-deployment-dzs42-user}, err: <nil>
Jan 16 07:05:59.878: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-cli-deployment-dzs42}, err: <nil>
Jan 16 07:05:59.917: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  6kIq0Tk4ROKoJRCRfzzUGwAAAAAAAAAA}, err: <nil>
[AfterEach] [Feature:DeploymentConfig] deploymentconfigs
  /root/openshift_ws/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:152
Jan 16 07:05:59.917: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-cli-deployment-dzs42" for this suite.
Jan 16 07:05:59.954: INFO: Running AfterSuite actions on all nodes
Jan 16 07:05:59.954: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/deployments/util.go:747]: watch closed unexpectedly
Expected
    <bool>: false
to be equivalent to
    <bool>: true
[root@localhost origin]#

complete log: http://pastebin.test.redhat.com/827902

Expected results:

Test should pass without any errors

Additional info:

Community issue: https://github.com/openshift/origin/issues/24158

Comment 1 Maciej Szulik 2020-01-28 11:56:08 UTC
Please attach logs from the kube-apiserver, openshift-apiserver, kube-controller-manager and openshift-controller-manager.

Comment 2 mkumatag 2020-01-30 12:20:45 UTC
Ran the oc adm must-gather and sharing via http://10.0.1.233:4444/must-gather.local.6070290862417239547/ timebeing, not sure where to upload those logs!

The test ran and watch crashed between Jan 30 07:02:31 to Jan 30 07:12:26 GMT

Comment 3 Maciej Szulik 2020-01-30 12:44:04 UTC
Can you please tar this directory and upload it as such, it's easier than going through those files manually.

Comment 4 mkumatag 2020-01-30 19:18:19 UTC
Uploaded the tar file in the same server here: http://10.0.1.233:4444/must-gather.local.6070290862417239547.tar.gz

Comment 6 mkumatag 2020-04-24 08:33:19 UTC
We have seen this issue more often now and following test cases are failing most of the time:


[Feature:DeploymentConfig] deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied [Suite:openshift/conformance/parallel/minimal]
[Feature:DeploymentConfig] deploymentconfigs won't deploy RC with unresolved images [Conformance] when patched with empty image [Suite:openshift/conformance/parallel/minimal]


with error:

Apr 23 16:09:36.303: INFO: minreadytest-1-deploy[e2e-test-cli-deployment-4rszc].container[deployment].log
--> Scaling minreadytest-1 to 2
--> Success

Apr 23 16:09:36.312: INFO: minreadytest-1-ft4lw[e2e-test-cli-deployment-4rszc].container[myapp].log

Apr 23 16:09:36.317: INFO: skipping dumping cluster info - cluster too large
Apr 23 16:09:36.330: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-cli-deployment-4rszc-user}, err: <nil>
Apr 23 16:09:36.337: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-cli-deployment-4rszc}, err: <nil>
Apr 23 16:09:36.354: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  AJJwsg6mS0qMHCcHi4yHcAAAAAAAAAAA}, err: <nil>
[AfterEach] [Feature:DeploymentConfig] deploymentconfigs
  /tmp/openshift_ws/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:152
Apr 23 16:09:36.354: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-cli-deployment-4rszc" for this suite.
Apr 23 16:09:36.381: INFO: Running AfterSuite actions on all nodes
Apr 23 16:09:36.381: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/deployments/util.go:747]: watch closed unexpectedly
Expected
    <bool>: false
to be equivalent to
    <bool>: true



latest logs pushed: http://file.rdu.redhat.com/~mkumatag/logs/BZ1791739/must-gather.local.5141860892755809294.tar.gz

Comment 7 Christy Norman 2020-05-14 14:17:58 UTC
This has started failing in our public CI (which is running on libvirt/kvm, instead of PowerVM). 


Can we get someone to take a look? 

Here are some runs where it failed:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.3/53
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-ppc64le-4.3/56

Comment 8 Dan Li 2020-05-20 13:32:20 UTC
Hi Manju, let us know if you are the right person to investigate this bug. Our team thinks this bug is related to CI and testing on Power

Comment 9 mkumatag 2020-05-20 14:08:35 UTC
(In reply to Dan Li from comment #8)
> Hi Manju, let us know if you are the right person to investigate this bug.
> Our team thinks this bug is related to CI and testing on Power

This is a new area for me to explore and tied up with other work items as of now. I don't think I will be able to spare any bandwidth on this bug for next few weeks.

Comment 10 Dan Li 2020-05-20 15:13:13 UTC
Hi Adam, I saw that Manju re-assigned this bug to you. Is this a bug that you can help us look into? This bug was escalated during our weekly multi-arch on IBM Power meeting last week.

Comment 12 Tomáš Nožička 2020-05-21 08:54:25 UTC
It seems like your platform LB is cutting off the connection, causing watch to be closed prematurely. If that's the case, can you fix the LB config? I have not seen it on other platforms.

There is a lot of these in apiserver logs:
2020-04-24T07:03:40.032281401Z W0424 07:03:40.032193       1 cacher.go:162] Terminating all watchers from cacher *unstructured.Unstructured
2020-04-24T07:03:40.034951064Z I0424 07:03:40.034919       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-04-24T07:03:40.035082244Z I0424 07:03:40.035061       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-04-24T07:03:40.035247851Z I0424 07:03:40.035228       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

I'll see if I can make the test more resilient for future, but I think the real issue is LB setup.

Comment 13 mkumatag 2020-05-26 04:50:02 UTC
(In reply to Tomáš Nožička from comment #12)
> It seems like your platform LB is cutting off the connection, causing watch
> to be closed prematurely. If that's the case, can you fix the LB config? I
> have not seen it on other platforms.
> 
> There is a lot of these in apiserver logs:
> 2020-04-24T07:03:40.032281401Z W0424 07:03:40.032193       1 cacher.go:162]
> Terminating all watchers from cacher *unstructured.Unstructured
> 2020-04-24T07:03:40.034951064Z I0424 07:03:40.034919       1
> controlbuf.go:508] transport: loopyWriter.run returning. connection error:
> desc = "transport is closing"
> 2020-04-24T07:03:40.035082244Z I0424 07:03:40.035061       1
> controlbuf.go:508] transport: loopyWriter.run returning. connection error:
> desc = "transport is closing"
> 2020-04-24T07:03:40.035247851Z I0424 07:03:40.035228       1
> controlbuf.go:508] transport: loopyWriter.run returning. connection error:
> desc = "transport is closing"
> 
> I'll see if I can make the test more resilient for future, but I think the
> real issue is LB setup.

I don't see any additional LB is running in the CI setup where @christy mentioned, we just have a libvirt IPI based cluster. Thanks for the patch in the test, hope it will get fixed in the master and backported to 4.4, 4.5 soon.

Comment 14 Christy Norman 2020-05-26 15:28:34 UTC
Right, Manju, there is no LB in the upstream CI. There is an ssh tunnel -- but this is the only test we ever see fail in this way, so, that would seem like a strange coincidence if it was causing any problems.

Comment 16 Tomáš Nožička 2020-05-27 16:47:43 UTC
> I don't see any additional LB is running in the CI setup where @christy mentioned, we just have a libvirt IPI based cluster. Thanks for the patch in the test, hope it will get fixed in the master and backported to 4.4, 4.5 soon.

Not an additional load balancer, but the 2 that should be part of OCP install. What do you get from

 oc get infrastructure/cluster --template='{{ printf "internalLB: %s\nexternalLB: %s\n" .status.apiServerInternalURI .status.apiServerURL}}'

and what's behind those addresses?

This test isn't failing in the supported setups like AWS, GCE where the LB is setup correctly. Something is cutting off the connection.

Comment 17 mkumatag 2020-05-28 10:37:06 UTC
(In reply to Tomáš Nožička from comment #16)
> > I don't see any additional LB is running in the CI setup where @christy mentioned, we just have a libvirt IPI based cluster. Thanks for the patch in the test, hope it will get fixed in the master and backported to 4.4, 4.5 soon.
> 
> Not an additional load balancer, but the 2 that should be part of OCP
> install. What do you get from
> 
>  oc get infrastructure/cluster --template='{{ printf "internalLB:
> %s\nexternalLB: %s\n" .status.apiServerInternalURI .status.apiServerURL}}'
> 
> and what's behind those addresses?
> 
> This test isn't failing in the supported setups like AWS, GCE where the LB
> is setup correctly. Something is cutting off the connection.

here are the LBs:

sh-4.2$ oc get infrastructure/cluster --template='{{ printf "internalLB:%s\nexternalLB: %s\n" .status.apiServerInternalURI .status.apiServerURL}}'
internalLB:https://api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04:6443
externalLB: https://api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04:6443

and here is the network configs in the libvirt for these endpoints, load-balanced between all the masters in the cluster:

[root@C155F2U33 ~]# virsh net-dumpxml ci-op-rimznz4n-8c2eb-5gj6c
<network>
  <name>ci-op-rimznz4n-8c2eb-5gj6c</name>
  <uuid>2f671596-3cf5-4a23-9682-7e9c2daf4121</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='br8c2eb1' stp='on' delay='0'/>
  <mac address='52:54:00:a3:b2:4f'/>
  <domain name='ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' localOnly='yes'/>
  <dns enable='yes'>
    <host ip='192.168.4.11'>
      <hostname>api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
    </host>
    <host ip='192.168.4.12'>
      <hostname>api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
    </host>
    <host ip='192.168.4.13'>
      <hostname>api.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>api-int.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
    </host>
    <host ip='192.168.4.51'>
      <hostname>alertmanager-main-openshift-monitoring.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>console-openshift-console.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>downloads-openshift-console.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>grafana-openshift-monitoring.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>oauth-openshift.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
      <hostname>prometheus-k8s-openshift-monitoring.apps.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04</hostname>
    </host>
  </dns>
  <ip family='ipv4' address='192.168.4.1' prefix='24'>
    <dhcp>
      <range start='192.168.4.2' end='192.168.4.254'/>
      <host mac='52:54:00:3a:6e:81' name='ci-op-rimznz4n-8c2eb-5gj6c-bootstrap.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.10'/>
      <host mac='52:54:00:7a:a3:5a' name='ci-op-rimznz4n-8c2eb-5gj6c-master-1.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.12'/>
      <host mac='52:54:00:9e:b5:9b' name='ci-op-rimznz4n-8c2eb-5gj6c-master-2.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.13'/>
      <host mac='52:54:00:c6:46:98' name='ci-op-rimznz4n-8c2eb-5gj6c-master-0.ci-op-rimznz4n-8c2eb.libvirt-ppc64le-04' ip='192.168.4.11'/>
      <host mac='52:3c:1e:6a:dd:b2' name='ci-op-rimznz4n-8c2eb-5gj6c-worker-0-xzvkf' ip='192.168.4.51'/>
      <host mac='de:91:4a:1c:9f:c3' name='ci-op-rimznz4n-8c2eb-5gj6c-worker-0-w2brz' ip='192.168.4.52'/>
    </dhcp>
  </ip>
</network>

[root@C155F2U33 ~]#

Comment 18 Christy Norman 2020-05-28 19:56:01 UTC
Is there some more information somewhere about those two LBs? This is obviously new to me. Any suggestions on debugging them on our end? I'm not sure if we have access to those LBs or not, as it seems they live inside the OCP CI cluster when they're alive.

Comment 19 lmcfadde 2020-06-17 20:27:12 UTC
Hi @tnozicka@redhat.com do you have any updates here and did you see the question (comment 18) from Christy?

Comment 20 Tomáš Nožička 2020-06-18 10:44:41 UTC
Documentation for LB requirements lives in

  https://github.com/openshift/installer/blob/master/docs/dev/kube-apiserver-health-check.md

Comment 21 wewang 2020-06-28 05:55:31 UTC
Hi @yin Changed QA Contact to you, thanks

Comment 23 Christy Norman 2020-09-11 18:14:27 UTC
Seeing a flurry of activity on this bz. Was an issue identified and fixed?

Comment 24 Tomáš Nožička 2020-09-14 10:52:28 UTC
As you can see the PR is merged and the BZ is on QA. Should be fixed in master.

Comment 25 Christy Norman 2020-09-14 13:36:13 UTC
Interesting. Are there comments that aren't visible to the public bz? I don't see any mention of a fix or a PR from my view.

Comment 26 Christy Norman 2020-09-14 13:38:48 UTC
Actually, there's a PR link added in an update a long time ago by a bot, but no mention of it by humans. Very strange. But if that's the PR that fixes this, then we should try and verify.

Comment 27 Tomáš Nožička 2020-09-14 14:00:47 UTC
> Actually, there's a PR link added in an update a long time ago by a bot, but no mention of it by humans.

That's it. If a PR mentions fixing this BZ in the title, bot will automatically link it and changes state to POST. When it merges it goes to MODIFIED and when there is a payload (nightly) available containing the fix it transitions to ON_QA.


Interesting I can see change like:

OpenShift BugZilla Robot 2020-09-09 20:41:32 UTC
Status: POST → MODIFIED

shown as public, but I don't see them when I am not logged in. But the `Status:	ON_QA` at the top of the BZ is publicly visible.

Comment 28 Christy Norman 2020-09-14 14:10:37 UTC
Yes all those things you pointed out are visible, but I missed the implication. Thanks for the explanation.

Comment 29 zhou ying 2020-09-15 02:00:49 UTC
HI :

Would you please help take this issue because I don't have the required cluster ? thanks a lot.

Comment 30 Dan Li 2020-09-15 12:25:59 UTC
Making Ying's QE request comment un-private as Manju cannot see private comments


Note You need to log in before you can comment on or make changes to this bug.