Bug 1824007

Summary: [must gather] Got "timed out waiting for the condition" for must gather
Product: OpenShift Container Platform Reporter: Yan Du <yadu>
Component: ocAssignee: Maciej Szulik <maszulik>
Status: CLOSED DUPLICATE QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: low    
Version: 4.4CC: aos-bugs, cnv-qe-bugs, danken, erich, fdeutsch, jokerman, jsaucier, knarra, mfojtik, ncredi, rcernin, rgarcia, wking, xtian, ycui, yinzhou
Target Milestone: ---Flags: mfojtik: needinfo?
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: LifecycleReset
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-15 13:47:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yan Du 2020-04-15 03:28:26 UTC
Description of problem:
$ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38 --dest-dir=/tmp/pytest-of-cnv-qe-jenkins/pytest-0/must_gather0
[must-gather      ] OUT Using must-gather plugin-in image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38
[must-gather      ] OUT namespace/openshift-must-gather-t7tn7 created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-kc56b created
[must-gather      ] OUT pod for plug-in image registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38 created
[must-gather-km89s] POD Gathering data for ns/openshift-cnv...
[must-gather-km89s] OUT gather logs unavailable: unexpected EOF
[must-gather-km89s] OUT waiting for gather to complete
[must-gather-km89s] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-kc56b deleted
[must-gather      ] OUT namespace/openshift-must-gather-t7tn7 deleted
error: gather never finished for pod must-gather-km89s: timed out waiting for the condition

Version-Release number of selected component (if applicable):
OCP4.4
CNV2.3

How reproducible:
Sometimes

Steps to Reproduce:
1. Run must gather
# oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38 --dest-dir=/tmp/pytest-of-cnv-qe-jenkins/pytest-0/must_gather0

Actual results:
Same as description

Expected results:
must gather works well without any error

Additional info:
$ oc describe pod
Name:         must-gather-km89s
Namespace:    openshift-must-gather-t7tn7
Priority:     0
Node:         host-172-16-0-22/172.16.0.22
Start Time:   Wed, 15 Apr 2020 02:29:19 +0000
Labels:       app=must-gather
Annotations:  k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.129.2.54"
                    ],
                    "dns": {},
                    "default-route": [
                        "10.129.2.1"
                    ]
                }]
Status:       Pending
IP:           10.129.2.54
IPs:
  IP:  10.129.2.54
Init Containers:
  gather:
    Container ID:  cri-o://0a1128fce000895017e6c492488b18c8368fa8cee95dec9fdb9f603b132509ab
    Image:         registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38
    Image ID:      registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8@sha256:985170afa4b301264969c0b1b3985baca89e40219582efbdbd52735b3eaddcec
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/gather
    State:          Running
      Started:      Wed, 15 Apr 2020 02:29:21 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /must-gather from must-gather-output (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-tfm2n (ro)
Containers:
  copy:
    Container ID:  
    Image:         registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      trap : TERM INT; sleep infinity & wait
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /must-gather from must-gather-output (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-tfm2n (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  must-gather-output:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  default-token-tfm2n:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-tfm2n
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     
Events:
  Type    Reason     Age        From                       Message
  ----    ------     ----       ----                       -------
  Normal  Scheduled  <unknown>  default-scheduler          Successfully assigned openshift-must-gather-t7tn7/must-gather-km89s to host-172-16-0-22
  Normal  Pulled     6m43s      kubelet, host-172-16-0-22  Container image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38" already present on machine
  Normal  Created    6m43s      kubelet, host-172-16-0-22  Created container gather
  Normal  Started    6m42s      kubelet, host-172-16-0-22  Started container gather

Comment 1 Yan Du 2020-04-15 03:39:51 UTC
Client Version: 4.4.0-0.nightly-2020-04-09-220855
Server Version: 4.4.0-rc.7
Kubernetes Version: v1.17.1

Openshift must gather has the same timeout issue:

$ oc adm must-gather
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051
[must-gather      ] OUT namespace/openshift-must-gather-t2qhp created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-gkb7m created
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051 created
[must-gather-x65n6] POD Wrote inspect data to must-gather.
[must-gather-x65n6] POD Gathering data for ns/openshift-cluster-version...
[must-gather-x65n6] POD Wrote inspect data to must-gather.
[must-gather-x65n6] POD Gathering data for ns/openshift-config...
[must-gather-x65n6] POD Gathering data for ns/openshift-config-managed...
[must-gather-x65n6] POD Gathering data for ns/openshift-authentication...
[must-gather-x65n6] POD Gathering data for ns/openshift-authentication-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-ingress...
[must-gather-x65n6] POD Gathering data for ns/openshift-cloud-credential-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-machine-api...
[must-gather-x65n6] POD Gathering data for ns/openshift-console-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-console...
[must-gather-x65n6] POD Gathering data for ns/openshift-cluster-storage-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-dns-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-dns...
[must-gather-x65n6] POD Gathering data for ns/openshift-etcd-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-etcd...
[must-gather-x65n6] POD Gathering data for ns/openshift-image-registry...
[must-gather-x65n6] POD Gathering data for ns/openshift-ingress-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-insights...
[must-gather-x65n6] POD Gathering data for ns/openshift-kube-apiserver-operator...




$ oc get pod -o wide -w
NAME                READY   STATUS     RESTARTS   AGE   IP            NODE               NOMINATED NODE   READINESS GATES
must-gather-x65n6   0/1     Init:0/1   0          10m   10.129.2.55   host-172-16-0-22   <none>           <none>

$ oc describe pod
Name:         must-gather-x65n6
Namespace:    openshift-must-gather-t2qhp
Priority:     0
Node:         host-172-16-0-22/172.16.0.22
Start Time:   Wed, 15 Apr 2020 02:50:23 +0000
Labels:       app=must-gather
Annotations:  k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.129.2.55"
                    ],
                    "dns": {},
                    "default-route": [
                        "10.129.2.1"
                    ]
                }]
Status:       Pending
IP:           10.129.2.55
IPs:
  IP:  10.129.2.55
Init Containers:
  gather:
    Container ID:  cri-o://1c412570dc9f4c17b00612fe7703e2f7a6c61944b9c863018507c88c41a898e0
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/gather
    State:          Running
      Started:      Wed, 15 Apr 2020 02:50:29 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /must-gather from must-gather-output (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-x6twm (ro)
Containers:
  copy:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      trap : TERM INT; sleep infinity & wait
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /must-gather from must-gather-output (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-x6twm (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  must-gather-output:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  default-token-x6twm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-x6twm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     
Events:
  Type    Reason     Age        From                       Message
  ----    ------     ----       ----                       -------
  Normal  Scheduled  <unknown>  default-scheduler          Successfully assigned openshift-must-gather-t2qhp/must-gather-x65n6 to host-172-16-0-22
  Normal  Pulling    10m        kubelet, host-172-16-0-22  Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051"
  Normal  Pulled     10m        kubelet, host-172-16-0-22  Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051"
  Normal  Created    10m        kubelet, host-172-16-0-22  Created container gather
  Normal  Started    10m        kubelet, host-172-16-0-22  Started container gather

Comment 2 Yan Du 2020-04-15 03:45:42 UTC
$ oc adm must-gather
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051
[must-gather      ] OUT namespace/openshift-must-gather-t2qhp created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-gkb7m created
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051 created
[must-gather-x65n6] POD Wrote inspect data to must-gather.
[must-gather-x65n6] POD Gathering data for ns/openshift-cluster-version...
[must-gather-x65n6] POD Wrote inspect data to must-gather.
[must-gather-x65n6] POD Gathering data for ns/openshift-config...
[must-gather-x65n6] POD Gathering data for ns/openshift-config-managed...
[must-gather-x65n6] POD Gathering data for ns/openshift-authentication...
[must-gather-x65n6] POD Gathering data for ns/openshift-authentication-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-ingress...
[must-gather-x65n6] POD Gathering data for ns/openshift-cloud-credential-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-machine-api...
[must-gather-x65n6] POD Gathering data for ns/openshift-console-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-console...
[must-gather-x65n6] POD Gathering data for ns/openshift-cluster-storage-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-dns-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-dns...
[must-gather-x65n6] POD Gathering data for ns/openshift-etcd-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-etcd...
[must-gather-x65n6] POD Gathering data for ns/openshift-image-registry...
[must-gather-x65n6] POD Gathering data for ns/openshift-ingress-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-insights...
[must-gather-x65n6] POD Gathering data for ns/openshift-kube-apiserver-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-kube-apiserver...
[must-gather-x65n6] OUT gather logs unavailable: unexpected EOF
[must-gather-x65n6] OUT waiting for gather to complete
[must-gather-x65n6] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-gkb7m deleted
[must-gather      ] OUT namespace/openshift-must-gather-t2qhp deleted
error: gather never finished for pod must-gather-x65n6: timed out waiting for the condition

Comment 3 Dan Kenigsberg 2020-04-19 12:16:58 UTC
Since this reproduces without CNV, I think this bug should be moved to OCP4.4. Please make sure to include the precise OpenShift cluster version

Comment 4 RamaKasturi 2020-04-20 10:07:50 UTC
I tried performing oc adm must-gather on latest nightly and i did not see any issues as mentioned in comment 2. Below are the version of oc client & server where the testing has been done.

[ramakasturinarra@dhcp35-60 verification-tests]$ oc version
Client Version: 4.4.0-202004040654-2576e48
Server Version: 4.4.0-0.nightly-2020-04-20-010655
Kubernetes Version: v1.17.1

Just tried again by updating to the latest version of oc client and did not see any issue.
[ramakasturinarra@dhcp35-60 verification-tests]$ oc version
Client Version: 4.4.0-202004180654-2576e48
Server Version: 4.4.0-0.nightly-2020-04-20-010655
Kubernetes Version: v1.17.1

Comment 5 Nelly Credi 2020-04-20 15:07:18 UTC
Im afraid I cannot live with pushing this out to 2.4
if we cannot collect logs, our customers will suffer
(and so will we as they open bugs)

I would like to make sure this is not a real issue before pushing it out, 
so for now im untargeting it and I want to compare the OC client we are using vs OCP maybe that is the problem

@Yan, can you please check which OC client version you were working with?

Comment 6 Yan Du 2020-04-21 01:24:32 UTC
Here is the version:

Client Version: 4.4.0-0.nightly-2020-04-09-220855
Server Version: 4.4.0-rc.7
Kubernetes Version: v1.17.1

Comment 7 zhou ying 2020-04-21 07:47:32 UTC
I've checked , the `oc adm must-gather` didn't completed during in the 15 mins:
I've opened two terminal , one run command `oc adm must-gather` .  on the second terminal check the logs from container , we can see : http://pastebin.test.redhat.com/857263  .the command still gathering data from cluster. then timeout and the gather action not completed.  I'm curious why the cluster run command so slow.

Comment 8 Nelly Credi 2020-04-21 08:07:21 UTC
IIUC this means that only OC timed-out, but the gathering itself still continues, 
meaning that if needed, customers will still be able to get the full must gather log.
in this case, I think we can add it to known issues
and we should maybe revisit the amount of data we are collecting.

@Zhou is my understanding correct?

Comment 9 Yan Du 2020-04-21 08:26:44 UTC
(In reply to Nelly Credi from comment #8)
> IIUC this means that only OC timed-out, but the gathering itself still
> continues, 

If the oc client timeout, that means the gathering pod did not compelte all the gathering tasks, and there is no gathering log available.

> meaning that if needed, customers will still be able to get the full must
> gather log.
> in this case, I think we can add it to known issues
> and we should maybe revisit the amount of data we are collecting.
> 
> @Zhou is my understanding correct?

Comment 10 RamaKasturi 2020-04-21 09:46:06 UTC
Hi Yan Du,

  I have updated the mail thread with all the details on why it happened. Requesting you to take a look at the same and hope that should help on why it failed. 

Thanks
kasturi

Comment 11 zhou ying 2020-04-21 10:06:29 UTC
Nelly Credi  : 

Yan is right .

Comment 12 Nelly Credi 2020-04-21 10:15:22 UTC
Thanks both. 
But this is not happening in OCP, so I assume we should keep the bug on CNV
and look into the data we are collecting to understand why it is taking so much time
targeting to 2.3
and suggesting it to be a blocker

Comment 13 Fabian Deutsch 2020-04-21 11:07:20 UTC
But comment 7 looks like this also happens without the CNV bits.

Comment 14 Nelly Credi 2020-04-21 11:51:22 UTC
@Fabian, you are correct
So I will move it to OCP and mark it as affecting CNV

Comment 15 Ying Cui 2020-05-06 09:10:43 UTC
(In reply to Fabian Deutsch from comment #13)
> But comment 7 looks like this also happens without the CNV bits.

@Zhou Ying, could you help to clarify you test it on OCP env. or OCP + CNV env.?  I am not quite sure to see this in comment #7.

Comment 16 zhou ying 2020-05-06 09:13:27 UTC
@Ying Cui :

Yes, I tested on OCP + CNV env, can't reproduce with OCP env.

Comment 17 Maciej Szulik 2020-05-20 09:40:04 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 18 Maciej Szulik 2020-05-25 12:05:45 UTC
This still requires further investigation and as such won't make 4.5 cut, moving to 4.6.

Comment 19 Ric Garcia 2020-05-26 20:09:52 UTC
Maciej Szulik, would it be possible to include a fix for this bug in a z-stream?

Comment 20 Maciej Szulik 2020-05-27 09:02:02 UTC
(In reply to Rick Garcia from comment #19)
> Maciej Szulik, would it be possible to include a fix for this bug in a
> z-stream?

We might consider that, if the fix will be small enough.

Comment 21 Ric Garcia 2020-05-27 18:19:25 UTC
Thanks Maciej! By when would we be able to know?

Comment 23 Maciej Szulik 2020-06-18 10:09:56 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 24 Maciej Szulik 2020-07-09 11:04:54 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 25 Maciej Szulik 2020-08-21 14:09:35 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 26 Michal Fojtik 2020-08-24 13:11:39 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 28 Michal Fojtik 2020-08-24 14:08:15 UTC
The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently.
The bug assignee was notified.

Comment 30 Maciej Szulik 2020-09-11 11:21:13 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 31 Michal Fojtik 2020-09-23 15:02:56 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 33 Michal Fojtik 2021-01-14 13:38:58 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 34 Maciej Szulik 2021-01-15 13:47:29 UTC
There are some improvements being made to must-gather which are awaiting in https://bugzilla.redhat.com/show_bug.cgi?id=1888192
The timeout itself as it was mentioned earlier is configurable.

*** This bug has been marked as a duplicate of bug 1888192 ***