Bug 1824007
| Summary: | [must gather] Got "timed out waiting for the condition" for must gather | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Yan Du <yadu> |
| Component: | oc | Assignee: | Maciej Szulik <maszulik> |
| Status: | CLOSED DUPLICATE | QA Contact: | zhou ying <yinzhou> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.4 | CC: | aos-bugs, cnv-qe-bugs, danken, erich, fdeutsch, jokerman, jsaucier, knarra, mfojtik, ncredi, rcernin, rgarcia, wking, xtian, ycui, yinzhou |
| Target Milestone: | --- | Flags: | mfojtik:
needinfo?
|
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | LifecycleReset | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-01-15 13:47:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Client Version: 4.4.0-0.nightly-2020-04-09-220855
Server Version: 4.4.0-rc.7
Kubernetes Version: v1.17.1
Openshift must gather has the same timeout issue:
$ oc adm must-gather
[must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051
[must-gather ] OUT namespace/openshift-must-gather-t2qhp created
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-gkb7m created
[must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051 created
[must-gather-x65n6] POD Wrote inspect data to must-gather.
[must-gather-x65n6] POD Gathering data for ns/openshift-cluster-version...
[must-gather-x65n6] POD Wrote inspect data to must-gather.
[must-gather-x65n6] POD Gathering data for ns/openshift-config...
[must-gather-x65n6] POD Gathering data for ns/openshift-config-managed...
[must-gather-x65n6] POD Gathering data for ns/openshift-authentication...
[must-gather-x65n6] POD Gathering data for ns/openshift-authentication-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-ingress...
[must-gather-x65n6] POD Gathering data for ns/openshift-cloud-credential-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-machine-api...
[must-gather-x65n6] POD Gathering data for ns/openshift-console-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-console...
[must-gather-x65n6] POD Gathering data for ns/openshift-cluster-storage-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-dns-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-dns...
[must-gather-x65n6] POD Gathering data for ns/openshift-etcd-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-etcd...
[must-gather-x65n6] POD Gathering data for ns/openshift-image-registry...
[must-gather-x65n6] POD Gathering data for ns/openshift-ingress-operator...
[must-gather-x65n6] POD Gathering data for ns/openshift-insights...
[must-gather-x65n6] POD Gathering data for ns/openshift-kube-apiserver-operator...
$ oc get pod -o wide -w
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
must-gather-x65n6 0/1 Init:0/1 0 10m 10.129.2.55 host-172-16-0-22 <none> <none>
$ oc describe pod
Name: must-gather-x65n6
Namespace: openshift-must-gather-t2qhp
Priority: 0
Node: host-172-16-0-22/172.16.0.22
Start Time: Wed, 15 Apr 2020 02:50:23 +0000
Labels: app=must-gather
Annotations: k8s.v1.cni.cncf.io/networks-status:
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.129.2.55"
],
"dns": {},
"default-route": [
"10.129.2.1"
]
}]
Status: Pending
IP: 10.129.2.55
IPs:
IP: 10.129.2.55
Init Containers:
gather:
Container ID: cri-o://1c412570dc9f4c17b00612fe7703e2f7a6c61944b9c863018507c88c41a898e0
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051
Port: <none>
Host Port: <none>
Command:
/usr/bin/gather
State: Running
Started: Wed, 15 Apr 2020 02:50:29 +0000
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/must-gather from must-gather-output (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-x6twm (ro)
Containers:
copy:
Container ID:
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051
Image ID:
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
trap : TERM INT; sleep infinity & wait
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/must-gather from must-gather-output (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-x6twm (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
must-gather-output:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
default-token-x6twm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-x6twm
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-must-gather-t2qhp/must-gather-x65n6 to host-172-16-0-22
Normal Pulling 10m kubelet, host-172-16-0-22 Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051"
Normal Pulled 10m kubelet, host-172-16-0-22 Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051"
Normal Created 10m kubelet, host-172-16-0-22 Created container gather
Normal Started 10m kubelet, host-172-16-0-22 Started container gather
$ oc adm must-gather [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051 [must-gather ] OUT namespace/openshift-must-gather-t2qhp created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-gkb7m created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad8508bdb8bd3a48c5f48011e37234937cc5bf8bd24dcae29bf888c987dda051 created [must-gather-x65n6] POD Wrote inspect data to must-gather. [must-gather-x65n6] POD Gathering data for ns/openshift-cluster-version... [must-gather-x65n6] POD Wrote inspect data to must-gather. [must-gather-x65n6] POD Gathering data for ns/openshift-config... [must-gather-x65n6] POD Gathering data for ns/openshift-config-managed... [must-gather-x65n6] POD Gathering data for ns/openshift-authentication... [must-gather-x65n6] POD Gathering data for ns/openshift-authentication-operator... [must-gather-x65n6] POD Gathering data for ns/openshift-ingress... [must-gather-x65n6] POD Gathering data for ns/openshift-cloud-credential-operator... [must-gather-x65n6] POD Gathering data for ns/openshift-machine-api... [must-gather-x65n6] POD Gathering data for ns/openshift-console-operator... [must-gather-x65n6] POD Gathering data for ns/openshift-console... [must-gather-x65n6] POD Gathering data for ns/openshift-cluster-storage-operator... [must-gather-x65n6] POD Gathering data for ns/openshift-dns-operator... [must-gather-x65n6] POD Gathering data for ns/openshift-dns... [must-gather-x65n6] POD Gathering data for ns/openshift-etcd-operator... [must-gather-x65n6] POD Gathering data for ns/openshift-etcd... [must-gather-x65n6] POD Gathering data for ns/openshift-image-registry... [must-gather-x65n6] POD Gathering data for ns/openshift-ingress-operator... [must-gather-x65n6] POD Gathering data for ns/openshift-insights... [must-gather-x65n6] POD Gathering data for ns/openshift-kube-apiserver-operator... [must-gather-x65n6] POD Gathering data for ns/openshift-kube-apiserver... [must-gather-x65n6] OUT gather logs unavailable: unexpected EOF [must-gather-x65n6] OUT waiting for gather to complete [must-gather-x65n6] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-gkb7m deleted [must-gather ] OUT namespace/openshift-must-gather-t2qhp deleted error: gather never finished for pod must-gather-x65n6: timed out waiting for the condition Since this reproduces without CNV, I think this bug should be moved to OCP4.4. Please make sure to include the precise OpenShift cluster version I tried performing oc adm must-gather on latest nightly and i did not see any issues as mentioned in comment 2. Below are the version of oc client & server where the testing has been done. [ramakasturinarra@dhcp35-60 verification-tests]$ oc version Client Version: 4.4.0-202004040654-2576e48 Server Version: 4.4.0-0.nightly-2020-04-20-010655 Kubernetes Version: v1.17.1 Just tried again by updating to the latest version of oc client and did not see any issue. [ramakasturinarra@dhcp35-60 verification-tests]$ oc version Client Version: 4.4.0-202004180654-2576e48 Server Version: 4.4.0-0.nightly-2020-04-20-010655 Kubernetes Version: v1.17.1 Im afraid I cannot live with pushing this out to 2.4 if we cannot collect logs, our customers will suffer (and so will we as they open bugs) I would like to make sure this is not a real issue before pushing it out, so for now im untargeting it and I want to compare the OC client we are using vs OCP maybe that is the problem @Yan, can you please check which OC client version you were working with? Here is the version: Client Version: 4.4.0-0.nightly-2020-04-09-220855 Server Version: 4.4.0-rc.7 Kubernetes Version: v1.17.1 I've checked , the `oc adm must-gather` didn't completed during in the 15 mins: I've opened two terminal , one run command `oc adm must-gather` . on the second terminal check the logs from container , we can see : http://pastebin.test.redhat.com/857263 .the command still gathering data from cluster. then timeout and the gather action not completed. I'm curious why the cluster run command so slow. IIUC this means that only OC timed-out, but the gathering itself still continues, meaning that if needed, customers will still be able to get the full must gather log. in this case, I think we can add it to known issues and we should maybe revisit the amount of data we are collecting. @Zhou is my understanding correct? (In reply to Nelly Credi from comment #8) > IIUC this means that only OC timed-out, but the gathering itself still > continues, If the oc client timeout, that means the gathering pod did not compelte all the gathering tasks, and there is no gathering log available. > meaning that if needed, customers will still be able to get the full must > gather log. > in this case, I think we can add it to known issues > and we should maybe revisit the amount of data we are collecting. > > @Zhou is my understanding correct? Hi Yan Du, I have updated the mail thread with all the details on why it happened. Requesting you to take a look at the same and hope that should help on why it failed. Thanks kasturi Nelly Credi : Yan is right . Thanks both. But this is not happening in OCP, so I assume we should keep the bug on CNV and look into the data we are collecting to understand why it is taking so much time targeting to 2.3 and suggesting it to be a blocker But comment 7 looks like this also happens without the CNV bits. @Fabian, you are correct So I will move it to OCP and mark it as affecting CNV (In reply to Fabian Deutsch from comment #13) > But comment 7 looks like this also happens without the CNV bits. @Zhou Ying, could you help to clarify you test it on OCP env. or OCP + CNV env.? I am not quite sure to see this in comment #7. @Ying Cui : Yes, I tested on OCP + CNV env, can't reproduce with OCP env. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. This still requires further investigation and as such won't make 4.5 cut, moving to 4.6. Maciej Szulik, would it be possible to include a fix for this bug in a z-stream? (In reply to Rick Garcia from comment #19) > Maciej Szulik, would it be possible to include a fix for this bug in a > z-stream? We might consider that, if the fix will be small enough. Thanks Maciej! By when would we be able to know? I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently. The bug assignee was notified. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified. There are some improvements being made to must-gather which are awaiting in https://bugzilla.redhat.com/show_bug.cgi?id=1888192 The timeout itself as it was mentioned earlier is configurable. *** This bug has been marked as a duplicate of bug 1888192 *** |
Description of problem: $ oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38 --dest-dir=/tmp/pytest-of-cnv-qe-jenkins/pytest-0/must_gather0 [must-gather ] OUT Using must-gather plugin-in image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38 [must-gather ] OUT namespace/openshift-must-gather-t7tn7 created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-kc56b created [must-gather ] OUT pod for plug-in image registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38 created [must-gather-km89s] POD Gathering data for ns/openshift-cnv... [must-gather-km89s] OUT gather logs unavailable: unexpected EOF [must-gather-km89s] OUT waiting for gather to complete [must-gather-km89s] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-kc56b deleted [must-gather ] OUT namespace/openshift-must-gather-t7tn7 deleted error: gather never finished for pod must-gather-km89s: timed out waiting for the condition Version-Release number of selected component (if applicable): OCP4.4 CNV2.3 How reproducible: Sometimes Steps to Reproduce: 1. Run must gather # oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38 --dest-dir=/tmp/pytest-of-cnv-qe-jenkins/pytest-0/must_gather0 Actual results: Same as description Expected results: must gather works well without any error Additional info: $ oc describe pod Name: must-gather-km89s Namespace: openshift-must-gather-t7tn7 Priority: 0 Node: host-172-16-0-22/172.16.0.22 Start Time: Wed, 15 Apr 2020 02:29:19 +0000 Labels: app=must-gather Annotations: k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.129.2.54" ], "dns": {}, "default-route": [ "10.129.2.1" ] }] Status: Pending IP: 10.129.2.54 IPs: IP: 10.129.2.54 Init Containers: gather: Container ID: cri-o://0a1128fce000895017e6c492488b18c8368fa8cee95dec9fdb9f603b132509ab Image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38 Image ID: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8@sha256:985170afa4b301264969c0b1b3985baca89e40219582efbdbd52735b3eaddcec Port: <none> Host Port: <none> Command: /usr/bin/gather State: Running Started: Wed, 15 Apr 2020 02:29:21 +0000 Ready: False Restart Count: 0 Environment: <none> Mounts: /must-gather from must-gather-output (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-tfm2n (ro) Containers: copy: Container ID: Image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38 Image ID: Port: <none> Host Port: <none> Command: /bin/bash -c trap : TERM INT; sleep infinity & wait State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: <none> Mounts: /must-gather from must-gather-output (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-tfm2n (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: must-gather-output: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> default-token-tfm2n: Type: Secret (a volume populated by a Secret) SecretName: default-token-tfm2n Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-must-gather-t7tn7/must-gather-km89s to host-172-16-0-22 Normal Pulled 6m43s kubelet, host-172-16-0-22 Container image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.3.0-38" already present on machine Normal Created 6m43s kubelet, host-172-16-0-22 Created container gather Normal Started 6m42s kubelet, host-172-16-0-22 Started container gather