test: [sig-storage] Managed cluster should have no crashlooping recycler pods over four minutes is failing frequently in CI, see search results: https://search.ci.openshift.org/?search=Managed+cluster+should+have+no+crashlooping+recycler+pods+over+four+minutes&maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job It fails in 60% of metal-e2e failures, 44% of gcp-e2e failure, 29% of aws-e2e failures. These are specifically recycler pods, not any other crashlooping pods. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1295260912655011840 and https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1295260912655011840 are good examples.
If this gets duped out on a non-storage specific bug, please show evidence of other pods that remained pending the entire four minutes. I haven't seen that failure mode in other places.
After going through the logs: the NFS volume the recycler is trying to erase seems to never get unmounted (unmount operation starts but there is no tear-down or detach). This is most likely because the NFS server became unreachable, i.e. it got killed or the network was taken down.
The failure referenced in the description (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1295260912655011840) looks like this: Run #0: Failed expand_less 4m3s fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected <[]string | len:1, cap:1>: [ "Pod openshift-infra/recycler-for-nfs-vgkns was pending entire time: unknown error", ] to be empty open stdoutopen_in_new Run #1: Failed expand_less 4m2s fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected <[]string | len:1, cap:1>: [ "Pod openshift-infra/recycler-for-nfs-vgkns was pending entire time: unknown error", ] to be empty ============================================ From the build.log: ... Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:22 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Started: Started container recyler-container Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:22 +0000 UTC - event for recycler-for-nfs-vgkns: {multus } AddedInterface: Add eth0 [10.131.0.84/23] Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:22 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Created: Created container recyler-container Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:22 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Pulled: Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9" already present on machine Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:38 +0000 UTC - event for recycler-for-nfs-vgkns: {multus } AddedInterface: Add eth0 [10.131.0.91/23] Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:39 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Created: Created container recyler-container Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:39 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Started: Started container recyler-container Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:39 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Pulled: Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9" already present on machine Aug 17 08:18:09.488: INFO: At 2020-08-17 08:09:53 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Killing: Stopping container recyler-container Aug 17 08:18:09.506: INFO: POD NODE PHASE GRACE CONDITIONS Aug 17 08:18:09.506: INFO: recycler-for-nfs-vgkns ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 Pending 30s [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-08-17 08:08:36 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-08-17 08:10:25 +0000 UTC ContainersNotReady containers with unready status: [recyler-container]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-08-17 08:10:25 +0000 UTC ContainersNotReady containers with unready status: [recyler-container]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-08-17 08:08:36 +0000 UTC }] ============================================ From the workers log: Aug 17 08:08:22.562930 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:22.562889 1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Container image \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9\" already present on machine" ... Aug 17 08:08:22.842460 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:22.842412 1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Created" message="Created container recyler-container" ... Aug 17 08:08:22.899708 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:22.899631 1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Started" message="Started container recyler-container" ... Aug 17 08:08:23.877366 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:23.877167 1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)" updated successfully: (2, {Phase:Succeeded Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.84 PodIPs:[{IP:10.131.0.84}] StartTime:2020-08-17 08:08:20 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-08-17 08:08:22 +0000 UTC,FinishedAt:2020-08-17 08:08:22 +0000 UTC,ContainerID:cri-o://0cf04f1ece66ce5d4147f66ef851df3d65cec785477f722cf3023be056cfd896,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://0cf04f1ece66ce5d4147f66ef851df3d65cec785477f722cf3023be056cfd896 Started:0xc000e5ddca}] QOSClass:BestEffort EphemeralContainerStatuses:[]}) ... Aug 17 08:08:26.755245 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:26.746451 1502 manager.go:1044] Destroyed container: "/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb7dd1acc_b524_4397_affb_038525555b8a.slice/crio-7cb5e8ad7ad7b8f933fb8d6c4e2dc77b355046eaf2b27f03018f16c6cdcf6a49.scope" (aliases: [k8s_POD_recycler-for-nfs-vgkns_openshift-infra_b7dd1acc-b524-4397-affb-038525555b8a_0 7cb5e8ad7ad7b8f933fb8d6c4e2dc77b355046eaf2b27f03018f16c6cdcf6a49], namespace: "crio") ... I would say the recycler pod (b7dd1acc-b524-4397-affb-038525555b8a) started at 8:08:22 finished sucessfully and it's not captured in the test results. There was a second one though (af234fcb-2bb3-45bc-a580-ece202da5074): Aug 17 08:08:39.140866 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:39.140795 1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Container image \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9\" already present on machine" ... Aug 17 08:08:39.425660 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:39.425523 1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Created" message="Created container recyler-container" ... Aug 17 08:08:39.474532 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:39.474491 1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Started" message="Started container recyler-container" ... Aug 17 08:08:40.264303 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:40.264041 1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" updated successfully: (2, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.91 PodIPs:[{IP:10.131.0.91}] StartTime:2020-08-17 08:08:36 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-08-17 08:08:39 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://5e5c9ce7311bd4687cd876277f4f1f95dbf2c0f3208ff456b5d39b33c920d7ff Started:0xc000e5cf8a}] QOSClass:BestEffort EphemeralContainerStatuses:[]}) ... Aug 17 08:09:53.957215 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.956978 1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container recyler-container" ... Aug 17 08:10:25.929353 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:10:25.927896 1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" updated successfully: (4, {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:10:25 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [recyler-container]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:10:25 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [recyler-container]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.91 PodIPs:[{IP:10.131.0.91}] StartTime:2020-08-17 08:08:36 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID: ContainerID: Started:0xc00458eea9}] QOSClass:BestEffort EphemeralContainerStatuses:[]}) ... Aug 17 08:10:30.688974 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:10:30.688472 1502 kubelet_pods.go:974] Pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" is terminated, but some volumes have not been cleaned up ============================================ The second recycler got killed and seems like the volume is hung. Might be the container ended up in uninterruptible sleep due to pending IO. The "pod was pending entire time" reason for the test failure is quite inaccurate: both the recyclers actually run and the first one seemed to run just fine.
Based on content from that run: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1295260912655011840/artifacts/e2e-gcp/pods.json The recycler-for-nfs-vgkns pod indicates ``` "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2020-08-17T08:08:36Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2020-08-17T08:10:25Z", "message": "containers with unready status: [recyler-container]", "reason": "ContainersNotReady", "status": "False", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": "2020-08-17T08:10:25Z", "message": "containers with unready status: [recyler-container]", "reason": "ContainersNotReady", "status": "False", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2020-08-17T08:08:36Z", "status": "True", "type": "PodScheduled" } ], "containerStatuses": [ { "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9", "imageID": "", "lastState": {}, "name": "recyler-container", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } } ], ``` It doesn't look like it started running to me. It appears to be waiting for some kind of storage attachment. If the failure is inaccurate, I'd be interested in the kubelet fixes to update status and the test fixes to properly categorize the failure.
Thanks for keep looking into this... I'm somewhat confused with what is going on. I tried to dig again in the workers log: Aug 17 08:08:20.295934 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:20.295873 1502 config.go:383] Receiving a new pod "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)" ... Aug 17 08:08:20.630345 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:20.630260 1502 volume_manager.go:403] All volumes are attached and mounted for pod "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)" ... Aug 17 08:08:22.795613 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:22.795004 1502 kubelet.go:1852] SyncLoop (PLEG): "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)", event: &pleg.PodLifecycleEvent{ID:"b7dd1acc-b524-4397-affb-038525555b8a", Type:"ContainerStarted", Data:"7cb5e8ad7ad7b8f933fb8d6c4e2dc77b355046eaf2b27f03018f16c6cdcf6a49" ... Aug 17 08:08:23.801781 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:23.801704 1502 kubelet_pods.go:1486] Generating status for "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)" Aug 17 08:08:23.801781 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:23.801732 1502 helpers.go:85] Already ran container "recyler-container" of pod "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)", do nothing ... Aug 17 08:08:23.877366 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:23.877167 1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)" updated successfully: (2, {Phase:Succeeded Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.84 PodIPs:[{IP:10.131.0.84}] StartTime:2020-08-17 08:08:20 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-08-17 08:08:22 +0000 UTC,FinishedAt:2020-08-17 08:08:22 +0000 UTC,ContainerID:cri-o://0cf04f1ece66ce5d4147f66ef851df3d65cec785477f722cf3023be056cfd896,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://0cf04f1ece66ce5d4147f66ef851df3d65cec785477f722cf3023be056cfd896 Started:0xc000e5ddca}] QOSClass:BestEffort EphemeralContainerStatuses:[]}) ... Aug 17 08:08:24.180747 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:24.180737 1502 kubelet_pods.go:1204] marking pod termination "b7dd1acc-b524-4397-affb-038525555b8a" ... Aug 17 08:08:24.846180 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:24.844079 1502 reconciler.go:319] Volume detached for volume "pv-recycler-controller-token-9tp2x" (UniqueName: "kubernetes.io/secret/b7dd1acc-b524-4397-affb-038525555b8a-pv-recycler-controller-token-9tp2x") on node "ci-op-6pn9js75-ad214-ssdz9 -worker-b-ww2r9" DevicePath "" Aug 17 08:08:24.846180 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:24.844098 1502 reconciler.go:319] Volume detached for volume "vol" (UniqueName: "kubernetes.io/nfs/b7dd1acc-b524-4397-affb-038525555b8a-vol") on node "ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9" DevicePath "" The first pod appeared 8:08:20, finished and got cleaned up 8:08:24 (four seconds roughly). The test polls the pod states every 5 seconds (https://github.com/openshift/origin/blob/master/test/extended/operators/cluster.go#L58) so these short-lived pods are below its resolution ability. I'm not sure it's completely correct behaviour, but definitely it's not OK to claim the pod was "entire time" doing something. Five seconds is a lot, especially if the pod's goal is to remove files from an empty volume. Five seconds is also enough for another pod of the same name to start so the results we'd get would be bogus. The second pod is more interesting: Aug 17 08:08:36.152943 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:36.152754 1502 config.go:383] Receiving a new pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" ... Aug 17 08:08:36.785728 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:36.784994 1502 volume_manager.go:403] All volumes are attached and mounted for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" ... Aug 17 08:08:40.200977 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:40.200958 1502 kubelet.go:1852] SyncLoop (PLEG): "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)", event: &pleg.PodLifecycleEvent{ID:"af234fcb-2bb3-45bc-a580-ece202da5074", Type:"ContainerStarted", Data:"5e5c9ce7311bd4687cd876277f4f1f95dbf2c0f3208ff456b5d39b33c920d7ff" ... Aug 17 08:08:40.201504 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:40.201471 1502 volume_manager.go:403] All volumes are attached and mounted for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" ... Aug 17 08:08:40.264303 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:40.264041 1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" updated successfully: (2, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +00 00 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.91 PodIPs:[{IP:10.131.0.91}] StartTime:2020-08-17 08:08:36 +0000 UTC InitContainerStatuses:[] Container Statuses:[{Name:recyler-container State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-08-17 08:08:39 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://5e5c9ce7311bd4687cd876277f4f1f95dbf2c0f3208ff456b5d39b33c920d7ff Started:0xc000e5cf8a}] QOSClass:BestEffort EphemeralContainerStatuses:[]}) ... Aug 17 08:08:41.219711 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:41.219112 1502 status_manager.go:435] Ignoring same status for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: hostIP:10.0.32.2 PodIP:10.131.0.91 PodIPs:[{IP:10.131.0.91}] StartTime:2020-08-17 08:08:36 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-08-17 08:08:39 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://5e5c9ce7311bd4687cd876277f4f1f95dbf2c0f3208ff456b5d39b33c920d7ff Started:0xc0018ebf89}] QOSClass:BestEffort EphemeralContainerStatuses:[]} ... Aug 17 08:09:53.952960 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.952228 1502 config.go:278] Setting pods for source api Aug 17 08:09:53.956962 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.956472 1502 kubelet.go:1836] SyncLoop (DELETE, "api"): "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" Aug 17 08:09:53.956962 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.956570 1502 kubelet_pods.go:1486] Generating status for "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" Aug 17 08:09:53.956962 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.956788 1502 kuberuntime_container.go:635] Killing container "cri-o://5e5c9ce7311bd4687cd876277f4f1f95dbf2c0f3208ff456b5d39b33c920d7ff" with a 30 second grace period Aug 17 08:09:53.957215 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.956978 1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container recyler-container ... Aug 17 08:10:25.929353 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:10:25.927896 1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" updated successfully: (4, {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:10:25 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [recyler-container]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:10:25 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [recyler-container]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.91 PodIPs:[{IP:10.131.0.91}] StartTime:2020-08-17 08:08:36 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID: ContainerID: Started:0xc00458eea9}] QOSClass:BestEffort EphemeralContainerStatuses:[]}) Aug 17 08:10:25.929353 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:10:25.929184 1502 kubelet_pods.go:974] Pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" is terminated, but some volumes have not been cleaned up The second pod ran for some ~13 seconds before it got killed. Then we failed to clean-up the volumes (unmount?), which suggests the NFS server got unreachable or the volume was busy. This is something to look into since it suggests there are races in the tests. However: the "pods.json" seems to be mixing events from two different pods of the same name and yes, if I sampled the pods' states at those times, I'd get "Pending". Both the recyclers were apparently in running state too according to the kubelet logs and both of them got the volumes attached and mounted OK. The problems start appearing during the second pod termination: it got killed and it seems the volume got hung. I still think that since it's NFS the obvious suspects are pending IO (volume busy) or the server went away too early. I'll examine logs of successful passes of the test for comparison.
Looked at a successful run and it really seems to be the failed cleanup that causes the pod to go from "Running" to "Pending". Need to find out what's happened.
*** Bug 1867780 has been marked as a duplicate of this bug. ***
I went through some of the later flakes and there's not much consistency. Some failed with no obviouse reason (like the above), some really failed to mount the NFS volume (with NFS server complaining about "no such file or directory" on the export path), several ones failed with Kubelet complaining about system load and inability to reserver container name.
And another one where two pods seemed to be somehow treated as one: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.6/1300436795133005824 Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:25 +0000 UTC - event for recycler-for-nfs-s2jp8: {multus } AddedInterface: Add eth0 [10.128.2.29/23] Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:26 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-130-140.us-west-1.compute.internal} Pulled: Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9" already present on machine Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:27 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-130-140.us-west-1.compute.internal} Started: Started container recyler-container Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:27 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-130-140.us-west-1.compute.internal} Created: Created container recyler-container Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:41 +0000 UTC - event for recycler-for-nfs-s2jp8: {multus } AddedInterface: Add eth0 [10.129.2.35/23] Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:42 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-211-150.us-west-1.compute.internal} Pulled: Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9" already present on machine Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:43 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-211-150.us-west-1.compute.internal} Created: Created container recyler-container Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:43 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-211-150.us-west-1.compute.internal} Started: Started container recyler-container Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:47 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-211-150.us-west-1.compute.internal} Killing: Stopping container recyler-container This is strange: Why is the image pulled twice? Why is the pod started twice? But the first one actually finished: Aug 31 15:01:28.789695 ip-10-0-130-140 hyperkube[1722]: I0831 15:01:28.782943 1722 status_manager.go:578] Status for pod "recycler-for-nfs-s2jp8_openshift-infra(3c6cf55a-da39-43d6-b847-628835057881)" updated successfully: (2, {Phase:Succeeded Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-31 15:01:19 +0000 UTC Reason:PodCompleted Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-31 15:01:19 +0000 UTC Reason:PodCompleted Message:} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-31 15:01:19 +0000 UTC Reason:PodCompleted Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-31 15:01:19 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.130.140 PodIP:10.128.2.29 PodIPs:[{IP:10.128.2.29}] StartTime:2020-08-31 15:01:19 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-08-31 15:01:27 +0000 UTC,FinishedAt:2020-08-31 15:01:27 +0000 UTC,ContainerID:cri-o://432259daba3a48e209ab9c9e56c85d58e6b6a431fea49a9b3fae5a6c000f30f1,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://432259daba3a48e209ab9c9e56c85d58e6b6a431fea49a9b3fae5a6c000f30f1 Started:0xc00204c7e0}] QOSClass:BestEffort EphemeralContainerStatuses:[]}) The whole wiping of (empty) disk took slightly more than one second. I just noticed the second pod is killed 30s after the first one had started. I'll check the other instances of this failure.
The latest failures ended with "mount.nfs: No route to host" and the logs are odd again: Sep 1 15:27:34.924: INFO: At 2020-09-01 15:20:57 +0000 UTC - event for recycler-for-nfs-j6xqq: {multus } AddedInterface: Add eth0 [10.129.3.80/23] Sep 1 15:27:34.924: INFO: At 2020-09-01 15:20:57 +0000 UTC - event for recycler-for-nfs-j6xqq: {kubelet compute-0} Pulled: Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9" already present on machine Sep 1 15:27:34.924: INFO: At 2020-09-01 15:20:57 +0000 UTC - event for recycler-for-nfs-j6xqq: {kubelet compute-0} Created: Created container recyler-container Sep 1 15:27:34.924: INFO: At 2020-09-01 15:20:57 +0000 UTC - event for recycler-for-nfs-j6xqq: {kubelet compute-0} Started: Started container recyler-container Sep 1 15:27:34.924: INFO: At 2020-09-01 15:23:14 +0000 UTC - event for recycler-for-nfs-j6xqq: {kubelet compute-1} FailedMount: Unable to attach or mount volumes: unmounted volumes=[vol], unattached volumes=[vol pv-recycler-controller-token-x9grs]: timed out waiting for the condition Sep 1 15:27:34.924: INFO: At 2020-09-01 15:23:27 +0000 UTC - event for recycler-for-nfs-j6xqq: {kubelet compute-1} FailedMount: MountVolume.SetUp failed for volume "vol" : mount failed: exit status 32 So, the pod has been started on "compute-0" node however the mount failed at "compute-1".
Anyway, the first thing I can start with is the test itself. It really doesn't look to be correct: - It polls for a list of pods every 5 seconds for 4 minutes, assuming the "namespace/pod_name" uniquely identifies a pod. We should use uid, otherwise the results would be bogus. - If I read the code correctly it creates a list of "pending" pods every 5 seconds and then after the 4 minute timeout it claims about any pod on the list it was "pending the entire time", while all we know is that the pod was in the pending state 0-10 seconds ago. Without checking the pod's start time it's not possible to make the conclusion the test makes. There are other reasons this test fails so I need to keep looking. What is also concerning is the comment #4: clearly the pods.json mixes events from two different pods too. Might be another case where the namespace/pod is used instead of uid. And I actually wonder if the excerpt from the build log with mixed events from two nodes isn't caused by the same error at another place too.
FTR: https://github.com/openshift/origin/pull/25471 It's probably not going to fix all the instances of the problem, but it should help to determine the failures more accurately.
The second largest problem is the time-out of the the nfs.mount on the system: Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/da144b06-0823-4c34-b5f6-a361cdb513af/volumes/kubernetes.io~nfs/vol --scope -- mount -t nfs 10.128.10.78:/exports /var/lib/kubelet/pods/da144b06-0823-4c34-b5f6-a361cdb513af/volumes/kubernetes.io~nfs/vol Output: Running scope as unit: run-r2f9a9644fdfa424e9c3873cdc41c22f8.scope mount.nfs: No route to host ... Sep 03 19:07:40.453 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: load-balancer-api-external: failed to establish a TCP connection to api.ci-op-n8m712wg-45fdb.origin-ci-int-gce.dev.openshift.com:6443: dial tcp 10.0.0.2:6443: i/o timeout (152 times) No idea what's wrong in this case, just that the network looks to be down.
I have to poke somebody to take a look at the PR. Or change it to a bugfix instead of "test:" to attract attention... UpcomingSprint.
The PR has been merged which should decrease the failure rate for the recycler crashlooping test hopefully -- I will review the new failures and try to fix those too.
Seems like the past failures are quite consistent: the pod fails to mount the volume with "mount.nfs: No route to host".
The NFS recycling test seems to be racy: it launches a NFS server, provisions a PV with Recycle policy, starts a pod that writes something to the volume, stops the pod and then starts another one that verifies the volume is clean and has been properly recycled. Then stops the second pod and the test finishes with cleanup that also stops the NFS server. However the PV still has the Recycle policy so it must be recycled again -- and depending how the recycling pod launch interleaves with the NFS server removal the recycler either succeeds or (quite often) hits the server in various stages of RPC services shutdown and fails.
Upstream PR: https://github.com/kubernetes/kubernetes/pull/96310
Hi Tomas, Still found a Flaky: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.7/1331014646701756416 This build should include the PR. But the error msg is a littile different: Nov 24 00:21:11.399: INFO: At 2020-11-24 00:19:23 +0000 UTC - event for recycler-for-nfs-8zcrl: {kubelet ci-op-hc958kw9-fdcb7-z8fln-worker-c-tkqtx} FailedMount: MountVolume.SetUp failed for volume "vol" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/a6619264-2c85-462e-8f46-2c7152e0d4b7/volumes/kubernetes.io~nfs/vol --scope -- mount -t nfs 10.128.11.235:/exports /var/lib/kubelet/pods/a6619264-2c85-462e-8f46-2c7152e0d4b7/volumes/kubernetes.io~nfs/vol Output: Running scope as unit: run-re8dc488a17704573a26a884bd2b2f7d9.scope mount.nfs: Connection timed out Could you help take a look at?
I noticed too. :( This test was failing for several reasons and I think I removed just some of them... There might be more reasons for the "no route to host" however the most probable is that the NFS server pod is no longer running -- which is strange since we should not get to the cleanup code before the volume is recycled now. Assinging back to myself.
I lived under the impression this was cherry-picked to 4.6... Seems like it wasn't. The 4.7 runs are OK: https://search.ci.openshift.org/?search=Managed+cluster+should+have+no+crashlooping+recycler+pods+over+four+minutes&maxAge=168h&context=1&type=bug%2Bjunit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job
Back to MODIFIED: this is really fixed in 4.7. We need to to backport/cherry-pick to previous releases eventually.
Backport bug: https://bugzilla.redhat.com/show_bug.cgi?id=1906011
Checked 14 days e2e test runs recently, not find the error, so marked this as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633