1870337 – [sig-storage] Managed cluster should have no crashlooping recycler pods over four minutes

Bug 1870337 - [sig-storage] Managed cluster should have no crashlooping recycler pods over four minutes

Summary: [sig-storage] Managed cluster should have no crashlooping recycler pods over ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Tomas Smetana
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1867780 (view as bug list)
Depends On:
Blocks:	1906011 1916335
TreeView+	depends on / blocked

Reported:	2020-08-19 19:48 UTC by David Eads
Modified:	2021-02-24 15:16 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes
Last Closed:	2021-02-24 15:15:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes pull 96310	None	closed	PV e2e: fix race in NFS recycling test	2021-02-19 05:34:27 UTC
Github	openshift kubernetes pull 441	None	closed	Bug 1870337: UPSTREAM: 96310: PV e2e: fix race in NFS recycling test	2021-02-19 05:34:27 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:16:17 UTC

Description David Eads 2020-08-19 19:48:47 UTC

test:
[sig-storage] Managed cluster should have no crashlooping recycler pods over four minutes 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?search=Managed+cluster+should+have+no+crashlooping+recycler+pods+over+four+minutes&maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job


It fails in 60% of metal-e2e failures, 44% of gcp-e2e failure, 29% of aws-e2e failures.  These are specifically recycler pods, not any other crashlooping pods.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1295260912655011840

and 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1295260912655011840

are good examples.

Comment 1 David Eads 2020-08-19 19:49:54 UTC

If this gets duped out on a non-storage specific bug, please show evidence of other pods that remained pending the entire four minutes.  I haven't seen that failure mode in other places.

Comment 2 Tomas Smetana 2020-08-21 15:05:50 UTC

After going through the logs: the NFS volume the recycler is trying to erase seems to never get unmounted (unmount operation starts but there is no tear-down or detach). This is most likely because the NFS server became unreachable, i.e. it got killed or the network was taken down.

Comment 3 Tomas Smetana 2020-08-24 15:18:25 UTC

The failure referenced in the description (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1295260912655011840) looks like this:

Run #0: Failed expand_less 	4m3s
fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-infra/recycler-for-nfs-vgkns was pending entire time: unknown error",
    ]
to be empty
open stdoutopen_in_new
Run #1: Failed expand_less 	4m2s
fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-infra/recycler-for-nfs-vgkns was pending entire time: unknown error",
    ]
to be empty

============================================ From the build.log:
...
Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:22 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Started: Started container recyler-container
Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:22 +0000 UTC - event for recycler-for-nfs-vgkns: {multus } AddedInterface: Add eth0 [10.131.0.84/23]
Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:22 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Created: Created container recyler-container
Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:22 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Pulled: Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9" already present on machine
Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:38 +0000 UTC - event for recycler-for-nfs-vgkns: {multus } AddedInterface: Add eth0 [10.131.0.91/23]
Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:39 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Created: Created container recyler-container
Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:39 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Started: Started container recyler-container
Aug 17 08:18:09.488: INFO: At 2020-08-17 08:08:39 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Pulled: Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9" already present on machine
Aug 17 08:18:09.488: INFO: At 2020-08-17 08:09:53 +0000 UTC - event for recycler-for-nfs-vgkns: {kubelet ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9} Killing: Stopping container recyler-container
Aug 17 08:18:09.506: INFO: POD                     NODE                                       PHASE    GRACE  CONDITIONS
Aug 17 08:18:09.506: INFO: recycler-for-nfs-vgkns  ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9  Pending  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-08-17 08:08:36 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-08-17 08:10:25 +0000 UTC ContainersNotReady containers with unready status: [recyler-container]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-08-17 08:10:25 +0000 UTC ContainersNotReady containers with unready status: [recyler-container]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-08-17 08:08:36 +0000 UTC  }]

============================================ From the workers log:

Aug 17 08:08:22.562930 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:22.562889    1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Container image \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9\" already present on machine"
...
Aug 17 08:08:22.842460 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:22.842412    1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Created" message="Created container recyler-container"
...
Aug 17 08:08:22.899708 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:22.899631    1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Started" message="Started container recyler-container"
...
Aug 17 08:08:23.877366 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:23.877167    1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)" updated successfully: (2, {Phase:Succeeded Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.84 PodIPs:[{IP:10.131.0.84}] StartTime:2020-08-17 08:08:20 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-08-17 08:08:22 +0000 UTC,FinishedAt:2020-08-17 08:08:22 +0000 UTC,ContainerID:cri-o://0cf04f1ece66ce5d4147f66ef851df3d65cec785477f722cf3023be056cfd896,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://0cf04f1ece66ce5d4147f66ef851df3d65cec785477f722cf3023be056cfd896 Started:0xc000e5ddca}] QOSClass:BestEffort EphemeralContainerStatuses:[]})
...
Aug 17 08:08:26.755245 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:26.746451    1502 manager.go:1044] Destroyed container: "/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb7dd1acc_b524_4397_affb_038525555b8a.slice/crio-7cb5e8ad7ad7b8f933fb8d6c4e2dc77b355046eaf2b27f03018f16c6cdcf6a49.scope" (aliases: [k8s_POD_recycler-for-nfs-vgkns_openshift-infra_b7dd1acc-b524-4397-affb-038525555b8a_0 7cb5e8ad7ad7b8f933fb8d6c4e2dc77b355046eaf2b27f03018f16c6cdcf6a49], namespace: "crio")
...


I would say the recycler pod (b7dd1acc-b524-4397-affb-038525555b8a) started at 8:08:22 finished sucessfully and it's not captured in the test results.
There was a second one though (af234fcb-2bb3-45bc-a580-ece202da5074):


Aug 17 08:08:39.140866 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:39.140795    1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Container image \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9\" already present on machine"
...
Aug 17 08:08:39.425660 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:39.425523    1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Created" message="Created container recyler-container"
...
Aug 17 08:08:39.474532 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:39.474491    1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Started" message="Started container recyler-container"
...
Aug 17 08:08:40.264303 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:40.264041    1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" updated successfully: (2, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.91 PodIPs:[{IP:10.131.0.91}] StartTime:2020-08-17 08:08:36 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-08-17 08:08:39 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://5e5c9ce7311bd4687cd876277f4f1f95dbf2c0f3208ff456b5d39b33c920d7ff Started:0xc000e5cf8a}] QOSClass:BestEffort EphemeralContainerStatuses:[]})
...
Aug 17 08:09:53.957215 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.956978    1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container recyler-container"
...
Aug 17 08:10:25.929353 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:10:25.927896    1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" updated successfully: (4, {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:10:25 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [recyler-container]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:10:25 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [recyler-container]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.91 PodIPs:[{IP:10.131.0.91}] StartTime:2020-08-17 08:08:36 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID: ContainerID: Started:0xc00458eea9}] QOSClass:BestEffort EphemeralContainerStatuses:[]})
...
Aug 17 08:10:30.688974 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:10:30.688472    1502 kubelet_pods.go:974] Pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" is terminated, but some volumes have not been cleaned up

============================================

The second recycler got killed and seems like the volume is hung. Might be the container ended up in uninterruptible sleep due to pending IO. The "pod was pending entire time" reason for the test failure is quite inaccurate: both the recyclers actually run and the first one seemed to run just fine.

Comment 4 David Eads 2020-08-24 15:26:57 UTC

Based on content from that run: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1295260912655011840/artifacts/e2e-gcp/pods.json 

The recycler-for-nfs-vgkns pod indicates

```
            "status": {
                "conditions": [
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-08-17T08:08:36Z",
                        "status": "True",
                        "type": "Initialized"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-08-17T08:10:25Z",
                        "message": "containers with unready status: [recyler-container]",
                        "reason": "ContainersNotReady",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-08-17T08:10:25Z",
                        "message": "containers with unready status: [recyler-container]",
                        "reason": "ContainersNotReady",
                        "status": "False",
                        "type": "ContainersReady"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-08-17T08:08:36Z",
                        "status": "True",
                        "type": "PodScheduled"
                    }
                ],
                "containerStatuses": [
                    {
                        "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9",
                        "imageID": "",
                        "lastState": {},
                        "name": "recyler-container",
                        "ready": false,
                        "restartCount": 0,
                        "started": false,
                        "state": {
                            "waiting": {
                                "reason": "ContainerCreating"
                            }
                        }
                    }
                ],
```

It doesn't look like it started running to me.  It appears to be waiting for some kind of storage attachment.  


If the failure is inaccurate, I'd be interested in the kubelet fixes to update status and the test fixes to properly categorize the failure.

Comment 5 Tomas Smetana 2020-08-26 10:08:36 UTC

Thanks for keep looking into this... I'm somewhat confused with what is going on. I tried to dig again in the workers log:

Aug 17 08:08:20.295934 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:20.295873    1502 config.go:383] Receiving a new pod "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)"
...
Aug 17 08:08:20.630345 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:20.630260    1502 volume_manager.go:403] All volumes are attached and mounted for pod "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)"
...
Aug 17 08:08:22.795613 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:22.795004    1502 kubelet.go:1852] SyncLoop (PLEG): "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)", event: &pleg.PodLifecycleEvent{ID:"b7dd1acc-b524-4397-affb-038525555b8a", Type:"ContainerStarted", Data:"7cb5e8ad7ad7b8f933fb8d6c4e2dc77b355046eaf2b27f03018f16c6cdcf6a49"
...
Aug 17 08:08:23.801781 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:23.801704    1502 kubelet_pods.go:1486] Generating status for "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)"
Aug 17 08:08:23.801781 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:23.801732    1502 helpers.go:85] Already ran container "recyler-container" of pod "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)", do nothing
...
Aug 17 08:08:23.877366 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:23.877167    1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(b7dd1acc-b524-4397-affb-038525555b8a)" updated successfully: (2, {Phase:Succeeded Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason:PodCompleted Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:20 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.84 PodIPs:[{IP:10.131.0.84}] StartTime:2020-08-17 08:08:20 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-08-17 08:08:22 +0000 UTC,FinishedAt:2020-08-17 08:08:22 +0000 UTC,ContainerID:cri-o://0cf04f1ece66ce5d4147f66ef851df3d65cec785477f722cf3023be056cfd896,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://0cf04f1ece66ce5d4147f66ef851df3d65cec785477f722cf3023be056cfd896 Started:0xc000e5ddca}] QOSClass:BestEffort EphemeralContainerStatuses:[]})
...
Aug 17 08:08:24.180747 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:24.180737    1502 kubelet_pods.go:1204] marking pod termination "b7dd1acc-b524-4397-affb-038525555b8a"
...
Aug 17 08:08:24.846180 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:24.844079    1502 reconciler.go:319] Volume detached for volume "pv-recycler-controller-token-9tp2x" (UniqueName: "kubernetes.io/secret/b7dd1acc-b524-4397-affb-038525555b8a-pv-recycler-controller-token-9tp2x") on node "ci-op-6pn9js75-ad214-ssdz9
-worker-b-ww2r9" DevicePath ""
Aug 17 08:08:24.846180 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:24.844098    1502 reconciler.go:319] Volume detached for volume "vol" (UniqueName: "kubernetes.io/nfs/b7dd1acc-b524-4397-affb-038525555b8a-vol") on node "ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9" DevicePath ""


The first pod appeared 8:08:20, finished and got cleaned up 8:08:24 (four seconds roughly). The test polls the pod states every 5 seconds (https://github.com/openshift/origin/blob/master/test/extended/operators/cluster.go#L58) so these short-lived pods are below its resolution ability. I'm not sure it's completely correct behaviour, but definitely it's not OK to claim the pod was "entire time" doing something. Five seconds is a lot, especially if the pod's goal is to remove files from an empty volume. Five seconds is also enough for another pod of the same name to start so the results we'd get would be bogus.

The second pod is more interesting:

Aug 17 08:08:36.152943 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:36.152754    1502 config.go:383] Receiving a new pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)"
...
Aug 17 08:08:36.785728 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:36.784994    1502 volume_manager.go:403] All volumes are attached and mounted for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)"
...
Aug 17 08:08:40.200977 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:40.200958    1502 kubelet.go:1852] SyncLoop (PLEG): "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)", event: &pleg.PodLifecycleEvent{ID:"af234fcb-2bb3-45bc-a580-ece202da5074", Type:"ContainerStarted", Data:"5e5c9ce7311bd4687cd876277f4f1f95dbf2c0f3208ff456b5d39b33c920d7ff"
...
Aug 17 08:08:40.201504 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:40.201471    1502 volume_manager.go:403] All volumes are attached and mounted for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)"
...
Aug 17 08:08:40.264303 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:40.264041    1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" updated successfully: (2, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +00 00 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.91 PodIPs:[{IP:10.131.0.91}] StartTime:2020-08-17 08:08:36 +0000 UTC InitContainerStatuses:[] Container Statuses:[{Name:recyler-container State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-08-17 08:08:39 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://5e5c9ce7311bd4687cd876277f4f1f95dbf2c0f3208ff456b5d39b33c920d7ff Started:0xc000e5cf8a}] QOSClass:BestEffort EphemeralContainerStatuses:[]})
...
Aug 17 08:08:41.219711 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:08:41.219112    1502 status_manager.go:435] Ignoring same status for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:40 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: hostIP:10.0.32.2 PodIP:10.131.0.91 PodIPs:[{IP:10.131.0.91}] StartTime:2020-08-17 08:08:36 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-08-17 08:08:39 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://5e5c9ce7311bd4687cd876277f4f1f95dbf2c0f3208ff456b5d39b33c920d7ff Started:0xc0018ebf89}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
...
Aug 17 08:09:53.952960 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.952228    1502 config.go:278] Setting pods for source api
Aug 17 08:09:53.956962 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.956472    1502 kubelet.go:1836] SyncLoop (DELETE, "api"): "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)"
Aug 17 08:09:53.956962 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.956570    1502 kubelet_pods.go:1486] Generating status for "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)"
Aug 17 08:09:53.956962 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.956788    1502 kuberuntime_container.go:635] Killing container "cri-o://5e5c9ce7311bd4687cd876277f4f1f95dbf2c0f3208ff456b5d39b33c920d7ff" with a 30 second grace period
Aug 17 08:09:53.957215 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:09:53.956978    1502 event.go:291] "Event occurred" object="openshift-infra/recycler-for-nfs-vgkns" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container recyler-container
...
Aug 17 08:10:25.929353 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:10:25.927896    1502 status_manager.go:578] Status for pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" updated successfully: (4, {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:10:25 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [recyler-container]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:10:25 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [recyler-container]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-17 08:08:36 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.32.2 PodIP:10.131.0.91 PodIPs:[{IP:10.131.0.91}] StartTime:2020-08-17 08:08:36 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID: ContainerID: Started:0xc00458eea9}] QOSClass:BestEffort EphemeralContainerStatuses:[]})
Aug 17 08:10:25.929353 ci-op-6pn9js75-ad214-ssdz9-worker-b-ww2r9 hyperkube[1502]: I0817 08:10:25.929184    1502 kubelet_pods.go:974] Pod "recycler-for-nfs-vgkns_openshift-infra(af234fcb-2bb3-45bc-a580-ece202da5074)" is terminated, but some volumes have not been cleaned up

The second pod ran for some ~13 seconds before it got killed. Then we failed to clean-up the volumes (unmount?), which suggests the NFS server got unreachable or the volume was busy. This is something to look into since it suggests there are races in the tests.

However: the "pods.json" seems to be mixing events from two different pods of the same name and yes, if I sampled the pods' states at those times, I'd get "Pending". Both the recyclers were apparently in running state too according to the kubelet logs and both of them got the volumes attached and mounted OK. The problems start appearing during the second pod termination: it got killed and it seems the volume got hung. I still think that since it's NFS the obvious suspects are pending IO (volume busy) or the server went away too early. I'll examine logs of successful passes of the test for comparison.

Comment 6 Tomas Smetana 2020-08-27 12:08:55 UTC

Looked at a successful run and it really seems to be the failed cleanup that causes the pod to go from "Running" to "Pending". Need to find out what's happened.

Comment 7 Tomas Smetana 2020-08-28 13:27:13 UTC

*** Bug 1867780 has been marked as a duplicate of this bug. ***

Comment 8 Tomas Smetana 2020-08-31 14:09:21 UTC

I went through some of the later flakes and there's not much consistency. Some failed with no obviouse reason (like the above), some really failed to mount the NFS volume (with NFS server complaining about "no such file or directory" on the export path), several ones failed with Kubelet complaining about system load and inability to reserver container name.

Comment 9 Tomas Smetana 2020-08-31 19:01:28 UTC

And another one where two pods seemed to be somehow treated as one:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.6/1300436795133005824

Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:25 +0000 UTC - event for recycler-for-nfs-s2jp8: {multus } AddedInterface: Add eth0 [10.128.2.29/23]
Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:26 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-130-140.us-west-1.compute.internal} Pulled: Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9" already present on machine
Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:27 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-130-140.us-west-1.compute.internal} Started: Started container recyler-container
Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:27 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-130-140.us-west-1.compute.internal} Created: Created container recyler-container
Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:41 +0000 UTC - event for recycler-for-nfs-s2jp8: {multus } AddedInterface: Add eth0 [10.129.2.35/23]
Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:42 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-211-150.us-west-1.compute.internal} Pulled: Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9" already present on machine
Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:43 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-211-150.us-west-1.compute.internal} Created: Created container recyler-container
Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:43 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-211-150.us-west-1.compute.internal} Started: Started container recyler-container
Aug 31 15:07:03.725: INFO: At 2020-08-31 15:01:47 +0000 UTC - event for recycler-for-nfs-s2jp8: {kubelet ip-10-0-211-150.us-west-1.compute.internal} Killing: Stopping container recyler-container

This is strange: Why is the image pulled twice? Why is the pod started twice? But the first one actually finished:

Aug 31 15:01:28.789695 ip-10-0-130-140 hyperkube[1722]: I0831 15:01:28.782943    1722 status_manager.go:578] Status for pod "recycler-for-nfs-s2jp8_openshift-infra(3c6cf55a-da39-43d6-b847-628835057881)" updated successfully: (2, {Phase:Succeeded Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-31 15:01:19 +0000 UTC Reason:PodCompleted Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-31 15:01:19 +0000 UTC Reason:PodCompleted Message:} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-31 15:01:19 +0000 UTC Reason:PodCompleted Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-08-31 15:01:19 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.130.140 PodIP:10.128.2.29 PodIPs:[{IP:10.128.2.29}] StartTime:2020-08-31 15:01:19 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:recyler-container State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-08-31 15:01:27 +0000 UTC,FinishedAt:2020-08-31 15:01:27 +0000 UTC,ContainerID:cri-o://432259daba3a48e209ab9c9e56c85d58e6b6a431fea49a9b3fae5a6c000f30f1,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ImageID:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 ContainerID:cri-o://432259daba3a48e209ab9c9e56c85d58e6b6a431fea49a9b3fae5a6c000f30f1 Started:0xc00204c7e0}] QOSClass:BestEffort EphemeralContainerStatuses:[]})

The whole wiping of (empty) disk took slightly more than one second. I just noticed the second pod is killed 30s after the first one had started. I'll check the other instances of this failure.

Comment 10 Tomas Smetana 2020-09-02 06:54:41 UTC

The latest failures ended with "mount.nfs: No route to host" and the logs are odd again:

Sep  1 15:27:34.924: INFO: At 2020-09-01 15:20:57 +0000 UTC - event for recycler-for-nfs-j6xqq: {multus } AddedInterface: Add eth0 [10.129.3.80/23]
Sep  1 15:27:34.924: INFO: At 2020-09-01 15:20:57 +0000 UTC - event for recycler-for-nfs-j6xqq: {kubelet compute-0} Pulled: Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9" already present on machine
Sep  1 15:27:34.924: INFO: At 2020-09-01 15:20:57 +0000 UTC - event for recycler-for-nfs-j6xqq: {kubelet compute-0} Created: Created container recyler-container
Sep  1 15:27:34.924: INFO: At 2020-09-01 15:20:57 +0000 UTC - event for recycler-for-nfs-j6xqq: {kubelet compute-0} Started: Started container recyler-container
Sep  1 15:27:34.924: INFO: At 2020-09-01 15:23:14 +0000 UTC - event for recycler-for-nfs-j6xqq: {kubelet compute-1} FailedMount: Unable to attach or mount volumes: unmounted volumes=[vol], unattached volumes=[vol pv-recycler-controller-token-x9grs]: timed out waiting for the condition
Sep  1 15:27:34.924: INFO: At 2020-09-01 15:23:27 +0000 UTC - event for recycler-for-nfs-j6xqq: {kubelet compute-1} FailedMount: MountVolume.SetUp failed for volume "vol" : mount failed: exit status 32

So, the pod has been started on "compute-0" node however the mount failed at "compute-1".

Comment 11 Tomas Smetana 2020-09-02 10:46:18 UTC

Anyway, the first thing I can start with is the test itself. It really doesn't look to be correct:

- It polls for a list of pods every 5 seconds for 4 minutes, assuming the "namespace/pod_name" uniquely identifies a pod. We should use uid, otherwise the results would be bogus.
- If I read the code correctly it creates a list of "pending" pods every 5 seconds and then after the 4 minute timeout it claims about any pod on the list it was "pending the entire time", while all we know is that the pod was in the pending state 0-10 seconds ago. Without checking the pod's start time it's not possible to make the conclusion the test makes.

There are other reasons this test fails so I need to keep looking. What is also concerning is the comment #4: clearly the pods.json mixes events from two different pods too. Might be another case where the namespace/pod is used instead of uid. And I actually wonder if the excerpt from the build log with mixed events from two nodes isn't caused by the same error at another place too.

Comment 12 Tomas Smetana 2020-09-07 08:21:36 UTC

FTR: https://github.com/openshift/origin/pull/25471

It's probably not going to fix all the instances of the problem, but it should help to determine the failures more accurately.

Comment 13 Tomas Smetana 2020-09-09 11:42:49 UTC

The second largest problem is the time-out of the the nfs.mount on the system:

Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/da144b06-0823-4c34-b5f6-a361cdb513af/volumes/kubernetes.io~nfs/vol --scope -- mount -t nfs 10.128.10.78:/exports /var/lib/kubelet/pods/da144b06-0823-4c34-b5f6-a361cdb513af/volumes/kubernetes.io~nfs/vol
Output: Running scope as unit: run-r2f9a9644fdfa424e9c3873cdc41c22f8.scope
mount.nfs: No route to host
...
Sep 03 19:07:40.453 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: load-balancer-api-external: failed to establish a TCP connection to api.ci-op-n8m712wg-45fdb.origin-ci-int-gce.dev.openshift.com:6443: dial tcp 10.0.0.2:6443: i/o timeout (152 times)

No idea what's wrong in this case, just that the network looks to be down.

Comment 14 Tomas Smetana 2020-10-01 08:21:52 UTC

I have to poke somebody to take a look at the PR. Or change it to a bugfix instead of "test:" to attract attention... UpcomingSprint.

Comment 15 Tomas Smetana 2020-10-09 14:07:27 UTC

The PR has been merged which should decrease the failure rate for the recycler crashlooping test hopefully -- I will review the new failures and try to fix those too.

Comment 16 Tomas Smetana 2020-10-12 14:31:05 UTC

Seems like the past failures are quite consistent: the pod fails to mount the volume with "mount.nfs: No route to host".

Comment 19 Tomas Smetana 2020-11-06 12:45:09 UTC

The NFS recycling test seems to be racy: it launches a NFS server, provisions a PV with Recycle policy, starts a pod that writes something to the volume, stops the pod and then starts another one that verifies the volume is clean and has been properly recycled. Then stops the second pod and the test finishes with cleanup that also stops the NFS server. However the PV still has the Recycle policy so it must be recycled again -- and depending how the recycling pod launch interleaves with the NFS server removal the recycler either succeeds or (quite often) hits the server in various stages of RPC services shutdown and fails.

Comment 20 Tomas Smetana 2020-11-06 15:54:59 UTC

Upstream PR: https://github.com/kubernetes/kubernetes/pull/96310

Comment 25 Qin Ping 2020-12-03 10:02:50 UTC

Hi Tomas,

Still found a Flaky: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.7/1331014646701756416
This build should include the PR.

But the error msg is a littile different: 
Nov 24 00:21:11.399: INFO: At 2020-11-24 00:19:23 +0000 UTC - event for recycler-for-nfs-8zcrl: {kubelet ci-op-hc958kw9-fdcb7-z8fln-worker-c-tkqtx} FailedMount: MountVolume.SetUp failed for volume "vol" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/a6619264-2c85-462e-8f46-2c7152e0d4b7/volumes/kubernetes.io~nfs/vol --scope -- mount -t nfs 10.128.11.235:/exports /var/lib/kubelet/pods/a6619264-2c85-462e-8f46-2c7152e0d4b7/volumes/kubernetes.io~nfs/vol
Output: Running scope as unit: run-re8dc488a17704573a26a884bd2b2f7d9.scope
mount.nfs: Connection timed out

Could you help take a look at?

Comment 26 Tomas Smetana 2020-12-03 12:01:53 UTC

I noticed too. :( This test was failing for several reasons and I think I removed just some of them... There might be more reasons for the "no route to host" however the most probable is that the NFS server pod is no longer running -- which is strange since we should not get to the cleanup code before the volume is recycled now. Assinging back to myself.

Comment 27 Tomas Smetana 2020-12-09 14:13:03 UTC

I lived under the impression this was cherry-picked to 4.6... Seems like it wasn't. The 4.7 runs are OK:
https://search.ci.openshift.org/?search=Managed+cluster+should+have+no+crashlooping+recycler+pods+over+four+minutes&maxAge=168h&context=1&type=bug%2Bjunit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 28 Tomas Smetana 2020-12-09 14:18:38 UTC

Back to MODIFIED: this is really fixed in 4.7. We need to to backport/cherry-pick to previous releases eventually.

Comment 29 Tomas Smetana 2020-12-09 14:54:47 UTC

Backport bug: https://bugzilla.redhat.com/show_bug.cgi?id=1906011

Comment 31 Qin Ping 2020-12-14 01:56:04 UTC

Checked 14 days e2e test runs recently, not find the error, so marked this as verified.

Comment 34 errata-xmlrpc 2021-02-24 15:15:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.