In https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25787/pull-ci-openshift-origin-master-e2e-aws-fips/1348717863535382528 the pod pod-submit-status-0-0 is created and is still pending 5m later (other later pods are created and failed). There are no obvious failure events that would explain why the pod is not created. This test creates and starts lots of pods in bulk - the fact that this run failed twice in a row indicates this is a fairly reproducible failure. Looking back over the last 14 day of jobs in master, about 2% of failed runs include this failure, which indicates we may be failing to start pods quickly around 2% of the time. Marking urgent because pods not starting within a reasonable period is certainly a release blocker.
This issue is the race when the pod is created and then deleted before the container is fully started. ┌Jan 11 21:04:47.627101 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:47.627056 1733 kubelet.go:1895] SyncLoop (ADD, "api"): "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" ┤ │Jan 11 21:04:47.703426 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:47.703262 1733 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-29fj8" (UniqueName: "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8") pod "pod-submit-status-0-0" (UID: "c76833c3-6e57-4135-8968-2ed47a665cf8") ┤ │Jan 11 21:04:47.804122 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:47.803923 1733 reconciler.go:269] operationExecutor.MountVolume started for volume "default-token-29fj8" (UniqueName: "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8") pod "pod-submit-status-0-0" (UID: "c76833c3-6e57-4135-8968-2ed47a665cf8") ┤ │Jan 11 21:04:47.808769 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:47.808731 1733 operation_generator.go:672] MountVolume.SetUp succeeded for volume "default-token-29fj8" (UniqueName: "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8") pod "pod-submit-status-0-0" (UID: "c76833c3-6e57-4135-8968-2ed47a665cf8") │ │Jan 11 21:04:47.950956 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:47.950805 1733 kuberuntime_manager.go:439] No sandbox for pod "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" can be found. Need to start a new one │ │Jan 11 21:04:48.815520 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:48.812665 1733 kubelet.go:1911] SyncLoop (DELETE, "api"): "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" │ │Jan 11 21:04:49.025532 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:49.024632 1733 reconciler.go:196] operationExecutor.UnmountVolume started for volume "default-token-29fj8" (UniqueName: "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8") pod "c76833c3-6e57-4135-8968-2ed47a665cf8" (UID: "c76833c3-6e57-4135-8968-2ed47a665cf8") ┤ │Jan 11 21:04:49.181520 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:49.175663 1733 operation_generator.go:797] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8" (OuterVolumeSpecName: "default-token-29fj8") pod "c76833c3-6e57-4135-8968-2ed47a665cf8" (UID: "c76833c3-6e57-4135-8968-2ed47a665cf8"). InnerVolumeSpecName "default-token-29fj8". PluginName "│ │Jan 11 21:04:49.278204 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:49.278115 1733 reconciler.go:319] Volume detached for volume "default-token-29fj8" (UniqueName: "kubernetes.io/secret/c76833c3-6e57-4135-8968-2ed47a665cf8-default-token-29fj8") on node "ip-10-0-166-126.ec2.internal" DevicePath "" │ │Jan 11 21:04:49.475544 ip-10-0-166-126 crio[1696]: 2021-01-11T21:04:49Z [verbose] Add: e2e-pods-3370:pod-submit-status-0-0:c76833c3-6e57-4135-8968-2ed47a665cf8:(openshift-sdn):eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/var/run/netns/ca245555-8ee4-42c9-860c-011cc8cb612b"}],"ips":[{"version":"4","interface":0,"address":"10.128.2.55/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.128.2.1"},{"dst":"224.0.0.0│ │Jan 11 21:04:49.475544 ip-10-0-166-126 crio[1696]: I0111 21:04:49.389759 246192 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-pods-3370", Name:"pod-submit-status-0-0", UID:"c76833c3-6e57-4135-8968-2ed47a665cf8", APIVersion:"v1", ResourceVersion:"181788", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.128.2.55/23] │ │Jan 11 21:04:49.664620 ip-10-0-166-126 hyperkube[1733]: I0111 21:04:49.664296 1733 kubelet.go:1911] SyncLoop (DELETE, "api"): "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" │ │Jan 11 21:04:52.326687 ip-10-0-166-126 hyperkube[1733]: W0111 21:04:52.325987 1733 pod_container_manager_linux.go:198] failed to delete cgroup paths for [kubepods burstable podc76833c3-6e57-4135-8968-2ed47a665cf8] : unable to destroy cgroup paths for cgroup [kubepods burstable podc76833c3-6e57-4135-8968-2ed47a665cf8] : Failed to remove paths: map[blkio:/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-│ │Jan 11 21:04:53.309527 ip-10-0-166-126 crio[1696]: 2021-01-11T21:04:51Z [verbose] Del: e2e-pods-3370:pod-submit-status-0-0:c76833c3-6e57-4135-8968-2ed47a665cf8:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"} │ │Jan 11 21:04:53.478638 ip-10-0-166-126 crio[1696]: time="2021-01-11 21:04:53.478493634Z" level=info msg="RunSandbox: releasing container name: k8s_POD_pod-submit-status-0-0_e2e-pods-3370_c76833c3-6e57-4135-8968-2ed47a665cf8_0" id=392e4627-a43c-4c0b-8fbc-3126a016d104 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox ┤ │Jan 11 21:04:53.478638 ip-10-0-166-126 crio[1696]: time="2021-01-11 21:04:53.478590851Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_pod-submit-status-0-0_e2e-pods-3370_c76833c3-6e57-4135-8968-2ed47a665cf8_0" id=392e4627-a43c-4c0b-8fbc-3126a016d104 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox │ │Jan 11 21:04:53.482614 ip-10-0-166-126 hyperkube[1733]: E0111 21:04:53.479356 1733 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF │ │Jan 11 21:04:53.482614 ip-10-0-166-126 hyperkube[1733]: E0111 21:04:53.479381 1733 kuberuntime_manager.go:755] createPodSandbox for pod "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF │ │Jan 11 21:04:53.482614 ip-10-0-166-126 hyperkube[1733]: E0111 21:04:53.479452 1733 pod_workers.go:191] Error syncing pod c76833c3-6e57-4135-8968-2ed47a665cf8 ("pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)"), skipping: failed to "CreatePodSandbox" for "pod-submit-status-0-0_e2e-pods-3370(c76833c3-6e57-4135-8968-2ed47a665cf8)" with CreatePodSandboxError: "CreatePodSandbox for pod
https://github.com/openshift/origin/blob/e49fe16bd3eb5a177d04e59739bc91864d608d1c/vendor/k8s.io/kubernetes/test/e2e/node/pods.go#L311 The Intn() will generate a number between [0..2000) to calculate for milliseconds. The lower the number the more likely we will hit the race where the Pod attempts to start, and then Kubelet tears it down. Perhaps we could add a random range between 2-4 seconds? Harshal: Could you create an upstream issue for this as well? (If it doesn't exist).
Just filed an upstream issue - https://github.com/kubernetes/kubernetes/issues/98142
*** Bug 1883991 has been marked as a duplicate of this bug. ***
*** Bug 1912880 has been marked as a duplicate of this bug. ***
Release impact is that customers who create and delete pods in job environments (batch, data, processing) or have fast running or looping pods, or perform a deployment and then immediately rollout a new deployment could see substantial (>5m) delays to their apps, preventing new versions from rolling out. The test deliberately abuses this path because any failures in this path can have significant impacts during upgrades to applications trying to remain available, causing us to violate our own or our customer SLOs for applications. Any bugs in starting pods are always release blockers because running pods is our one job.
https://search.ci.openshift.org/?search=Pod+Container+Status+should+never+report+success+for+a+pending+container+&maxAge=48h&context=1&type=junit&name=4%5C.7&maxMatches=5&maxBytes=20971520&groupBy=job In the last 2 days 11% of all 4.7 CI jobs that failed reported this as one error, so as we've been fixing other flakes this is jumping up to top blocker.
Making it a blocker for the release. Please share if there are any mitigating factors for this bug.
This bug and the underlying race condition are complex but we are still looking into it. I'm reviewing https://github.com/kubernetes/kubernetes/pull/89155 to see if that might help mitigate this issue at all as it touches the sync loop that we thing is the underlying issue. If it does, I will pick up the PR.
Filling the environment field with related CI job. Original CI bug was bug #1887857, closed as dup of bug #1883991, closed as dup of this one.
*** Bug 1908378 has been marked as a duplicate of this bug. ***
Reproduced bug 1912880 on 4.7.0-0.nightly-2021-01-27-110023. This is blocking the ability to run a cluster with a reasonable user workload for > 1 day. Marking this as a TestBlocker (same as 1912880)
*** Bug 1898614 has been marked as a duplicate of this bug. ***
How to detect the issue: 1) Check pod metadata. Look for creation and deletion timestamp. If they are close together (e.g. <1m apart), and the pod is stuck in termination, you have likely hit this bug. The pods may be in "Pending" state. Example: creationTimestamp: "2021-01-29T19:50:57Z" deletionGracePeriodSeconds: 30 deletionTimestamp: "2021-01-29T19:51:27Z" 2) Check kubelet logs. You will likely see entries similar to the above or on linked BZs: https://bugzilla.redhat.com/show_bug.cgi?id=1915085#c1 You may also see that the kubelet main SyncLoop calls "DELETE" on the pod, but never calls a final "REMOVE". How to mitigate issue when present: Force terminate pods. This should be low risk when it's confirmed that the pod profile matches the criteria above in "how to detect the issue". When stateless pods controlled by e.g. replicasets or deployments are spun up and then quickly killed, it should be fine to force remove them. There is a risk of running into this scenario: https://bugzilla.redhat.com/show_bug.cgi?id=1903228 The biggest risk when doing force removal of pods is when the pod is long-running and has a long terminationGracePeriodSeconds set, or is a statefulset pod. Check for these conditions before removal. We do not believe these sorts of pods are affected by this bug.
"Force terminate pod" meaning running "oc delete pod --force podname"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
*** Bug 1934691 has been marked as a duplicate of this bug. ***
*** Bug 1946418 has been marked as a duplicate of this bug. ***
*** Bug 1914022 has been marked as a duplicate of this bug. ***
I still experience this in recent 4.7.x clusters: deletionTimestamp: '2021-10-28T16:18:09Z' resourceVersion: '1494521179' name: deployment-deployment-7f5d48bdd7-pb69f uid: 7d94e616-c460-494e-bb3b-72ee803b609f deletionGracePeriodSeconds: 1 creationTimestamp: '2021-10-28T16:18:03Z'
@david.karlsen can you confirm if the kubelet logs also match the symptoms as reported in https://bugzilla.redhat.com/show_bug.cgi?id=1915085#c22 ?
I find this in messages.log: Oct 29 23:53:20 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:20.706492067+02:00" level=info msg="Running pod sandbox: kuberhealthy/deployment-deployment-5468b65549-6g989/POD" id=8e43e8fc-bc05-4929-83de-656559afbf50 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Oct 29 23:53:21 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:21.533273834+02:00" level=info msg="Got pod network &{Name:deployment-deployment-5468b65549-6g989 Namespace:kuberhealthy ID:2dcc07a20c6ccc2e20fdc7dcc982912a95d4c0af34b4c9734b154a4c51193c24 UID:6aaaa5ea-fcfc-404a-91dd-408a33080105 NetNS:/var/run/netns/d54964b5-85ad-40e2-a731-eed553ea81b0 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Oct 29 23:53:21 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:21.533787417+02:00" level=info msg="Adding pod kuberhealthy_deployment-deployment-5468b65549-6g989 to CNI network \"multus-cni-network\" (type=multus)" Oct 29 23:53:24 alt-ksx-r-c01oco03 crio: 2021-10-29T23:53:23+02:00 [verbose] Add: kuberhealthy:deployment-deployment-5468b65549-6g989:6aaaa5ea-fcfc-404a-91dd-408a33080105:openshift-sdn(openshift-sdn):eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/var/run/netns/d54964b5-85ad-40e2-a731-eed553ea81b0"}],"ips":[{"version":"4","interface":0,"address":"10.200.7.98/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.200.6.1"},{"dst":"224.0.0.0/4"},{"dst":"10.200.0.0/16"}],"dns":{}} Oct 29 23:53:24 alt-ksx-r-c01oco03 crio: I1029 23:53:23.999505 1212108 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kuberhealthy", Name:"deployment-deployment-5468b65549-6g989", UID:"6aaaa5ea-fcfc-404a-91dd-408a33080105", APIVersion:"v1", ResourceVersion:"916476433", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.200.7.98/23] from openshift-sdn Oct 29 23:53:24 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:24.122058234+02:00" level=info msg="Got pod network &{Name:deployment-deployment-5468b65549-6g989 Namespace:kuberhealthy ID:2dcc07a20c6ccc2e20fdc7dcc982912a95d4c0af34b4c9734b154a4c51193c24 UID:6aaaa5ea-fcfc-404a-91dd-408a33080105 NetNS:/var/run/netns/d54964b5-85ad-40e2-a731-eed553ea81b0 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Oct 29 23:53:24 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:24.122942532+02:00" level=info msg="Checking pod kuberhealthy_deployment-deployment-5468b65549-6g989 for CNI network multus-cni-network (type=multus)" Oct 29 23:53:26 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:26.810628606+02:00" level=info msg="Got pod network &{Name:deployment-deployment-5468b65549-6g989 Namespace:kuberhealthy ID:2dcc07a20c6ccc2e20fdc7dcc982912a95d4c0af34b4c9734b154a4c51193c24 UID:6aaaa5ea-fcfc-404a-91dd-408a33080105 NetNS:/var/run/netns/d54964b5-85ad-40e2-a731-eed553ea81b0 Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Oct 29 23:53:26 alt-ksx-r-c01oco03 crio: time="2021-10-29 23:53:26.810973273+02:00" level=info msg="Deleting pod kuberhealthy_deployment-deployment-5468b65549-6g989 from CNI network \"multus-cni-network\" (type=multus)" Oct 29 23:53:27 alt-ksx-r-c01oco03 crio: 2021-10-29T23:53:26+02:00 [verbose] Del: kuberhealthy:deployment-deployment-5468b65549-6g989:6aaaa5ea-fcfc-404a-91dd-408a33080105:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"} Oct 29 23:54:33 alt-ksx-r-c01oco03 hyperkube: E1029 23:54:33.433190 3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989" Oct 29 23:54:59 alt-ksx-r-c01oco03 hyperkube: E1029 23:54:59.767814 3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989" Oct 29 23:56:34 alt-ksx-r-c01oco03 hyperkube: E1029 23:56:34.487743 3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989" Oct 29 23:57:04 alt-ksx-r-c01oco03 hyperkube: E1029 23:57:04.368029 3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989" Oct 29 23:57:30 alt-ksx-r-c01oco03 hyperkube: E1029 23:57:30.978273 3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989" Oct 29 23:59:04 alt-ksx-r-c01oco03 hyperkube: E1029 23:59:04.782507 3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989" Oct 30 00:01:34 alt-ksx-r-c01oco03 hyperkube: E1030 00:01:34.186534 3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989" Oct 30 00:05:07 alt-ksx-r-c01oco03 hyperkube: E1030 00:05:07.083424 3127 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/kuberhealthy_deployment-deployment-5468b65549-6g989_6aaaa5ea-fcfc-404a-91dd-408a33080105: no such file or directory" pod="kuberhealthy/deployment-deployment-5468b65549-6g989" Oct 30 00:06:05 alt-ksx-r-c01oco03 hyperkube: E
@david.karlsen the logs you've provided do not contain any SyncLoop logs that match my comment so I can't confirm. It is possible that you are encountering similar symptoms with a different underlying root cause. If you can attach a full kubelet log at verbosity v=4 + the full pod statuses as obtained with `oc get pod -o yaml` I will be able to tell for sure.
Per comment 28, the linked pull requests shipped with 4.7.0, and it's important to track that an not re-open CLOSED ERRATA bugs. If you are running a later release and see similar issues, please open a new bug, from which you can refer back to this one with "...similar symptoms as bug 1915085" or similar if that seems appropriate to your new bug.
*** Bug 2063138 has been marked as a duplicate of this bug. ***