A possible static pod racing issue is once again observed, this time with kube-controller-manager: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26831/pull-ci-openshift-origin-master-e2e-aws-fips/1491798226607542272 TIMELINE OF EVENTS: 0."Kubelet version" kubeletVersion="v1.23.3+f14faf2" - contains rphillips fix for static pods 1. kcm revision/7 node/215 installer (normal pod) started at 16:22:03Z, finished at 16:22:57Z 2. kcm revision/7 node/215 static pod file created before 16:22:57. pod in API created at 16:23:10Z, started kcm container 16:23:10Z 3. kcm-o observes node/215 as available at 16:23:30 4. kcm revision/7 node/177 installer (normal pod) started at 16:23:48Z, finished at 16:24:21Z 5. kcm revision/7 node/177 static pod file created before 16:24:21Z. pod in API not updated by 16:42\:35 No pod was ever created for node/177 revision 7. This is a serious issue if any customer attempts to upgrade their system. They will get stuck if this happens.
This race is not self-resolving. We have four static pod operators: etcd, kube-apiserver, kube-controller-manager, kube-scheduler. If they wedge in this manner during an upgrade, they will not self-resolve. If they do not self-resolve, the upgrade fails. In addition to failing the upgrade, there are cases like certificate and trust rotation. If those revisions wedge and do not unstick, the cluster will lose trust between internal components, resulting cases like kubelets unable to communicate to kube-apiserver, which causes workloads to fail.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90 seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it has always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Looks like static pod never came up, similar to previous bug 2040533 Feb 10 16:24:21.445590 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.445488 1699 kubelet.go:2096] "SyncLoop REMOVE" source="file" pods=[openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal] Feb 10 16:24:21.446157 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.445820 1699 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" podUID=63909fc6133bbfa6049558bdad60a428 containerName="kube-controller-manager-cert-syncer" containerID="cri-o://872e4d7de952818350a4 6b0a8406dd16bc6dc6662bd2e3e8a78365bf34cefaa8" gracePeriod=30 Feb 10 16:24:21.446157 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.445835 1699 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" podUID=63909fc6133bbfa6049558bdad60a428 containerName="kube-controller-manager" containerID="cri-o://556073693d5ba8577cb4a34283625e03 79796ec96d55df2237efed1fd7568f23" gracePeriod=30 Feb 10 16:24:21.446157 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.445834 1699 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" podUID=63909fc6133bbfa6049558bdad60a428 containerName="kube-controller-manager-recovery-controller" containerID="cri-o://7f059db2c67d 497b2fbc00034dd91ff93e2b0ffc0b425b92cdd0e4996064173d" gracePeriod=30 Feb 10 16:24:21.446157 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.445918 1699 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" podUID=63909fc6133bbfa6049558bdad60a428 containerName="cluster-policy-controller" containerID="cri-o://24162b3a61ea8cf2e3a266995bf88f cf726b918c8f8630cd5719e4d441bc943b" gracePeriod=30 Feb 10 16:24:21.446575 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.446397 1699 kubelet.go:2086] "SyncLoop ADD" source="file" pods=[openshift-kube-controller-manager/kube-controller-man ager-ip-10-0-191-177.ec2.internal] Feb 10 16:24:21.530283 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.530246 1699 reconciler.go:221] "operationExecutor.VerifyControllerAttachedVolume started for volume \"cert-dir\" (Uni queName: \"kubernetes.io/host-path/2e4bf7cd59f497640222976da89b4026-cert-dir\") pod \"kube-controller-manager-ip-10-0-191-177.ec2.internal\" (UID: \"2e4bf7cd59f497640222976da89b4026\") " pod= "openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" Feb 10 16:24:21.530439 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.530323 1699 reconciler.go:221] "operationExecutor.VerifyControllerAttachedVolume started for volume \"resource-dir\" (UniqueName: \"kubernetes.io/host-path/2e4bf7cd59f497640222976da89b4026-resource-dir\") pod \"kube-controller-manager-ip-10-0-191-177.ec2.internal\" (UID: \"2e4bf7cd59f497640222976da89b4026\" ) " pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" Feb 10 16:24:21.631636 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.631606 1699 reconciler.go:258] "operationExecutor.MountVolume started for volume \"cert-dir\" (UniqueName: \"kubernet es.io/host-path/2e4bf7cd59f497640222976da89b4026-cert-dir\") pod \"kube-controller-manager-ip-10-0-191-177.ec2.internal\" (UID: \"2e4bf7cd59f497640222976da89b4026\") " pod="openshift-kube-con troller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" Feb 10 16:24:21.631742 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.631680 1699 reconciler.go:258] "operationExecutor.MountVolume started for volume \"resource-dir\" (UniqueName: \"kube rnetes.io/host-path/2e4bf7cd59f497640222976da89b4026-resource-dir\") pod \"kube-controller-manager-ip-10-0-191-177.ec2.internal\" (UID: \"2e4bf7cd59f497640222976da89b4026\") " pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" Feb 10 16:24:21.631842 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.631610 1699 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"cert-dir\" (UniqueName: \"kubernetes.io/host-path/2e4bf7cd59f497640222976da89b4026-cert-dir\") pod \"kube-controller-manager-ip-10-0-191-177.ec2.internal\" (UID: \"2e4bf7cd59f497640222976da89b4026\") " pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" Feb 10 16:24:21.631874 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.631859 1699 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"resource-dir\" (UniqueName: \"kubernetes.io/host-path/2e4bf7cd59f497640222976da89b4026-resource-dir\") pod \"kube-controller-manager-ip-10-0-191-177.ec2.internal\" (UID: \"2e4bf7cd59f497640222976da89b4026\") " pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" Feb 10 16:24:21.777293 ip-10-0-191-177 crio[1666]: time="2022-02-10 16:24:21.777253691Z" level=info msg="Stopped container 7f059db2c67d497b2fbc00034dd91ff93e2b0ffc0b425b92cdd0e4996064173d: openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal/kube-controller-manager-recovery-controller" id=e913f7e3-7ffd-4c86-94da-410c719bbcfd name=/runtime.v1.RuntimeService/StopContainer Feb 10 16:24:21.785877 ip-10-0-191-177 crio[1666]: time="2022-02-10 16:24:21.785840209Z" level=info msg="Stopped container 872e4d7de952818350a46b0a8406dd16bc6dc6662bd2e3e8a78365bf34cefaa8: openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal/kube-controller-manager-cert-syncer" id=752d424d-5cb5-4375-bc55-05af8848a3ca name=/runtime.v1.RuntimeService/StopContainer Feb 10 16:24:21.804655 ip-10-0-191-177 crio[1666]: time="2022-02-10 16:24:21.804621711Z" level=info msg="Stopped container 24162b3a61ea8cf2e3a266995bf88fcf726b918c8f8630cd5719e4d441bc943b: openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal/cluster-policy-controller" id=0c1c4898-7304-4964-b4c1-ca3126046634 name=/runtime.v1.RuntimeService/StopContainer Feb 10 16:24:21.806952 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.806881 1699 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-191-177.ec2.internal_63909fc6133bbfa6049558bdad60a428/cluster-policy-controller/2.log" Feb 10 16:24:21.810502 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.809671 1699 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-191-177.ec2.internal_63909fc6133bbfa6049558bdad60a428/kube-controller-manager-cert-syncer/0.log" Feb 10 16:24:21.811298 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.811095 1699 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-191-177.ec2.internal_63909fc6133bbfa6049558bdad60a428/kube-controller-manager/0.log" Feb 10 16:24:21.839956 ip-10-0-191-177 crio[1666]: time="2022-02-10 16:24:21.839916503Z" level=info msg="Removed container f58970829953d238191f0d3fd4e7527ee388b276979f433ba701f0d11cd6570c: openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal/cluster-policy-controller" id=3b4e74be-f97f-4d87-929c-872b4691f9ca name=/runtime.v1.RuntimeService/RemoveContainer Feb 10 16:24:21.861885 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:21.861859 1699 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-191-177.ec2.internal_63909fc6133bbfa6049558bdad60a428/kube-controller-manager-cert-syncer/0.log" Feb 10 16:24:21.862910 ip-10-0-191-177 hyperkube[1699]: E0210 16:24:21.862162 1699 kuberuntime_manager.go:1072] "getPodContainerStatuses for pod failed" err="rpc error: code = NotFound desc = could not find container \"826a938adc6c3cd84a295db53b4e7910d69f127f9f0b71dfdc06f41ea9d3bdd3\": specified container not found: 826a938adc6c3cd84a295db53b4e7910d69f127f9f0b71dfdc06f41ea9d3bdd3" pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" Feb 10 16:24:21.862910 ip-10-0-191-177 hyperkube[1699]: E0210 16:24:21.862185 1699 kubelet.go:1833] "Unable to read pod status prior to final pod termination" err="rpc error: code = NotFound desc = could not find container \"826a938adc6c3cd84a295db53b4e7910d69f127f9f0b71dfdc06f41ea9d3bdd3\": specified container not found: 826a938adc6c3cd84a295db53b4e7910d69f127f9f0b71dfdc06f41ea9d3bdd3" pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" podUID=63909fc6133bbfa6049558bdad60a428 Feb 10 16:24:21.862910 ip-10-0-191-177 hyperkube[1699]: E0210 16:24:21.862204 1699 pod_workers.go:949] "Error syncing pod, skipping" err="rpc error: code = NotFound desc = could not find container \"826a938adc6c3cd84a295db53b4e7910d69f127f9f0b71dfdc06f41ea9d3bdd3\": specified container not found: 826a938adc6c3cd84a295db53b4e7910d69f127f9f0b71dfdc06f41ea9d3bdd3" pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal" podUID=63909fc6133bbfa6049558bdad60a428 Feb 10 16:24:21.876585 ip-10-0-191-177 crio[1666]: time="2022-02-10 16:24:21.876550672Z" level=info msg="Removed container 826a938adc6c3cd84a295db53b4e7910d69f127f9f0b71dfdc06f41ea9d3bdd3: openshift-kube-controller-manager/kube-controller-manager-ip-10-0-191-177.ec2.internal/kube-controller-manager" id=a6667f10-0a5e-453d-95e7-9e0cd9e7bb0c name=/runtime.v1.RuntimeService/RemoveContainer Feb 10 16:24:22.815270 ip-10-0-191-177 hyperkube[1699]: I0210 16:24:22.815244 1699 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-191-177.ec2.internal_63909fc6133bbfa6049558bdad60a428/kube-controller-manager-cert-syncer/0.log" Kubelet version is Feb 10 16:09:29.721959 ip-10-0-191-177 hyperkube[1699]: I0210 16:09:29.721953 1699 server.go:446] "Kubelet version" kubeletVersion="v1.23.3+f14faf2" Have we confirmed that this SHA has the patch?
Yes, see https://github.com/openshift/kubernetes/commit/f14faf24a932226615d1498350e79296daa3c1b4
You will need -v=4 to further triage the issue. I thought our CI envs historically were set to v=4, but if they're not you'll need it back to v4 to know what has hung. I opened https://github.com/openshift/kubernetes/pull/1175 which tries to force v=4 (may not work) and adds logging to allowPodStart. It would be nice to get to v=4 in CI again by default. If someone can poke 1175 until it recreates and let me know that will help.
Bug 2040533 has UpgradeBlocker since Harshal added it on 2022-02-14, so no need to track blocker-ness over here.