[1] failed a recent 4.8 update, with one of the compute nodes stuck coming back after reboot. From the JUnit: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 2h35m9s May 27 06:34:41.899: Unexpected alerts fired or pending during the upgrade: alert ClusterMonitoringOperatorReconciliationErrors fired for 1429 seconds with labels: {severity="warning"} alert ClusterNotUpgradeable fired for 942 seconds with labels: {condition="Upgradeable", endpoint="metrics", name="version", severity="warning"} alert ClusterOperatorDegraded fired for 2712 seconds with labels: {condition="Degraded", endpoint="metrics", instance="10.0.139.25:9099", job="cluster-version-operator", name="machine-config", namespace="openshift-cluster-version", pod="cluster-version-operator-86bdc6877d-8dpfp", reason="MachineConfigDaemonFailed", service="cluster-version-operator", severity="warning"} alert ClusterOperatorDegraded fired for 2742 seconds with labels: {condition="Degraded", endpoint="metrics", instance="10.0.139.25:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-86bdc6877d-8dpfp", reason="UpdatingPrometheusK8SFailed", service="cluster-version-operator", severity="warning"} alert ClusterOperatorDown fired for 3942 seconds with labels: {endpoint="metrics", instance="10.0.139.25:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-86bdc6877d-8dpfp", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci-2021-05-26-225946"} alert KubeContainerWaiting fired for 1645 seconds with labels: {container="machine-config-daemon", namespace="openshift-machine-config-operator", pod="machine-config-daemon-mm7gt", severity="warning"} alert KubeContainerWaiting fired for 1645 seconds with labels: {container="oauth-proxy", namespace="openshift-machine-config-operator", pod="machine-config-daemon-mm7gt", severity="warning"} alert KubePodNotReady fired for 360 seconds with labels: {namespace="openshift-authentication", pod="oauth-openshift-5d96b55df8-ldqdp", severity="warning"} alert KubePodNotReady fired for 4345 seconds with labels: {namespace="openshift-machine-config-operator", pod="machine-config-daemon-mm7gt", severity="warning"} alert KubePodNotReady fired for 4525 seconds with labels: {namespace="openshift-monitoring", pod="prometheus-k8s-0", severity="warning"} alert KubeStatefulSetReplicasMismatch fired for 4525 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning", statefulset="prometheus-k8s"} Digging into that machine-config daemon pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "machine-config-daemon-mm7gt").status.containerStatuses[]' { "image": "registry.ci.openshift.org/ocp/4.8-2021-05-26-225946@sha256:26b5b45670764e9270780891b2142602ebaa8b364d6917ef28e3f898e28725d9", "imageID": "", "lastState": {}, "name": "machine-config-daemon", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } } { "image": "registry.ci.openshift.org/ocp/4.8-2021-05-26-225946@sha256:b3177a6ad870f49f3bb6ce9a53344b14a4150c8c3a0711f20943501014b22f67", "imageID": "", "lastState": {}, "name": "oauth-proxy", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "reason": "ContainerCreating" } } } $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "machine-config-daemon-mm7gt").spec.nodeName' ip-10-0-220-230.ec2.internal Which is not really helpful. But we did gather kubelet logs for that node: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-220-230.ec2.internal/journal | gunzip | grep -1 machine-config-daemon-mm7gt | tail -n3 May 27 06:38:19.960408 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:19.960361 1448 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists." pod="openshift-machine-config-operator/machine-config-daemon-mm7gt" podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 May 27 06:38:33.962303 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:33.962251 1448 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists." pod="openshift-machine-config-operator/machine-config-daemon-mm7gt" podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 May 27 06:38:47.962911 ip-10-0-220-230 hyperkube[1448]: E0527 06:38:47.962851 1448 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16] : Unit kubepods-burstable-pod5ac83c3f_0b16_4cf2_a3cb_f67c19cd0e16.slice already exists." pod="openshift-machine-config-operator/machine-config-daemon-mm7gt" podUID=5ac83c3f-0b16-4cf2-a3cb-f67c19cd0e16 That error message was mentioned way back in bug 1466636, but I don't see anything since, so opening a new bug about this new instance. Stuck-creating is pretty serious, but going with medium until we have a better handle on frequency. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1397752317276590080
Rough upper bound on frequency: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=alert+KubePodNotReady+fired+for' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 33 runs, 27% failed, 22% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 32 runs, 38% failed, 8% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 32 runs, 100% failed, 13% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 31 runs, 100% failed, 10% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp (all) - 38 runs, 32% failed, 17% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 38 runs, 100% failed, 34% of failures match = 34% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 15 runs, 100% failed, 7% of failures match = 7% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 16 runs, 100% failed, 44% of failures match = 44% impact release-openshift-ocp-installer-e2e-aws-ovn-4.8 (all) - 16 runs, 25% failed, 25% of failures match = 6% impact release-openshift-ocp-installer-e2e-azure-ovn-4.8 (all) - 16 runs, 31% failed, 40% of failures match = 13% impact release-openshift-ocp-installer-e2e-gcp-ovn-4.8 (all) - 16 runs, 38% failed, 17% of failures match = 6% impact release-openshift-origin-installer-launch-aws (all) - 116 runs, 65% failed, 1% of failures match = 1% impact Certainly feels like a 4.8 regression. Stretching out to maxAge=336h doesn't seem to have all that much impact on the impacts, so probably something that's been broken for at least a week or two. Raising to high based on these rates, but feel free to push back if that seems off base.
> Kir, could this be related to the recent runc bump to rc95? Yes, looks like it. Looking...
Filed https://github.com/opencontainers/runc/issues/2996, reproduced.
Fix: https://github.com/opencontainers/runc/pull/2997
Backport notes: 1. Created a branch for runc: https://github.com/openshift/opencontainers-runc/tree/openshift-4.8 2. Created a backport of runc fix: https://github.com/openshift/opencontainers-runc/pull/9, merged it. 3. Created a bump PR for 4.8: https://github.com/openshift/kubernetes/pull/790
Upstream issue (just created by me): https://github.com/kubernetes/kubernetes/issues/102676
[1] is 4.8.0-0.ci-2021-06-07-101955 -> 4.8.0-0.ci-2021-06-07-155134, with: ... alert KubeNodeNotReady fired for 2100 seconds with labels: {condition="Ready", container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", node="ci-op-ycrp2437-17f95-8r28k-worker-c-25hzn", service="kube-state-metrics", severity="warning", status="true"} alert KubePodNotReady fired for 2040 seconds with labels: {namespace="openshift-multus", pod="network-metrics-daemon-cd52c", severity="warning"} alert KubePodNotReady fired for 2040 seconds with labels: {namespace="openshift-network-diagnostics", pod="network-check-target-59phj", severity="warning"} alert KubePodNotReady fired for 2040 seconds with labels: {namespace="openshift-sdn", pod="sdn-68n94", severity="warning"} ... Looks the same as this bug: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1401930066266427392/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/nodes/ci-op-ycrp2437-17f95-8r28k-worker-c-25hzn/journal | gunzip | grep 'Unit .*slice already exists' | tail -n2 Jun 07 17:07:23.648904 ci-op-ycrp2437-17f95-8r28k-worker-c-25hzn hyperkube[1440]: E0607 17:07:23.648854 1440 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 37965ca8-72fb-4fd4-83e6-51142c02904d cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod37965ca8-72fb-4fd4-83e6-51142c02904d] : Unit kubepods-burstable-pod37965ca8_72fb_4fd4_83e6_51142c02904d.slice already exists." pod="openshift-sdn/sdn-68n94" podUID=37965ca8-72fb-4fd4-83e6-51142c02904d Jun 07 17:07:36.649006 ci-op-ycrp2437-17f95-8r28k-worker-c-25hzn hyperkube[1440]: E0607 17:07:36.648938 1440 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 37965ca8-72fb-4fd4-83e6-51142c02904d cgroups exist and are correctly applied: failed to create container for [kubepods burstable pod37965ca8-72fb-4fd4-83e6-51142c02904d] : Unit kubepods-burstable-pod37965ca8_72fb_4fd4_83e6_51142c02904d.slice already exists." pod="openshift-sdn/sdn-68n94" podUID=37965ca8-72fb-4fd4-83e6-51142c02904d [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1401930066266427392
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438