While debugging the permafailing ovn-upgrade jobs and the consistent failure of the "Cluster should remain functional during upgrade" test case the below alerts are seen in the test log: alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="kube-rbac-proxy", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"} alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="kube-rbac-proxy", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-ncjv9", service="kube-state-metrics", severity="warning", uid="05bdb346-82bf-4e88-9c02-eda29e5c19bd"} alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="nbdb", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"} alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="northd", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"} alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="ovn-acl-logging", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-ncjv9", service="kube-state-metrics", severity="warning", uid="05bdb346-82bf-4e88-9c02-eda29e5c19bd"} alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="ovn-controller", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-ncjv9", service="kube-state-metrics", severity="warning", uid="05bdb346-82bf-4e88-9c02-eda29e5c19bd"} alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="ovn-dbchecker", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"} alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="ovnkube-master", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"} alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="ovnkube-node", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-ncjv9", service="kube-state-metrics", severity="warning", uid="05bdb346-82bf-4e88-9c02-eda29e5c19bd"} alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="sbdb", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"} alert NetworkPodsCrashLooping pending for 499.3250000476837 seconds with labels: {container="kube-rbac-proxy", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-j6tg2", service="kube-state-metrics", severity="warning", uid="2b560bae-b242-4d24-8171-f0fe02001cb0"} alert NetworkPodsCrashLooping pending for 499.3250000476837 seconds with labels: {container="ovn-acl-logging", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-j6tg2", service="kube-state-metrics", severity="warning", uid="2b560bae-b242-4d24-8171-f0fe02001cb0"} alert NetworkPodsCrashLooping pending for 499.3250000476837 seconds with labels: {container="ovn-controller", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-j6tg2", service="kube-state-metrics", severity="warning", uid="2b560bae-b242-4d24-8171-f0fe02001cb0"} alert NetworkPodsCrashLooping pending for 499.3250000476837 seconds with labels: {container="ovnkube-node", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-j6tg2", service="kube-state-metrics", severity="warning", uid="2b560bae-b242-4d24-8171-f0fe02001cb0"} alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="kube-rbac-proxy", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"} alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="kube-rbac-proxy", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-rq9k6", service="kube-state-metrics", severity="warning", uid="f4333f0f-6cb0-4949-a633-7adfe3c00576"} alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="nbdb", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"} alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="northd", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"} alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="ovn-acl-logging", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-rq9k6", service="kube-state-metrics", severity="warning", uid="f4333f0f-6cb0-4949-a633-7adfe3c00576"} alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="ovn-controller", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-rq9k6", service="kube-state-metrics", severity="warning", uid="f4333f0f-6cb0-4949-a633-7adfe3c00576"} alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="ovn-dbchecker", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"} alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="ovnkube-master", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"} alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="ovnkube-node", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-rq9k6", service="kube-state-metrics", severity="warning", uid="f4333f0f-6cb0-4949-a633-7adfe3c00576"} alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="sbdb", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"} The e2e test log has them here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade/1442936187571408896/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/build-log.txt From this job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade/1442936187571408896 Not sure the reason for why we have all these pods restarting or if it's expected, but after the upgrade you can see that ovnkube pods do have a higher restart count than all other pods: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade/1442936187571408896/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/oc_cmds/pods
This is expected. I performed an upgrade. Upgrade of ovn-k was completed successfully with 0 restarts on all pods. MCO restarts each node individually. The high restart count is due to the high number of containers for the ovn-kubernetes pods. I saw MCO going through each node to reboot. I saw the restart count jump from 0 to the pod's container count usually when kubelet came back up.
I recommend closing this if it isn't directly causing the test to fail and it looks like that is the case right now. OVN-kubernetes restarting is expected during a reboot.
> This is expected. I performed an upgrade. Upgrade of ovn-k was completed successfully with 0 restarts on all pods. > MCO restarts each node individually. The high restart count is due to the high number of containers for the ovn-kubernetes pods. > I saw MCO going through each node to reboot. I saw the restart count jump from 0 to the pod's container count usually when kubelet came back up. ok, the pod restart count makes sense now, but does it correlate to the alerts as well? In other words, is it ok/expected that something like the sbdb container in an ovnkube-master pod would be marked as crashlooping for almost 10 minutes? > I recommend closing this if it isn't directly causing the test to fail and it looks like that is the case right now. > OVN-kubernetes restarting is expected during a reboot. the failure is because some service was not responding for several minutes: fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:161]: Sep 28 21:28:46.157: Service was unreachable during disruption for at least 2m7s of 1h17m5s (3%): if the crashlooping alerts are expected though, and not part of the reason why we have an unreachable service, we can close this bug with that explanation.
closing this as not a bug. the alerts are expected, as Martin pointed out. Another thread discussing the same: https://coreos.slack.com/archives/CDCP2LA9L/p1633109908134100
This error is hiding CI signal for alerts on 4.10 upgrade jobs. We need a PR to avoid reporting this in our tests.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056