Bug 2009078 - NetworkPodsCrashLooping alerts in upgrade CI jobs
Summary: NetworkPodsCrashLooping alerts in upgrade CI jobs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: Nadia Pinaeva
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-29 21:11 UTC by jamo luhrsen
Modified: 2022-03-10 16:14 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:13:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1212 0 None open Bug 2009078: Remove NetworkPodsCrashLooping alert for ovn-kubernetes 2021-11-02 10:23:26 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:14:21 UTC

Description jamo luhrsen 2021-09-29 21:11:22 UTC
While debugging the permafailing ovn-upgrade jobs and the consistent failure of
the "Cluster should remain functional during upgrade" test case the below alerts
are seen in the test log:

alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="kube-rbac-proxy", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"}
alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="kube-rbac-proxy", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-ncjv9", service="kube-state-metrics", severity="warning", uid="05bdb346-82bf-4e88-9c02-eda29e5c19bd"}
alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="nbdb", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"}
alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="northd", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"}
alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="ovn-acl-logging", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-ncjv9", service="kube-state-metrics", severity="warning", uid="05bdb346-82bf-4e88-9c02-eda29e5c19bd"}
alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="ovn-controller", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-ncjv9", service="kube-state-metrics", severity="warning", uid="05bdb346-82bf-4e88-9c02-eda29e5c19bd"}
alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="ovn-dbchecker", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"}
alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="ovnkube-master", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"}
alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="ovnkube-node", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-ncjv9", service="kube-state-metrics", severity="warning", uid="05bdb346-82bf-4e88-9c02-eda29e5c19bd"}
alert NetworkPodsCrashLooping pending for 139.32500004768372 seconds with labels: {container="sbdb", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-cqlxr", service="kube-state-metrics", severity="warning", uid="ccdb0462-0ead-4cd1-b234-c409919576f0"}
alert NetworkPodsCrashLooping pending for 499.3250000476837 seconds with labels: {container="kube-rbac-proxy", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-j6tg2", service="kube-state-metrics", severity="warning", uid="2b560bae-b242-4d24-8171-f0fe02001cb0"}
alert NetworkPodsCrashLooping pending for 499.3250000476837 seconds with labels: {container="ovn-acl-logging", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-j6tg2", service="kube-state-metrics", severity="warning", uid="2b560bae-b242-4d24-8171-f0fe02001cb0"}
alert NetworkPodsCrashLooping pending for 499.3250000476837 seconds with labels: {container="ovn-controller", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-j6tg2", service="kube-state-metrics", severity="warning", uid="2b560bae-b242-4d24-8171-f0fe02001cb0"}
alert NetworkPodsCrashLooping pending for 499.3250000476837 seconds with labels: {container="ovnkube-node", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-j6tg2", service="kube-state-metrics", severity="warning", uid="2b560bae-b242-4d24-8171-f0fe02001cb0"}
alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="kube-rbac-proxy", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"}
alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="kube-rbac-proxy", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-rq9k6", service="kube-state-metrics", severity="warning", uid="f4333f0f-6cb0-4949-a633-7adfe3c00576"}
alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="nbdb", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"}
alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="northd", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"}
alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="ovn-acl-logging", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-rq9k6", service="kube-state-metrics", severity="warning", uid="f4333f0f-6cb0-4949-a633-7adfe3c00576"}
alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="ovn-controller", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-rq9k6", service="kube-state-metrics", severity="warning", uid="f4333f0f-6cb0-4949-a633-7adfe3c00576"}
alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="ovn-dbchecker", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"}
alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="ovnkube-master", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"}
alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="ovnkube-node", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-node-rq9k6", service="kube-state-metrics", severity="warning", uid="f4333f0f-6cb0-4949-a633-7adfe3c00576"}
alert NetworkPodsCrashLooping pending for 559.3250000476837 seconds with labels: {container="sbdb", endpoint="https-main", job="kube-state-metrics", namespace="openshift-ovn-kubernetes", pod="ovnkube-master-bldnj", service="kube-state-metrics", severity="warning", uid="08747776-008f-473f-a6a6-78b4719fec31"}


The e2e test log has them here:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade/1442936187571408896/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/build-log.txt

From this job:
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade/1442936187571408896


Not sure the reason for why we have all these pods restarting or if it's expected, but after the upgrade you can see that ovnkube pods
do have a higher restart count than all other pods:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade/1442936187571408896/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/oc_cmds/pods

Comment 1 Martin Kennelly 2021-09-30 13:09:33 UTC
This is expected. I performed an upgrade. Upgrade of ovn-k was completed successfully with 0 restarts on all pods.
MCO restarts each node individually. The high restart count is due to the high number of containers for the ovn-kubernetes pods.
I saw MCO going through each node to reboot. I saw the restart count jump from 0 to the pod's container count usually when kubelet came back up.

Comment 2 Martin Kennelly 2021-09-30 14:05:12 UTC
I recommend closing this if it isn't directly causing the test to fail and it looks like that is the case right now.
OVN-kubernetes restarting is expected during a reboot.

Comment 3 jamo luhrsen 2021-09-30 20:31:51 UTC
> This is expected. I performed an upgrade. Upgrade of ovn-k was completed successfully with 0 restarts on all pods.
> MCO restarts each node individually. The high restart count is due to the high number of containers for the ovn-kubernetes pods.
> I saw MCO going through each node to reboot. I saw the restart count jump from 0 to the pod's container count usually when kubelet came back up.

ok, the pod restart count makes sense now, but does it correlate to the alerts as well? In other words, is it ok/expected
that something like the sbdb container in an ovnkube-master pod would be marked as crashlooping for almost 10 minutes?

> I recommend closing this if it isn't directly causing the test to fail and it looks like that is the case right now.
> OVN-kubernetes restarting is expected during a reboot.

the failure is because some service was not responding for several minutes:
  fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:161]: Sep 28 21:28:46.157: Service was unreachable during disruption for at least 2m7s of 1h17m5s (3%):

if the crashlooping alerts are expected though, and not part of the reason why we have
an unreachable service, we can close this bug with that explanation.

Comment 4 jamo luhrsen 2021-10-01 20:06:07 UTC
closing this as not a bug. the alerts are expected, as Martin pointed out. Another thread discussing the same:
  https://coreos.slack.com/archives/CDCP2LA9L/p1633109908134100

Comment 5 David Eads 2021-10-05 13:13:02 UTC
This error is hiding CI signal for alerts on 4.10 upgrade jobs.  We need a PR to avoid reporting this in our tests.

Comment 11 errata-xmlrpc 2022-03-10 16:13:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.