2064866 – ip-reconcile job is failing consistently

Bug 2064866 - ip-reconcile job is failing consistently

Summary: ip-reconcile job is failing consistently

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Douglas Smith
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:	2064861
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-16 19:11 UTC by Douglas Smith
Modified:	2022-08-25 11:56 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2064861
Environment:
Last Closed:	2022-08-25 11:56:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Douglas Smith 2022-03-16 19:11:50 UTC

+++ This bug was initially created as a clone of Bug #2064861 +++

+++ This bug was initially created as a clone of Bug #2064860 +++

+++ This bug was initially created as a clone of Bug #2064859 +++

+++ This bug was initially created as a clone of Bug #2054404 +++

+++ This bug was initially created as a clone of Bug #2050409 +++

Description of problem:
As seen in this job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26804/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1488985461421510656

The test that ensures we don't have alerts firing is failing because of the "KubeFailedJob" alert, which is firing because of the ip-reconciler job failing:


fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Feb  2 22:55:22.399: Unexpected alerts fired or pending during the upgrade:

alert KubeJobFailed fired for 2925 seconds with labels: {condition="true", container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", job_name="ip-reconciler-27397305", namespace="openshift-multus", service="kube-state-metrics", severity="warning"}



Version-Release number of selected component (if applicable):
4.10->4.11 upgrade


How reproducible:
Showing up pretty regularly in CI:

https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job



The way the alert is defined, it looks like a single failure of this job is enough to cause the alert to fire and stay firing until the failed job is removed:

https://github.com/openshift/cluster-monitoring-operator/blob/a1f93e39508feed796e30a7a944db72a79628951/assets/control-plane/prometheus-rule.yaml#L187-L192

So if there's reason to think this job is going to fail semi-regularly, perhaps we need to reconsider the definition of that alert(work w/ the Monitoring team) or whitelist this particular alert in the CI test(discuss w/ OTA team first, as they are strongly concerned about alerts that fire that admins will see)


Unfortunately i wasn't able to find much about why the job failed, here's the event:

        {
            "apiVersion": "v1",
            "count": 1,
            "eventTime": null,
            "firstTimestamp": "2022-02-02T21:49:56Z",
            "involvedObject": {
                "apiVersion": "batch/v1",
                "kind": "Job",
                "name": "ip-reconciler-27397305",
                "namespace": "openshift-multus",
                "resourceVersion": "20074",
                "uid": "ef821118-aa3a-4d80-bc2d-07a274c058aa"
            },
            "kind": "Event",
            "lastTimestamp": "2022-02-02T21:49:56Z",
            "message": "Job has reached the specified backoff limit",
            "metadata": {
                "creationTimestamp": "2022-02-02T21:49:56Z",
                "name": "ip-reconciler-27397305.16d0167e64971e60",
                "namespace": "openshift-multus",
                "resourceVersion": "20105",
                "uid": "c939de9f-fac2-4301-89e1-b354a335c9de"
            },
            "reason": "BackoffLimitExceeded",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "job-controller"
            },
            "type": "Warning"
        },


it looks like it may have failed scheduling:

        {
            "action": "Scheduling",
            "apiVersion": "v1",
            "eventTime": "2022-02-02T21:45:00.241986Z",
            "firstTimestamp": null,
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "ip-reconciler-27397305-v4p2q",
                "namespace": "openshift-multus",
                "resourceVersion": "12112",
                "uid": "cc6d32ea-55c3-4003-aede-c52f24ca63a7"
            },
            "kind": "Event",
            "lastTimestamp": null,
            "message": "0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.",
            "metadata": {
                "creationTimestamp": "2022-02-02T21:45:00Z",
                "name": "ip-reconciler-27397305-v4p2q.16d01639821b22bf",
                "namespace": "openshift-multus",
                "resourceVersion": "12114",
                "uid": "adaaf922-c176-4f3f-8903-8b72c802edf2"
            },
            "reason": "FailedScheduling",
            "reportingComponent": "default-scheduler",
            "reportingInstance": "default-scheduler-ci-op-050xwfij-db044-zv4gx-master-0",
            "source": {},
            "type": "Warning"
        },


which might mean it failed scheduling during upgrade (I haven't evaluated the timeline of the upgrade against this job execution), though all 3 workers should not have been down during upgrade to cause that.

--- Additional comment from W. Trevor King on 2022-02-10 19:37:15 UTC ---

This is completely hammering CI.  Just over the past 24h:

$ curl -s 'https://search.ci.openshift.org/search?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=24h&type=junit' | jq -r 'keys | length'
301

And there's 4.10 impact too, so I'm moving the Version back to include that:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&type=junit' | grep 'failures match' | grep -v 'rehearse\|pull-ci' | sort
openshift-router-375-ci-4.11-e2e-aws-ovn-upgrade (all) - 10 runs, 60% failed, 17% of failures match = 10% impact
openshift-router-375-ci-4.11-e2e-gcp-upgrade (all) - 11 runs, 73% failed, 38% of failures match = 27% impact
openshift-router-375-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact
openshift-router-375-nightly-4.11-e2e-aws-upgrade (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-single-node (all) - 8 runs, 100% failed, 38% of failures match = 38% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview (all) - 8 runs, 38% failed, 33% of failures match = 13% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-ppc64le (all) - 4 runs, 75% failed, 67% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-serial-aws-arm64 (all) - 8 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-single-node (all) - 14 runs, 93% failed, 62% of failures match = 57% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-techpreview-serial (all) - 14 runs, 29% failed, 25% of failures match = 7% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-ovn-arm64 (all) - 14 runs, 43% failed, 17% of failures match = 7% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-remote-libvirt-ppc64le (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-serial-aws-arm64 (all) - 14 runs, 36% failed, 80% of failures match = 29% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-cgroupsv2 (all) - 16 runs, 13% failed, 50% of failures match = 6% impact
...many, many more impacted jobs...

--- Additional comment from Douglas Smith on 2022-02-14 12:18:40 UTC ---

I've got a PR to see how much impact this has on CI: https://github.com/openshift/whereabouts-cni/pull/84 -- I still need to talk to my team about this approach (I whipped this up quick independently in the meanwhile to look at impact.

We had a previous set of changes related to the same error, see: https://bugzilla.redhat.com/show_bug.cgi?id=2051639

However it looks as though the previous changes may have been insufficient.

Note You need to log in before you can comment on or make changes to this bug.