Bug 2050409

Summary:	ip-reconcile job is failing consistently
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Networking	Assignee:	Douglas Smith <dosmith>
Networking sub component:	multus	QA Contact:	Weibin Liang <weliang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	wking
Version:	4.10
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2054404 (view as bug list)		Environment:
Last Closed:	2022-08-10 10:47:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2054404

Description Ben Parees 2022-02-03 21:56:21 UTC

Description of problem:
As seen in this job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26804/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1488985461421510656

The test that ensures we don't have alerts firing is failing because of the "KubeFailedJob" alert, which is firing because of the ip-reconciler job failing:


fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Feb  2 22:55:22.399: Unexpected alerts fired or pending during the upgrade:

alert KubeJobFailed fired for 2925 seconds with labels: {condition="true", container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", job_name="ip-reconciler-27397305", namespace="openshift-multus", service="kube-state-metrics", severity="warning"}



Version-Release number of selected component (if applicable):
4.10->4.11 upgrade


How reproducible:
Showing up pretty regularly in CI:

https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job



The way the alert is defined, it looks like a single failure of this job is enough to cause the alert to fire and stay firing until the failed job is removed:

https://github.com/openshift/cluster-monitoring-operator/blob/a1f93e39508feed796e30a7a944db72a79628951/assets/control-plane/prometheus-rule.yaml#L187-L192

So if there's reason to think this job is going to fail semi-regularly, perhaps we need to reconsider the definition of that alert(work w/ the Monitoring team) or whitelist this particular alert in the CI test(discuss w/ OTA team first, as they are strongly concerned about alerts that fire that admins will see)


Unfortunately i wasn't able to find much about why the job failed, here's the event:

        {
            "apiVersion": "v1",
            "count": 1,
            "eventTime": null,
            "firstTimestamp": "2022-02-02T21:49:56Z",
            "involvedObject": {
                "apiVersion": "batch/v1",
                "kind": "Job",
                "name": "ip-reconciler-27397305",
                "namespace": "openshift-multus",
                "resourceVersion": "20074",
                "uid": "ef821118-aa3a-4d80-bc2d-07a274c058aa"
            },
            "kind": "Event",
            "lastTimestamp": "2022-02-02T21:49:56Z",
            "message": "Job has reached the specified backoff limit",
            "metadata": {
                "creationTimestamp": "2022-02-02T21:49:56Z",
                "name": "ip-reconciler-27397305.16d0167e64971e60",
                "namespace": "openshift-multus",
                "resourceVersion": "20105",
                "uid": "c939de9f-fac2-4301-89e1-b354a335c9de"
            },
            "reason": "BackoffLimitExceeded",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "job-controller"
            },
            "type": "Warning"
        },


it looks like it may have failed scheduling:

        {
            "action": "Scheduling",
            "apiVersion": "v1",
            "eventTime": "2022-02-02T21:45:00.241986Z",
            "firstTimestamp": null,
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "ip-reconciler-27397305-v4p2q",
                "namespace": "openshift-multus",
                "resourceVersion": "12112",
                "uid": "cc6d32ea-55c3-4003-aede-c52f24ca63a7"
            },
            "kind": "Event",
            "lastTimestamp": null,
            "message": "0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.",
            "metadata": {
                "creationTimestamp": "2022-02-02T21:45:00Z",
                "name": "ip-reconciler-27397305-v4p2q.16d01639821b22bf",
                "namespace": "openshift-multus",
                "resourceVersion": "12114",
                "uid": "adaaf922-c176-4f3f-8903-8b72c802edf2"
            },
            "reason": "FailedScheduling",
            "reportingComponent": "default-scheduler",
            "reportingInstance": "default-scheduler-ci-op-050xwfij-db044-zv4gx-master-0",
            "source": {},
            "type": "Warning"
        },


which might mean it failed scheduling during upgrade (I haven't evaluated the timeline of the upgrade against this job execution), though all 3 workers should not have been down during upgrade to cause that.

Comment 1 W. Trevor King 2022-02-10 19:37:15 UTC

This is completely hammering CI.  Just over the past 24h:

$ curl -s 'https://search.ci.openshift.org/search?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=24h&type=junit' | jq -r 'keys | length'
301

And there's 4.10 impact too, so I'm moving the Version back to include that:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&type=junit' | grep 'failures match' | grep -v 'rehearse\|pull-ci' | sort
openshift-router-375-ci-4.11-e2e-aws-ovn-upgrade (all) - 10 runs, 60% failed, 17% of failures match = 10% impact
openshift-router-375-ci-4.11-e2e-gcp-upgrade (all) - 11 runs, 73% failed, 38% of failures match = 27% impact
openshift-router-375-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact
openshift-router-375-nightly-4.11-e2e-aws-upgrade (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-single-node (all) - 8 runs, 100% failed, 38% of failures match = 38% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview (all) - 8 runs, 38% failed, 33% of failures match = 13% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-ppc64le (all) - 4 runs, 75% failed, 67% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-serial-aws-arm64 (all) - 8 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-single-node (all) - 14 runs, 93% failed, 62% of failures match = 57% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-techpreview-serial (all) - 14 runs, 29% failed, 25% of failures match = 7% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-ovn-arm64 (all) - 14 runs, 43% failed, 17% of failures match = 7% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-remote-libvirt-ppc64le (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-serial-aws-arm64 (all) - 14 runs, 36% failed, 80% of failures match = 29% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-cgroupsv2 (all) - 16 runs, 13% failed, 50% of failures match = 6% impact
...many, many more impacted jobs...

Comment 2 Douglas Smith 2022-02-14 12:18:40 UTC

I've got a PR to see how much impact this has on CI: https://github.com/openshift/whereabouts-cni/pull/84 -- I still need to talk to my team about this approach (I whipped this up quick independently in the meanwhile to look at impact.

We had a previous set of changes related to the same error, see: https://bugzilla.redhat.com/show_bug.cgi?id=2051639

However it looks as though the previous changes may have been insufficient.

Comment 5 Weibin Liang 2022-02-17 16:52:27 UTC

Can not find failed log in recently e2e CI  testing:

https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=6h&context=1&type=bug&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 7 errata-xmlrpc 2022-08-10 10:47:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069