Bug 2064860

Summary:	ip-reconcile job is failing consistently
Product:	OpenShift Container Platform	Reporter:	Douglas Smith <dosmith>
Component:	Networking	Assignee:	Douglas Smith <dosmith>
Networking sub component:	multus	QA Contact:	Weibin Liang <weliang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	weliang, wking
Version:	4.10
Target Milestone:	---
Target Release:	4.8.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2064859
Clones:	2064861 (view as bug list)		Environment:
Last Closed:	2022-09-14 20:38:55 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2064859
Bug Blocks:	2064861

Description Douglas Smith 2022-03-16 19:08:22 UTC

+++ This bug was initially created as a clone of Bug #2064859 +++

+++ This bug was initially created as a clone of Bug #2054404 +++

+++ This bug was initially created as a clone of Bug #2050409 +++

Description of problem:
As seen in this job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26804/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1488985461421510656

The test that ensures we don't have alerts firing is failing because of the "KubeFailedJob" alert, which is firing because of the ip-reconciler job failing:


fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Feb  2 22:55:22.399: Unexpected alerts fired or pending during the upgrade:

alert KubeJobFailed fired for 2925 seconds with labels: {condition="true", container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", job_name="ip-reconciler-27397305", namespace="openshift-multus", service="kube-state-metrics", severity="warning"}



Version-Release number of selected component (if applicable):
4.10->4.11 upgrade


How reproducible:
Showing up pretty regularly in CI:

https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job



The way the alert is defined, it looks like a single failure of this job is enough to cause the alert to fire and stay firing until the failed job is removed:

https://github.com/openshift/cluster-monitoring-operator/blob/a1f93e39508feed796e30a7a944db72a79628951/assets/control-plane/prometheus-rule.yaml#L187-L192

So if there's reason to think this job is going to fail semi-regularly, perhaps we need to reconsider the definition of that alert(work w/ the Monitoring team) or whitelist this particular alert in the CI test(discuss w/ OTA team first, as they are strongly concerned about alerts that fire that admins will see)


Unfortunately i wasn't able to find much about why the job failed, here's the event:

        {
            "apiVersion": "v1",
            "count": 1,
            "eventTime": null,
            "firstTimestamp": "2022-02-02T21:49:56Z",
            "involvedObject": {
                "apiVersion": "batch/v1",
                "kind": "Job",
                "name": "ip-reconciler-27397305",
                "namespace": "openshift-multus",
                "resourceVersion": "20074",
                "uid": "ef821118-aa3a-4d80-bc2d-07a274c058aa"
            },
            "kind": "Event",
            "lastTimestamp": "2022-02-02T21:49:56Z",
            "message": "Job has reached the specified backoff limit",
            "metadata": {
                "creationTimestamp": "2022-02-02T21:49:56Z",
                "name": "ip-reconciler-27397305.16d0167e64971e60",
                "namespace": "openshift-multus",
                "resourceVersion": "20105",
                "uid": "c939de9f-fac2-4301-89e1-b354a335c9de"
            },
            "reason": "BackoffLimitExceeded",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "job-controller"
            },
            "type": "Warning"
        },


it looks like it may have failed scheduling:

        {
            "action": "Scheduling",
            "apiVersion": "v1",
            "eventTime": "2022-02-02T21:45:00.241986Z",
            "firstTimestamp": null,
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "ip-reconciler-27397305-v4p2q",
                "namespace": "openshift-multus",
                "resourceVersion": "12112",
                "uid": "cc6d32ea-55c3-4003-aede-c52f24ca63a7"
            },
            "kind": "Event",
            "lastTimestamp": null,
            "message": "0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.",
            "metadata": {
                "creationTimestamp": "2022-02-02T21:45:00Z",
                "name": "ip-reconciler-27397305-v4p2q.16d01639821b22bf",
                "namespace": "openshift-multus",
                "resourceVersion": "12114",
                "uid": "adaaf922-c176-4f3f-8903-8b72c802edf2"
            },
            "reason": "FailedScheduling",
            "reportingComponent": "default-scheduler",
            "reportingInstance": "default-scheduler-ci-op-050xwfij-db044-zv4gx-master-0",
            "source": {},
            "type": "Warning"
        },


which might mean it failed scheduling during upgrade (I haven't evaluated the timeline of the upgrade against this job execution), though all 3 workers should not have been down during upgrade to cause that.

--- Additional comment from W. Trevor King on 2022-02-10 19:37:15 UTC ---

This is completely hammering CI.  Just over the past 24h:

$ curl -s 'https://search.ci.openshift.org/search?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=24h&type=junit' | jq -r 'keys | length'
301

And there's 4.10 impact too, so I'm moving the Version back to include that:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&type=junit' | grep 'failures match' | grep -v 'rehearse\|pull-ci' | sort
openshift-router-375-ci-4.11-e2e-aws-ovn-upgrade (all) - 10 runs, 60% failed, 17% of failures match = 10% impact
openshift-router-375-ci-4.11-e2e-gcp-upgrade (all) - 11 runs, 73% failed, 38% of failures match = 27% impact
openshift-router-375-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact
openshift-router-375-nightly-4.11-e2e-aws-upgrade (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-single-node (all) - 8 runs, 100% failed, 38% of failures match = 38% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview (all) - 8 runs, 38% failed, 33% of failures match = 13% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-ppc64le (all) - 4 runs, 75% failed, 67% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-serial-aws-arm64 (all) - 8 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-single-node (all) - 14 runs, 93% failed, 62% of failures match = 57% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-techpreview-serial (all) - 14 runs, 29% failed, 25% of failures match = 7% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-ovn-arm64 (all) - 14 runs, 43% failed, 17% of failures match = 7% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-remote-libvirt-ppc64le (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-serial-aws-arm64 (all) - 14 runs, 36% failed, 80% of failures match = 29% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-cgroupsv2 (all) - 16 runs, 13% failed, 50% of failures match = 6% impact
...many, many more impacted jobs...

--- Additional comment from Douglas Smith on 2022-02-14 12:18:40 UTC ---

I've got a PR to see how much impact this has on CI: https://github.com/openshift/whereabouts-cni/pull/84 -- I still need to talk to my team about this approach (I whipped this up quick independently in the meanwhile to look at impact.

We had a previous set of changes related to the same error, see: https://bugzilla.redhat.com/show_bug.cgi?id=2051639

However it looks as though the previous changes may have been insufficient.

Comment 2 Weibin Liang 2022-08-29 14:02:26 UTC

Following verifying steps from https://gist.github.com/dougbtv/b84c5dec4953f4b85048d16ddcf72c15, testing passed in 4.8.0-0.nightly-2022-08-25-142953

[weliang@weliang tmp]$ oc project openshift-multus
Already on project "openshift-multus" on server "https://api.weliang-829a.qe.gcp.devcluster.openshift.com:6443".
[weliang@weliang tmp]$ oc get cronjob ip-reconciler -o yaml | grep -vP "creationTimestamp|\- apiVersion|ownerReferences|blockOwnerDeletion|controller|kind\: Network|name\: cluster|uid\:|resourceVersion" | sed 's/name: ip-reconciler/name: test-reconciler/' | sed '/            - -log-level=verbose/a \ \ \ \ \ \ \ \ \ \ \ \ - -timeout=invalid' > /tmp/reconcile.yml
[weliang@weliang tmp]$ oc create -f /tmp/reconcile.yml
cronjob.batch/test-reconciler created
[weliang@weliang tmp]$ oc create job --from=cronjob/test-reconciler -n openshift-multus testrun-ip-reconciler
job.batch/testrun-ip-reconciler created
[weliang@weliang tmp]$ oc get pods | grep testrun
testrun-ip-reconciler-n5x7c           0/1     Error     0          8s
[weliang@weliang tmp]$ oc logs testrun-ip-reconciler-n5x7c
invalid value "invalid" for flag -timeout: parse error
Usage of /ip-reconciler:
  -kubeconfig string
    	the path to the Kubernetes configuration file
  -log-level ip-reconciler
    	the logging level for the ip-reconciler app. Valid values are: "debug", "verbose", "error", and "panic". (default "error")
  -timeout int
    	the value for a request timeout in seconds. (default 30)
[weliang@weliang tmp]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2022-08-25-142953   True        False         28m     Cluster version is 4.8.0-0.nightly-2022-08-25-142953
[weliang@weliang tmp]$

Comment 5 errata-xmlrpc 2022-09-14 20:38:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.8.49 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6308