2064859 – ip-reconcile job is failing consistently

Bug 2064859 - ip-reconcile job is failing consistently

Summary: ip-reconcile job is failing consistently

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.z
Assignee:	Douglas Smith
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:	2054404
Blocks:	2064860
TreeView+	depends on / blocked

Reported:	2022-03-16 19:05 UTC by Douglas Smith
Modified:	2022-05-12 20:41 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2054404
Clones:	2064860 (view as bug list)
Environment:
Last Closed:	2022-05-12 20:40:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift whereabouts-cni pull 89	0	None	open	Bug 2064859: Sync context improvements [backport 4.9]	2022-04-15 01:00:50 UTC
Red Hat Product Errata	RHBA-2022:1694	0	None	None	None	2022-05-12 20:41:01 UTC

Description Douglas Smith 2022-03-16 19:05:49 UTC

+++ This bug was initially created as a clone of Bug #2054404 +++

+++ This bug was initially created as a clone of Bug #2050409 +++

Description of problem:
As seen in this job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26804/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1488985461421510656

The test that ensures we don't have alerts firing is failing because of the "KubeFailedJob" alert, which is firing because of the ip-reconciler job failing:


fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Feb  2 22:55:22.399: Unexpected alerts fired or pending during the upgrade:

alert KubeJobFailed fired for 2925 seconds with labels: {condition="true", container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", job_name="ip-reconciler-27397305", namespace="openshift-multus", service="kube-state-metrics", severity="warning"}



Version-Release number of selected component (if applicable):
4.10->4.11 upgrade


How reproducible:
Showing up pretty regularly in CI:

https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job



The way the alert is defined, it looks like a single failure of this job is enough to cause the alert to fire and stay firing until the failed job is removed:

https://github.com/openshift/cluster-monitoring-operator/blob/a1f93e39508feed796e30a7a944db72a79628951/assets/control-plane/prometheus-rule.yaml#L187-L192

So if there's reason to think this job is going to fail semi-regularly, perhaps we need to reconsider the definition of that alert(work w/ the Monitoring team) or whitelist this particular alert in the CI test(discuss w/ OTA team first, as they are strongly concerned about alerts that fire that admins will see)


Unfortunately i wasn't able to find much about why the job failed, here's the event:

        {
            "apiVersion": "v1",
            "count": 1,
            "eventTime": null,
            "firstTimestamp": "2022-02-02T21:49:56Z",
            "involvedObject": {
                "apiVersion": "batch/v1",
                "kind": "Job",
                "name": "ip-reconciler-27397305",
                "namespace": "openshift-multus",
                "resourceVersion": "20074",
                "uid": "ef821118-aa3a-4d80-bc2d-07a274c058aa"
            },
            "kind": "Event",
            "lastTimestamp": "2022-02-02T21:49:56Z",
            "message": "Job has reached the specified backoff limit",
            "metadata": {
                "creationTimestamp": "2022-02-02T21:49:56Z",
                "name": "ip-reconciler-27397305.16d0167e64971e60",
                "namespace": "openshift-multus",
                "resourceVersion": "20105",
                "uid": "c939de9f-fac2-4301-89e1-b354a335c9de"
            },
            "reason": "BackoffLimitExceeded",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "job-controller"
            },
            "type": "Warning"
        },


it looks like it may have failed scheduling:

        {
            "action": "Scheduling",
            "apiVersion": "v1",
            "eventTime": "2022-02-02T21:45:00.241986Z",
            "firstTimestamp": null,
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "ip-reconciler-27397305-v4p2q",
                "namespace": "openshift-multus",
                "resourceVersion": "12112",
                "uid": "cc6d32ea-55c3-4003-aede-c52f24ca63a7"
            },
            "kind": "Event",
            "lastTimestamp": null,
            "message": "0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.",
            "metadata": {
                "creationTimestamp": "2022-02-02T21:45:00Z",
                "name": "ip-reconciler-27397305-v4p2q.16d01639821b22bf",
                "namespace": "openshift-multus",
                "resourceVersion": "12114",
                "uid": "adaaf922-c176-4f3f-8903-8b72c802edf2"
            },
            "reason": "FailedScheduling",
            "reportingComponent": "default-scheduler",
            "reportingInstance": "default-scheduler-ci-op-050xwfij-db044-zv4gx-master-0",
            "source": {},
            "type": "Warning"
        },


which might mean it failed scheduling during upgrade (I haven't evaluated the timeline of the upgrade against this job execution), though all 3 workers should not have been down during upgrade to cause that.

--- Additional comment from W. Trevor King on 2022-02-10 19:37:15 UTC ---

This is completely hammering CI.  Just over the past 24h:

$ curl -s 'https://search.ci.openshift.org/search?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=24h&type=junit' | jq -r 'keys | length'
301

And there's 4.10 impact too, so I'm moving the Version back to include that:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&type=junit' | grep 'failures match' | grep -v 'rehearse\|pull-ci' | sort
openshift-router-375-ci-4.11-e2e-aws-ovn-upgrade (all) - 10 runs, 60% failed, 17% of failures match = 10% impact
openshift-router-375-ci-4.11-e2e-gcp-upgrade (all) - 11 runs, 73% failed, 38% of failures match = 27% impact
openshift-router-375-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact
openshift-router-375-nightly-4.11-e2e-aws-upgrade (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-single-node (all) - 8 runs, 100% failed, 38% of failures match = 38% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview (all) - 8 runs, 38% failed, 33% of failures match = 13% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-ppc64le (all) - 4 runs, 75% failed, 67% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-serial-aws-arm64 (all) - 8 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-single-node (all) - 14 runs, 93% failed, 62% of failures match = 57% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-techpreview-serial (all) - 14 runs, 29% failed, 25% of failures match = 7% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-ovn-arm64 (all) - 14 runs, 43% failed, 17% of failures match = 7% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-remote-libvirt-ppc64le (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-serial-aws-arm64 (all) - 14 runs, 36% failed, 80% of failures match = 29% impact
periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-cgroupsv2 (all) - 16 runs, 13% failed, 50% of failures match = 6% impact
...many, many more impacted jobs...

--- Additional comment from Douglas Smith on 2022-02-14 12:18:40 UTC ---

I've got a PR to see how much impact this has on CI: https://github.com/openshift/whereabouts-cni/pull/84 -- I still need to talk to my team about this approach (I whipped this up quick independently in the meanwhile to look at impact.

We had a previous set of changes related to the same error, see: https://bugzilla.redhat.com/show_bug.cgi?id=2051639

However it looks as though the previous changes may have been insufficient.

Comment 3 Weibin Liang 2022-04-28 17:39:30 UTC

Tested and verified in 4.9.0-0.nightly-2022-04-27-100704

[weliang@weliang ~]$ oc project openshift-multus
Now using project "openshift-multus" on server "https://api.weliang-4282.qe.gcp.devcluster.openshift.com:6443".
[weliang@weliang ~]$ oc get cronjob ip-reconciler -o yaml | grep -vP "creationTimestamp|\- apiVersion|ownerReferences|blockOwnerDeletion|controller|kind\: Network|name\: cluster|uid\:|resourceVersion" | sed 's/name: ip-reconciler/name: test-reconciler/' | sed '/            - -log-level=verbose/a \ \ \ \ \ \ \ \ \ \ \ \ - -timeout=invalid' > /tmp/reconcile.yml
[weliang@weliang ~]$ oc create -f /tmp/reconcile.yml
cronjob.batch/test-reconciler created
[weliang@weliang ~]$ oc create job --from=cronjob/test-reconciler -n openshift-multus testrun-ip-reconciler
job.batch/testrun-ip-reconciler created
[weliang@weliang ~]$ oc get pods | grep testrun
testrun-ip-reconciler--1-49zsr        0/1     Error     0          8s
[weliang@weliang ~]$ oc logs testrun-ip-reconciler--1-49zsr
invalid value "invalid" for flag -timeout: parse error
Usage of /ip-reconciler:
  -kubeconfig string
    	the path to the Kubernetes configuration file
  -log-level ip-reconciler
    	the logging level for the ip-reconciler app. Valid values are: "debug", "verbose", "error", and "panic". (default "error")
  -timeout int
    	the value for a request timeout in seconds. (default 30)
[weliang@weliang ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2022-04-27-100704   True        False         3h14m   Cluster version is 4.9.0-0.nightly-2022-04-27-100704
[weliang@weliang ~]$

Comment 6 errata-xmlrpc 2022-05-12 20:40:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.32 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1694

Note You need to log in before you can comment on or make changes to this bug.