+++ This bug was initially created as a clone of Bug #2064861 +++ +++ This bug was initially created as a clone of Bug #2064860 +++ +++ This bug was initially created as a clone of Bug #2064859 +++ +++ This bug was initially created as a clone of Bug #2054404 +++ +++ This bug was initially created as a clone of Bug #2050409 +++ Description of problem: As seen in this job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26804/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1488985461421510656 The test that ensures we don't have alerts firing is failing because of the "KubeFailedJob" alert, which is firing because of the ip-reconciler job failing: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Feb 2 22:55:22.399: Unexpected alerts fired or pending during the upgrade: alert KubeJobFailed fired for 2925 seconds with labels: {condition="true", container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", job_name="ip-reconciler-27397305", namespace="openshift-multus", service="kube-state-metrics", severity="warning"} Version-Release number of selected component (if applicable): 4.10->4.11 upgrade How reproducible: Showing up pretty regularly in CI: https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job The way the alert is defined, it looks like a single failure of this job is enough to cause the alert to fire and stay firing until the failed job is removed: https://github.com/openshift/cluster-monitoring-operator/blob/a1f93e39508feed796e30a7a944db72a79628951/assets/control-plane/prometheus-rule.yaml#L187-L192 So if there's reason to think this job is going to fail semi-regularly, perhaps we need to reconsider the definition of that alert(work w/ the Monitoring team) or whitelist this particular alert in the CI test(discuss w/ OTA team first, as they are strongly concerned about alerts that fire that admins will see) Unfortunately i wasn't able to find much about why the job failed, here's the event: { "apiVersion": "v1", "count": 1, "eventTime": null, "firstTimestamp": "2022-02-02T21:49:56Z", "involvedObject": { "apiVersion": "batch/v1", "kind": "Job", "name": "ip-reconciler-27397305", "namespace": "openshift-multus", "resourceVersion": "20074", "uid": "ef821118-aa3a-4d80-bc2d-07a274c058aa" }, "kind": "Event", "lastTimestamp": "2022-02-02T21:49:56Z", "message": "Job has reached the specified backoff limit", "metadata": { "creationTimestamp": "2022-02-02T21:49:56Z", "name": "ip-reconciler-27397305.16d0167e64971e60", "namespace": "openshift-multus", "resourceVersion": "20105", "uid": "c939de9f-fac2-4301-89e1-b354a335c9de" }, "reason": "BackoffLimitExceeded", "reportingComponent": "", "reportingInstance": "", "source": { "component": "job-controller" }, "type": "Warning" }, it looks like it may have failed scheduling: { "action": "Scheduling", "apiVersion": "v1", "eventTime": "2022-02-02T21:45:00.241986Z", "firstTimestamp": null, "involvedObject": { "apiVersion": "v1", "kind": "Pod", "name": "ip-reconciler-27397305-v4p2q", "namespace": "openshift-multus", "resourceVersion": "12112", "uid": "cc6d32ea-55c3-4003-aede-c52f24ca63a7" }, "kind": "Event", "lastTimestamp": null, "message": "0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.", "metadata": { "creationTimestamp": "2022-02-02T21:45:00Z", "name": "ip-reconciler-27397305-v4p2q.16d01639821b22bf", "namespace": "openshift-multus", "resourceVersion": "12114", "uid": "adaaf922-c176-4f3f-8903-8b72c802edf2" }, "reason": "FailedScheduling", "reportingComponent": "default-scheduler", "reportingInstance": "default-scheduler-ci-op-050xwfij-db044-zv4gx-master-0", "source": {}, "type": "Warning" }, which might mean it failed scheduling during upgrade (I haven't evaluated the timeline of the upgrade against this job execution), though all 3 workers should not have been down during upgrade to cause that. --- Additional comment from W. Trevor King on 2022-02-10 19:37:15 UTC --- This is completely hammering CI. Just over the past 24h: $ curl -s 'https://search.ci.openshift.org/search?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=24h&type=junit' | jq -r 'keys | length' 301 And there's 4.10 impact too, so I'm moving the Version back to include that: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&type=junit' | grep 'failures match' | grep -v 'rehearse\|pull-ci' | sort openshift-router-375-ci-4.11-e2e-aws-ovn-upgrade (all) - 10 runs, 60% failed, 17% of failures match = 10% impact openshift-router-375-ci-4.11-e2e-gcp-upgrade (all) - 11 runs, 73% failed, 38% of failures match = 27% impact openshift-router-375-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact openshift-router-375-nightly-4.11-e2e-aws-upgrade (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-single-node (all) - 8 runs, 100% failed, 38% of failures match = 38% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview (all) - 8 runs, 38% failed, 33% of failures match = 13% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-ppc64le (all) - 4 runs, 75% failed, 67% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-serial-aws-arm64 (all) - 8 runs, 50% failed, 50% of failures match = 25% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-single-node (all) - 14 runs, 93% failed, 62% of failures match = 57% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-techpreview-serial (all) - 14 runs, 29% failed, 25% of failures match = 7% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-ovn-arm64 (all) - 14 runs, 43% failed, 17% of failures match = 7% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-remote-libvirt-ppc64le (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-serial-aws-arm64 (all) - 14 runs, 36% failed, 80% of failures match = 29% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws-cgroupsv2 (all) - 16 runs, 13% failed, 50% of failures match = 6% impact ...many, many more impacted jobs... --- Additional comment from Douglas Smith on 2022-02-14 12:18:40 UTC --- I've got a PR to see how much impact this has on CI: https://github.com/openshift/whereabouts-cni/pull/84 -- I still need to talk to my team about this approach (I whipped this up quick independently in the meanwhile to look at impact. We had a previous set of changes related to the same error, see: https://bugzilla.redhat.com/show_bug.cgi?id=2051639 However it looks as though the previous changes may have been insufficient.