Hide Forgot
+++ This bug was initially created as a clone of Bug #2054404 +++ +++ This bug was initially created as a clone of Bug #2050409 +++ Description of problem: As seen in this job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26804/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1488985461421510656 The test that ensures we don't have alerts firing is failing because of the "KubeFailedJob" alert, which is firing because of the ip-reconciler job failing: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Feb 2 22:55:22.399: Unexpected alerts fired or pending during the upgrade: alert KubeJobFailed fired for 2925 seconds with labels: {condition="true", container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", job_name="ip-reconciler-27397305", namespace="openshift-multus", service="kube-state-metrics", severity="warning"} Version-Release number of selected component (if applicable): 4.10->4.11 upgrade How reproducible: Showing up pretty regularly in CI: https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job The way the alert is defined, it looks like a single failure of this job is enough to cause the alert to fire and stay firing until the failed job is removed: https://github.com/openshift/cluster-monitoring-operator/blob/a1f93e39508feed796e30a7a944db72a79628951/assets/control-plane/prometheus-rule.yaml#L187-L192 So if there's reason to think this job is going to fail semi-regularly, perhaps we need to reconsider the definition of that alert(work w/ the Monitoring team) or whitelist this particular alert in the CI test(discuss w/ OTA team first, as they are strongly concerned about alerts that fire that admins will see) Unfortunately i wasn't able to find much about why the job failed, here's the event: { "apiVersion": "v1", "count": 1, "eventTime": null, "firstTimestamp": "2022-02-02T21:49:56Z", "involvedObject": { "apiVersion": "batch/v1", "kind": "Job", "name": "ip-reconciler-27397305", "namespace": "openshift-multus", "resourceVersion": "20074", "uid": "ef821118-aa3a-4d80-bc2d-07a274c058aa" }, "kind": "Event", "lastTimestamp": "2022-02-02T21:49:56Z", "message": "Job has reached the specified backoff limit", "metadata": { "creationTimestamp": "2022-02-02T21:49:56Z", "name": "ip-reconciler-27397305.16d0167e64971e60", "namespace": "openshift-multus", "resourceVersion": "20105", "uid": "c939de9f-fac2-4301-89e1-b354a335c9de" }, "reason": "BackoffLimitExceeded", "reportingComponent": "", "reportingInstance": "", "source": { "component": "job-controller" }, "type": "Warning" }, it looks like it may have failed scheduling: { "action": "Scheduling", "apiVersion": "v1", "eventTime": "2022-02-02T21:45:00.241986Z", "firstTimestamp": null, "involvedObject": { "apiVersion": "v1", "kind": "Pod", "name": "ip-reconciler-27397305-v4p2q", "namespace": "openshift-multus", "resourceVersion": "12112", "uid": "cc6d32ea-55c3-4003-aede-c52f24ca63a7" }, "kind": "Event", "lastTimestamp": null, "message": "0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.", "metadata": { "creationTimestamp": "2022-02-02T21:45:00Z", "name": "ip-reconciler-27397305-v4p2q.16d01639821b22bf", "namespace": "openshift-multus", "resourceVersion": "12114", "uid": "adaaf922-c176-4f3f-8903-8b72c802edf2" }, "reason": "FailedScheduling", "reportingComponent": "default-scheduler", "reportingInstance": "default-scheduler-ci-op-050xwfij-db044-zv4gx-master-0", "source": {}, "type": "Warning" }, which might mean it failed scheduling during upgrade (I haven't evaluated the timeline of the upgrade against this job execution), though all 3 workers should not have been down during upgrade to cause that. --- Additional comment from W. Trevor King on 2022-02-10 19:37:15 UTC --- This is completely hammering CI. Just over the past 24h: $ curl -s 'https://search.ci.openshift.org/search?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=24h&type=junit' | jq -r 'keys | length' 301 And there's 4.10 impact too, so I'm moving the Version back to include that: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&type=junit' | grep 'failures match' | grep -v 'rehearse\|pull-ci' | sort openshift-router-375-ci-4.11-e2e-aws-ovn-upgrade (all) - 10 runs, 60% failed, 17% of failures match = 10% impact openshift-router-375-ci-4.11-e2e-gcp-upgrade (all) - 11 runs, 73% failed, 38% of failures match = 27% impact openshift-router-375-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact openshift-router-375-nightly-4.11-e2e-aws-upgrade (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-single-node (all) - 8 runs, 100% failed, 38% of failures match = 38% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview (all) - 8 runs, 38% failed, 33% of failures match = 13% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-ppc64le (all) - 4 runs, 75% failed, 67% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-serial-aws-arm64 (all) - 8 runs, 50% failed, 50% of failures match = 25% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-single-node (all) - 14 runs, 93% failed, 62% of failures match = 57% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-techpreview-serial (all) - 14 runs, 29% failed, 25% of failures match = 7% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-ovn-arm64 (all) - 14 runs, 43% failed, 17% of failures match = 7% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-remote-libvirt-ppc64le (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-serial-aws-arm64 (all) - 14 runs, 36% failed, 80% of failures match = 29% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws-cgroupsv2 (all) - 16 runs, 13% failed, 50% of failures match = 6% impact ...many, many more impacted jobs... --- Additional comment from Douglas Smith on 2022-02-14 12:18:40 UTC --- I've got a PR to see how much impact this has on CI: https://github.com/openshift/whereabouts-cni/pull/84 -- I still need to talk to my team about this approach (I whipped this up quick independently in the meanwhile to look at impact. We had a previous set of changes related to the same error, see: https://bugzilla.redhat.com/show_bug.cgi?id=2051639 However it looks as though the previous changes may have been insufficient.
Tested and verified in 4.9.0-0.nightly-2022-04-27-100704 [weliang@weliang ~]$ oc project openshift-multus Now using project "openshift-multus" on server "https://api.weliang-4282.qe.gcp.devcluster.openshift.com:6443". [weliang@weliang ~]$ oc get cronjob ip-reconciler -o yaml | grep -vP "creationTimestamp|\- apiVersion|ownerReferences|blockOwnerDeletion|controller|kind\: Network|name\: cluster|uid\:|resourceVersion" | sed 's/name: ip-reconciler/name: test-reconciler/' | sed '/ - -log-level=verbose/a \ \ \ \ \ \ \ \ \ \ \ \ - -timeout=invalid' > /tmp/reconcile.yml [weliang@weliang ~]$ oc create -f /tmp/reconcile.yml cronjob.batch/test-reconciler created [weliang@weliang ~]$ oc create job --from=cronjob/test-reconciler -n openshift-multus testrun-ip-reconciler job.batch/testrun-ip-reconciler created [weliang@weliang ~]$ oc get pods | grep testrun testrun-ip-reconciler--1-49zsr 0/1 Error 0 8s [weliang@weliang ~]$ oc logs testrun-ip-reconciler--1-49zsr invalid value "invalid" for flag -timeout: parse error Usage of /ip-reconciler: -kubeconfig string the path to the Kubernetes configuration file -log-level ip-reconciler the logging level for the ip-reconciler app. Valid values are: "debug", "verbose", "error", and "panic". (default "error") -timeout int the value for a request timeout in seconds. (default 30) [weliang@weliang ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2022-04-27-100704 True False 3h14m Cluster version is 4.9.0-0.nightly-2022-04-27-100704 [weliang@weliang ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.32 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1694