Bug 2050409
Summary: | ip-reconcile job is failing consistently | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> | |
Component: | Networking | Assignee: | Douglas Smith <dosmith> | |
Networking sub component: | multus | QA Contact: | Weibin Liang <weliang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | wking | |
Version: | 4.10 | |||
Target Milestone: | --- | |||
Target Release: | 4.11.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2054404 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-10 10:47:09 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2054404 |
Description
Ben Parees
2022-02-03 21:56:21 UTC
This is completely hammering CI. Just over the past 24h: $ curl -s 'https://search.ci.openshift.org/search?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=24h&type=junit' | jq -r 'keys | length' 301 And there's 4.10 impact too, so I'm moving the Version back to include that: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&type=junit' | grep 'failures match' | grep -v 'rehearse\|pull-ci' | sort openshift-router-375-ci-4.11-e2e-aws-ovn-upgrade (all) - 10 runs, 60% failed, 17% of failures match = 10% impact openshift-router-375-ci-4.11-e2e-gcp-upgrade (all) - 11 runs, 73% failed, 38% of failures match = 27% impact openshift-router-375-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 11 runs, 55% failed, 50% of failures match = 27% impact openshift-router-375-nightly-4.11-e2e-aws-upgrade (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-single-node (all) - 8 runs, 100% failed, 38% of failures match = 38% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview (all) - 8 runs, 38% failed, 33% of failures match = 13% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-remote-libvirt-ppc64le (all) - 4 runs, 75% failed, 67% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-serial-aws-arm64 (all) - 8 runs, 50% failed, 50% of failures match = 25% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-single-node (all) - 14 runs, 93% failed, 62% of failures match = 57% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-techpreview-serial (all) - 14 runs, 29% failed, 25% of failures match = 7% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-ovn-arm64 (all) - 14 runs, 43% failed, 17% of failures match = 7% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-remote-libvirt-ppc64le (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-serial-aws-arm64 (all) - 14 runs, 36% failed, 80% of failures match = 29% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-image-ecosystem-remote-libvirt-ppc64le (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws-cgroupsv2 (all) - 16 runs, 13% failed, 50% of failures match = 6% impact ...many, many more impacted jobs... I've got a PR to see how much impact this has on CI: https://github.com/openshift/whereabouts-cni/pull/84 -- I still need to talk to my team about this approach (I whipped this up quick independently in the meanwhile to look at impact. We had a previous set of changes related to the same error, see: https://bugzilla.redhat.com/show_bug.cgi?id=2051639 However it looks as though the previous changes may have been insufficient. Can not find failed log in recently e2e CI testing: https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=6h&context=1&type=bug&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |