Bug 2054426

Summary: ip-reconciler still fails during initial cluster installs
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: NetworkingAssignee: Douglas Smith <dosmith>
Networking sub component: multus QA Contact: Weibin Liang <weliang>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: high CC: bparees, dgoodwin, wking
Version: 4.10Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:12:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Eads 2022-02-14 23:00:32 UTC
The situation appears to have improved with the retry, but has not fully resolved.  The cronjob retries and succeeds quickly enough now that we see failing pods removed.

At this time, the error definitely qualifies as weird (notice the failure here https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1493301582970753024 happening during an upgrade with no obvious reason for a failure)

I don't consider a blocker at this time, but correcting this in 4.11 or devoting the effort fix the test to unblock the org is important.

Comment 1 Douglas Smith 2022-02-14 23:25:30 UTC
Thanks David. I've got a couple PRs posted for a change that introduces a set of known errors for the ip-reconciler, that is, if an error is matched, the ip-reconciler exits zero. It's a trade off in terms of correctness testing vs. visibility into other issues which match the known error. But, given that it's been giving us some headaches, I think that having the ip-reconciler ignore some errors is the direction I'd tip the scales.

https://github.com/openshift/whereabouts-cni/pull/84
https://github.com/openshift/whereabouts-cni/pull/85

I'm looking to get a review from my team tomorrow morning. But if we can check if those improve CI, that's also good feedback.

There was also an additional report @ https://bugzilla.redhat.com/show_bug.cgi?id=2050409 -- which is where I have the PRs posted

Comment 2 Devan Goodwin 2022-02-15 15:09:11 UTC
For tracking current hit rate: 

https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired.*ip-reconciler&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

I have filed pr to skip the test for now, https://github.com/openshift/origin/pull/26842. Once merged we'll need a new search CI query to track how often it's occurring. 

Jira filed to make this it's own test in future.

Comment 3 Douglas Smith 2022-11-09 14:14:24 UTC
Please reopen or create another BZ if we're still seeing CI results for this problem.

Comment 4 Ben Parees 2022-11-14 22:37:57 UTC
This is causing a failed 4.8 payload acceptance due to our 4.7->4.8 upgrade jobs (which likely means any fix that's been applied needs to get into 4.7 to fully avoid this).

see:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1592129606809292800

Comment 5 Ben Parees 2022-11-14 22:38:45 UTC
if you want to close this and create a new jira bug for tracking resolving this in 4.7/4.8 that's ok w/ me

Comment 7 Shiftzilla 2023-03-09 01:12:58 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9119