Bug 1723966
Summary: | During upgrade, kube-apiserver operator reporting:: NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> | ||||||
Component: | kube-apiserver | Assignee: | Stefan Schimanski <sttts> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Xingxing Xia <xxia> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4.1.z | CC: | aos-bugs, brad.williams, deads, dollierp, florin-alexandru.peter, igor.tiunov, jokerman, kewang, mfojtik, palonsor, sttts, wking | ||||||
Target Milestone: | --- | Keywords: | DeliveryBlocker, Reopened, Upgrades | ||||||
Target Release: | 4.7.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-01-12 09:23:20 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Justin Pierce
2019-06-25 21:46:28 UTC
Created attachment 1584478 [details]
audit log mentions of installer pod
Created attachment 1584479 [details]
node / operator listings
One issue I found while doing this is that our operator reads pods to determine status. That status becomes unreliable as pods are deleted. I still suspect that a pod is being deleted unexpectedly (all pods on a node actually), but we can make our operator more resilient by using a non-pod resource to track the status of whether installer pods are successful or not. It's fairly involved surgery, but I don't see a way to reliably function otherwise when pods can be deleted by other actors. I’m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available. In other bugs I have worked at, LifecycleStale was added so the bug was auto-closed 7 days after that. In order to prevent it, I am removing it. Please let me know if this is something wrong. Thanks and regards. This needs arch discussions and not targetted for the current release. moving it to 4.6 I am currently working on other priority items. I am adding UpcomingSprint keyword as I am working on other deliverables for 4.6 I am working on other high priority items. I will get to this bug next sprint. We hit this in 4.4.3 -> 4.4.15 CI [1], so there should be a must-gather there with all the stuff that modern CI runs collect in place. If that's not sufficient to debug the issue, we probably need a separate bug about growing the data collected in CI runs. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1288352996144451584 Working on this bug will be evaluated in next sprint. Closing this tracker. The same issue is tracked in below link and a PR to resolve it is posted. https://bugzilla.redhat.com/show_bug.cgi?id=1858763 *** This bug has been marked as a duplicate of bug 1858763 *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |