Description of problem: Version-Release number of selected component (if applicable): During upgrade from 4.1.2 to 4.1.3 How reproducible: 1 out of 3 upgrades Steps to Reproduce: 1. Trigger an upgrade from 4.1.2 to 4.1.3 2. 3. Actual results: Messages like the following were reporting for multiple hours: 02:49:47.109 ClusterOperator not fully ready: kube-apiserver 02:49:47.109 Degraded=True :: NodeInstallerDegraded: 1 nodes are failing on revision 22: 02:49:47.109 NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found 02:49:47.109 Progressing=True :: Progressing: 1 nodes are at revision 20; 2 nodes are at revision 22 02:49:47.109 Available=True :: Available: 3 nodes are active; 1 nodes are at revision 20; 2 nodes are at revision 22 Additional info: Upgrade was unable to complete. See attachments for additional details.
Created attachment 1584478 [details] audit log mentions of installer pod
Created attachment 1584479 [details] node / operator listings
One issue I found while doing this is that our operator reads pods to determine status. That status becomes unreliable as pods are deleted. I still suspect that a pod is being deleted unexpectedly (all pods on a node actually), but we can make our operator more resilient by using a non-pod resource to track the status of whether installer pods are successful or not. It's fairly involved surgery, but I don't see a way to reliably function otherwise when pods can be deleted by other actors.
Iām adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available.
In other bugs I have worked at, LifecycleStale was added so the bug was auto-closed 7 days after that. In order to prevent it, I am removing it. Please let me know if this is something wrong. Thanks and regards.
This needs arch discussions and not targetted for the current release. moving it to 4.6
I am currently working on other priority items.
I am adding UpcomingSprint keyword as I am working on other deliverables for 4.6
I am working on other high priority items. I will get to this bug next sprint.
We hit this in 4.4.3 -> 4.4.15 CI [1], so there should be a must-gather there with all the stuff that modern CI runs collect in place. If that's not sufficient to debug the issue, we probably need a separate bug about growing the data collected in CI runs. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1288352996144451584
Working on this bug will be evaluated in next sprint.
Closing this tracker. The same issue is tracked in below link and a PR to resolve it is posted. https://bugzilla.redhat.com/show_bug.cgi?id=1858763 *** This bug has been marked as a duplicate of bug 1858763 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days