Bug 1723966

Summary: During upgrade, kube-apiserver operator reporting:: NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED CURRENTRELEASE QA Contact: Xingxing Xia <xxia>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.zCC: aos-bugs, brad.williams, deads, dollierp, florin-alexandru.peter, igor.tiunov, jokerman, kewang, mfojtik, palonsor, sttts, wking
Target Milestone: ---Keywords: DeliveryBlocker, Reopened, Upgrades
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-12 09:23:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
audit log mentions of installer pod
none
node / operator listings none

Description Justin Pierce 2019-06-25 21:46:28 UTC
Description of problem:


Version-Release number of selected component (if applicable):
During upgrade from 4.1.2 to 4.1.3

How reproducible:
1 out of 3 upgrades

Steps to Reproduce:
1. Trigger an upgrade from 4.1.2 to 4.1.3
2.
3.

Actual results:
Messages like the following were reporting for multiple hours:
02:49:47.109  ClusterOperator not fully ready: kube-apiserver
02:49:47.109  	Degraded=True  :: NodeInstallerDegraded: 1 nodes are failing on revision 22:
02:49:47.109  NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found
02:49:47.109  	Progressing=True  :: Progressing: 1 nodes are at revision 20; 2 nodes are at revision 22
02:49:47.109  	Available=True  :: Available: 3 nodes are active; 1 nodes are at revision 20; 2 nodes are at revision 22


Additional info:
Upgrade was unable to complete. See attachments for additional details.

Comment 2 Justin Pierce 2019-06-25 21:51:46 UTC
Created attachment 1584478 [details]
audit log mentions of installer pod

Comment 3 Justin Pierce 2019-06-25 21:52:56 UTC
Created attachment 1584479 [details]
node / operator listings

Comment 19 David Eads 2019-08-22 21:37:28 UTC
One issue I found while doing this is that our operator reads pods to determine status.  That status becomes unreliable as pods are deleted.  I still suspect that a pod is being deleted unexpectedly (all pods on a node actually), but we can make our operator more resilient by using a non-pod resource to track the status of whether installer pods are successful or not.  It's fairly involved surgery, but I don't see a way to reliably function otherwise when pods can be deleted by other actors.

Comment 24 Venkata Siva Teja Areti 2020-05-20 12:27:38 UTC
I’m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available.

Comment 25 Pablo Alonso Rodriguez 2020-05-20 12:29:53 UTC
In other bugs I have worked at, LifecycleStale was added so the bug was auto-closed 7 days after that. In order to prevent it, I am removing it. Please let me know if this is something wrong.

Thanks and regards.

Comment 26 Venkata Siva Teja Areti 2020-05-27 17:26:07 UTC
This needs arch discussions and not targetted for the current release. moving it to 4.6

Comment 27 Venkata Siva Teja Areti 2020-06-18 10:56:19 UTC
I am currently working on other priority items.

Comment 28 Venkata Siva Teja Areti 2020-07-09 19:07:45 UTC
I am adding UpcomingSprint keyword as I am working on other deliverables for 4.6

Comment 29 Venkata Siva Teja Areti 2020-07-31 18:45:19 UTC
I am working on other high priority items. I will get to this bug next sprint.

Comment 30 W. Trevor King 2020-08-03 19:04:54 UTC
We hit this in 4.4.3 -> 4.4.15 CI [1], so there should be a must-gather there with all the stuff that modern CI runs collect in place.  If that's not sufficient to debug the issue, we probably need a separate bug about growing the data collected in CI runs.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1288352996144451584

Comment 31 Venkata Siva Teja Areti 2020-08-21 22:00:38 UTC
Working on this bug will be evaluated in next sprint.

Comment 32 Venkata Siva Teja Areti 2020-08-26 14:41:16 UTC
Closing this tracker. The same issue is tracked in below link and a PR to resolve it is posted.

https://bugzilla.redhat.com/show_bug.cgi?id=1858763

*** This bug has been marked as a duplicate of bug 1858763 ***

Comment 34 Red Hat Bugzilla 2023-09-15 00:17:15 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days