Bug 1723966 - During upgrade, kube-apiserver operator reporting:: NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found
Summary: During upgrade, kube-apiserver operator reporting:: NodeInstallerDegraded: po...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-25 21:46 UTC by Justin Pierce
Modified: 2023-12-15 16:34 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-12 09:23:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
audit log mentions of installer pod (8.26 KB, text/plain)
2019-06-25 21:51 UTC, Justin Pierce
no flags Details
node / operator listings (18.31 KB, text/plain)
2019-06-25 21:52 UTC, Justin Pierce
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4849711 0 None None None 2020-02-21 16:46:30 UTC

Description Justin Pierce 2019-06-25 21:46:28 UTC
Description of problem:


Version-Release number of selected component (if applicable):
During upgrade from 4.1.2 to 4.1.3

How reproducible:
1 out of 3 upgrades

Steps to Reproduce:
1. Trigger an upgrade from 4.1.2 to 4.1.3
2.
3.

Actual results:
Messages like the following were reporting for multiple hours:
02:49:47.109  ClusterOperator not fully ready: kube-apiserver
02:49:47.109  	Degraded=True  :: NodeInstallerDegraded: 1 nodes are failing on revision 22:
02:49:47.109  NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found
02:49:47.109  	Progressing=True  :: Progressing: 1 nodes are at revision 20; 2 nodes are at revision 22
02:49:47.109  	Available=True  :: Available: 3 nodes are active; 1 nodes are at revision 20; 2 nodes are at revision 22


Additional info:
Upgrade was unable to complete. See attachments for additional details.

Comment 2 Justin Pierce 2019-06-25 21:51:46 UTC
Created attachment 1584478 [details]
audit log mentions of installer pod

Comment 3 Justin Pierce 2019-06-25 21:52:56 UTC
Created attachment 1584479 [details]
node / operator listings

Comment 19 David Eads 2019-08-22 21:37:28 UTC
One issue I found while doing this is that our operator reads pods to determine status.  That status becomes unreliable as pods are deleted.  I still suspect that a pod is being deleted unexpectedly (all pods on a node actually), but we can make our operator more resilient by using a non-pod resource to track the status of whether installer pods are successful or not.  It's fairly involved surgery, but I don't see a way to reliably function otherwise when pods can be deleted by other actors.

Comment 24 Venkata Siva Teja Areti 2020-05-20 12:27:38 UTC
I’m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available.

Comment 25 Pablo Alonso Rodriguez 2020-05-20 12:29:53 UTC
In other bugs I have worked at, LifecycleStale was added so the bug was auto-closed 7 days after that. In order to prevent it, I am removing it. Please let me know if this is something wrong.

Thanks and regards.

Comment 26 Venkata Siva Teja Areti 2020-05-27 17:26:07 UTC
This needs arch discussions and not targetted for the current release. moving it to 4.6

Comment 27 Venkata Siva Teja Areti 2020-06-18 10:56:19 UTC
I am currently working on other priority items.

Comment 28 Venkata Siva Teja Areti 2020-07-09 19:07:45 UTC
I am adding UpcomingSprint keyword as I am working on other deliverables for 4.6

Comment 29 Venkata Siva Teja Areti 2020-07-31 18:45:19 UTC
I am working on other high priority items. I will get to this bug next sprint.

Comment 30 W. Trevor King 2020-08-03 19:04:54 UTC
We hit this in 4.4.3 -> 4.4.15 CI [1], so there should be a must-gather there with all the stuff that modern CI runs collect in place.  If that's not sufficient to debug the issue, we probably need a separate bug about growing the data collected in CI runs.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1288352996144451584

Comment 31 Venkata Siva Teja Areti 2020-08-21 22:00:38 UTC
Working on this bug will be evaluated in next sprint.

Comment 32 Venkata Siva Teja Areti 2020-08-26 14:41:16 UTC
Closing this tracker. The same issue is tracked in below link and a PR to resolve it is posted.

https://bugzilla.redhat.com/show_bug.cgi?id=1858763

*** This bug has been marked as a duplicate of bug 1858763 ***

Comment 34 Red Hat Bugzilla 2023-09-15 00:17:15 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.