1723966 – During upgrade, kube-apiserver operator reporting:: NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found

Bug 1723966 - During upgrade, kube-apiserver operator reporting:: NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found

Summary: During upgrade, kube-apiserver operator reporting:: NodeInstallerDegraded: po...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-25 21:46 UTC by Justin Pierce
Modified:	2023-12-15 16:34 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-12 09:23:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
audit log mentions of installer pod (8.26 KB, text/plain) 2019-06-25 21:51 UTC, Justin Pierce	no flags	Details
node / operator listings (18.31 KB, text/plain) 2019-06-25 21:52 UTC, Justin Pierce	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4849711	0	None	None	None	2020-02-21 16:46:30 UTC

Description Justin Pierce 2019-06-25 21:46:28 UTC

Description of problem:


Version-Release number of selected component (if applicable):
During upgrade from 4.1.2 to 4.1.3

How reproducible:
1 out of 3 upgrades

Steps to Reproduce:
1. Trigger an upgrade from 4.1.2 to 4.1.3
2.
3.

Actual results:
Messages like the following were reporting for multiple hours:
02:49:47.109  ClusterOperator not fully ready: kube-apiserver
02:49:47.109  	Degraded=True  :: NodeInstallerDegraded: 1 nodes are failing on revision 22:
02:49:47.109  NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found
02:49:47.109  	Progressing=True  :: Progressing: 1 nodes are at revision 20; 2 nodes are at revision 22
02:49:47.109  	Available=True  :: Available: 3 nodes are active; 1 nodes are at revision 20; 2 nodes are at revision 22


Additional info:
Upgrade was unable to complete. See attachments for additional details.

Comment 2 Justin Pierce 2019-06-25 21:51:46 UTC

Created attachment 1584478 [details]
audit log mentions of installer pod

Comment 3 Justin Pierce 2019-06-25 21:52:56 UTC

Created attachment 1584479 [details]
node / operator listings

Comment 19 David Eads 2019-08-22 21:37:28 UTC

One issue I found while doing this is that our operator reads pods to determine status.  That status becomes unreliable as pods are deleted.  I still suspect that a pod is being deleted unexpectedly (all pods on a node actually), but we can make our operator more resilient by using a non-pod resource to track the status of whether installer pods are successful or not.  It's fairly involved surgery, but I don't see a way to reliably function otherwise when pods can be deleted by other actors.

Comment 24 Venkata Siva Teja Areti 2020-05-20 12:27:38 UTC

I’m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available.

Comment 25 Pablo Alonso Rodriguez 2020-05-20 12:29:53 UTC

In other bugs I have worked at, LifecycleStale was added so the bug was auto-closed 7 days after that. In order to prevent it, I am removing it. Please let me know if this is something wrong.

Thanks and regards.

Comment 26 Venkata Siva Teja Areti 2020-05-27 17:26:07 UTC

This needs arch discussions and not targetted for the current release. moving it to 4.6

Comment 27 Venkata Siva Teja Areti 2020-06-18 10:56:19 UTC

I am currently working on other priority items.

Comment 28 Venkata Siva Teja Areti 2020-07-09 19:07:45 UTC

I am adding UpcomingSprint keyword as I am working on other deliverables for 4.6

Comment 29 Venkata Siva Teja Areti 2020-07-31 18:45:19 UTC

I am working on other high priority items. I will get to this bug next sprint.

Comment 30 W. Trevor King 2020-08-03 19:04:54 UTC

We hit this in 4.4.3 -> 4.4.15 CI [1], so there should be a must-gather there with all the stuff that modern CI runs collect in place.  If that's not sufficient to debug the issue, we probably need a separate bug about growing the data collected in CI runs.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1288352996144451584

Comment 31 Venkata Siva Teja Areti 2020-08-21 22:00:38 UTC

Working on this bug will be evaluated in next sprint.

Comment 32 Venkata Siva Teja Areti 2020-08-26 14:41:16 UTC

Closing this tracker. The same issue is tracked in below link and a PR to resolve it is posted.

https://bugzilla.redhat.com/show_bug.cgi?id=1858763

*** This bug has been marked as a duplicate of bug 1858763 ***

Comment 34 Red Hat Bugzilla 2023-09-15 00:17:15 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.