Bug 1723966

Summary:

During upgrade, kube-apiserver operator reporting:: NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found

Product:

OpenShift Container Platform

Reporter:

Justin Pierce <jupierce>

Component:

kube-apiserver

Assignee:

Stefan Schimanski <sttts>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Xingxing Xia <xxia>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.1.z

CC:

aos-bugs, brad.williams, deads, dollierp, florin-alexandru.peter, igor.tiunov, jokerman, kewang, mfojtik, palonsor, sttts, wking

Target Milestone:

---

Keywords:

DeliveryBlocker, Reopened, Upgrades

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-01-12 09:23:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
audit log mentions of installer pod	none
node / operator listings	none

Description Justin Pierce 2019-06-25 21:46:28 UTC

Description of problem:


Version-Release number of selected component (if applicable):
During upgrade from 4.1.2 to 4.1.3

How reproducible:
1 out of 3 upgrades

Steps to Reproduce:
1. Trigger an upgrade from 4.1.2 to 4.1.3
2.
3.

Actual results:
Messages like the following were reporting for multiple hours:
02:49:47.109  ClusterOperator not fully ready: kube-apiserver
02:49:47.109  	Degraded=True  :: NodeInstallerDegraded: 1 nodes are failing on revision 22:
02:49:47.109  NodeInstallerDegraded: pods "installer-22-ip-10-0-129-159.us-west-2.compute.internal" not found
02:49:47.109  	Progressing=True  :: Progressing: 1 nodes are at revision 20; 2 nodes are at revision 22
02:49:47.109  	Available=True  :: Available: 3 nodes are active; 1 nodes are at revision 20; 2 nodes are at revision 22


Additional info:
Upgrade was unable to complete. See attachments for additional details.

Comment 2 Justin Pierce 2019-06-25 21:51:46 UTC

Created attachment 1584478 [details]
audit log mentions of installer pod

Comment 3 Justin Pierce 2019-06-25 21:52:56 UTC

Created attachment 1584479 [details]
node / operator listings

Comment 19 David Eads 2019-08-22 21:37:28 UTC

One issue I found while doing this is that our operator reads pods to determine status.  That status becomes unreliable as pods are deleted.  I still suspect that a pod is being deleted unexpectedly (all pods on a node actually), but we can make our operator more resilient by using a non-pod resource to track the status of whether installer pods are successful or not.  It's fairly involved surgery, but I don't see a way to reliably function otherwise when pods can be deleted by other actors.

Comment 24 Venkata Siva Teja Areti 2020-05-20 12:27:38 UTC

I’m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available.

Comment 25 Pablo Alonso Rodriguez 2020-05-20 12:29:53 UTC

In other bugs I have worked at, LifecycleStale was added so the bug was auto-closed 7 days after that. In order to prevent it, I am removing it. Please let me know if this is something wrong.

Thanks and regards.

Comment 26 Venkata Siva Teja Areti 2020-05-27 17:26:07 UTC

This needs arch discussions and not targetted for the current release. moving it to 4.6

Comment 27 Venkata Siva Teja Areti 2020-06-18 10:56:19 UTC

I am currently working on other priority items.

Comment 28 Venkata Siva Teja Areti 2020-07-09 19:07:45 UTC

I am adding UpcomingSprint keyword as I am working on other deliverables for 4.6

Comment 29 Venkata Siva Teja Areti 2020-07-31 18:45:19 UTC

I am working on other high priority items. I will get to this bug next sprint.

Comment 30 W. Trevor King 2020-08-03 19:04:54 UTC

We hit this in 4.4.3 -> 4.4.15 CI [1], so there should be a must-gather there with all the stuff that modern CI runs collect in place.  If that's not sufficient to debug the issue, we probably need a separate bug about growing the data collected in CI runs.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1288352996144451584

Comment 31 Venkata Siva Teja Areti 2020-08-21 22:00:38 UTC

Working on this bug will be evaluated in next sprint.

Comment 32 Venkata Siva Teja Areti 2020-08-26 14:41:16 UTC

Closing this tracker. The same issue is tracked in below link and a PR to resolve it is posted.

https://bugzilla.redhat.com/show_bug.cgi?id=1858763

*** This bug has been marked as a duplicate of bug 1858763 ***

Comment 34 Red Hat Bugzilla 2023-09-15 00:17:15 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days