1803188 – Openshift-Apiservers restarting in a loop after Upgrade to 4.3.0 with Multi-Master

Bug 1803188 - Openshift-Apiservers restarting in a loop after Upgrade to 4.3.0 with Multi-Master

Summary: Openshift-Apiservers restarting in a loop after Upgrade to 4.3.0 with Multi-M...

Keywords:
Status:	CLOSED DUPLICATE of bug 1807638
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.z
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Aniket Bhat
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-14 16:06 UTC by Markus Frahm
Modified:	2020-05-28 11:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-26 13:29:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
openshift-apiserver.log (19.54 KB, text/plain) 2020-02-17 16:34 UTC, Markus Frahm	no flags	Details
View All

Description Markus Frahm 2020-02-14 16:06:00 UTC

Description of problem:
Apiservers are restarted before beeing ready.

Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:248d2d3c94484a907bcf123d371f5349034802be877fc0cdab5391acf87da979

How reproducible:
happens after Upgrade with several master Servers

Steps to Reproduce:
1. Upgrade from 4.2.16 to 4.3.0 with 3 master servers
2. Apiservers are in a loop restarting
3.

Actual results:
only 2 of 3 apiservers are available, one is restarting in a loop

Expected results:
Apiservers are all in ready state and not restarting.

Additional info:
if the failureThreshold of the readinessProbe in the daemonset is increased from 10 to 30, the apiservers are stable and not restarting. But this value is then overridden by the openshift-apiserver-operator.

Proposal: Increase the failureThreshold for the apiserver daemonset from 10 to 30 to give the apiserver more time to get ready.

Comment 1 Stefan Schimanski 2020-02-17 11:34:43 UTC

The readiness probe does not cause restarts, only the liveness probe does.

To understand what's going on we need openshift-apiserver logs.

Comment 2 Markus Frahm 2020-02-17 16:34:00 UTC

Created attachment 1663569 [details]
openshift-apiserver.log

Comment 3 Markus Frahm 2020-02-17 16:34:55 UTC

Attached is the log of an apiserver just before it gets killed.
The corresponding event log is:

[markus@mfrahm-pc installeruat9]$ oc get events  | grep 59pf6
<unknown>   Normal    Scheduled                pod/apiserver-59pf6   Successfully assigned openshift-apiserver/apiserver-59pf6 to master-0
7m14s       Normal    Pulled                   pod/apiserver-59pf6   Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:248d2d3c94484a907bcf123d371f5349034802be877fc0cdab5391acf87da979" already present on machine
7m14s       Normal    Created                  pod/apiserver-59pf6   Created container fix-audit-permissions
7m14s       Normal    Started                  pod/apiserver-59pf6   Started container fix-audit-permissions
7m13s       Normal    Pulled                   pod/apiserver-59pf6   Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:248d2d3c94484a907bcf123d371f5349034802be877fc0cdab5391acf87da979" already present on machine
7m13s       Normal    Created                  pod/apiserver-59pf6   Created container openshift-apiserver
7m13s       Normal    Started                  pod/apiserver-59pf6   Started container openshift-apiserver
7m6s        Normal    Killing                  pod/apiserver-59pf6   Stopping container openshift-apiserver

Comment 4 Stefan Schimanski 2020-02-18 10:09:02 UTC

Can you attach the kubelet log for that node? I think from the liveness prove we should only see container restarts and not recreated pods. Something else is going on here.

Please run must-gather and attach more logs (https://docs.openshift.com/container-platform/4.1/support/gathering-cluster-data.html). Next to kubelet logs we need the operator logs, events and possibly more.

Comment 5 Markus Frahm 2020-02-18 18:01:27 UTC

Here are links to the requested log-files:
https://www.dropbox.com/s/gdwb4yqy2pjbhq8/must-gather.tar.gz?dl=0
https://www.dropbox.com/s/9zuto26ys47asje/kubelet.log.gz?dl=0

Comment 6 Markus Frahm 2020-02-21 10:50:07 UTC

updated cluster to 4.3.1. problem still persists. no change

Comment 7 Markus Frahm 2020-02-21 10:59:32 UTC

additional info:
increasing the failure threshold in readiness probe of the apiserver daemonset from 10 to 15 is sufficient to solve the problem
But it's overwritten by apiserver-operator.

Comment 9 Michal Fojtik 2020-05-12 10:45:19 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing the severity. 

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 10 Markus Frahm 2020-05-12 11:09:55 UTC

bug still persists. No change.

Comment 11 Lukasz Szaszkiewicz 2020-05-20 09:16:24 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level.
I will revisit this bug next sprint.

Comment 12 Lukasz Szaszkiewicz 2020-05-20 14:08:07 UTC

Today I’ve been looking into this issue and here is what I’ve found:

First I went from 4.2.16 to 4.2.28 and I didn’t observe any issues. Kube/OpenShift APIs were upgraded to the new version without any interruptions. I noticed that only the network and the DNS operators didn’t upgrade to the newer version. 

Next, I went from 4.2.28 to 4.3.19. Again Kube/OpenShift APIs were upgraded to the new version without any interruptions and remained in that state until (a few minutes later) the network operator started its upgrade (from 4.2.16 to 4.3.19) procedure.

When the network operator was being upgraded:
- SSH connections from my local machine to all master nodes were periodically dropped and I had to reconnect a few times
- I wasn’t able to download must-gather - I was able to get the logs only after the operator was fully upgraded.
- openshift-apiserver pods were restarted due to failed liveness probes (connection refused)
- events from “openshift-console” suggest that “console-7489846965-fm96l” failed the check as well (connection refused)
- same for “sdn-lft85” in “openshift-sdn” (connection refused)

It looks like the network wasn’t stable and kubelet couldn’t monitor the pods. I’m attaching must-gather and assigning to the network team for further investigation

Comment 13 Ben Bennett 2020-05-20 19:13:14 UTC

Aniket, do you think this is resolved by the other upgrade stability work you have been doing?

Comment 17 Aniket Bhat 2020-05-22 13:03:38 UTC

Markus,

Part of the connection disruption issue which deals with not deleting ovs flows during upgrade has been fixed in Openshift 4.5. We have a reason to believe that this will potentially fix the issue you are seeing with API servers restarting. We are in the process of backporting this to 4.4.z and 4.3.z streams. It should land in a 4.3.z stream soon.

Thanks,
Aniket.

Comment 18 Tomas Smetana 2020-05-26 13:29:25 UTC


*** This bug has been marked as a duplicate of bug 1807638 ***

Note You need to log in before you can comment on or make changes to this bug.