Bug 1803188 - Openshift-Apiservers restarting in a loop after Upgrade to 4.3.0 with Multi-Master
Summary: Openshift-Apiservers restarting in a loop after Upgrade to 4.3.0 with Multi-M...
Keywords:
Status: CLOSED DUPLICATE of bug 1807638
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.z
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Aniket Bhat
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-14 16:06 UTC by Markus Frahm
Modified: 2020-05-28 11:16 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-26 13:29:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
openshift-apiserver.log (19.54 KB, text/plain)
2020-02-17 16:34 UTC, Markus Frahm
no flags Details

Description Markus Frahm 2020-02-14 16:06:00 UTC
Description of problem:
Apiservers are restarted before beeing ready.

Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:248d2d3c94484a907bcf123d371f5349034802be877fc0cdab5391acf87da979

How reproducible:
happens after Upgrade with several master Servers

Steps to Reproduce:
1. Upgrade from 4.2.16 to 4.3.0 with 3 master servers
2. Apiservers are in a loop restarting
3.

Actual results:
only 2 of 3 apiservers are available, one is restarting in a loop

Expected results:
Apiservers are all in ready state and not restarting.

Additional info:
if the failureThreshold of the readinessProbe in the daemonset is increased from 10 to 30, the apiservers are stable and not restarting. But this value is then overridden by the openshift-apiserver-operator.

Proposal: Increase the failureThreshold for the apiserver daemonset from 10 to 30 to give the apiserver more time to get ready.

Comment 1 Stefan Schimanski 2020-02-17 11:34:43 UTC
The readiness probe does not cause restarts, only the liveness probe does.

To understand what's going on we need openshift-apiserver logs.

Comment 2 Markus Frahm 2020-02-17 16:34:00 UTC
Created attachment 1663569 [details]
openshift-apiserver.log

Comment 3 Markus Frahm 2020-02-17 16:34:55 UTC
Attached is the log of an apiserver just before it gets killed.
The corresponding event log is:

[markus@mfrahm-pc installeruat9]$ oc get events  | grep 59pf6
<unknown>   Normal    Scheduled                pod/apiserver-59pf6   Successfully assigned openshift-apiserver/apiserver-59pf6 to master-0
7m14s       Normal    Pulled                   pod/apiserver-59pf6   Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:248d2d3c94484a907bcf123d371f5349034802be877fc0cdab5391acf87da979" already present on machine
7m14s       Normal    Created                  pod/apiserver-59pf6   Created container fix-audit-permissions
7m14s       Normal    Started                  pod/apiserver-59pf6   Started container fix-audit-permissions
7m13s       Normal    Pulled                   pod/apiserver-59pf6   Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:248d2d3c94484a907bcf123d371f5349034802be877fc0cdab5391acf87da979" already present on machine
7m13s       Normal    Created                  pod/apiserver-59pf6   Created container openshift-apiserver
7m13s       Normal    Started                  pod/apiserver-59pf6   Started container openshift-apiserver
7m6s        Normal    Killing                  pod/apiserver-59pf6   Stopping container openshift-apiserver

Comment 4 Stefan Schimanski 2020-02-18 10:09:02 UTC
Can you attach the kubelet log for that node? I think from the liveness prove we should only see container restarts and not recreated pods. Something else is going on here.

Please run must-gather and attach more logs (https://docs.openshift.com/container-platform/4.1/support/gathering-cluster-data.html). Next to kubelet logs we need the operator logs, events and possibly more.

Comment 6 Markus Frahm 2020-02-21 10:50:07 UTC
updated cluster to 4.3.1. problem still persists. no change

Comment 7 Markus Frahm 2020-02-21 10:59:32 UTC
additional info:
increasing the failure threshold in readiness probe of the apiserver daemonset from 10 to 15 is sufficient to solve the problem
But it's overwritten by apiserver-operator.

Comment 9 Michal Fojtik 2020-05-12 10:45:19 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing the severity. 

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 10 Markus Frahm 2020-05-12 11:09:55 UTC
bug still persists. No change.

Comment 11 Lukasz Szaszkiewicz 2020-05-20 09:16:24 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level.
I will revisit this bug next sprint.

Comment 12 Lukasz Szaszkiewicz 2020-05-20 14:08:07 UTC
Today I’ve been looking into this issue and here is what I’ve found:

First I went from 4.2.16 to 4.2.28 and I didn’t observe any issues. Kube/OpenShift APIs were upgraded to the new version without any interruptions. I noticed that only the network and the DNS operators didn’t upgrade to the newer version. 

Next, I went from 4.2.28 to 4.3.19. Again Kube/OpenShift APIs were upgraded to the new version without any interruptions and remained in that state until (a few minutes later) the network operator started its upgrade (from 4.2.16 to 4.3.19) procedure.

When the network operator was being upgraded:
- SSH connections from my local machine to all master nodes were periodically dropped and I had to reconnect a few times
- I wasn’t able to download must-gather - I was able to get the logs only after the operator was fully upgraded.
- openshift-apiserver pods were restarted due to failed liveness probes (connection refused)
- events from “openshift-console” suggest that “console-7489846965-fm96l” failed the check as well (connection refused)
- same for “sdn-lft85” in “openshift-sdn” (connection refused)

It looks like the network wasn’t stable and kubelet couldn’t monitor the pods. I’m attaching must-gather and assigning to the network team for further investigation

Comment 13 Ben Bennett 2020-05-20 19:13:14 UTC
Aniket, do you think this is resolved by the other upgrade stability work you have been doing?

Comment 17 Aniket Bhat 2020-05-22 13:03:38 UTC
Markus,

Part of the connection disruption issue which deals with not deleting ovs flows during upgrade has been fixed in Openshift 4.5. We have a reason to believe that this will potentially fix the issue you are seeing with API servers restarting. We are in the process of backporting this to 4.4.z and 4.3.z streams. It should land in a 4.3.z stream soon.

Thanks,
Aniket.

Comment 18 Tomas Smetana 2020-05-26 13:29:25 UTC

*** This bug has been marked as a duplicate of bug 1807638 ***


Note You need to log in before you can comment on or make changes to this bug.