Description of problem: Apiservers are restarted before beeing ready. Version-Release number of selected component (if applicable): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:248d2d3c94484a907bcf123d371f5349034802be877fc0cdab5391acf87da979 How reproducible: happens after Upgrade with several master Servers Steps to Reproduce: 1. Upgrade from 4.2.16 to 4.3.0 with 3 master servers 2. Apiservers are in a loop restarting 3. Actual results: only 2 of 3 apiservers are available, one is restarting in a loop Expected results: Apiservers are all in ready state and not restarting. Additional info: if the failureThreshold of the readinessProbe in the daemonset is increased from 10 to 30, the apiservers are stable and not restarting. But this value is then overridden by the openshift-apiserver-operator. Proposal: Increase the failureThreshold for the apiserver daemonset from 10 to 30 to give the apiserver more time to get ready.
The readiness probe does not cause restarts, only the liveness probe does. To understand what's going on we need openshift-apiserver logs.
Created attachment 1663569 [details] openshift-apiserver.log
Attached is the log of an apiserver just before it gets killed. The corresponding event log is: [markus@mfrahm-pc installeruat9]$ oc get events | grep 59pf6 <unknown> Normal Scheduled pod/apiserver-59pf6 Successfully assigned openshift-apiserver/apiserver-59pf6 to master-0 7m14s Normal Pulled pod/apiserver-59pf6 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:248d2d3c94484a907bcf123d371f5349034802be877fc0cdab5391acf87da979" already present on machine 7m14s Normal Created pod/apiserver-59pf6 Created container fix-audit-permissions 7m14s Normal Started pod/apiserver-59pf6 Started container fix-audit-permissions 7m13s Normal Pulled pod/apiserver-59pf6 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:248d2d3c94484a907bcf123d371f5349034802be877fc0cdab5391acf87da979" already present on machine 7m13s Normal Created pod/apiserver-59pf6 Created container openshift-apiserver 7m13s Normal Started pod/apiserver-59pf6 Started container openshift-apiserver 7m6s Normal Killing pod/apiserver-59pf6 Stopping container openshift-apiserver
Can you attach the kubelet log for that node? I think from the liveness prove we should only see container restarts and not recreated pods. Something else is going on here. Please run must-gather and attach more logs (https://docs.openshift.com/container-platform/4.1/support/gathering-cluster-data.html). Next to kubelet logs we need the operator logs, events and possibly more.
Here are links to the requested log-files: https://www.dropbox.com/s/gdwb4yqy2pjbhq8/must-gather.tar.gz?dl=0 https://www.dropbox.com/s/9zuto26ys47asje/kubelet.log.gz?dl=0
updated cluster to 4.3.1. problem still persists. no change
additional info: increasing the failure threshold in readiness probe of the apiserver daemonset from 10 to 15 is sufficient to solve the problem But it's overwritten by apiserver-operator.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity. If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
bug still persists. No change.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Today I’ve been looking into this issue and here is what I’ve found: First I went from 4.2.16 to 4.2.28 and I didn’t observe any issues. Kube/OpenShift APIs were upgraded to the new version without any interruptions. I noticed that only the network and the DNS operators didn’t upgrade to the newer version. Next, I went from 4.2.28 to 4.3.19. Again Kube/OpenShift APIs were upgraded to the new version without any interruptions and remained in that state until (a few minutes later) the network operator started its upgrade (from 4.2.16 to 4.3.19) procedure. When the network operator was being upgraded: - SSH connections from my local machine to all master nodes were periodically dropped and I had to reconnect a few times - I wasn’t able to download must-gather - I was able to get the logs only after the operator was fully upgraded. - openshift-apiserver pods were restarted due to failed liveness probes (connection refused) - events from “openshift-console” suggest that “console-7489846965-fm96l” failed the check as well (connection refused) - same for “sdn-lft85” in “openshift-sdn” (connection refused) It looks like the network wasn’t stable and kubelet couldn’t monitor the pods. I’m attaching must-gather and assigning to the network team for further investigation
Aniket, do you think this is resolved by the other upgrade stability work you have been doing?
Markus, Part of the connection disruption issue which deals with not deleting ovs flows during upgrade has been fixed in Openshift 4.5. We have a reason to believe that this will potentially fix the issue you are seeing with API servers restarting. We are in the process of backporting this to 4.4.z and 4.3.z streams. It should land in a 4.3.z stream soon. Thanks, Aniket.
*** This bug has been marked as a duplicate of bug 1807638 ***