Bug 1684547

Summary:	kube-apiserver certificate rotation causes API service impact
Product:	OpenShift Container Platform	Reporter:	Sebastian Jug <sejug>
Component:	Master	Assignee:	David Eads <deads>
Status:	CLOSED ERRATA	QA Contact:	Mike Fiedler <mifiedle>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	4.1.0	CC:	akrzos, aos-bugs, ekuric, florin-alexandru.peter, hongkliu, jeder, jmencak, jokerman, maszulik, mifiedle, mmccomas, nelluri, sponnaga, wsun, xtian, xxia
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	4.1.0
Hardware:	All
OS:	Linux
Whiteboard:	aos-scalability-41
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:44:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sebastian Jug 2019-03-01 14:14:42 UTC

Description of problem:
During kube-apiserver certificate rotation, client API calls see multiple failures:

get pods:
an error on the server ("apiserver is shutting down.") has prevented the request from succeeding (get pods)

create pods:
Unexpected error occurred: Post https://api.ravi-scale-cluster114.perf-testing.devcluster.openshift.com:6443/api/v1/namespaces/clusterproject0/pods: read tcp 10.0.0.191:56114->10.0.175.137:6443: read: connection reset by peer

Can the deployment of the apiserver be handled more gracefully with the load balancer moving requests to the "new" pods and preventing requests being sent to the "old" or do we expect consumers/users of the API to have to adapt their tooling to this?


Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-02-27-213933   True        False         22h       Cluster version is 4.0.0-0.nightly-2019-02-27-213933


How reproducible:
Every time


Steps to Reproduce:
1. Create test using wait.Poll() to wrap a pod List() (for example, github.com/openshift/origin/test/extended/util.WaitForPods())
2. Run test during apiserver certificate rotation
3. See errors relating to apiserver certificate rotation

Actual results:
Client can see errors from requests made during apiserver redeployment.


Expected results:
The client is unaffected by the apiserver redeployment.


Additional info:

Comment 2 Xingxing Xia 2019-03-04 02:00:14 UTC

Sounds similar to bug 1678847 ? That bug has more comments and logs fyi

Comment 3 Sebastian Jug 2019-03-04 02:04:32 UTC

@Xingxing I agree it's the same symptom. I suppose the difference is that we now expect the kube-apiserver pods to restart, but we want to ensure that it doesn't cause apiserver outages & errors.

Comment 4 Xingxing Xia 2019-03-04 02:15:26 UTC

Bug 1684602 appears to "avoid unnecessary restarts"

Comment 5 Michal Fojtik 2019-03-05 09:36:53 UTC

We are going to avoid restarts on cert rotations before we ship 4.0. David is working on dynamic cert reloading.

Comment 6 Jeremy Eder 2019-03-05 20:58:45 UTC

Just so I'm clear "dynamic cert reloading" makes it so that a client connection to the apiserver having it's cert reloaded will *not* be interrupted?

Comment 7 David Eads 2019-03-06 19:40:57 UTC

We'll see how the golang stack handles it.  If the default stack doesn't terminate connections, they'll remain open.  If it does terminate connections, then the connections will be broken.

Comment 10 Radek Vokál 2019-03-18 14:35:53 UTC

*** Bug 1688503 has been marked as a duplicate of this bug. ***

Comment 11 Maciej Szulik 2019-03-19 09:19:51 UTC

The fix landed in https://github.com/openshift/origin/pull/22322 and https://github.com/openshift/installer/pull/1421

Comment 12 Xingxing Xia 2019-03-19 10:50:07 UTC

Same reliability bug 1678847 is closed. This bug 1684547 should be kept for verification. Hongkai Liu, could you help check this bug?

Comment 13 Hongkai Liu 2019-03-19 12:00:37 UTC

(In reply to Xingxing Xia from comment #12)
> Same reliability bug 1678847 is closed. This bug 1684547 should be kept for
> verification. Hongkai Liu, could you help check this bug?

Hi Sebastian,

can you help verify if the fug is fixed? The PRs in Comment 11 have been merged.

Thanks.

Comment 14 Sebastian Jug 2019-03-19 12:04:16 UTC

(In reply to Hongkai Liu from comment #13)
> (In reply to Xingxing Xia from comment #12)
> > Same reliability bug 1678847 is closed. This bug 1684547 should be kept for
> > verification. Hongkai Liu, could you help check this bug?
> 
> Hi Sebastian,
> 
> can you help verify if the fug is fixed? The PRs in Comment 11 have been
> merged.
> 
> Thanks.

Yes I see that, what build are the fixes in?

Comment 15 Hongkai Liu 2019-03-19 16:40:25 UTC

Both PRs got merged 3/4 days ago.
My best guess would be just using the latest nightly builds. ^_^

Comment 17 Xingxing Xia 2019-03-20 10:40:11 UTC

Hongkai, Sebastian, after fix landing to payload please help verify this bug considering it is a reliable issue for which SVT team seems better to check, thank you!

Comment 18 Hongkai Liu 2019-03-20 12:47:14 UTC

Thanks, Xingxing.
I will keep checking if the PRs are included.
Glad to see the commands to check. I did not know before.

Comment 19 Hongkai Liu 2019-03-20 18:24:19 UTC

Checked the latest green build for the moment: 4.0.0-0.nightly-2019-03-20-153904
The PR is not there yet.

# BUILD_TAG=4.0.0-0.nightly-2019-03-20-153904
# IMAGE_NAME=hyperkube
# oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:${BUILD_TAG} | grep "${IMAGE_NAME}"
  hyperkube                                     https://github.com/openshift/ose                                           bfd0e7ce8aa0777eb7d8022bee8eb831c08ecb28
# COMMIT_HASH=bfd0e7ce8aa0777eb7d8022bee8eb831c08ecb28
# PR_NUMBER=#22322
# git clone https://github.com/openshift/ose
# cd ose/
# git log --oneline "${COMMIT_HASH}" | grep "${PR_NUMBER}"

Comment 22 Xingxing Xia 2019-03-25 07:06:46 UTC

FYI, all above PRs landed in 4.0.0-0.nightly-2019-03-23-222829 (latest Accepted one as of now), please have a check, thanks.

Comment 23 Wei Sun 2019-03-26 08:58:01 UTC

Please help check if it could be verified,thanks

Comment 24 Hongkai Liu 2019-03-26 12:35:05 UTC

Hi Sebastian,

Please help verify. Thanks.

Comment 25 Sebastian Jug 2019-03-26 15:14:33 UTC

Yes... I'm not having any luck installing new builds.

Comment 27 Sebastian Jug 2019-03-27 12:38:08 UTC

I was able to get the clusters up yesterday afternoon but now the issue is that there's no way to manually trigger cert rotation and as of now no user configurable way to change duration either.

Comment 28 Mike Fiedler 2019-03-27 19:31:08 UTC

With the increase of cert rotation to 1 month, this is not very easy to verify.   Some workarounds were tried without success.   However, since we do have the 31 day cert rotation and we know from bug 1688820 that we can run commands without error for > 24 hours, I am removing the BetaBlocker flag.   The current state of things should be fine for beta customers.

Comment 29 Mike Fiedler 2019-03-29 13:37:25 UTC

Verified the /readyz endpoint is available and returning the correct status.   Build verified on 4.0.0-0.nightly-2019-03-28-030453.

Comment 32 errata-xmlrpc 2019-06-04 10:44:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758