Description of problem: During kube-apiserver certificate rotation, client API calls see multiple failures: get pods: an error on the server ("apiserver is shutting down.") has prevented the request from succeeding (get pods) create pods: Unexpected error occurred: Post https://api.ravi-scale-cluster114.perf-testing.devcluster.openshift.com:6443/api/v1/namespaces/clusterproject0/pods: read tcp 10.0.0.191:56114->10.0.175.137:6443: read: connection reset by peer Can the deployment of the apiserver be handled more gracefully with the load balancer moving requests to the "new" pods and preventing requests being sent to the "old" or do we expect consumers/users of the API to have to adapt their tooling to this? Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-27-213933 True False 22h Cluster version is 4.0.0-0.nightly-2019-02-27-213933 How reproducible: Every time Steps to Reproduce: 1. Create test using wait.Poll() to wrap a pod List() (for example, github.com/openshift/origin/test/extended/util.WaitForPods()) 2. Run test during apiserver certificate rotation 3. See errors relating to apiserver certificate rotation Actual results: Client can see errors from requests made during apiserver redeployment. Expected results: The client is unaffected by the apiserver redeployment. Additional info:
Sounds similar to bug 1678847 ? That bug has more comments and logs fyi
@Xingxing I agree it's the same symptom. I suppose the difference is that we now expect the kube-apiserver pods to restart, but we want to ensure that it doesn't cause apiserver outages & errors.
Bug 1684602 appears to "avoid unnecessary restarts"
We are going to avoid restarts on cert rotations before we ship 4.0. David is working on dynamic cert reloading.
Just so I'm clear "dynamic cert reloading" makes it so that a client connection to the apiserver having it's cert reloaded will *not* be interrupted?
We'll see how the golang stack handles it. If the default stack doesn't terminate connections, they'll remain open. If it does terminate connections, then the connections will be broken.
*** Bug 1688503 has been marked as a duplicate of this bug. ***
The fix landed in https://github.com/openshift/origin/pull/22322 and https://github.com/openshift/installer/pull/1421
Same reliability bug 1678847 is closed. This bug 1684547 should be kept for verification. Hongkai Liu, could you help check this bug?
(In reply to Xingxing Xia from comment #12) > Same reliability bug 1678847 is closed. This bug 1684547 should be kept for > verification. Hongkai Liu, could you help check this bug? Hi Sebastian, can you help verify if the fug is fixed? The PRs in Comment 11 have been merged. Thanks.
(In reply to Hongkai Liu from comment #13) > (In reply to Xingxing Xia from comment #12) > > Same reliability bug 1678847 is closed. This bug 1684547 should be kept for > > verification. Hongkai Liu, could you help check this bug? > > Hi Sebastian, > > can you help verify if the fug is fixed? The PRs in Comment 11 have been > merged. > > Thanks. Yes I see that, what build are the fixes in?
Both PRs got merged 3/4 days ago. My best guess would be just using the latest nightly builds. ^_^
Hongkai, Sebastian, after fix landing to payload please help verify this bug considering it is a reliable issue for which SVT team seems better to check, thank you!
Thanks, Xingxing. I will keep checking if the PRs are included. Glad to see the commands to check. I did not know before.
Checked the latest green build for the moment: 4.0.0-0.nightly-2019-03-20-153904 The PR is not there yet. # BUILD_TAG=4.0.0-0.nightly-2019-03-20-153904 # IMAGE_NAME=hyperkube # oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:${BUILD_TAG} | grep "${IMAGE_NAME}" hyperkube https://github.com/openshift/ose bfd0e7ce8aa0777eb7d8022bee8eb831c08ecb28 # COMMIT_HASH=bfd0e7ce8aa0777eb7d8022bee8eb831c08ecb28 # PR_NUMBER=#22322 # git clone https://github.com/openshift/ose # cd ose/ # git log --oneline "${COMMIT_HASH}" | grep "${PR_NUMBER}"
FYI, all above PRs landed in 4.0.0-0.nightly-2019-03-23-222829 (latest Accepted one as of now), please have a check, thanks.
Please help check if it could be verified,thanks
Hi Sebastian, Please help verify. Thanks.
Yes... I'm not having any luck installing new builds.
I was able to get the clusters up yesterday afternoon but now the issue is that there's no way to manually trigger cert rotation and as of now no user configurable way to change duration either.
With the increase of cert rotation to 1 month, this is not very easy to verify. Some workarounds were tried without success. However, since we do have the 31 day cert rotation and we know from bug 1688820 that we can run commands without error for > 24 hours, I am removing the BetaBlocker flag. The current state of things should be fine for beta customers.
Verified the /readyz endpoint is available and returning the correct status. Build verified on 4.0.0-0.nightly-2019-03-28-030453.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758