Hide Forgot
Description of problem: After a certain amount of time (over night), the openshift-apiserver pods start reporting x509 failures. Version-Release number of selected component (if applicable): master How reproducible: 100% Steps to Reproduce: 1. create a cluster with the installer from master 2. wait Actual results: openshift-apiserver pods die silently (still show as running), yet they are not serving anything and their logs show: E0314 13:40:24.040295 1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority] --------- Openshift-apiserver clusteroperator status at this point: openshift-apiserver 4.0.0-0.alpha-2019-03-12-153711 False False False 39h Expected results: openshift-apiserver is happy, openshift resources are available
*** Bug 1688824 has been marked as a duplicate of this bug. ***
Additional info: removing the pods makes stuff work again
Same as bug 1688147 / bug 1688503
*** Bug 1688147 has been marked as a duplicate of this bug. ***
WIP PR is here for the rotation period increase: https://github.com/openshift/cluster-kube-apiserver-operator/pull/338
Note that this is still GA blocker.
*** Bug 1690101 has been marked as a duplicate of this bug. ***
This was fixed in https://github.com/openshift/library-go/pull/311
Will continue checking next day
Checked in today's latest build 4.0.0-0.nightly-2019-03-19-004004 env which already include above https://github.com/openshift/cluster-kube-apiserver-operator/pull/338 , so far after 9 hours the bug is still not hit. BTW I can check whether a payload included cluster-kube-apiserver-operator PR by `oc adm release info --commits $PAYLOAD` to check cluster-kube-apiserver-operator commit and then `git log $COMMIT | grep '$PR_NUMBER'` under cluster-kube-apiserver-operator repo. But how to check whether a payload already include a library-go PR? Check library-go commit in cluster-kube-apiserver-operator glide.lock, then check `git log` under library-go repo?
Though above "after 9 hours the bug is still not hit", now after about 19 hours, the bug is still hit: oc get po -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-fmj9t 1/1 Running 0 19h apiserver-r4867 1/1 Running 0 19h apiserver-rc6gs 1/1 Running 0 19h See the time gap between the pod's beginning log and the first x509 log: oc logs apiserver-fmj9t -n openshift-apiserver I0319 05:10:06.835679 1 plugins.go:84] Registered admission plugin "NamespaceLifecycle" I0319 05:10:06.835759 1 plugins.go:84] Registered admission plugin "Initializers" ... E0320 00:45:45.666531 1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority] E0320 00:45:45.884978 1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority] Above commands can run, but other commands cannot: oc get route error: You must be logged in to the server (Unauthorized) oc project openshift-apiserver error: You must be logged in to the server (Unauthorized) oc whoami error: You must be logged in to the server (Unauthorized
oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-19-004004 ... cluster-kube-apiserver-operator https://github.com/openshift/cluster-kube-apiserver-operator 887bdfc3431fb56fb113166aa43a7787178872f3 ... git clone git:openshift/cluster-kube-apiserver-operator.git cd cluster-kube-apiserver-operator git log --date=local --pretty="%h %an %cd - %s" 887bdfc3431fb56fb113166aa43a7787178872f3 -1 -- vendor/github.com/openshift/library-go/ 408fa0dc Michal Fojtik Mon Mar 18 23:10:51 2019 - bump library-go to fix cert syncer From above info, in above latest build 4.0.0-0.nightly-2019-03-19-004004 env, component cluster-kube-apiserver-operator already bumped https://github.com/openshift/library-go/pull/311. But checked components cluster-openshift-apiserver-operator and hyperkube/hypershift, they haven't bumped yet. Not sure whether they need bump too.
This PR should fix the problem: https://github.com/openshift/origin/pull/22395
PR merged, lets move to QE.
Yesterday the PR didn't land into yesterday's latest payload. Today, it landed into today's latest payload. Will verify it today
Checked today latest payload 4.0.0-0.nightly-2019-03-25-180911 env. 6 hours elapsed, didn't hit it so far. Mike, per my comment 14 and comment 15, much more hours (at least 19 hours) are needed to verify this bug. (This morning I had a yesterday's latest payload env that doesn't include this fix, it ALSO didn't hit it until 18 hours elapsed). https://bugzilla.redhat.com/show_bug.cgi?id=1690101#c0 even says 48h but I checked its log the time gap between the pod's beginning log and the first "Unauthorized log" is about 3h, not 48h. Could you relay on observing my env to verify this bug? All my env info is in https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/58546/ top stanza, including host.spec, users.spec, kubeconfig etc (download link and "view" page). BTW, do your team have simulation testing approach to make time on env elapse faster so that we need not wait really so long for such kind of bug? Thanks!
Hi Xingxing, I will keep an eye on your env today. We do have the reliability test for long runs, but I think in this case it is the real elapsed time that matter - the real clock time.
After 20 hours (8PM Eastern US) the cluster is still running well - no x509 certificate authentication failures, no openshift-apiserver restarts and oc commands have been running continuously. Out of 19000 oc commands executed, 7 of them failed. Error from server (NotFound): Unable to list {"route.openshift.io" "v1" "routes"}: the server could not find the requested resource (get routes.route.openshift.io) Error from server (NotFound): the server could not find the requested resource (get users.user.openshift.io ~) Error from server (NotFound): Unable to list {"route.openshift.io" "v1" "routes"}: the server could not find the requested resource (get routes.route.openshift.io) Error from server (NotFound): the server could not find the requested resource (get users.user.openshift.io ~) Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io) Error from server (NotFound): the server could not find the requested resource (get users.user.openshift.io ~) Error from server (ServiceUnavailable): the server is currently unable to handle the request (get users.user.openshift.io ~) I am fairly certain the failures were due to timing - changing active projects while doing oc gets. In any case, they are not the authentication failures tracked by this bz. @Xingxing, let me know if you want me to continue tracking this cluster tomorrow.
Mike, thank you very much! Now 22 hours elapsed, all is well. I was also running a loop `while true; do oc get route; oc whoami; oc project default; [ "$?" != "0" ] && date && echo "Hit error"; sleep 300; done` to check "openshift" related stuff like comment 15, didn't see one error output. Now I checked the openshift-apiserver pods logs by `oc logs ... > ...log` and `grep -Ein "x509|Unauthorize|(Unable to list)|(could not find the requested resource)" apiserver-*.log`, no error either. So moving to VERIFIED.
Now 30 hours elapsed, still running OK. Above env is recycled for other use.
*** Bug 1694527 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758