Bug 1688820 - openshift-apiserver silently dies with x509 errors
Summary: openshift-apiserver silently dies with x509 errors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.1.0
Assignee: Maciej Szulik
QA Contact: Xingxing Xia
URL:
Whiteboard: aos-scalability-41
: 1688147 1688824 1690101 1694527 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-14 14:24 UTC by Standa Laznicka
Modified: 2019-06-04 10:46 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:45:52 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:45:59 UTC
Red Hat Bugzilla 1688147 'urgent' 'CLOSED' 'Service discovery is broken for a several hours cluster' 2019-12-04 08:17:32 UTC

Description Standa Laznicka 2019-03-14 14:24:41 UTC
Description of problem:
After a certain amount of time (over night), the openshift-apiserver pods start reporting x509 failures.

Version-Release number of selected component (if applicable):
master

How reproducible:
100%

Steps to Reproduce:
1. create a cluster with the installer from master
2. wait

Actual results:
openshift-apiserver pods die silently (still show as running), yet they are not serving anything and their logs show:

E0314 13:40:24.040295       1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority]

---------
Openshift-apiserver clusteroperator status at this point:
openshift-apiserver                   4.0.0-0.alpha-2019-03-12-153711   False       False         False     39h

Expected results:
openshift-apiserver is happy, openshift resources are available

Comment 1 Jeremy Eder 2019-03-14 14:31:47 UTC
*** Bug 1688824 has been marked as a duplicate of this bug. ***

Comment 2 Standa Laznicka 2019-03-14 14:52:46 UTC
Additional info:
removing the pods makes stuff work again

Comment 4 Xingxing Xia 2019-03-15 00:58:14 UTC
Same as bug 1688147 / bug 1688503

Comment 7 Radek Vokál 2019-03-18 14:40:31 UTC
*** Bug 1688147 has been marked as a duplicate of this bug. ***

Comment 8 Stefan Schimanski 2019-03-18 14:57:05 UTC
WIP PR is here for the rotation period increase: https://github.com/openshift/cluster-kube-apiserver-operator/pull/338

Comment 10 Michal Fojtik 2019-03-18 16:18:04 UTC
Note that this is still GA blocker.

Comment 11 Xingxing Xia 2019-03-19 03:10:12 UTC
*** Bug 1690101 has been marked as a duplicate of this bug. ***

Comment 12 Maciej Szulik 2019-03-19 09:13:25 UTC
This was fixed in https://github.com/openshift/library-go/pull/311

Comment 13 Xingxing Xia 2019-03-19 10:53:30 UTC
Will continue checking next day

Comment 14 Xingxing Xia 2019-03-19 14:16:39 UTC
Checked in today's latest build 4.0.0-0.nightly-2019-03-19-004004 env which already include above https://github.com/openshift/cluster-kube-apiserver-operator/pull/338 , so far after 9 hours the bug is still not hit.

BTW I can check whether a payload included cluster-kube-apiserver-operator PR by `oc adm release info --commits $PAYLOAD` to check cluster-kube-apiserver-operator commit and then `git log $COMMIT | grep '$PR_NUMBER'` under cluster-kube-apiserver-operator repo.
But how to check whether a payload already include a library-go PR? Check library-go commit in cluster-kube-apiserver-operator glide.lock, then check `git log` under library-go repo?

Comment 15 Xingxing Xia 2019-03-20 03:31:22 UTC
Though above "after 9 hours the bug is still not hit", now after about 19 hours, the bug is still hit:
oc get po -n openshift-apiserver
NAME              READY   STATUS    RESTARTS   AGE
apiserver-fmj9t   1/1     Running   0          19h
apiserver-r4867   1/1     Running   0          19h
apiserver-rc6gs   1/1     Running   0          19h

See the time gap between the pod's beginning log and the first x509 log:
oc logs apiserver-fmj9t -n openshift-apiserver
I0319 05:10:06.835679       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"
I0319 05:10:06.835759       1 plugins.go:84] Registered admission plugin "Initializers"
...
E0320 00:45:45.666531       1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority]
E0320 00:45:45.884978       1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority]

Above commands can run, but other commands cannot:
oc get route
error: You must be logged in to the server (Unauthorized)
oc project openshift-apiserver
error: You must be logged in to the server (Unauthorized)
oc whoami 
error: You must be logged in to the server (Unauthorized

Comment 16 Xingxing Xia 2019-03-20 03:42:38 UTC
oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-19-004004
...
  cluster-kube-apiserver-operator               https://github.com/openshift/cluster-kube-apiserver-operator               887bdfc3431fb56fb113166aa43a7787178872f3
...

git clone git@github.com:openshift/cluster-kube-apiserver-operator.git
cd cluster-kube-apiserver-operator
git log --date=local --pretty="%h %an %cd - %s" 887bdfc3431fb56fb113166aa43a7787178872f3 -1 -- vendor/github.com/openshift/library-go/
408fa0dc Michal Fojtik Mon Mar 18 23:10:51 2019 - bump library-go to fix cert syncer

From above info, in above latest build 4.0.0-0.nightly-2019-03-19-004004 env, component cluster-kube-apiserver-operator already bumped https://github.com/openshift/library-go/pull/311. But checked components cluster-openshift-apiserver-operator and hyperkube/hypershift, they haven't bumped yet. Not sure whether they need bump too.

Comment 17 Maciej Szulik 2019-03-22 15:30:27 UTC
This PR should fix the problem: https://github.com/openshift/origin/pull/22395

Comment 18 Michal Fojtik 2019-03-25 10:57:47 UTC
PR merged, lets move to QE.

Comment 19 Xingxing Xia 2019-03-26 01:45:10 UTC
Yesterday the PR didn't land into yesterday's latest payload. Today, it landed into today's latest payload. Will verify it today

Comment 20 Xingxing Xia 2019-03-26 10:44:48 UTC
Checked today latest payload 4.0.0-0.nightly-2019-03-25-180911 env. 6 hours elapsed, didn't hit it so far.

Mike, per my comment 14 and comment 15, much more hours (at least 19 hours) are needed to verify this bug. (This morning I had a yesterday's latest payload env that doesn't include this fix, it ALSO didn't hit it until 18 hours elapsed). https://bugzilla.redhat.com/show_bug.cgi?id=1690101#c0 even says 48h but I checked its log the time gap between the pod's beginning log and the first "Unauthorized log" is about 3h, not 48h.
Could you relay on observing my env to verify this bug?
All my env info is in https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/58546/ top stanza, including host.spec, users.spec, kubeconfig etc (download link and "view" page).
BTW, do your team have simulation testing approach to make time on env elapse faster so that we need not wait really so long for such kind of bug?
Thanks!

Comment 21 Mike Fiedler 2019-03-26 14:28:23 UTC
Hi Xingxing,  I will keep an eye on your env today.   We do have the reliability test for long runs, but I think in this case it is the real elapsed time that matter - the real clock time.

Comment 22 Mike Fiedler 2019-03-27 00:15:12 UTC
After 20 hours (8PM Eastern US) the cluster is still running well - no x509 certificate authentication failures, no openshift-apiserver restarts and oc commands have been running continuously.   Out of 19000 oc commands executed, 7 of them failed.

Error from server (NotFound): Unable to list {"route.openshift.io" "v1" "routes"}: the server could not find the requested resource (get routes.route.openshift.io)
Error from server (NotFound): the server could not find the requested resource (get users.user.openshift.io ~)
Error from server (NotFound): Unable to list {"route.openshift.io" "v1" "routes"}: the server could not find the requested resource (get routes.route.openshift.io)
Error from server (NotFound): the server could not find the requested resource (get users.user.openshift.io ~)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io)
Error from server (NotFound): the server could not find the requested resource (get users.user.openshift.io ~)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get users.user.openshift.io ~)

I am fairly certain the failures were due to timing - changing active projects while doing oc gets.   In any case, they are not the authentication failures tracked by this bz.
@Xingxing, let me know if you want me to continue tracking this cluster tomorrow.

Comment 23 Xingxing Xia 2019-03-27 02:21:28 UTC
Mike, thank you very much!
Now 22 hours elapsed, all is well. I was also running a loop `while true; do   oc get route;   oc whoami;   oc project default;   [ "$?" != "0" ] && date && echo "Hit error";   sleep 300; done` to check "openshift" related stuff like comment 15, didn't see one error output. Now I checked the openshift-apiserver pods logs by `oc logs ... > ...log` and `grep -Ein  "x509|Unauthorize|(Unable to list)|(could not find the requested resource)" apiserver-*.log`, no error either. So moving to VERIFIED.

Comment 24 Xingxing Xia 2019-03-27 10:10:49 UTC
Now 30 hours elapsed, still running OK. Above env is recycled for other use.

Comment 25 Mo 2019-04-01 13:47:03 UTC
*** Bug 1694527 has been marked as a duplicate of this bug. ***

Comment 27 errata-xmlrpc 2019-06-04 10:45:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.