Bug 1769247 - 4.2 -> 4.2.2 upgrade seems to work, but can no longer login, and console does not come up.
Summary: 4.2 -> 4.2.2 upgrade seems to work, but can no longer login, and console does...
Keywords:
Status: CLOSED DUPLICATE of bug 1764670
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Dan Mace
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-06 09:43 UTC by Daniel Webster
Modified: 2022-08-04 22:24 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-06 14:42:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
journalctl -xefl | grep -i auth (33.76 KB, text/plain)
2019-11-06 10:20 UTC, Daniel Webster
no flags Details

Description Daniel Webster 2019-11-06 09:43:27 UTC
Description of problem:

After a very stable period with 4.2, I decided to upgrade to 4.2.2 last night. We appeared to reach around 73% or more (reported by the cluster upgrade part of the console), and then upon refreshing, the console was no longer available (ERR_CONNECTION_CLOSED from Chrome).

Now, no login to the cluster is possible.  Not even from the master, as system:admin.  No single oc command will work.

"error: EOF" is thrown, and we seem to go straight to unauth'd.  I have based much of my debug on the only thing possible, which is crictl logs on masters.  I will provide more in subsequent updates to this bz.

dan@danbox:~$ oc login -u admin lb-ocp4a.redact.ed:6443 --loglevel=6
I1106 10:34:59.727291   31605 loader.go:359] Config loaded from file /home/dan/ocp4a/auth/kubeconfig
I1106 10:34:59.729765   31605 loader.go:359] Config loaded from file /home/dan/ocp4a/auth/kubeconfig
I1106 10:34:59.781641   31605 round_trippers.go:438] HEAD https://lb-ocp4a.redact.ed:6443/  in 51 milliseconds
I1106 10:34:59.809043   31605 round_trippers.go:438] GET https://lb-ocp4a.redact.ed:6443/apis/user.openshift.io/v1/users/~ 401 Unauthorized in 27 milliseconds
I1106 10:34:59.819922   31605 round_trippers.go:438] GET https://lb-ocp4a.redact.ed:6443/apis/user.openshift.io/v1/users/~ 401 Unauthorized in 9 milliseconds
I1106 10:34:59.830692   31605 round_trippers.go:438] GET https://lb-ocp4a.redact.ed:6443/apis/user.openshift.io/v1/users/~ 401 Unauthorized in 10 milliseconds
I1106 10:34:59.841394   31605 round_trippers.go:438] GET https://lb-ocp4a.redact.ed:6443/apis/user.openshift.io/v1/users/~ 401 Unauthorized in 10 milliseconds
I1106 10:34:59.851918   31605 round_trippers.go:438] GET https://lb-ocp4a.redact.ed:6443/apis/user.openshift.io/v1/users/~ 401 Unauthorized in 10 milliseconds
I1106 10:34:59.861777   31605 round_trippers.go:438] GET https://lb-ocp4a.redact.ed:6443/apis/user.openshift.io/v1/users/~ 401 Unauthorized in 9 milliseconds
I1106 10:34:59.871462   31605 round_trippers.go:438] GET https://lb-ocp4a.redact.ed:6443/apis/user.openshift.io/v1/users/~ 401 Unauthorized in 9 milliseconds
I1106 10:34:59.882014   31605 round_trippers.go:438] GET https://lb-ocp4a.redact.ed:6443/apis/user.openshift.io/v1/users/~ 401 Unauthorized in 10 milliseconds
I1106 10:34:59.882121   31605 request_token.go:87] GSSAPI Enabled
I1106 10:34:59.884329   31605 round_trippers.go:438] GET https://lb-ocp4a.redact.ed:6443/.well-known/oauth-authorization-server 200 OK in 2 milliseconds
I1106 10:34:59.887090   31605 round_trippers.go:438] HEAD https://oauth-openshift.apps.ocp4a.redact.ed  in 2 milliseconds
I1106 10:34:59.887213   31605 request_token.go:445] falling back to kubeconfig CA due to possible IO error: EOF
I1106 10:34:59.889848   31605 round_trippers.go:438] GET https://oauth-openshift.apps.ocp4a.redact.ed/oauth/authorize?client_id=openshift-challenging-client&code_challenge=xcv-sATjWYCCUJzeh_QAPYfbycFHiAYIE6FPz4RiJik&code_challenge_method=S256&redirect_uri=https%3A%2F%2Foauth-openshift.apps.ocp4a.redact.ed%2Foauth%2Ftoken%2Fimplicit&response_type=code  in 2 milliseconds
F1106 10:34:59.889905   31605 helpers.go:116] error: EOF


Version-Release number of selected component (if applicable):

4.2 (attempted to go to 4.2.2).

Using Trident 19.07.1 with CSI (questionable support for 4.2, but worked fine.  19.10 available as of 30.10.19, but not yet installed).

Using GitLab runner / helm tiller.

How reproducible:

I assume, get a stable 4.2 and upgrade to 4.2.2.  I don't currently have the resources to be able to build another one.


Steps to Reproduce:
1. N/A
2. N/A
3. N/A

Actual results:

Broken cluster.

Expected results:

4.2.2, and able to at least authenticate.

Additional info:

Will provide more debug below.

Comment 1 Daniel Webster 2019-11-06 09:59:16 UTC
Note the restarts on console, as it's constantly trying to contact OAuth endpoint:

[root@master0 ~]# crictl ps| grep console
954ffddcf7d87       4dd2f7a47c2fc6470e0ccc36d9890e65a94d63f0956d6f1e238696d01d76831b                                                         About a minute ago   Running             console                                  154                 ed586e311e03a
[root@master0 ~]# crictl logs 954ffddcf7d87
2019/11/6 09:53:07 cmd/main: cookies are secure!
2019/11/6 09:53:07 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4a.redact.ed/oauth/token failed: Head https://oauth-openshift.apps.ocp4a.redact.ed: EOF
2019/11/6 09:53:17 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4a.redact.ed/oauth/token failed: Head https://oauth-openshift.apps.ocp4a.redact.ed: EOF
2019/11/6 09:53:27 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4a.redact.ed/oauth/token failed: Head https://oauth-openshift.apps.ocp4a.redact.ed: EOF
2019/11/6 09:53:37 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4a.redact.ed/oauth/token failed: Head https://oauth-openshift.apps.ocp4a.redact.ed: EOF
2019/11/6 09:53:47 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4a.redact.ed/oauth/token failed: Head https://oauth-openshift.apps.ocp4a.redact.ed: EOF
2019/11/6 09:53:57 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4a.redact.ed/oauth/token failed: Head https://oauth-openshift.apps.ocp4a.redact.ed: EOF
2019/11/6 09:54:07 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4a.redact.ed/oauth/token failed: Head https://oauth-openshift.apps.ocp4a.redact.ed: EOF
2019/11/6 09:54:17 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4a.redact.ed/oauth/token failed: Head https://oauth-openshift.apps.ocp4a.redact.ed: EOF
2019/11/6 09:54:27 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4a.redact.ed/oauth/token failed: Head https://oauth-openshift.apps.ocp4a.redact.ed: EOF
[root@master0 ~]# 

But interestingly(?), the oauth-openshift containers seem to be "ok".. although i'm not an expert.

[root@master0 ~]# crictl ps| grep auth
d5770262b946f       ab8cc51728abdc698d0f7e0abd1dad4d26736e46f496a5137aef52537457cb6a                                                         6 hours ago         Running             oauth-openshift                          0                   9fe4e6ba3838a
[root@master0 ~]# crictl logs d5770262b946f
Copying system trust bundle
I1105 20:02:50.596646       1 secure_serving.go:65] Forcing use of http/1.1 only
I1105 20:02:50.596899       1 secure_serving.go:127] Serving securely on 0.0.0.0:6443
Copying system trust bundle
I1106 03:40:40.211003       1 secure_serving.go:65] Forcing use of http/1.1 only
I1106 03:40:40.211178       1 secure_serving.go:127] Serving securely on 0.0.0.0:6443
[root@master0 ~]# 

[root@master0 ~]# crictl exec -it d5770262b946f /bin/bash
[root@oauth-openshift-7d4bb548d6-m8x4v /]# ps auxwww
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          1  0.1  0.3 595656 53668 ?        Ssl  03:40   0:37 oauth-server osinserver --config=/var/config/system/configmaps/v4-0-config-system-cliconfig/v4-0-config-system-cliconfig --v=2
root         30  0.0  0.0  11824  2948 pts/0    Ss   09:57   0:00 /bin/bash
root         46  0.0  0.0  51764  3484 pts/0    R+   09:57   0:00 ps auxwww

Comment 2 Daniel Webster 2019-11-06 10:20:37 UTC
Created attachment 1633242 [details]
journalctl -xefl | grep -i auth

grep'd for auth in a following journalctl from 1 of 3 masters, in the hope that this sheds some light for those much more learned than I.

Comment 3 Daniel Webster 2019-11-06 10:40:53 UTC
Just managed to login using the service token I had previously setup for a GitLab runner, so am able to upload a must-gather if needed.

Comment 4 Standa Laznicka 2019-11-06 13:09:43 UTC
Since the oauth-server pods are running fine, I'm suspecting this is going to be a problem somewhere either in route or sdn.

Daniel, could you please share must-gather in the meantime so that the other folks handling the BZ have some more insight on what happened in the cluster?

Comment 5 Daniel Webster 2019-11-06 13:21:48 UTC
Hi Standa,

I also have a RH case open for this, and just received this update (from RedHat eng Jaspreet Kaur - have added her to the CC)

I could see below errors in your authentication operator :


finishedAt: 2019-11-05T20:00:04Z
        message: |2
           *v1.ConfigMap ended with: too old resource version: 13614852 (13617538)
          E1105 19:45:48.329481       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: EOF
          W1105 19:45:58.279162       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 13614852 (13617612)
          W1105 19:46:04.270745       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 13614852 (13617631)
          E1105 19:46:16.333263       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: EOF
          W1105 19:47:38.218524       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 13614852 (13618181)
          W1105 19:52:08.278910       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 13617781 (13619414)
          W1105 19:52:49.225506       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 13618343 (13619626)
          W1105 19:52:50.262096       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 13617708 (13619631)
          E1105 19:52:52.333114       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: EOF
          W1105 19:54:25.285752       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 13617763 (13620068)
          I1105 19:56:54.871296       1 cmd.go:79] Received SIGTERM or SIGINT signal, shutting down controller.
          F1105 19:56:54.871397       1 leaderelection.go:66] leaderelection lost
        reason: Error
        startedAt: 2019-11-05T19:37:01Z
    name: operator


On checking this further and previous logs, I think we are hitting below :

https://bugzilla.redhat.com/show_bug.cgi?id=1764670

This is originally deferred in this bugzilla :

https://bugzilla.redhat.com/show_bug.cgi?id=1759523

Let me check the status with Engineering and get back to you on this case.
-----------

Would you still like me to upload the MG?

Cheers

dan

Comment 6 Daniel Webster 2019-11-06 13:33:40 UTC
(In reply to Standa Laznicka from comment #4)
> Since the oauth-server pods are running fine, I'm suspecting this is going
> to be a problem somewhere either in route or sdn.
> 
> Daniel, could you please share must-gather in the meantime so that the other
> folks handling the BZ have some more insight on what happened in the cluster?

Regarding the MG - I have uploaded it to our RH support case... Are you able to access that system?  Otherwise, there is sensitive information surrounding our apps, and it's not something that is yet ideal in the public domain.. so am actually reluctant to upload here. Hope you understand.

cheers
dan

Comment 7 Dan Mace 2019-11-06 14:05:23 UTC
(In reply to Daniel from comment #6)
> (In reply to Standa Laznicka from comment #4)
> > Since the oauth-server pods are running fine, I'm suspecting this is going
> > to be a problem somewhere either in route or sdn.
> > 
> > Daniel, could you please share must-gather in the meantime so that the other
> > folks handling the BZ have some more insight on what happened in the cluster?
> 
> Regarding the MG - I have uploaded it to our RH support case... Are you able
> to access that system?  Otherwise, there is sensitive information
> surrounding our apps, and it's not something that is yet ideal in the public
> domain.. so am actually reluctant to upload here. Hope you understand.
> 
> cheers
> dan

What case? There is none linked to this bug. I also don't see any linked to case 02497879.

This also looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1764670.

Is there any reason I shouldn't close this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1764670?

Comment 8 Daniel Webster 2019-11-06 14:19:11 UTC
(In reply to Dan Mace from comment #7)
> (In reply to Daniel from comment #6)
> > (In reply to Standa Laznicka from comment #4)
> > > Since the oauth-server pods are running fine, I'm suspecting this is going
> > > to be a problem somewhere either in route or sdn.
> > > 
> > > Daniel, could you please share must-gather in the meantime so that the other
> > > folks handling the BZ have some more insight on what happened in the cluster?
> > 
> > Regarding the MG - I have uploaded it to our RH support case... Are you able
> > to access that system?  Otherwise, there is sensitive information
> > surrounding our apps, and it's not something that is yet ideal in the public
> > domain.. so am actually reluctant to upload here. Hope you understand.
> > 
> > cheers
> > dan
> 
> What case? There is none linked to this bug. I also don't see any linked to
> case 02497879.
> 
> This also looks like a duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=1764670.
> 
> Is there any reason I shouldn't close this as a duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=1764670?

Hi Dan,

The case number is:  02511382  -- How can I link it?

Not that i'm an expert, but those bugs seem to refer to new installs.  This was a perfectly happy, stable 4.2.0 cluster, and had been running a good month before I then, last night, decided to upgrade to 4.2.2, and it did not survive.  There was no waiting an indeterminate period of time post-install for the "error: EOF" to appear, like in the other bugs.  The breakage was immediate.

What are your thoughts?

cheers

dan

Comment 9 Daniel Webster 2019-11-06 14:29:08 UTC
In fact, worth mentioning perhaps is that this cluster started life ~2 months ago as a 4.1 cluster.  It was upgraded all the way to 4.2.0, with a handful of intermediary versions, without many issues.

Comment 10 Dan Mace 2019-11-06 14:42:51 UTC
Thanks for the case details. Looking at the must-gather output, the ingresscontroller is using HostNetwork, and all router pods are healthy. These details align with prior investigation into this class of issue which have concluded there is no strong evidence of a routing issue specifically. Because there don't appear to be any new details or symptoms here, I'm closing this as a dupe of 1764670. The nature of the problem is intermittent, so I don't read too much into the fact that troubles can emerge at install time, during an upgrade, or during steady state. That is, I don't consider the timing details here to represent the discovery of some new bug.

My bet is still on either an SDN issue or the auth endpoint itself returning a bad response. Investigation should continue in the context of the existing bug, though.

*** This bug has been marked as a duplicate of bug 1764670 ***


Note You need to log in before you can comment on or make changes to this bug.