Bug 1988576

Summary:

Authentication operator fails to become available during upgrade to 4.8.2

Product:

OpenShift Container Platform

Reporter:

rvanderp

Component:

apiserver-auth

Assignee:

Sergiusz Urbaniak <surbania>

Status:

CLOSED ERRATA

QA Contact:

Xingxing Xia <xxia>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.8

CC:

alkazako, amcdermo, aos-bugs, cblecker, david.karlsen, fsoppels, g.parera, jscalf, liyao, lmohanty, mbargenq, mfojtik, mtleilia, mwhittin, nsu, rsandu, sdodson, sdodson, slaznick, surbania, wking, xxia, yanyang, ychoukse

Target Milestone:

---

Keywords:

Regression, Reopened, ServiceDeliveryBlocker, ServiceDeliveryImpact, UpgradeBlocker

Target Release:

4.9.0

Flags:

ychoukse: needinfo-
ychoukse: needinfo-

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

UpdateRecommendationsBlocked

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

1989587 (view as bug list)

Environment:

Last Closed:

2021-10-18 17:43:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1989587

Attachments:

Description	Flags
curler binary - make repeated calls to an endpoint.	none

Description rvanderp 2021-07-30 21:37:36 UTC

Description of problem:

Authentication operator fails to become available during upgrade to 4.8.2

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.2     False       False         False      31m
baremetal                                  4.8.2     True        False         False      157d
cloud-credential                           4.8.2     True        False         False      352d
cluster-autoscaler                         4.8.2     True        False         False      352d
config-operator                            4.8.2     True        False         False      352d
console                                    4.8.2     True        False         False      75m
csi-snapshot-controller                    4.8.2     True        False         False      176m
dns                                        4.7.21    True        False         False      3h19m
etcd                                       4.8.2     True        False         False      352d
image-registry                             4.8.2     True        False         False      31h
ingress                                    4.8.2     True        False         False      30h
insights                                   4.8.2     True        False         False      352d
kube-apiserver                             4.8.2     True        False         False      352d
kube-controller-manager                    4.8.2     True        False         False      352d
kube-scheduler                             4.8.2     True        False         False      352d
kube-storage-version-migrator              4.8.2     True        False         False      129m
machine-api                                4.8.2     True        False         False      352d
machine-approver                           4.8.2     True        False         False      352d
machine-config                             4.7.21    True        False         False      140m
marketplace                                4.8.2     True        False         False      32h
monitoring                                 4.8.2     True        False         False      95m
network                                    4.7.21    True        False         False      157d
node-tuning                                4.8.2     True        False         False      30h
openshift-apiserver                        4.8.2     True        False         False      63m
openshift-controller-manager               4.8.2     True        False         False      32h
openshift-samples                          4.8.2     True        False         False      30h
operator-lifecycle-manager                 4.8.2     True        False         False      352d
operator-lifecycle-manager-catalog         4.8.2     True        False         False      352d
operator-lifecycle-manager-packageserver   4.8.2     True        False         False      42m
service-ca                                 4.8.2     True        False         False      352d
storage                                    4.8.2     True        False         False      86m


Version-Release number of selected component (if applicable):
4.8.2 on AWS IPI

How reproducible:
Unknown

Steps to Reproduce:
1. Install 4.7.21
2. Upgrade to 4.8.2
3. Upgrade will stall when attempting to upgrade the authentication operator

Actual results:
Operator status should reflect that it is Available.  If there is a problem it should reflect that it is Degraded.  Operator reports:

clusteroperator/authentication is not available (OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.xx.yy.org/healthz": context canceled) because All is well

Expected results:


Additional info:
oauth is functional.  Can login via the console with a configured provider.  

This is the vSphere CI build cluster.  If needed access can be provided.

Comment 1 Sergiusz Urbaniak 2021-08-02 11:36:30 UTC

the underlying issue seems to be that the route endpoint check fails with a timeout:

OAuthRouteCheckEndpointAccessibleControllerAvailable: Get \"https://oauth-openshift.apps.build01-us-west-2.vmc.ci.openshift.org/healthz\": context canceled"

this is rather network related, the oauth pod cannot reach external routes.

Comment 3 Andrew McDermott 2021-08-02 12:42:20 UTC

Could you enable router access logs?

$ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"logging":{"access":{"destination":{"type":"Container"}}}}}'

That will restart the ingress router pods, and each pod will now have a "logs" container.

I'd like to see if this helps provide any insight or correlation w.r.t auth GET failure mentioned in comment #1.

On top of that would it be possible to run the curler binary (see attachments). This is a wrapped up version of using curl that repeatedly makes a GET request:

Usage would be:

$ O=1 ./curler https://oauth-openshift.apps.build01-us-west-2.vmc.ci.openshift.org/healthz
reopening stdout to "curler-R0-2021-08-02-133557.stdout"
reopening stderr to "curler-R0-2021-08-02-133557.stderr"

You can tail -f the .stdout file to watch the GET requests to the endpoint, looking for either slow requests, requests that fail, DNS issues, et al. It will repeat the GET indefinitely. 

Can we run this external to the cluster and from a pod (or node?) within the cluster? I'd like to see the same failure from the curler binary that we do from the auth pod.

Comment 4 Andrew McDermott 2021-08-02 12:43:39 UTC

Created attachment 1810102 [details]
curler binary - make repeated calls to an endpoint.

Usage: 

O=1 ./curler <URL>

Comment 5 Sergiusz Urbaniak 2021-08-02 13:47:22 UTC

needinfo should go to OP

Comment 6 rvanderp 2021-08-02 13:52:41 UTC

Performed 36 minutes of curl testing

From the perspective of the auth pod:
sh-4.4# cat curler-R0-2021-08-02-130644.stderr
sh-4.4# cat curler-R0-2021-08-02-130644.stdout | grep "http_code 200" | wc -l
123425
sh-4.4# cat curler-R0-2021-08-02-130644.stdout | grep -v "http_code 200" | wc -l
0

From the perspective of an external caller:
$ cat curler-R0-2021-08-02-090657.stderr
$ cat curler-R0-2021-08-02-090657.stdout | grep "http_code 200" | wc -l
5808
$ cat curler-R0-2021-08-02-090657.stdout | grep -v "http_code 200" | wc -l
0

No failures were observed from the degraded operator pod or externally.

Comment 13 Sergiusz Urbaniak 2021-08-03 13:37:05 UTC

The issue that the failing route status was set in a authentication-operator from version 4.7

```
$ kubectl get co authentication -o yaml
...
  - lastTransitionTime: "2021-08-02T21:47:41Z"
    message: 'OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.build01-us-west-2.vmc.ci.openshift.org/healthz":
      context canceled'
    reason: OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable
    status: "False"
    type: Available

```

however the prefix OAuthRoute changed to OAuthServerRoute but the stale status controller entries have not been updated correctly. Instead of referring to OAuthRouteCheckEndpointAccessibleController_Available_ they refer to OAuthRouteCheckEndpointAccessibleController_Degraded_: https://github.com/openshift/cluster-authentication-operator/blob/4dfd59792e303282731f6120a22b042144901b39/pkg/operator/starter.go#L256

Comment 14 rvanderp 2021-08-03 13:54:23 UTC

Worked around this issue by appending an `Available: True` condition to the operator conditions in etcd.  The upgrading is proceeding.

Comment 15 rvanderp 2021-08-03 17:15:35 UTC

edit: I had to remove the condition `OAuthRouteCheckEndpointAccessibleController` from the authentications/cluster resource.

Comment 16 Lalatendu Mohanty 2021-08-03 18:17:43 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 21 W. Trevor King 2021-08-04 12:25:48 UTC

> > Customers upgrading from 4.7.z to 4.8.z during a period when the authentication operator is unable to reach the oauth route just as the authentication operator is rolling out 4.8.
>
> This should be prevented by CVO as upgrades should not be possible while operators report degraded. We are still not sure why the upgrade was still possible.

Because sometimes updating to a new release is how we want folks to fix a degraded operator.  During updates, the CVO blocks on ClusterOperator manifests when they are Degraded=True [1], but that's generally (always?) after the manifest for the operator deployment.  Blocking a move from release A to release B because A's operator X is Degraded=True isn't crisp, because we'll only block mid-update if B's operator X is also Degraded=True.  Or maybe the degradation root is outside operator X completely, in which case, maybe it gets sorted out before we get around to ClusterOperator X, or maybe not, but that chance is still better than forcing folks to manually recover or force through a guard before they can attempt an update.

[1]: https://github.com/openshift/enhancements/blob/ac1c27da8307933263e5273bc087b407d79f713f/dev-guide/cluster-version-operator/user/reconciliation.md#clusteroperator

Comment 23 Lalatendu Mohanty 2021-08-04 18:37:15 UTC

We have not seen this issue in Telemetry and as per our discussion it seems like a corner case issue. So we have decided not to remove the edge from 4.7 to 4.8 for this bug. However if we get evidence of this bug is impacting more clusters then we will reconsider the decision.

Comment 27 David J. M. Karlsen 2021-08-16 08:37:31 UTC

It seems that has changed: https://github.com/openshift/cincinnati-graph-data/pull/987

Comment 28 Standa Laznicka 2021-08-16 08:50:42 UTC

*** Bug 1993712 has been marked as a duplicate of this bug. ***

Comment 30 Scott Dodson 2021-08-16 19:54:06 UTC

It's been confirmed that a cluster which had run into this upgrade halting problem completed the upgrade after applying 4.8.5 which contained the backported version of this fix. This bug primarilly served as a pre-requisite for backporting that change to 4.8 and as such I'm marking this CLOSED CURRENTRELEASE after the verification I just mentioned.

Comment 31 W. Trevor King 2021-08-16 20:30:07 UTC

With a few more clusters getting stuck on this issue, and 4.8.5 now in fast-4.8 with the fix [1,2], we've blocked 4.7 -> 4.8[234] [3] to keep future updates from hitting this same problem.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1989587#c5
[2]: https://github.com/openshift/cincinnati-graph-data/pull/988#event-5164866620
[3]: https://github.com/openshift/cincinnati-graph-data/pull/987

Comment 36 Scott Dodson 2021-08-18 12:33:19 UTC

Updated Impact Statement

Who is impacted?
  Some Clusters upgrading from 4.7 to 4.8.2-4.8.4.

What is the impact?  Is it serious enough to warrant blocking edges?
  The authentication operator incorrectly marks itself Available=False and the upgrade process halts once that happens. The upgrade will never complete but absent other unrelated issues the cluster should be healthy. Additionally since the CVO halts reconciliation any out of band changes made will remain intact until this problem is resolved.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  We have shipped a fix for this issue in 4.8.5, upgrading to that version will heal the cluster, the following command should work
  oc adm upgrade --to=4.8.5 --allow-upgrade-with-warnings

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  Yes, it's a regression between 4.7 and 4.8.2-4.8.4 and has been fixed in 4.8.5.

Comment 39 errata-xmlrpc 2021-10-18 17:43:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759