Bug 1737591 - OCP 4.2 - Azure - Authentication operator along with monitoring, console, openshift-apiserver operator degraded 24 hour post-install
Summary: OCP 4.2 - Azure - Authentication operator along with monitoring, console, ope...
Keywords:
Status: CLOSED DUPLICATE of bug 1736800
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Stefan Schimanski
QA Contact: Chuan Yu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-05 18:32 UTC by Walid A.
Modified: 2019-08-07 06:57 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-07 06:57:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Walid A. 2019-08-05 18:32:16 UTC
Description of problem:

After leaving a successful IPI OCP 4.2 cluster (3 master, 2 worker) install on Azure running longer than 24 hours, several operators (authentication, monitoring, console, openshift-apiserver) got into a degraded state and/or stayed Progressing/Not available.  The openshift-cluster-version operator logs show several "Unauthorized" errors for these degraded operators.  Additionally I am no longer able to run some oc commands such "oc get projects", and unable to login as kubeadmin with kubeadmin password to the api-server URL.

Errors from CVO logs:

E0805 16:32:19.198913       1 memcache.go:135] couldn't get resource list for template.openshift.io/v1: Unauthorized
E0805 16:32:19.200984       1 memcache.go:135] couldn't get resource list for user.openshift.io/v1: Unauthorized
.
.
.
E0805 16:32:39.460766       1 task.go:77] error running apply for clusteroperator "monitoring" (250 of 431): Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Prometheus host: getting Route object failed: Unauthorized
E0805 16:32:39.460867       1 task.go:77] error running apply for clusteroperator "openshift-apiserver" (106 of 431): Cluster operator openshift-apiserver has not yet reported success
E0805 16:32:39.462231       1 sync_worker.go:311] unable to synchronize image (waiting 2m50.956499648s): Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Prometheus host: getting Route object failed: Unauthorized


# oc get co | grep -v "True        False         False"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-08-01-113533   True        False         True       2d4h
console                                    4.2.0-0.nightly-2019-08-01-113533   True        True          True       2d4h
monitoring                                 4.2.0-0.nightly-2019-08-01-113533   False       False         True       33h
openshift-apiserver                        4.2.0-0.nightly-2019-08-01-113533   False       False         False      33h


For Authentication Operator:

  - lastTransitionTime: "2019-08-05T15:21:28Z"
    message: 'OAuthClientsDegraded: Unauthorized'
    reason: OAuthClientsDegradedError
    status: "True"


Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-01-113533   True        False         2d4h    Error while reconciling 4.2.0-0.nightly-2019-08-01-113533: the cluster operator monitoring is degraded

# oc version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-alpha.0-43-g86a09cad", GitCommit:"86a09cad3831361c2f1efb70c0faa1aac611d3e0", GitTreeState:"clean", BuildDate:"2019-07-31T23:47:33Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0+bf9534a", GitCommit:"bf9534a", GitTreeState:"clean", BuildDate:"2019-07-31T23:43:56Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
OpenShift Version: 4.2.0-0.nightly-2019-08-01-113533



How reproducible:
Happened once so far

Steps to Reproduce:
1. IPI Install of OCP 4.2.0-0.nightly-2019-08-01-113533 on Azure
2. Initally all the cluster operators are running and available
3. Wait at least 24 hours

Actual results:
Some operators have degraded or progressing/not available
# oc get co | grep -v "True        False         False"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-08-01-113533   True        False         True       2d4h
console                                    4.2.0-0.nightly-2019-08-01-113533   True        True          True       2d4h
monitoring                                 4.2.0-0.nightly-2019-08-01-113533   False       False         True       33h
openshift-apiserver                        4.2.0-0.nightly-2019-08-01-113533   False       False         False      33h


Expected results:
All cluster operators after install should remain available and not progressing, not degraded

Additional info:

Link to must-gather logs and individual operator pod logs are provided in next comment

Comment 3 Mike Fiedler 2019-08-06 14:06:17 UTC
Blocks long running reliability tests.

Comment 4 Standa Laznicka 2019-08-07 06:57:47 UTC

*** This bug has been marked as a duplicate of bug 1736800 ***


Note You need to log in before you can comment on or make changes to this bug.