Bug 1737591

Summary: OCP 4.2 - Azure - Authentication operator along with monitoring, console, openshift-apiserver operator degraded 24 hour post-install
Product: OpenShift Container Platform Reporter: Walid A. <wabouham>
Component: apiserver-authAssignee: Stefan Schimanski <sttts>
Status: CLOSED DUPLICATE QA Contact: Chuan Yu <chuyu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: aos-bugs, juzhao, mfojtik, mifiedle, slaznick
Target Milestone: ---Keywords: TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-07 06:57:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Walid A. 2019-08-05 18:32:16 UTC
Description of problem:

After leaving a successful IPI OCP 4.2 cluster (3 master, 2 worker) install on Azure running longer than 24 hours, several operators (authentication, monitoring, console, openshift-apiserver) got into a degraded state and/or stayed Progressing/Not available.  The openshift-cluster-version operator logs show several "Unauthorized" errors for these degraded operators.  Additionally I am no longer able to run some oc commands such "oc get projects", and unable to login as kubeadmin with kubeadmin password to the api-server URL.

Errors from CVO logs:

E0805 16:32:19.198913       1 memcache.go:135] couldn't get resource list for template.openshift.io/v1: Unauthorized
E0805 16:32:19.200984       1 memcache.go:135] couldn't get resource list for user.openshift.io/v1: Unauthorized
.
.
.
E0805 16:32:39.460766       1 task.go:77] error running apply for clusteroperator "monitoring" (250 of 431): Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Prometheus host: getting Route object failed: Unauthorized
E0805 16:32:39.460867       1 task.go:77] error running apply for clusteroperator "openshift-apiserver" (106 of 431): Cluster operator openshift-apiserver has not yet reported success
E0805 16:32:39.462231       1 sync_worker.go:311] unable to synchronize image (waiting 2m50.956499648s): Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Prometheus host: getting Route object failed: Unauthorized


# oc get co | grep -v "True        False         False"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-08-01-113533   True        False         True       2d4h
console                                    4.2.0-0.nightly-2019-08-01-113533   True        True          True       2d4h
monitoring                                 4.2.0-0.nightly-2019-08-01-113533   False       False         True       33h
openshift-apiserver                        4.2.0-0.nightly-2019-08-01-113533   False       False         False      33h


For Authentication Operator:

  - lastTransitionTime: "2019-08-05T15:21:28Z"
    message: 'OAuthClientsDegraded: Unauthorized'
    reason: OAuthClientsDegradedError
    status: "True"


Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-01-113533   True        False         2d4h    Error while reconciling 4.2.0-0.nightly-2019-08-01-113533: the cluster operator monitoring is degraded

# oc version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-alpha.0-43-g86a09cad", GitCommit:"86a09cad3831361c2f1efb70c0faa1aac611d3e0", GitTreeState:"clean", BuildDate:"2019-07-31T23:47:33Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0+bf9534a", GitCommit:"bf9534a", GitTreeState:"clean", BuildDate:"2019-07-31T23:43:56Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
OpenShift Version: 4.2.0-0.nightly-2019-08-01-113533



How reproducible:
Happened once so far

Steps to Reproduce:
1. IPI Install of OCP 4.2.0-0.nightly-2019-08-01-113533 on Azure
2. Initally all the cluster operators are running and available
3. Wait at least 24 hours

Actual results:
Some operators have degraded or progressing/not available
# oc get co | grep -v "True        False         False"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-08-01-113533   True        False         True       2d4h
console                                    4.2.0-0.nightly-2019-08-01-113533   True        True          True       2d4h
monitoring                                 4.2.0-0.nightly-2019-08-01-113533   False       False         True       33h
openshift-apiserver                        4.2.0-0.nightly-2019-08-01-113533   False       False         False      33h


Expected results:
All cluster operators after install should remain available and not progressing, not degraded

Additional info:

Link to must-gather logs and individual operator pod logs are provided in next comment

Comment 3 Mike Fiedler 2019-08-06 14:06:17 UTC
Blocks long running reliability tests.

Comment 4 Standa Laznicka 2019-08-07 06:57:47 UTC

*** This bug has been marked as a duplicate of bug 1736800 ***