Bug 1737591

Summary:	OCP 4.2 - Azure - Authentication operator along with monitoring, console, openshift-apiserver operator degraded 24 hour post-install
Product:	OpenShift Container Platform	Reporter:	Walid A. <wabouham>
Component:	apiserver-auth	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED DUPLICATE	QA Contact:	Chuan Yu <chuyu>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.2.0	CC:	aos-bugs, juzhao, mfojtik, mifiedle, slaznick
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-07 06:57:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Walid A. 2019-08-05 18:32:16 UTC

Description of problem:

After leaving a successful IPI OCP 4.2 cluster (3 master, 2 worker) install on Azure running longer than 24 hours, several operators (authentication, monitoring, console, openshift-apiserver) got into a degraded state and/or stayed Progressing/Not available.  The openshift-cluster-version operator logs show several "Unauthorized" errors for these degraded operators.  Additionally I am no longer able to run some oc commands such "oc get projects", and unable to login as kubeadmin with kubeadmin password to the api-server URL.

Errors from CVO logs:

E0805 16:32:19.198913       1 memcache.go:135] couldn't get resource list for template.openshift.io/v1: Unauthorized
E0805 16:32:19.200984       1 memcache.go:135] couldn't get resource list for user.openshift.io/v1: Unauthorized
.
.
.
E0805 16:32:39.460766       1 task.go:77] error running apply for clusteroperator "monitoring" (250 of 431): Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Prometheus host: getting Route object failed: Unauthorized
E0805 16:32:39.460867       1 task.go:77] error running apply for clusteroperator "openshift-apiserver" (106 of 431): Cluster operator openshift-apiserver has not yet reported success
E0805 16:32:39.462231       1 sync_worker.go:311] unable to synchronize image (waiting 2m50.956499648s): Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Prometheus host: getting Route object failed: Unauthorized


# oc get co | grep -v "True        False         False"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-08-01-113533   True        False         True       2d4h
console                                    4.2.0-0.nightly-2019-08-01-113533   True        True          True       2d4h
monitoring                                 4.2.0-0.nightly-2019-08-01-113533   False       False         True       33h
openshift-apiserver                        4.2.0-0.nightly-2019-08-01-113533   False       False         False      33h


For Authentication Operator:

  - lastTransitionTime: "2019-08-05T15:21:28Z"
    message: 'OAuthClientsDegraded: Unauthorized'
    reason: OAuthClientsDegradedError
    status: "True"


Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-01-113533   True        False         2d4h    Error while reconciling 4.2.0-0.nightly-2019-08-01-113533: the cluster operator monitoring is degraded

# oc version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-alpha.0-43-g86a09cad", GitCommit:"86a09cad3831361c2f1efb70c0faa1aac611d3e0", GitTreeState:"clean", BuildDate:"2019-07-31T23:47:33Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0+bf9534a", GitCommit:"bf9534a", GitTreeState:"clean", BuildDate:"2019-07-31T23:43:56Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
OpenShift Version: 4.2.0-0.nightly-2019-08-01-113533



How reproducible:
Happened once so far

Steps to Reproduce:
1. IPI Install of OCP 4.2.0-0.nightly-2019-08-01-113533 on Azure
2. Initally all the cluster operators are running and available
3. Wait at least 24 hours

Actual results:
Some operators have degraded or progressing/not available
# oc get co | grep -v "True        False         False"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-08-01-113533   True        False         True       2d4h
console                                    4.2.0-0.nightly-2019-08-01-113533   True        True          True       2d4h
monitoring                                 4.2.0-0.nightly-2019-08-01-113533   False       False         True       33h
openshift-apiserver                        4.2.0-0.nightly-2019-08-01-113533   False       False         False      33h


Expected results:
All cluster operators after install should remain available and not progressing, not degraded

Additional info:

Link to must-gather logs and individual operator pod logs are provided in next comment

Comment 3 Mike Fiedler 2019-08-06 14:06:17 UTC

Blocks long running reliability tests.

Comment 4 Standa Laznicka 2019-08-07 06:57:47 UTC


*** This bug has been marked as a duplicate of bug 1736800 ***