Bug 1750269

Summary:	vSphere UPI: failed to initialize the cluster: Some cluster operators are still updating: authentication, console
Product:	OpenShift Container Platform	Reporter:	Vijay Avuthu <vavuthu>
Component:	Networking	Assignee:	Casey Callendrello <cdc>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aos-bugs, lsm5, mfojtik, ratamir, rphillips, sbatsche, scuppett, sttts
Version:	4.2.0
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-09-11 00:20:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vijay Avuthu 2019-09-09 08:00:58 UTC

Description of problem:

vSphere UPI installation is failing with below error

"
failed to initialize the cluster: Some cluster operators are still updating: authentication, console 
"

Version-Release number of the following components:

$ ./bin/openshift-install version
./bin/openshift-install unreleased-master-1704-gb5dbb46b7e97d2c63333048f055dd518aa01eb10-dirty
built from commit b5dbb46b7e97d2c63333048f055dd518aa01eb10
release image registry.svc.ci.openshift.org/ocp/release@sha256:50e379837780325a517151a5edf61eb1689b8249a0e206731d9593e63f2e71d6

> openshift-install-linux-4.2.0-0.ci-2019-09-09-021340.tar.gz

How reproducible:
1/1

Steps to Reproduce:
1. Install OCP vSphere UPI as per documentaion

waiting for install complete fails with below error





Actual results:

> boot-strap is completed without any issues

2019-09-09 11:37:03,319 - INFO - ocs_ci.utility.utils.run_cmd.369 - Executing command: ./bin/openshift-install wait-for bootstrap-complete --dir /home/vavuthu/VJ/installations/clusterdirs/vs
p-test --log-level INFO
2019-09-09 12:05:22,630 - DEBUG - ocs_ci.utility.utils.run_cmd.379 - Command output: 
2019-09-09 12:05:22,631 - WARNING - ocs_ci.utility.utils.run_cmd.381 - Command warning:: level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.vsp-test.qe.rh-ocs.com:6443
..."
level=info msg="API v1.14.6+f26aefa up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="It is now safe to remove the bootstrap resources"


> wait-for install-complete failed with below error
$  ./bin/openshift-install wait-for install-complete --dir=/home/vavuthu/VJ/installations/clusterdirs/vsp-test/
INFO Waiting up to 30m0s for the cluster at https://api.vsp-test.qe.rh-ocs.com:6443 to initialize... 
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console 
$ 

Expected results:

installation should complete without any errors


Additional info:

> clusteroperator status

$ oc get clusteroperator
NAME                                       VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                            False       True          False      30m
cloud-credential                           4.2.0-0.ci-2019-09-09-021340   True        False         False      54m
cluster-autoscaler                         4.2.0-0.ci-2019-09-09-021340   True        False         False      27m
console                                    4.2.0-0.ci-2019-09-09-021340   False       True          False      34m
dns                                        4.2.0-0.ci-2019-09-09-021340   True        False         False      55m
image-registry                             4.2.0-0.ci-2019-09-09-021340   True        False         False      32m
ingress                                    4.2.0-0.ci-2019-09-09-021340   True        False         False      33m
insights                                   4.2.0-0.ci-2019-09-09-021340   True        False         False      56m
kube-apiserver                             4.2.0-0.ci-2019-09-09-021340   True        True          False      41m
kube-controller-manager                    4.2.0-0.ci-2019-09-09-021340   True        False         False      42m
kube-scheduler                             4.2.0-0.ci-2019-09-09-021340   True        True          False      41m
machine-api                                4.2.0-0.ci-2019-09-09-021340   True        False         False      54m
machine-config                             4.2.0-0.ci-2019-09-09-021340   True        False         False      42m
marketplace                                4.2.0-0.ci-2019-09-09-021340   True        False         False      35m
monitoring                                 4.2.0-0.ci-2019-09-09-021340   True        False         False      26m
network                                    4.2.0-0.ci-2019-09-09-021340   True        True          False      41m
node-tuning                                4.2.0-0.ci-2019-09-09-021340   True        False         False      37m
openshift-apiserver                        4.2.0-0.ci-2019-09-09-021340   True        False         False      39m
openshift-controller-manager               4.2.0-0.ci-2019-09-09-021340   True        False         False      42m
openshift-samples                          4.2.0-0.ci-2019-09-09-021340   True        False         False      22m
operator-lifecycle-manager                 4.2.0-0.ci-2019-09-09-021340   True        False         False      43m
operator-lifecycle-manager-catalog         4.2.0-0.ci-2019-09-09-021340   True        False         False      42m
operator-lifecycle-manager-packageserver   4.2.0-0.ci-2019-09-09-021340   True        False         False      40m
service-ca                                 4.2.0-0.ci-2019-09-09-021340   True        False         False      54m
service-catalog-apiserver                  4.2.0-0.ci-2019-09-09-021340   True        False         False      38m
service-catalog-controller-manager         4.2.0-0.ci-2019-09-09-021340   True        False         False      38m
storage                                    4.2.0-0.ci-2019-09-09-021340   True        False         False      35m


> describe of failed clusteroperators

$ oc describe co authentication
Name:         authentication
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-09-09T06:37:10Z
  Generation:          1
  Resource Version:    16335
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/authentication
  UID:                 400983d9-d2cc-11e9-8fc8-005056be0641
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-09-09T06:45:08Z
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-09-09T06:45:08Z
    Message:               Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.35.145.26:6443/.well-known/oauth-authorization-server endpoint data
    Reason:                ProgressingWellKnownNotReady
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-09-09T06:45:08Z
    Reason:                Available
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-09-09T06:37:10Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  authentications
    Group:     config.openshift.io
    Name:      cluster
    Resource:  authentications
    Group:     config.openshift.io
    Name:      cluster
    Resource:  infrastructures
    Group:     config.openshift.io
    Name:      cluster
    Resource:  oauths
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-authentication
    Resource:  namespaces
    Group:     
    Name:      openshift-authentication-operator
    Resource:  namespaces
Events:        <none>


$ oc describe co console
Name:         console
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-09-09T06:41:22Z
  Generation:          1
  Resource Version:    20241
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/console
  UID:                 d63d41af-d2cc-11e9-8fc8-005056be0641
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-09-09T06:41:24Z
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-09-09T06:42:00Z
    Message:               SyncLoopRefreshProgressing: Working toward version 4.2.0-0.ci-2019-09-09-021340
    Reason:                SyncLoopRefreshProgressingInProgress
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-09-09T06:42:00Z
    Message:               DeploymentAvailable: 2 replicas ready at version 4.2.0-0.ci-2019-09-09-021340
    Reason:                DeploymentAvailableFailedUpdate
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-09-09T06:41:24Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:      operator.openshift.io
    Name:       cluster
    Resource:   consoles
    Group:      config.openshift.io
    Name:       cluster
    Resource:   consoles
    Group:      config.openshift.io
    Name:       cluster
    Resource:   infrastructures
    Group:      config.openshift.io
    Name:       cluster
    Resource:   proxies
    Group:      oauth.openshift.io
    Name:       console
    Resource:   oauthclients
    Group:      
    Name:       openshift-console-operator
    Resource:   namespaces
    Group:      
    Name:       openshift-console
    Resource:   namespaces
    Group:      
    Name:       console-public
    Namespace:  openshift-config-managed
    Resource:   configmaps
  Versions:
    Name:     operator
    Version:  4.2.0-0.ci-2019-09-09-021340
Events:       <none>
$ 


must-gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/must-gather.local.6406989811990035970.tar.gz

Comment 2 Standa Laznicka 2019-09-10 12:23:44 UTC

It seems that kube-apiserver pod fails on being created because networking is not available at one of the nodes, here's the reported error:

```
'Failed create pod sandbox: rpc error: code = Unknown desc = failed to
    create pod network sandbox k8s_installer-5-control-plane-1_openshift-kube-apiserver_900db7f3-d2ce-11e9-8fc8-005056be0641_0(6eb4a350ef81b980482f853dc2585bcac49e5b395ab03fb7472c0736833d91e3):
    Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate
    add - "openshift-sdn": failed to send CNI request: Post http://dummy/: dial unix
    /var/run/openshift-sdn/cniserver/socket: connect: connection refused'
```

Comment 3 Casey Callendrello 2019-09-10 13:27:24 UTC

sdn-7llq6 is failing with "rm: cannot remove '/etc/cni/net.d/80-openshift-network.conf': Permission denied"

How!?!?

Comment 4 Casey Callendrello 2019-09-10 13:36:36 UTC

Is this cluster still up? Can we get the node journal?

Comment 5 Vijay Avuthu 2019-09-10 14:11:25 UTC

(In reply to Casey Callendrello from comment #4)
> Is this cluster still up? Can we get the node journal?

Cluster is not available. It was removed after collecting all logs.

Comment 6 Casey Callendrello 2019-09-10 15:13:25 UTC

Unfortunately must-gather doesn't actually gather everything we need to debug this. Please try and reproduce, and keep the cluster up.

Comment 7 Ryan Phillips 2019-09-10 15:19:54 UTC

I suspect this is a selinux issue. Running `ls -Z /etc/cni/net.d/80-openshift-network.conf` on all the nodes would tell us if different selinux permissions are being used.

Comment 9 Vijay Avuthu 2019-09-10 19:08:19 UTC

> I have tried 3 times to reproduce the issue but it was successful every time. Below are builds used for installation

1st attempt: 4.2.0-0.ci-2019-09-10-121820  
2nd attempt : 4.2.0-0.ci-2019-09-09-021340 ( same build where I faced issue previously )
3rd attempt: 4.2.0-0.ci-2019-09-09-021340


>

$ date;oc --kubeconfig /home/vavuthu/VJ/installations/clusterdirs/qe1/auth/kubeconfig get ClusterOperator
Wed Sep 11 00:27:32 IST 2019
NAME                                       VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.ci-2019-09-09-021340   True        False         False      3m4s
cloud-credential                           4.2.0-0.ci-2019-09-09-021340   True        False         False      40m
cluster-autoscaler                         4.2.0-0.ci-2019-09-09-021340   True        False         False      12m
console                                    4.2.0-0.ci-2019-09-09-021340   True        False         False      6m38s
dns                                        4.2.0-0.ci-2019-09-09-021340   True        False         False      28m
image-registry                             4.2.0-0.ci-2019-09-09-021340   True        False         False      17m

>