Bug 1734276

Summary: [DOCS] upi/vsphere cluster with nodes on multiple vCenters can not succeed
Product: OpenShift Container Platform Reporter: liujia <jiajliu>
Component: DocumentationAssignee: Max Bridges <mbridges>
Status: POST --- QA Contact: liujia <jiajliu>
Severity: high Docs Contact: Vikram Goyal <vigoyal>
Priority: high    
Version: 4.1.zCC: aos-bugs, dphillip, jokerman, kalexand, mbridges, mstaeble, rkant, rsunog, suchaudh
Target Milestone: ---Flags: jiajliu: needinfo-
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Comment 3 Matthew Staebler 2019-08-05 16:36:07 UTC
This looks like an issue with rolling out a new revision of the api server. I assume that the control plane node was cordoned so that the pods running on it could be drained. I am not sure whether that is possible with only a single control plane node.

Comment 4 liujia 2019-08-06 05:16:09 UTC
(In reply to Matthew Staebler from comment #3)
> This looks like an issue with rolling out a new revision of the api server.
> I assume that the control plane node was cordoned so that the pods running
> on it could be drained. I am not sure whether that is possible with only a
> single control plane node.

Hi Matthew

I have the same suspection with u about only one control plane node when first try. So I have tried 3 control plane after that, but unfortunately, I hit the same error that some of operators failed to come back.

And just now, I tried it again with a 3masters+2nodes deployment, still fail.

[root@preserve-jliu-worker tmp]# oc get node
NAME              STATUS                     ROLES    AGE   VERSION
compute-0         Ready                      worker   19h   v1.13.4+ab8449285
compute-1         Ready                      worker   19h   v1.13.4+ab8449285
control-plane-0   Ready                      master   19h   v1.13.4+ab8449285
control-plane-1   Ready,SchedulingDisabled   master   19h   v1.13.4+ab8449285
control-plane-2   Ready                      master   19h   v1.13.4+ab8449285
[root@preserve-jliu-worker tmp]# oc get co
NAME                                 VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                       4.1.8     True        True          False      19h
cloud-credential                     4.1.8     True        False         False      19h
cluster-autoscaler                   4.1.8     True        False         False      19h
console                              4.1.8     True        False         False      19h
dns                                  4.1.8     True        False         False      19h
image-registry                       4.1.8     True        False         False      19h
ingress                              4.1.8     True        False         False      19h
kube-apiserver                       4.1.8     True        True          False      19h
kube-controller-manager              4.1.8     True        True          False      19h
kube-scheduler                       4.1.8     True        False         False      19h
machine-api                          4.1.8     True        False         False      19h
machine-config                       4.1.8     True        False         False      19h
marketplace                          4.1.8     False       False         False      99m
monitoring                           4.1.8     False       True          True       95m
network                              4.1.8     True        True          False      19h
node-tuning                          4.1.8     True        False         False      19h
openshift-apiserver                  4.1.8     False       False         False      100m
openshift-controller-manager         4.1.8     True        False         False      19h
openshift-samples                    4.1.8     True        False         False      19h
operator-lifecycle-manager           4.1.8     True        False         False      19h
operator-lifecycle-manager-catalog   4.1.8     True        False         False      19h
service-ca                           4.1.8     True        False         False      19h
service-catalog-apiserver            4.1.8     True        False         False      19h
service-catalog-controller-manager   4.1.8     True        False         False      19h
storage                              4.1.8     True        False         False      19h
[root@preserve-jliu-worker tmp]# oc describe co kube-apiserver
Name:         kube-apiserver
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-08-05T08:15:44Z
  Generation:          1
  Resource Version:    266783
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/kube-apiserver
  UID:                 38d8e61a-b759-11e9-85d3-0050568b1015
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-08-05T08:26:35Z
    Message:               StaticPodsDegraded: nodes/control-plane-2 pods/kube-apiserver-control-plane-2 container="kube-apiserver-6" is not ready
StaticPodsDegraded: nodes/control-plane-2 pods/kube-apiserver-control-plane-2 container="kube-apiserver-cert-syncer-6" is not ready
StaticPodsDegraded: nodes/control-plane-0 pods/kube-apiserver-control-plane-0 container="kube-apiserver-5" is not ready
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-08-06T02:02:23Z
    Message:               Progressing: 2 nodes are at revision 5; 1 nodes are at revision 6
    Reason:                Progressing
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-08-05T08:20:24Z
    Message:               Available: 3 nodes are active; 2 nodes are at revision 5; 1 nodes are at revision 6
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-08-05T08:15:45Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  kubeapiservers
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-apiserver-operator
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-apiserver
    Resource:  namespaces
  Versions:
    Name:     raw-internal
    Version:  4.1.8
    Name:     kube-apiserver
    Version:  1.13.4
    Name:     operator
    Version:  4.1.8
Events:       <none>

Comment 5 davis phillips 2019-08-06 19:00:15 UTC
You can recreate this by making a relatively small change to the cloud provider configmap. (I changed insecure-flag    = 0)

Can we verify this is the process to successfully implement changes to the cloud provider configmap?

Comment 6 Matthew Staebler 2019-08-06 23:41:17 UTC
(In reply to davis phillips from comment #5)
> You can recreate this by making a relatively small change to the cloud
> provider configmap. (I changed insecure-flag    = 0)
> 
> Can we verify this is the process to successfully implement changes to the
> cloud provider configmap?

Yes, this is the expected way to change the cloud provider config. It looks like there is a bug here that needs to be created against the OpenShift API Server.

Comment 7 Matthew Staebler 2019-08-06 23:44:02 UTC
It would be helpful to attach the logs from running `oc adm must-gather` for a cluster that is stuck like this.

Comment 8 liujia 2019-08-07 08:48:44 UTC
> You can recreate this by making a relatively small change to the cloud provider configmap. (I changed insecure-flag    = 0)
Tried "insecure-flag=0" and still fail.

> It would be helpful to attach the logs from running `oc adm must-gather` for a cluster that is stuck like this.
`oc adm must-gather` can not work when cluster in this stuck status. So what i can do is keeping the cluster there.

> Yes, this is the expected way to change the cloud provider config. It looks like there is a bug here that needs to be created against the OpenShift API Server.
I agree to file a bug to track the issue against openshift api server. Will paste it here later.

Comment 9 Matthew Staebler 2019-08-07 15:09:17 UTC
(In reply to liujia from comment #8)
> > It would be helpful to attach the logs from running `oc adm must-gather` for a cluster that is stuck like this.
> `oc adm must-gather` can not work when cluster in this stuck status. So what
> i can do is keeping the cluster there.
This surprises me. If `oc adm must-gather` does not work when the cluster fails, then that seems to defeat the purpose of `oc adm must-gather`. Is the cluster not functioning at all? I would have expected that the cluster would still be functioning but in a degraded state with only 2 control plane nodes instead of all 3.

Comment 10 liujia 2019-08-08 02:05:33 UTC
# ./oc adm must-gather
the server is currently unable to handle the request (get imagestreams.image.openshift.io must-gather)
Using image: quay.io/openshift/origin-must-gather:latest
namespace/openshift-must-gather-dc4mn created
clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp created
clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp deleted
namespace/openshift-must-gather-dc4mn deleted
Error from server (Forbidden): pods "must-gather-" is forbidden: error looking up service account openshift-must-gather-dc4mn/default: serviceaccount "default" not found

It seems related with the unavailable openshift api.

Comment 12 liujia 2019-08-08 03:28:41 UTC
I dig more on the stuck cluster, i think it should be related with machine-config which result the update of nodes fail. Will track it in another bug.

Comment 13 liujia 2019-08-08 09:09:41 UTC
File a bug against machine-config-operator to track the issue https://bugzilla.redhat.com/show_bug.cgi?id=1738834

Comment 16 liujia 2019-09-20 07:28:27 UTC
@davis phillips 
I tried again with above method to update cloud-provider-config, but the cluster still failed to get back to normal status.

1) get cm/cloud-provider-config
# oc extract cm/cloud-provider-config -n openshift-config --to=/tmp
/tmp/config

2) update /tmp/config to add vcsa2
...
[VirtualCenter "vcsa2-qe.vmware.devcluster.openshift.com"]
datacenters = dc1
secret-name      = vcenter2-creds
secret-namespace = kube-system

3) update cm/cloud-provider-config
# oc create cm cloud-provider-config -n openshift-config --from-file=/tmp/config --dry-run -o yaml| oc replace -f -
configmap/cloud-provider-config replaced

[root@preserve-jliu-worker 20190920_15060]# oc get node
NAME              STATUS                     ROLES    AGE    VERSION
compute-0         Ready                      worker   141m   v1.13.4+244797462
control-plane-0   Ready                      master   141m   v1.13.4+244797462
control-plane-1   Ready                      master   141m   v1.13.4+244797462
control-plane-2   Ready,SchedulingDisabled   master   141m   v1.13.4+244797462
[root@preserve-jliu-worker 20190920_15060]# oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED
master   rendered-master-c58c9cc78f695e1e69342a9a1cde4d9d   False     True       False
worker   rendered-worker-67eb1c3ade1acf565b6334975b2a5e22   True      False      False
[root@preserve-jliu-worker 20190920_15060]# oc describe node |grep render
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-worker-67eb1c3ade1acf565b6334975b2a5e22
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-67eb1c3ade1acf565b6334975b2a5e22
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-master-b325801ce7f1a75461527adaab220b3e
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-b325801ce7f1a75461527adaab220b3e
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-master-b325801ce7f1a75461527adaab220b3e
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-b325801ce7f1a75461527adaab220b3e
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-master-c58c9cc78f695e1e69342a9a1cde4d9d
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-b325801ce7f1a75461527adaab220b3e
[root@preserve-jliu-worker 20190920_15060]# oc get machineconfig|grep render
rendered-master-b325801ce7f1a75461527adaab220b3e            a2175e587b007272f26305fe7d8b603c49e8f1fc   2.2.0             60m
rendered-master-c58c9cc78f695e1e69342a9a1cde4d9d            a2175e587b007272f26305fe7d8b603c49e8f1fc   2.2.0             136m
rendered-worker-67eb1c3ade1acf565b6334975b2a5e22            a2175e587b007272f26305fe7d8b603c49e8f1fc   2.2.0             60m
rendered-worker-ab143d2160ce1a70ec80c672fc708fdd            a2175e587b007272f26305fe7d8b603c49e8f1fc   2.2.0             136m

I catched some logs in kubelet.service:
Sep 20 06:58:53 control-plane-1 hyperkube[904]: W0920 06:56:46.565268       1 plugins.go:118] WARNING: vsphere built-in cloud provider is now deprecated. The vSphere provider is deprecated and will be removed in a future release
Sep 20 06:58:53 control-plane-1 hyperkube[904]: F0920 06:56:46.565623       1 controllermanager.go:231] error building controller context: cloud provider could not be initialized: could not init cloud provider "vsphere": warnings:
Sep 20 06:58:53 control-plane-1 hyperkube[904]: can't store data at section "VirtualCenter", subsection "vcsa2-qe.vmware.devcluster.openshift.com", variable "secret-name"
Sep 20 06:58:53 control-plane-1 hyperkube[904]: can't store data at section "VirtualCenter", subsection "vcsa2-qe.vmware.devcluster.openshift.com", variable "secret-namespace"

Could u help check if my secret is not added correctly or the config is updated wrongly?

Comment 17 liujia 2019-09-20 07:30:45 UTC
*** Bug 1744839 has been marked as a duplicate of this bug. ***

Comment 18 liujia 2019-09-20 07:32:09 UTC
Since this bug is used to track the doc issue for multi vcenters, so close bz1744839 to duplicate with this one, and keep the bz1738834 open for v4.2 verify.

And for this bug, since there is not available doc pr for qe's final verification, so change the bug status back to wait for pr.

Comment 20 liujia 2019-09-26 06:29:08 UTC
Since the limited resources on vsphere, i can not keep the cluster too long and destroy it now. please contack me if need new reproduced cluster.

Comment 27 Vikram Goyal 2019-10-14 10:55:21 UTC
*** Bug 1744954 has been marked as a duplicate of this bug. ***