Bug 1738834 - [DOCS] [upi-vmware] update vsphere cloud provider config will get the cluster stuck
Summary: [DOCS] [upi-vmware] update vsphere cloud provider config will get the cluster...
Keywords:
Status: NEW
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.2.z
Assignee: Kathryn Alexander
QA Contact: liujia
Vikram Goyal
URL:
Whiteboard:
Depends On:
Blocks: 1744839
TreeView+ depends on / blocked
 
Reported: 2019-08-08 09:06 UTC by liujia
Modified: 2019-10-02 03:06 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1744839 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description liujia 2019-08-08 09:06:36 UTC
Description of problem:
Following up http://file.rdu.redhat.com/kalexand/061019/osdocs404/installing/install_config/vsphere-hosts.html to try upi/vmware cluster with nodes located on multiple vCenters. After edit configmap/cloud-provider-config according to linked vmware doc[1], the cluster will never return back in normal status due to the new updated machineconfig can not apply to both master and worker nodes.

[1] https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/existing.html#multiple-vcenters

================================================================================
# ./oc adm must-gather
the server is currently unable to handle the request (get imagestreams.image.openshift.io must-gather)
Using image: quay.io/openshift/origin-must-gather:latest
namespace/openshift-must-gather-dc4mn created
clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp created
clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp deleted
namespace/openshift-must-gather-dc4mn deleted
Error from server (Forbidden): pods "must-gather-" is forbidden: error looking up service account openshift-must-gather-dc4mn/default: serviceaccount "default" not found

Because must-gather can not in this stuck cluster, so I tried to give more info to help debug. And will leave the cluster for dev to dig more.

1) Checked that some clusteroperators are not workable.
# ./oc describe co kube-apiserver
...
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-08-07T09:56:54Z
    Message:               StaticPodsDegraded: nodes/control-plane-0 pods/kube-apiserver-control-plane-0 container="kube-apiserver-7" is not ready
StaticPodsDegraded: nodes/control-plane-0 pods/kube-apiserver-control-plane-0 container="kube-apiserver-cert-syncer-7" is not ready
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-08-07T10:15:52Z
    Message:               Progressing: 1 nodes are at revision 6; 2 nodes are at revision 7
    Reason:                Progressing
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-08-07T09:44:14Z
    Message:               Available: 3 nodes are active; 1 nodes are at revision 6; 2 nodes are at revision 7
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-08-07T09:34:13Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
...

# ./oc describe co machine-config
...
Status:
  Conditions:
    Last Transition Time:  2019-08-07T10:34:08Z
    Message:               Cluster not available for 4.1.9
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-08-07T09:43:54Z
    Message:               Cluster version is 4.1.9
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-08-07T10:34:08Z
    Message:               Failed to resync 4.1.9 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 2, unavailable: 1)
    Reason:                FailedToSync
    Status:                True
    Type:                  Degraded
  Extension:
    Master:  2 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-741bbd2385241c3507ad5854b17862ac
    Worker:  all 2 nodes are at latest configuration rendered-worker-96b981964718500f4bc19d4a65504bfa
...

2) Checked node status and machineconfig update status, and found that only control-plane-2 stunk on applying to the new machine config(after several times try, always one of master node will fail to apply the update).
# ./oc get node
NAME              STATUS                     ROLES    AGE   VERSION
compute-0         Ready                      worker   18h   v1.13.4+0cb23916f
compute-1         Ready                      worker   18h   v1.13.4+0cb23916f
control-plane-0   Ready                      master   18h   v1.13.4+0cb23916f
control-plane-1   Ready                      master   18h   v1.13.4+0cb23916f
control-plane-2   Ready,SchedulingDisabled   master   18h   v1.13.4+0cb23916f

# ./oc describe node|grep rendered
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-worker-96b981964718500f4bc19d4a65504bfa
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-96b981964718500f4bc19d4a65504bfa
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-worker-96b981964718500f4bc19d4a65504bfa
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-96b981964718500f4bc19d4a65504bfa
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-master-c761fde1ce7bd6027b1846802116f469
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-c761fde1ce7bd6027b1846802116f469
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-master-c761fde1ce7bd6027b1846802116f469
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-c761fde1ce7bd6027b1846802116f469
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-master-741bbd2385241c3507ad5854b17862ac
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-c761fde1ce7bd6027b1846802116f469

# ./oc get machineconfig|grep rendered
rendered-master-741bbd2385241c3507ad5854b17862ac            83392b13a5c17e56656acf3f7b0031e3303fb5c0   2.2.0             19h
rendered-master-c761fde1ce7bd6027b1846802116f469            83392b13a5c17e56656acf3f7b0031e3303fb5c0   2.2.0             18h
rendered-worker-5588104dc718990968912addaa51c557            83392b13a5c17e56656acf3f7b0031e3303fb5c0   2.2.0             19h
rendered-worker-96b981964718500f4bc19d4a65504bfa            83392b13a5c17e56656acf3f7b0031e3303fb5c0   2.2.0             18h

# ./oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED
master   rendered-master-741bbd2385241c3507ad5854b17862ac   False     True       False
worker   rendered-worker-96b981964718500f4bc19d4a65504bfa   True      False      False

# ./oc describe machineconfigpool master
Name:         master
Namespace:    
Labels:       operator.machineconfiguration.openshift.io/required-for-upgrade=
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
Metadata:
  Creation Timestamp:  2019-08-07T09:33:59Z
  Generation:          3
  Resource Version:    27412
  Self Link:           /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master
  UID:                 7c188687-b8f6-11e9-9b04-0050568bbb78
Spec:
  Configuration:
    Name:  rendered-master-c761fde1ce7bd6027b1846802116f469
    Source:
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         00-master
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-master-container-runtime
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-master-kubelet
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-master-7c188687-b8f6-11e9-9b04-0050568bbb78-registries
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-master-ssh
  Machine Config Selector:
    Match Labels:
      machineconfiguration.openshift.io/role:  master
  Max Unavailable:                             <nil>
  Node Selector:
    Match Labels:
      node-role.kubernetes.io/master:  
  Paused:                              false
Status:
  Conditions:
    Last Transition Time:  2019-08-07T09:35:43Z
    Message:               
    Reason:                
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2019-08-07T09:43:53Z
    Message:               
    Reason:                
    Status:                False
    Type:                  NodeDegraded
    Last Transition Time:  2019-08-07T09:43:53Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-08-07T10:16:01Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-08-07T10:16:01Z
    Message:               
    Reason:                All nodes are updating to rendered-master-c761fde1ce7bd6027b1846802116f469
    Status:                True
    Type:                  Updating
  Configuration:
    Name:  rendered-master-741bbd2385241c3507ad5854b17862ac
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-master
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-master-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-master-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-master-7c188687-b8f6-11e9-9b04-0050568bbb78-registries
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-master-ssh
  Degraded Machine Count:     0
  Machine Count:              3
  Observed Generation:        3
  Ready Machine Count:        2
  Unavailable Machine Count:  1
  Updated Machine Count:      2
Events:                       <none>

3) Checked the new and old machine config, which shows the only difference is cloud config.
# ./oc get machineconfig rendered-master-c761fde1ce7bd6027b1846802116f469 -o yaml > new_mc.yaml
# ./oc get machineconfig rendered-master-741bbd2385241c3507ad5854b17862ac -o yaml > old_mc.yaml
...
- contents:
          source: data:,%5BGlobal%5D%0Asecret-name%20%20%20%20%20%20%3D%20vsphere-creds%0Asecret-namespace%20%3D%20kube-system%0Ainsecure-flag%20%20%20%20%3D%200%0A%0A%5BWorkspace%5D%0Aserver%20%20%20%20%20%20%20%20%20%20%20%20%3D%20vcsa-qe.vmware.devcluster.openshift.com%0Adatacenter%20%20%20%20%20%20%20%20%3D%20dc1%0Adefault-datastore%20%3D%20nfs-ds1%0Afolder%20%20%20%20%20%20%20%20%20%20%20%20%3D%20jliu-24290%0A%0A%5BVirtualCenter%20%22vcsa-qe.vmware.devcluster.openshift.com%22%5D%0Adatacenters%20%3D%20dc1%0A%5BVirtualCenter%20%22vcsa2-qe.vmware.devcluster.openshift.com%22%5D%0Adatacenters%20%3D%20dc1%0Asecret-name%20%20%20%20%20%20%3D%20vcenter2-creds%0Asecret-namespace%20%3D%20kube-system
          verification: {}
        filesystem: root
        mode: 420
        path: /etc/kubernetes/cloud.conf

4) Checked mcd log on control plane 2 node.

# ./oc logs machine-config-daemon-hhfmm -n openshift-machine-config-operator
I0807 09:40:38.187045    2526 start.go:67] Version: 4.1.9-201907311355-dirty (83392b13a5c17e56656acf3f7b0031e3303fb5c0)
I0807 09:40:38.187762    2526 start.go:100] Starting node writer
I0807 09:40:38.191674    2526 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0807 09:40:38.568894    2526 daemon.go:200] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60d15a500766bb720cec7900da9b00d0ba68887087ec8242158a53f41cf19bc5 (410.8.20190801.0)
I0807 09:40:40.091552    2526 start.go:196] Calling chroot("/rootfs")
I0807 09:40:40.091602    2526 start.go:206] Starting MachineConfigDaemon
I0807 09:40:40.091711    2526 update.go:836] Starting to manage node: control-plane-2
I0807 09:40:40.291846    2526 run.go:22] Running captured: rpm-ostree status
I0807 09:40:40.329722    2526 daemon.go:740] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60d15a500766bb720cec7900da9b00d0ba68887087ec8242158a53f41cf19bc5
              CustomOrigin: Managed by pivot tool
                   Version: 410.8.20190801.0 (2019-08-01T10:02:02Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05c856bb59bf5bda754e34937958dfaf941c0e8cfc289fa0f1deca0d18411b62
              CustomOrigin: Provisioned from oscontainer
                   Version: 410.8.20190516.0 (2019-05-16T20:26:25Z)
I0807 09:40:40.329741    2526 run.go:22] Running captured: journalctl --list-boots
I0807 09:40:40.336004    2526 daemon.go:747] journalctl --list-boots:
-2 0d21277961194df69646d587027ab1ad Wed 2019-08-07 09:09:05 UTC—Wed 2019-08-07 09:13:52 UTC
-1 c1a84e082ea64b5e9241569d0b2eba71 Wed 2019-08-07 09:14:06 UTC—Wed 2019-08-07 09:38:12 UTC
 0 c8cc5f9e22814f8db83a7acedfc8882d Wed 2019-08-07 09:38:24 UTC—Wed 2019-08-07 09:40:40 UTC
I0807 09:40:40.336027    2526 daemon.go:494] Enabling Kubelet Healthz Monitor
E0807 09:41:10.294464    2526 reflector.go:134] k8s.io/client-go/informers/factory.go:132: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout
E0807 09:41:10.297104    2526 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout
I0807 09:41:12.402204    2526 update.go:723] logger doesn't support --jounald, grepping the journal
I0807 09:41:12.718313    2526 daemon.go:667] In bootstrap mode
I0807 09:41:12.718334    2526 daemon.go:695] Current+desired config: rendered-master-741bbd2385241c3507ad5854b17862ac
I0807 09:41:12.718346    2526 update.go:836] using pending config same as desired
I0807 09:41:12.727265    2526 daemon.go:854] No bootstrap pivot required; unlinking bootstrap node annotations
I0807 09:41:12.727327    2526 update.go:836] Validating against pending config rendered-master-741bbd2385241c3507ad5854b17862ac
I0807 09:41:12.729671    2526 daemon.go:894] Validating against pending config rendered-master-741bbd2385241c3507ad5854b17862ac
I0807 09:41:12.897467    2526 daemon.go:904] Validated on-disk state
I0807 09:41:13.093883    2526 update.go:801] logger doesn't support --jounald, logging json directly
I0807 09:41:13.194552    2526 daemon.go:938] Completing pending config rendered-master-741bbd2385241c3507ad5854b17862ac
I0807 09:41:13.393431    2526 update.go:836] completed update for config rendered-master-741bbd2385241c3507ad5854b17862ac
I0807 09:41:13.593751    2526 daemon.go:951] In desired config rendered-master-741bbd2385241c3507ad5854b17862ac
E0807 09:46:04.201162    2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=15, ErrCode=NO_ERROR, debug=""
E0807 09:46:04.201899    2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=15, ErrCode=NO_ERROR, debug=""
W0807 09:46:04.280917    2526 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 3853 (8147)
E0807 09:49:34.230488    2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=7, ErrCode=NO_ERROR, debug=""
E0807 09:49:34.230783    2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=7, ErrCode=NO_ERROR, debug=""
W0807 09:49:37.433475    2526 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 8147 (12467)
E0807 10:00:13.863151    2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug=""
E0807 10:00:13.863632    2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug=""
W0807 10:00:14.163932    2526 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 12467 (16992)
W0807 10:20:35.037056    2526 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 21264 (23000)
W0807 10:23:25.410505    2526 reflector.go:270] k8s.io/client-go/informers/factory.go:132: watch of *v1.Node ended with: very short watch: k8s.io/client-go/informers/factory.go:132: Unexpected watch close - watch lasted less than a second and no items received
W0807 10:23:25.754556    2526 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 23000 (25279)
I0807 10:26:25.893097    2526 update.go:174] Checking reconcilable for config rendered-master-741bbd2385241c3507ad5854b17862ac to rendered-master-c761fde1ce7bd6027b1846802116f469
I0807 10:26:25.960870    2526 update.go:836] Starting update from rendered-master-741bbd2385241c3507ad5854b17862ac to rendered-master-c761fde1ce7bd6027b1846802116f469
I0807 10:26:25.972660    2526 update.go:388] Updating files
I0807 10:26:25.972690    2526 update.go:590] Writing file "/etc/tmpfiles.d/cleanup-cni.conf"
I0807 10:26:26.060795    2526 update.go:590] Writing file "/etc/etcd/etcd.conf"
I0807 10:26:26.065888    2526 update.go:590] Writing file "/etc/kubernetes/manifests/etcd-member.yaml"
I0807 10:26:26.159548    2526 update.go:590] Writing file "/etc/kubernetes/static-pod-resources/etcd-member/ca.crt"
I0807 10:26:26.161569    2526 update.go:590] Writing file "/etc/kubernetes/static-pod-resources/etcd-member/metric-ca.crt"
I0807 10:26:26.164905    2526 update.go:590] Writing file "/etc/kubernetes/static-pod-resources/etcd-member/root-ca.crt"
I0807 10:26:26.166876    2526 update.go:590] Writing file "/etc/systemd/system.conf.d/kubelet-cgroups.conf"
I0807 10:26:26.168295    2526 update.go:590] Writing file "/var/lib/kubelet/config.json"
I0807 10:26:26.258376    2526 update.go:590] Writing file "/etc/kubernetes/ca.crt"
I0807 10:26:26.262001    2526 update.go:590] Writing file "/etc/sysctl.d/forward.conf"
I0807 10:26:26.264040    2526 update.go:590] Writing file "/usr/local/bin/etcd-member-recover.sh"
I0807 10:26:26.265444    2526 update.go:590] Writing file "/usr/local/bin/etcd-snapshot-backup.sh"
I0807 10:26:26.266993    2526 update.go:590] Writing file "/usr/local/bin/etcd-snapshot-restore.sh"
I0807 10:26:26.358069    2526 update.go:590] Writing file "/usr/local/bin/recover-kubeconfig.sh"
I0807 10:26:26.361154    2526 update.go:590] Writing file "/usr/local/bin/openshift-recovery-tools"
I0807 10:26:26.363067    2526 update.go:590] Writing file "/usr/local/bin/tokenize-signer.sh"
I0807 10:26:26.364559    2526 update.go:590] Writing file "/usr/local/share/openshift-recovery/template/etcd-generate-certs.yaml.template"
I0807 10:26:26.458831    2526 update.go:590] Writing file "/usr/local/share/openshift-recovery/template/kube-etcd-cert-signer.yaml.template"
I0807 10:26:26.860185    2526 update.go:590] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy"
I0807 10:26:26.958303    2526 update.go:590] Writing file "/etc/containers/registries.conf"
I0807 10:26:27.059967    2526 update.go:590] Writing file "/etc/containers/storage.conf"
I0807 10:26:27.160150    2526 update.go:590] Writing file "/etc/crio/crio.conf"
I0807 10:26:27.261613    2526 update.go:590] Writing file "/etc/kubernetes/cloud.conf"
I0807 10:26:27.363001    2526 update.go:590] Writing file "/etc/kubernetes/kubelet-ca.crt"
I0807 10:26:27.459489    2526 update.go:590] Writing file "/etc/kubernetes/kubelet.conf"
I0807 10:26:27.461190    2526 update.go:545] Writing systemd unit "kubelet.service"
I0807 10:26:27.464222    2526 update.go:562] Enabling systemd unit "kubelet.service"
I0807 10:26:27.464374    2526 update.go:482] /etc/systemd/system/multi-user.target.wants/kubelet.service already exists. Not making a new symlink
I0807 10:26:27.464395    2526 update.go:407] Deleting stale data
I0807 10:26:27.464420    2526 update.go:652] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0807 10:26:27.657774    2526 update.go:801] logger doesn't support --jounald, logging json directly
I0807 10:26:27.763309    2526 update.go:836] Update prepared; beginning drain

 

Version-Release number of selected component (if applicable):
4.1.9

How reproducible:
100%

Steps to Reproduce:
We have two vsphere(vcsa and vcsa2) with different user/passwd. My steps are as following:
1.setup cluster on vcsa
2.create secret vsphere-creds with vcsa2's user/passwd(this step is not in the doc, but I think it should be added.)
# ./oc create -f vcenter2-creds 
secret/vcenter2-creds created
# ./oc get secret vcenter2-creds -n kube-system -o yaml
apiVersion: v1
data:
  vcsa2-qe.vmware.devcluster.openshift.com.password: xxx
  vcsa2-qe.vmware.devcluster.openshift.com.username: xxx
kind: Secret
metadata:
  creationTimestamp: "2019-08-07T10:10:18Z"
  name: vcenter2-creds
  namespace: kube-system
  resourceVersion: "19890"
  selfLink: /api/v1/namespaces/kube-system/secrets/vcenter2-creds
  uid: 8f010060-b8fb-11e9-bd37-0050568ba9d6
type: Opaque

3.update vspher cloud config
# ./oc edit configmap/cloud-provider-config -n openshift-config
configmap/cloud-provider-config edited
# ./oc get configmap/cloud-provider-config -n openshift-config -o yaml
apiVersion: v1
data:
  config: |
    [Global]
    secret-name      = vsphere-creds
    secret-namespace = kube-system
    insecure-flag    = 1

    [Workspace]
    server            = vcsa-qe.vmware.devcluster.openshift.com
    datacenter        = dc1
    default-datastore = nfs-ds1
    folder            = jliu-24290

    [VirtualCenter "vcsa-qe.vmware.devcluster.openshift.com"]
    datacenters = dc1

    [VirtualCenter "vcsa2-qe.vmware.devcluster.openshift.com"]
    datacenters = dc1
    secret-name      = vcenter2-creds
    secret-namespace = kube-system
kind: ConfigMap
metadata:
  creationTimestamp: "2019-08-07T09:30:24Z"
  name: cloud-provider-config
  namespace: openshift-config
  resourceVersion: "21153"
  selfLink: /api/v1/namespaces/openshift-config/configmaps/cloud-provider-config
  uid: fc3bd27a-b8f5-11e9-9b04-0050568bbb78

4.Check the cluster and operators status(even after 16hrs+, still can not recover)
# ./oc get node
NAME              STATUS                     ROLES    AGE   VERSION
compute-0         Ready                      worker   16h   v1.13.4+0cb23916f
compute-1         Ready                      worker   16h   v1.13.4+0cb23916f
control-plane-0   Ready                      master   16h   v1.13.4+0cb23916f
control-plane-1   Ready                      master   16h   v1.13.4+0cb23916f
control-plane-2   Ready,SchedulingDisabled   master   16h   v1.13.4+0cb23916f

# ./oc get co
NAME                                 VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                       4.1.9     True        False         False      15h
cloud-credential                     4.1.9     True        False         False      16h
cluster-autoscaler                   4.1.9     True        False         False      16h
console                              4.1.9     True        False         False      15h
dns                                  4.1.9     True        False         False      16h
image-registry                       4.1.9     True        False         False      15h
ingress                              4.1.9     True        False         False      16h
kube-apiserver                       4.1.9     True        True          False      16h
kube-controller-manager              4.1.9     True        False         False      16h
kube-scheduler                       4.1.9     True        False         False      16h
machine-api                          4.1.9     True        False         False      16h
machine-config                       4.1.9     False       False         True       15h
marketplace                          4.1.9     True        False         False      3m38s
monitoring                           4.1.9     False       True          True       15h
network                              4.1.9     True        True          False      16h
node-tuning                          4.1.9     True        False         False      16h
openshift-apiserver                  4.1.9     True        False         False      15h
openshift-controller-manager         4.1.9     True        False         False      16h
openshift-samples                    4.1.9     True        False         False      15h
operator-lifecycle-manager           4.1.9     True        False         False      16h
operator-lifecycle-manager-catalog   4.1.9     True        False         False      16h
service-ca                           4.1.9     True        False         False      16h
service-catalog-apiserver            4.1.9     True        False         False      16h
service-catalog-controller-manager   4.1.9     True        False         False      16h
storage                              4.1.9     True        False         False      16h

Actual results:
The cluster is in a stuck status.

Expected results:
The cluster should return to working status after update vsphere cloud provider config.

Additional info:
Refer to https://bugzilla.redhat.com/show_bug.cgi?id=1734276
We have a doc bug for the steps to deploy upi/vsphere cluster on multiple vcenters, and now it blocked on the issue above, so file a seperate bug to track the issue against machine config.

Comment 2 Antonio Murdaca 2019-08-08 12:19:11 UTC
```
I0807 10:26:27.763309    2526 update.go:836] Update prepared; beginning drain
```

Looks like it's there draining, could you check:

oc get pods -n openshift-machine-config-operator


There might be the etcd-quorum-guard still rolling

Comment 3 Matthew Staebler 2019-08-08 14:36:30 UTC
(In reply to liujia from comment #0)
> # ./oc adm must-gather
> the server is currently unable to handle the request (get
> imagestreams.image.openshift.io must-gather)
> Using image: quay.io/openshift/origin-must-gather:latest
> namespace/openshift-must-gather-dc4mn created
> clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp created
> clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp deleted
> namespace/openshift-must-gather-dc4mn deleted
> Error from server (Forbidden): pods "must-gather-" is forbidden: error
> looking up service account openshift-must-gather-dc4mn/default:
> serviceaccount "default" not found

How are you providing the kubeconfig when running `oc adm must-gather`? Are you specifying it using an environment variable? Or are you specifying it using the `--config` flag? There is a bug where it does not work using the `--config` flag.

Comment 4 liujia 2019-08-09 02:00:39 UTC
(In reply to Matthew Staebler from comment #3)

> How are you providing the kubeconfig when running `oc adm must-gather`? Are
> you specifying it using an environment variable? Or are you specifying it
> using the `--config` flag? There is a bug where it does not work using the
> `--config` flag.

Yes, I use env variable.
# export KUBECONFIG=`pwd`/auth/kubeconfig

Comment 5 liujia 2019-08-09 02:02:18 UTC
(In reply to Antonio Murdaca from comment #2)
> ```
> I0807 10:26:27.763309    2526 update.go:836] Update prepared; beginning drain
> ```
> 
> Looks like it's there draining, could you check:
> 
> oc get pods -n openshift-machine-config-operator
> 
> 
> There might be the etcd-quorum-guard still rolling

# ./oc get pods -n openshift-machine-config-operator -o wide
NAME                                         READY   STATUS    RESTARTS   AGE   IP             NODE              NOMINATED NODE   READINESS GATES
etcd-quorum-guard-8646778784-2mqb8           1/1     Running   0          39h   139.178.76.6   control-plane-0   <none>           <none>
etcd-quorum-guard-8646778784-jgs2p           1/1     Running   0          40h   139.178.76.4   control-plane-2   <none>           <none>
machine-config-controller-79776b8675-k7cgx   1/1     Running   0          39h   10.128.2.73    control-plane-0   <none>           <none>
machine-config-daemon-7th94                  1/1     Running   2          40h   139.178.76.5   compute-1         <none>           <none>
machine-config-daemon-856cd                  1/1     Running   2          40h   139.178.76.8   control-plane-1   <none>           <none>
machine-config-daemon-hhfmm                  1/1     Running   1          40h   139.178.76.4   control-plane-2   <none>           <none>
machine-config-daemon-k7dpm                  1/1     Running   2          40h   139.178.76.6   control-plane-0   <none>           <none>
machine-config-daemon-xgfwf                  1/1     Running   2          40h   139.178.76.9   compute-0         <none>           <none>
machine-config-operator-6569bbbbdd-w4c78     1/1     Running   0          39h   10.128.2.67    control-plane-0   <none>           <none>
machine-config-server-8j9zm                  1/1     Running   1          40h   139.178.76.4   control-plane-2   <none>           <none>
machine-config-server-rjd6z                  1/1     Running   1          40h   139.178.76.6   control-plane-0   <none>           <none>
machine-config-server-xl6vq                  1/1     Running   2          40h   139.178.76.8   control-plane-1   <none>           <none>

Comment 7 liujia 2019-08-23 00:42:01 UTC
The same issue for 4.2. So update the target_version to 4.2. And clone a bug to 4.1 for tracking.

Comment 8 Antonio Murdaca 2019-08-23 14:57:42 UTC
Do you have a cluster stuck into this situation? the one you provided earlier is no longer valid I guess as it shows other errors not related to this.

Comment 9 Kirsten Garrison 2019-08-23 19:03:47 UTC
Seeing in the oc get pods -n openshift-machine-config-operator above:

There are 3 control plane nodes but only 2 etcd-quorum guard pods? the quorum guard pod for control-plane-1 seems to be missing...

@liuja, the next time you encounter this issue can you please note whether there are a matching number of etcd-quorum guard pods (1 for each control-plane)?

Comment 10 Sam Batschelet 2019-09-12 13:00:22 UTC
> Following up http://file.rdu.redhat.com/kalexand/061019/osdocs404/installing/install_config/vsphere-hosts.html to try upi/vmware cluster with nodes located on multiple vCenters. After edit configmap/cloud-provider-config according to linked vmware doc[1], the cluster will never return back in normal status due to the new updated machineconfig can not apply to both master and worker nodes.

This is a very specific installation method which I have not been able to reproduce. Moving this to 4.3 we will need to work with the reporter to reproduce and document this configuration could have issues. Also because quorum-guard is a deployment defined with 3 replicas the fact that one is missing is probably a scheduling issue vs a bug on quorum-guard/etcd.

># ./oc get co
>NAME                                 VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
>authentication                       4.1.9     True        False         False      15h
>[..]

Is this even supported in 4.1? If yes this should be a bug on 4.1 if no retested using 4.2.

Comment 11 liujia 2019-09-16 01:52:46 UTC
It should be supported on both 4.1 and 4.2 according to the labels https://github.com/openshift/openshift-docs/pull/15312.

btw, this one is for 4.2, and i clone this bug to 4.1 at https://bugzilla.redhat.com/show_bug.cgi?id=1744839

Comment 16 Sam Batschelet 2019-09-19 04:20:21 UTC
As a result of further testing, it appears this is not a bug on etcd but on docs I am going to move this to documentation.


Note You need to log in before you can comment on or make changes to this bug.