Description of problem: Following up http://file.rdu.redhat.com/kalexand/061019/osdocs404/installing/install_config/vsphere-hosts.html to try upi/vmware cluster with nodes located on multiple vCenters. After edit configmap/cloud-provider-config according to linked vmware doc[1], the cluster will never return back in normal status due to the new updated machineconfig can not apply to both master and worker nodes. [1] https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/existing.html#multiple-vcenters ================================================================================ # ./oc adm must-gather the server is currently unable to handle the request (get imagestreams.image.openshift.io must-gather) Using image: quay.io/openshift/origin-must-gather:latest namespace/openshift-must-gather-dc4mn created clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp created clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp deleted namespace/openshift-must-gather-dc4mn deleted Error from server (Forbidden): pods "must-gather-" is forbidden: error looking up service account openshift-must-gather-dc4mn/default: serviceaccount "default" not found Because must-gather can not in this stuck cluster, so I tried to give more info to help debug. And will leave the cluster for dev to dig more. 1) Checked that some clusteroperators are not workable. # ./oc describe co kube-apiserver ... Spec: Status: Conditions: Last Transition Time: 2019-08-07T09:56:54Z Message: StaticPodsDegraded: nodes/control-plane-0 pods/kube-apiserver-control-plane-0 container="kube-apiserver-7" is not ready StaticPodsDegraded: nodes/control-plane-0 pods/kube-apiserver-control-plane-0 container="kube-apiserver-cert-syncer-7" is not ready Reason: AsExpected Status: False Type: Degraded Last Transition Time: 2019-08-07T10:15:52Z Message: Progressing: 1 nodes are at revision 6; 2 nodes are at revision 7 Reason: Progressing Status: True Type: Progressing Last Transition Time: 2019-08-07T09:44:14Z Message: Available: 3 nodes are active; 1 nodes are at revision 6; 2 nodes are at revision 7 Reason: AsExpected Status: True Type: Available Last Transition Time: 2019-08-07T09:34:13Z Reason: AsExpected Status: True Type: Upgradeable ... # ./oc describe co machine-config ... Status: Conditions: Last Transition Time: 2019-08-07T10:34:08Z Message: Cluster not available for 4.1.9 Status: False Type: Available Last Transition Time: 2019-08-07T09:43:54Z Message: Cluster version is 4.1.9 Status: False Type: Progressing Last Transition Time: 2019-08-07T10:34:08Z Message: Failed to resync 4.1.9 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 2, unavailable: 1) Reason: FailedToSync Status: True Type: Degraded Extension: Master: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-741bbd2385241c3507ad5854b17862ac Worker: all 2 nodes are at latest configuration rendered-worker-96b981964718500f4bc19d4a65504bfa ... 2) Checked node status and machineconfig update status, and found that only control-plane-2 stunk on applying to the new machine config(after several times try, always one of master node will fail to apply the update). # ./oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready worker 18h v1.13.4+0cb23916f compute-1 Ready worker 18h v1.13.4+0cb23916f control-plane-0 Ready master 18h v1.13.4+0cb23916f control-plane-1 Ready master 18h v1.13.4+0cb23916f control-plane-2 Ready,SchedulingDisabled master 18h v1.13.4+0cb23916f # ./oc describe node|grep rendered Annotations: machineconfiguration.openshift.io/currentConfig: rendered-worker-96b981964718500f4bc19d4a65504bfa machineconfiguration.openshift.io/desiredConfig: rendered-worker-96b981964718500f4bc19d4a65504bfa Annotations: machineconfiguration.openshift.io/currentConfig: rendered-worker-96b981964718500f4bc19d4a65504bfa machineconfiguration.openshift.io/desiredConfig: rendered-worker-96b981964718500f4bc19d4a65504bfa Annotations: machineconfiguration.openshift.io/currentConfig: rendered-master-c761fde1ce7bd6027b1846802116f469 machineconfiguration.openshift.io/desiredConfig: rendered-master-c761fde1ce7bd6027b1846802116f469 Annotations: machineconfiguration.openshift.io/currentConfig: rendered-master-c761fde1ce7bd6027b1846802116f469 machineconfiguration.openshift.io/desiredConfig: rendered-master-c761fde1ce7bd6027b1846802116f469 Annotations: machineconfiguration.openshift.io/currentConfig: rendered-master-741bbd2385241c3507ad5854b17862ac machineconfiguration.openshift.io/desiredConfig: rendered-master-c761fde1ce7bd6027b1846802116f469 # ./oc get machineconfig|grep rendered rendered-master-741bbd2385241c3507ad5854b17862ac 83392b13a5c17e56656acf3f7b0031e3303fb5c0 2.2.0 19h rendered-master-c761fde1ce7bd6027b1846802116f469 83392b13a5c17e56656acf3f7b0031e3303fb5c0 2.2.0 18h rendered-worker-5588104dc718990968912addaa51c557 83392b13a5c17e56656acf3f7b0031e3303fb5c0 2.2.0 19h rendered-worker-96b981964718500f4bc19d4a65504bfa 83392b13a5c17e56656acf3f7b0031e3303fb5c0 2.2.0 18h # ./oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED master rendered-master-741bbd2385241c3507ad5854b17862ac False True False worker rendered-worker-96b981964718500f4bc19d4a65504bfa True False False # ./oc describe machineconfigpool master Name: master Namespace: Labels: operator.machineconfiguration.openshift.io/required-for-upgrade= Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2019-08-07T09:33:59Z Generation: 3 Resource Version: 27412 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master UID: 7c188687-b8f6-11e9-9b04-0050568bbb78 Spec: Configuration: Name: rendered-master-c761fde1ce7bd6027b1846802116f469 Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-master API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-7c188687-b8f6-11e9-9b04-0050568bbb78-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-ssh Machine Config Selector: Match Labels: machineconfiguration.openshift.io/role: master Max Unavailable: <nil> Node Selector: Match Labels: node-role.kubernetes.io/master: Paused: false Status: Conditions: Last Transition Time: 2019-08-07T09:35:43Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2019-08-07T09:43:53Z Message: Reason: Status: False Type: NodeDegraded Last Transition Time: 2019-08-07T09:43:53Z Message: Reason: Status: False Type: Degraded Last Transition Time: 2019-08-07T10:16:01Z Message: Reason: Status: False Type: Updated Last Transition Time: 2019-08-07T10:16:01Z Message: Reason: All nodes are updating to rendered-master-c761fde1ce7bd6027b1846802116f469 Status: True Type: Updating Configuration: Name: rendered-master-741bbd2385241c3507ad5854b17862ac Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-master API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-7c188687-b8f6-11e9-9b04-0050568bbb78-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-ssh Degraded Machine Count: 0 Machine Count: 3 Observed Generation: 3 Ready Machine Count: 2 Unavailable Machine Count: 1 Updated Machine Count: 2 Events: <none> 3) Checked the new and old machine config, which shows the only difference is cloud config. # ./oc get machineconfig rendered-master-c761fde1ce7bd6027b1846802116f469 -o yaml > new_mc.yaml # ./oc get machineconfig rendered-master-741bbd2385241c3507ad5854b17862ac -o yaml > old_mc.yaml ... - contents: source: data:,%5BGlobal%5D%0Asecret-name%20%20%20%20%20%20%3D%20vsphere-creds%0Asecret-namespace%20%3D%20kube-system%0Ainsecure-flag%20%20%20%20%3D%200%0A%0A%5BWorkspace%5D%0Aserver%20%20%20%20%20%20%20%20%20%20%20%20%3D%20vcsa-qe.vmware.devcluster.openshift.com%0Adatacenter%20%20%20%20%20%20%20%20%3D%20dc1%0Adefault-datastore%20%3D%20nfs-ds1%0Afolder%20%20%20%20%20%20%20%20%20%20%20%20%3D%20jliu-24290%0A%0A%5BVirtualCenter%20%22vcsa-qe.vmware.devcluster.openshift.com%22%5D%0Adatacenters%20%3D%20dc1%0A%5BVirtualCenter%20%22vcsa2-qe.vmware.devcluster.openshift.com%22%5D%0Adatacenters%20%3D%20dc1%0Asecret-name%20%20%20%20%20%20%3D%20vcenter2-creds%0Asecret-namespace%20%3D%20kube-system verification: {} filesystem: root mode: 420 path: /etc/kubernetes/cloud.conf 4) Checked mcd log on control plane 2 node. # ./oc logs machine-config-daemon-hhfmm -n openshift-machine-config-operator I0807 09:40:38.187045 2526 start.go:67] Version: 4.1.9-201907311355-dirty (83392b13a5c17e56656acf3f7b0031e3303fb5c0) I0807 09:40:38.187762 2526 start.go:100] Starting node writer I0807 09:40:38.191674 2526 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json I0807 09:40:38.568894 2526 daemon.go:200] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60d15a500766bb720cec7900da9b00d0ba68887087ec8242158a53f41cf19bc5 (410.8.20190801.0) I0807 09:40:40.091552 2526 start.go:196] Calling chroot("/rootfs") I0807 09:40:40.091602 2526 start.go:206] Starting MachineConfigDaemon I0807 09:40:40.091711 2526 update.go:836] Starting to manage node: control-plane-2 I0807 09:40:40.291846 2526 run.go:22] Running captured: rpm-ostree status I0807 09:40:40.329722 2526 daemon.go:740] State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60d15a500766bb720cec7900da9b00d0ba68887087ec8242158a53f41cf19bc5 CustomOrigin: Managed by pivot tool Version: 410.8.20190801.0 (2019-08-01T10:02:02Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05c856bb59bf5bda754e34937958dfaf941c0e8cfc289fa0f1deca0d18411b62 CustomOrigin: Provisioned from oscontainer Version: 410.8.20190516.0 (2019-05-16T20:26:25Z) I0807 09:40:40.329741 2526 run.go:22] Running captured: journalctl --list-boots I0807 09:40:40.336004 2526 daemon.go:747] journalctl --list-boots: -2 0d21277961194df69646d587027ab1ad Wed 2019-08-07 09:09:05 UTC—Wed 2019-08-07 09:13:52 UTC -1 c1a84e082ea64b5e9241569d0b2eba71 Wed 2019-08-07 09:14:06 UTC—Wed 2019-08-07 09:38:12 UTC 0 c8cc5f9e22814f8db83a7acedfc8882d Wed 2019-08-07 09:38:24 UTC—Wed 2019-08-07 09:40:40 UTC I0807 09:40:40.336027 2526 daemon.go:494] Enabling Kubelet Healthz Monitor E0807 09:41:10.294464 2526 reflector.go:134] k8s.io/client-go/informers/factory.go:132: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout E0807 09:41:10.297104 2526 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout I0807 09:41:12.402204 2526 update.go:723] logger doesn't support --jounald, grepping the journal I0807 09:41:12.718313 2526 daemon.go:667] In bootstrap mode I0807 09:41:12.718334 2526 daemon.go:695] Current+desired config: rendered-master-741bbd2385241c3507ad5854b17862ac I0807 09:41:12.718346 2526 update.go:836] using pending config same as desired I0807 09:41:12.727265 2526 daemon.go:854] No bootstrap pivot required; unlinking bootstrap node annotations I0807 09:41:12.727327 2526 update.go:836] Validating against pending config rendered-master-741bbd2385241c3507ad5854b17862ac I0807 09:41:12.729671 2526 daemon.go:894] Validating against pending config rendered-master-741bbd2385241c3507ad5854b17862ac I0807 09:41:12.897467 2526 daemon.go:904] Validated on-disk state I0807 09:41:13.093883 2526 update.go:801] logger doesn't support --jounald, logging json directly I0807 09:41:13.194552 2526 daemon.go:938] Completing pending config rendered-master-741bbd2385241c3507ad5854b17862ac I0807 09:41:13.393431 2526 update.go:836] completed update for config rendered-master-741bbd2385241c3507ad5854b17862ac I0807 09:41:13.593751 2526 daemon.go:951] In desired config rendered-master-741bbd2385241c3507ad5854b17862ac E0807 09:46:04.201162 2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=15, ErrCode=NO_ERROR, debug="" E0807 09:46:04.201899 2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=15, ErrCode=NO_ERROR, debug="" W0807 09:46:04.280917 2526 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 3853 (8147) E0807 09:49:34.230488 2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=7, ErrCode=NO_ERROR, debug="" E0807 09:49:34.230783 2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=7, ErrCode=NO_ERROR, debug="" W0807 09:49:37.433475 2526 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 8147 (12467) E0807 10:00:13.863151 2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug="" E0807 10:00:13.863632 2526 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug="" W0807 10:00:14.163932 2526 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 12467 (16992) W0807 10:20:35.037056 2526 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 21264 (23000) W0807 10:23:25.410505 2526 reflector.go:270] k8s.io/client-go/informers/factory.go:132: watch of *v1.Node ended with: very short watch: k8s.io/client-go/informers/factory.go:132: Unexpected watch close - watch lasted less than a second and no items received W0807 10:23:25.754556 2526 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 23000 (25279) I0807 10:26:25.893097 2526 update.go:174] Checking reconcilable for config rendered-master-741bbd2385241c3507ad5854b17862ac to rendered-master-c761fde1ce7bd6027b1846802116f469 I0807 10:26:25.960870 2526 update.go:836] Starting update from rendered-master-741bbd2385241c3507ad5854b17862ac to rendered-master-c761fde1ce7bd6027b1846802116f469 I0807 10:26:25.972660 2526 update.go:388] Updating files I0807 10:26:25.972690 2526 update.go:590] Writing file "/etc/tmpfiles.d/cleanup-cni.conf" I0807 10:26:26.060795 2526 update.go:590] Writing file "/etc/etcd/etcd.conf" I0807 10:26:26.065888 2526 update.go:590] Writing file "/etc/kubernetes/manifests/etcd-member.yaml" I0807 10:26:26.159548 2526 update.go:590] Writing file "/etc/kubernetes/static-pod-resources/etcd-member/ca.crt" I0807 10:26:26.161569 2526 update.go:590] Writing file "/etc/kubernetes/static-pod-resources/etcd-member/metric-ca.crt" I0807 10:26:26.164905 2526 update.go:590] Writing file "/etc/kubernetes/static-pod-resources/etcd-member/root-ca.crt" I0807 10:26:26.166876 2526 update.go:590] Writing file "/etc/systemd/system.conf.d/kubelet-cgroups.conf" I0807 10:26:26.168295 2526 update.go:590] Writing file "/var/lib/kubelet/config.json" I0807 10:26:26.258376 2526 update.go:590] Writing file "/etc/kubernetes/ca.crt" I0807 10:26:26.262001 2526 update.go:590] Writing file "/etc/sysctl.d/forward.conf" I0807 10:26:26.264040 2526 update.go:590] Writing file "/usr/local/bin/etcd-member-recover.sh" I0807 10:26:26.265444 2526 update.go:590] Writing file "/usr/local/bin/etcd-snapshot-backup.sh" I0807 10:26:26.266993 2526 update.go:590] Writing file "/usr/local/bin/etcd-snapshot-restore.sh" I0807 10:26:26.358069 2526 update.go:590] Writing file "/usr/local/bin/recover-kubeconfig.sh" I0807 10:26:26.361154 2526 update.go:590] Writing file "/usr/local/bin/openshift-recovery-tools" I0807 10:26:26.363067 2526 update.go:590] Writing file "/usr/local/bin/tokenize-signer.sh" I0807 10:26:26.364559 2526 update.go:590] Writing file "/usr/local/share/openshift-recovery/template/etcd-generate-certs.yaml.template" I0807 10:26:26.458831 2526 update.go:590] Writing file "/usr/local/share/openshift-recovery/template/kube-etcd-cert-signer.yaml.template" I0807 10:26:26.860185 2526 update.go:590] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy" I0807 10:26:26.958303 2526 update.go:590] Writing file "/etc/containers/registries.conf" I0807 10:26:27.059967 2526 update.go:590] Writing file "/etc/containers/storage.conf" I0807 10:26:27.160150 2526 update.go:590] Writing file "/etc/crio/crio.conf" I0807 10:26:27.261613 2526 update.go:590] Writing file "/etc/kubernetes/cloud.conf" I0807 10:26:27.363001 2526 update.go:590] Writing file "/etc/kubernetes/kubelet-ca.crt" I0807 10:26:27.459489 2526 update.go:590] Writing file "/etc/kubernetes/kubelet.conf" I0807 10:26:27.461190 2526 update.go:545] Writing systemd unit "kubelet.service" I0807 10:26:27.464222 2526 update.go:562] Enabling systemd unit "kubelet.service" I0807 10:26:27.464374 2526 update.go:482] /etc/systemd/system/multi-user.target.wants/kubelet.service already exists. Not making a new symlink I0807 10:26:27.464395 2526 update.go:407] Deleting stale data I0807 10:26:27.464420 2526 update.go:652] Writing SSHKeys at "/home/core/.ssh/authorized_keys" I0807 10:26:27.657774 2526 update.go:801] logger doesn't support --jounald, logging json directly I0807 10:26:27.763309 2526 update.go:836] Update prepared; beginning drain Version-Release number of selected component (if applicable): 4.1.9 How reproducible: 100% Steps to Reproduce: We have two vsphere(vcsa and vcsa2) with different user/passwd. My steps are as following: 1.setup cluster on vcsa 2.create secret vsphere-creds with vcsa2's user/passwd(this step is not in the doc, but I think it should be added.) # ./oc create -f vcenter2-creds secret/vcenter2-creds created # ./oc get secret vcenter2-creds -n kube-system -o yaml apiVersion: v1 data: vcsa2-qe.vmware.devcluster.openshift.com.password: xxx vcsa2-qe.vmware.devcluster.openshift.com.username: xxx kind: Secret metadata: creationTimestamp: "2019-08-07T10:10:18Z" name: vcenter2-creds namespace: kube-system resourceVersion: "19890" selfLink: /api/v1/namespaces/kube-system/secrets/vcenter2-creds uid: 8f010060-b8fb-11e9-bd37-0050568ba9d6 type: Opaque 3.update vspher cloud config # ./oc edit configmap/cloud-provider-config -n openshift-config configmap/cloud-provider-config edited # ./oc get configmap/cloud-provider-config -n openshift-config -o yaml apiVersion: v1 data: config: | [Global] secret-name = vsphere-creds secret-namespace = kube-system insecure-flag = 1 [Workspace] server = vcsa-qe.vmware.devcluster.openshift.com datacenter = dc1 default-datastore = nfs-ds1 folder = jliu-24290 [VirtualCenter "vcsa-qe.vmware.devcluster.openshift.com"] datacenters = dc1 [VirtualCenter "vcsa2-qe.vmware.devcluster.openshift.com"] datacenters = dc1 secret-name = vcenter2-creds secret-namespace = kube-system kind: ConfigMap metadata: creationTimestamp: "2019-08-07T09:30:24Z" name: cloud-provider-config namespace: openshift-config resourceVersion: "21153" selfLink: /api/v1/namespaces/openshift-config/configmaps/cloud-provider-config uid: fc3bd27a-b8f5-11e9-9b04-0050568bbb78 4.Check the cluster and operators status(even after 16hrs+, still can not recover) # ./oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready worker 16h v1.13.4+0cb23916f compute-1 Ready worker 16h v1.13.4+0cb23916f control-plane-0 Ready master 16h v1.13.4+0cb23916f control-plane-1 Ready master 16h v1.13.4+0cb23916f control-plane-2 Ready,SchedulingDisabled master 16h v1.13.4+0cb23916f # ./oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.1.9 True False False 15h cloud-credential 4.1.9 True False False 16h cluster-autoscaler 4.1.9 True False False 16h console 4.1.9 True False False 15h dns 4.1.9 True False False 16h image-registry 4.1.9 True False False 15h ingress 4.1.9 True False False 16h kube-apiserver 4.1.9 True True False 16h kube-controller-manager 4.1.9 True False False 16h kube-scheduler 4.1.9 True False False 16h machine-api 4.1.9 True False False 16h machine-config 4.1.9 False False True 15h marketplace 4.1.9 True False False 3m38s monitoring 4.1.9 False True True 15h network 4.1.9 True True False 16h node-tuning 4.1.9 True False False 16h openshift-apiserver 4.1.9 True False False 15h openshift-controller-manager 4.1.9 True False False 16h openshift-samples 4.1.9 True False False 15h operator-lifecycle-manager 4.1.9 True False False 16h operator-lifecycle-manager-catalog 4.1.9 True False False 16h service-ca 4.1.9 True False False 16h service-catalog-apiserver 4.1.9 True False False 16h service-catalog-controller-manager 4.1.9 True False False 16h storage 4.1.9 True False False 16h Actual results: The cluster is in a stuck status. Expected results: The cluster should return to working status after update vsphere cloud provider config. Additional info: Refer to https://bugzilla.redhat.com/show_bug.cgi?id=1734276 We have a doc bug for the steps to deploy upi/vsphere cluster on multiple vcenters, and now it blocked on the issue above, so file a seperate bug to track the issue against machine config.
``` I0807 10:26:27.763309 2526 update.go:836] Update prepared; beginning drain ``` Looks like it's there draining, could you check: oc get pods -n openshift-machine-config-operator There might be the etcd-quorum-guard still rolling
(In reply to liujia from comment #0) > # ./oc adm must-gather > the server is currently unable to handle the request (get > imagestreams.image.openshift.io must-gather) > Using image: quay.io/openshift/origin-must-gather:latest > namespace/openshift-must-gather-dc4mn created > clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp created > clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp deleted > namespace/openshift-must-gather-dc4mn deleted > Error from server (Forbidden): pods "must-gather-" is forbidden: error > looking up service account openshift-must-gather-dc4mn/default: > serviceaccount "default" not found How are you providing the kubeconfig when running `oc adm must-gather`? Are you specifying it using an environment variable? Or are you specifying it using the `--config` flag? There is a bug where it does not work using the `--config` flag.
(In reply to Matthew Staebler from comment #3) > How are you providing the kubeconfig when running `oc adm must-gather`? Are > you specifying it using an environment variable? Or are you specifying it > using the `--config` flag? There is a bug where it does not work using the > `--config` flag. Yes, I use env variable. # export KUBECONFIG=`pwd`/auth/kubeconfig
(In reply to Antonio Murdaca from comment #2) > ``` > I0807 10:26:27.763309 2526 update.go:836] Update prepared; beginning drain > ``` > > Looks like it's there draining, could you check: > > oc get pods -n openshift-machine-config-operator > > > There might be the etcd-quorum-guard still rolling # ./oc get pods -n openshift-machine-config-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES etcd-quorum-guard-8646778784-2mqb8 1/1 Running 0 39h 139.178.76.6 control-plane-0 <none> <none> etcd-quorum-guard-8646778784-jgs2p 1/1 Running 0 40h 139.178.76.4 control-plane-2 <none> <none> machine-config-controller-79776b8675-k7cgx 1/1 Running 0 39h 10.128.2.73 control-plane-0 <none> <none> machine-config-daemon-7th94 1/1 Running 2 40h 139.178.76.5 compute-1 <none> <none> machine-config-daemon-856cd 1/1 Running 2 40h 139.178.76.8 control-plane-1 <none> <none> machine-config-daemon-hhfmm 1/1 Running 1 40h 139.178.76.4 control-plane-2 <none> <none> machine-config-daemon-k7dpm 1/1 Running 2 40h 139.178.76.6 control-plane-0 <none> <none> machine-config-daemon-xgfwf 1/1 Running 2 40h 139.178.76.9 compute-0 <none> <none> machine-config-operator-6569bbbbdd-w4c78 1/1 Running 0 39h 10.128.2.67 control-plane-0 <none> <none> machine-config-server-8j9zm 1/1 Running 1 40h 139.178.76.4 control-plane-2 <none> <none> machine-config-server-rjd6z 1/1 Running 1 40h 139.178.76.6 control-plane-0 <none> <none> machine-config-server-xl6vq 1/1 Running 2 40h 139.178.76.8 control-plane-1 <none> <none>
The same issue for 4.2. So update the target_version to 4.2. And clone a bug to 4.1 for tracking.
Do you have a cluster stuck into this situation? the one you provided earlier is no longer valid I guess as it shows other errors not related to this.
Seeing in the oc get pods -n openshift-machine-config-operator above: There are 3 control plane nodes but only 2 etcd-quorum guard pods? the quorum guard pod for control-plane-1 seems to be missing... @liuja, the next time you encounter this issue can you please note whether there are a matching number of etcd-quorum guard pods (1 for each control-plane)?
> Following up http://file.rdu.redhat.com/kalexand/061019/osdocs404/installing/install_config/vsphere-hosts.html to try upi/vmware cluster with nodes located on multiple vCenters. After edit configmap/cloud-provider-config according to linked vmware doc[1], the cluster will never return back in normal status due to the new updated machineconfig can not apply to both master and worker nodes. This is a very specific installation method which I have not been able to reproduce. Moving this to 4.3 we will need to work with the reporter to reproduce and document this configuration could have issues. Also because quorum-guard is a deployment defined with 3 replicas the fact that one is missing is probably a scheduling issue vs a bug on quorum-guard/etcd. ># ./oc get co >NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE >authentication 4.1.9 True False False 15h >[..] Is this even supported in 4.1? If yes this should be a bug on 4.1 if no retested using 4.2.
It should be supported on both 4.1 and 4.2 according to the labels https://github.com/openshift/openshift-docs/pull/15312. btw, this one is for 4.2, and i clone this bug to 4.1 at https://bugzilla.redhat.com/show_bug.cgi?id=1744839
As a result of further testing, it appears this is not a bug on etcd but on docs I am going to move this to documentation.