Bug 1738834
Summary: | [DOCS] [upi-vmware] update vsphere cloud provider config will get the cluster stuck | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> | |
Component: | Documentation | Assignee: | Kathryn Alexander <kalexand> | |
Status: | CLOSED DEFERRED | QA Contact: | liujia <jiajliu> | |
Severity: | high | Docs Contact: | Latha S <lmurthy> | |
Priority: | high | |||
Version: | 4.2.0 | CC: | dphillip, gblomqui, kalexand, kgarriso, lmurthy | |
Target Milestone: | --- | |||
Target Release: | 4.2.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1744839 (view as bug list) | Environment: | ||
Last Closed: | 2022-07-06 12:46:24 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1744839 |
Description
liujia
2019-08-08 09:06:36 UTC
``` I0807 10:26:27.763309 2526 update.go:836] Update prepared; beginning drain ``` Looks like it's there draining, could you check: oc get pods -n openshift-machine-config-operator There might be the etcd-quorum-guard still rolling (In reply to liujia from comment #0) > # ./oc adm must-gather > the server is currently unable to handle the request (get > imagestreams.image.openshift.io must-gather) > Using image: quay.io/openshift/origin-must-gather:latest > namespace/openshift-must-gather-dc4mn created > clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp created > clusterrolebinding.rbac.authorization.k8s.io/must-gather-mskjp deleted > namespace/openshift-must-gather-dc4mn deleted > Error from server (Forbidden): pods "must-gather-" is forbidden: error > looking up service account openshift-must-gather-dc4mn/default: > serviceaccount "default" not found How are you providing the kubeconfig when running `oc adm must-gather`? Are you specifying it using an environment variable? Or are you specifying it using the `--config` flag? There is a bug where it does not work using the `--config` flag. (In reply to Matthew Staebler from comment #3) > How are you providing the kubeconfig when running `oc adm must-gather`? Are > you specifying it using an environment variable? Or are you specifying it > using the `--config` flag? There is a bug where it does not work using the > `--config` flag. Yes, I use env variable. # export KUBECONFIG=`pwd`/auth/kubeconfig (In reply to Antonio Murdaca from comment #2) > ``` > I0807 10:26:27.763309 2526 update.go:836] Update prepared; beginning drain > ``` > > Looks like it's there draining, could you check: > > oc get pods -n openshift-machine-config-operator > > > There might be the etcd-quorum-guard still rolling # ./oc get pods -n openshift-machine-config-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES etcd-quorum-guard-8646778784-2mqb8 1/1 Running 0 39h 139.178.76.6 control-plane-0 <none> <none> etcd-quorum-guard-8646778784-jgs2p 1/1 Running 0 40h 139.178.76.4 control-plane-2 <none> <none> machine-config-controller-79776b8675-k7cgx 1/1 Running 0 39h 10.128.2.73 control-plane-0 <none> <none> machine-config-daemon-7th94 1/1 Running 2 40h 139.178.76.5 compute-1 <none> <none> machine-config-daemon-856cd 1/1 Running 2 40h 139.178.76.8 control-plane-1 <none> <none> machine-config-daemon-hhfmm 1/1 Running 1 40h 139.178.76.4 control-plane-2 <none> <none> machine-config-daemon-k7dpm 1/1 Running 2 40h 139.178.76.6 control-plane-0 <none> <none> machine-config-daemon-xgfwf 1/1 Running 2 40h 139.178.76.9 compute-0 <none> <none> machine-config-operator-6569bbbbdd-w4c78 1/1 Running 0 39h 10.128.2.67 control-plane-0 <none> <none> machine-config-server-8j9zm 1/1 Running 1 40h 139.178.76.4 control-plane-2 <none> <none> machine-config-server-rjd6z 1/1 Running 1 40h 139.178.76.6 control-plane-0 <none> <none> machine-config-server-xl6vq 1/1 Running 2 40h 139.178.76.8 control-plane-1 <none> <none> The same issue for 4.2. So update the target_version to 4.2. And clone a bug to 4.1 for tracking. Do you have a cluster stuck into this situation? the one you provided earlier is no longer valid I guess as it shows other errors not related to this. Seeing in the oc get pods -n openshift-machine-config-operator above: There are 3 control plane nodes but only 2 etcd-quorum guard pods? the quorum guard pod for control-plane-1 seems to be missing... @liuja, the next time you encounter this issue can you please note whether there are a matching number of etcd-quorum guard pods (1 for each control-plane)? > Following up http://file.rdu.redhat.com/kalexand/061019/osdocs404/installing/install_config/vsphere-hosts.html to try upi/vmware cluster with nodes located on multiple vCenters. After edit configmap/cloud-provider-config according to linked vmware doc[1], the cluster will never return back in normal status due to the new updated machineconfig can not apply to both master and worker nodes. This is a very specific installation method which I have not been able to reproduce. Moving this to 4.3 we will need to work with the reporter to reproduce and document this configuration could have issues. Also because quorum-guard is a deployment defined with 3 replicas the fact that one is missing is probably a scheduling issue vs a bug on quorum-guard/etcd. ># ./oc get co >NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE >authentication 4.1.9 True False False 15h >[..] Is this even supported in 4.1? If yes this should be a bug on 4.1 if no retested using 4.2. It should be supported on both 4.1 and 4.2 according to the labels https://github.com/openshift/openshift-docs/pull/15312. btw, this one is for 4.2, and i clone this bug to 4.1 at https://bugzilla.redhat.com/show_bug.cgi?id=1744839 As a result of further testing, it appears this is not a bug on etcd but on docs I am going to move this to documentation. |