*Description of problem:* The machineconfig state shows in `Working` for one node due to error while evicting pod "router-default-X" # oc describe node | grep machineconfiguration.openshift.io/state machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/state: Working machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/state: Done *Version-Release number of selected component (if applicable):* 4.2.30-x86_64 *How reproducible:* Only Once *Steps to Reproduce:* 1. install a upi-vsphere-disconnected cluster, and scale 3 Rhel nodes 2. mirror image: # oc adm release mirror --from=registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-05-07-171148 --to=vsphere.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release --to-release-image=vsphere.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release:4.3.0-0.nightly-2020-05-07-171148 3. create image ImageContentSourcePolicy # oc create -f - << EOF > apiVersion: operator.openshift.io/v1alpha1 > kind: ImageContentSourcePolicy > metadata: > name: example > spec: > repositoryDigestMirrors: > - mirrors: > - vsphere.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release > source: quay.io/openshift-release-dev/ocp-v4.0-art-dev > - mirrors: > - vsphere.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release > source: registry.svc.ci.openshift.org/ocp/release > EOF 4. Run `oc describe node | grep machineconfig` to ensure every node' machineconfig state is in `Done` state *Actual results:* The compute-1 node shows machineconfig state in `Working` howerver, other nodes shows `Done` #oc describe node Name: compute-1 Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=compute-1 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos Annotations: machineconfiguration.openshift.io/currentConfig: rendered-worker-c2a0628c4b495939f2f1d70907b10365 machineconfiguration.openshift.io/desiredConfig: rendered-worker-7becd32232fe3362f41961d7ae89ce54 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Working volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 07 May 2020 21:46:35 -0400 Taints: <none> Unschedulable: false Lease: HolderIdentity: compute-1 AcquireTime: <unset> RenewTime: Fri, 08 May 2020 04:02:14 -0400 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Fri, 08 May 2020 04:02:03 -0400 Thu, 07 May 2020 22:16:43 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Fri, 08 May 2020 04:02:03 -0400 Thu, 07 May 2020 22:16:43 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Fri, 08 May 2020 04:02:03 -0400 Thu, 07 May 2020 22:16:43 -0400 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Fri, 08 May 2020 04:02:03 -0400 Thu, 07 May 2020 22:16:53 -0400 KubeletReady kubelet is posting ready status # oc logs machine-config-daemon-9k62m -n openshift-machine-config-operator ... I0508 02:16:54.112791 2646 daemon.go:678] Current config: rendered-worker-b2f7e3e33c5d3004cace8ffcfa91c5b6 I0508 02:16:54.112806 2646 daemon.go:679] Desired config: rendered-worker-c2a0628c4b495939f2f1d70907b10365 I0508 02:16:54.119530 2646 update.go:993] Disk currentConfig rendered-worker-c2a0628c4b495939f2f1d70907b10365 overrides node annotation rendered-worker-b2f7e3e33c5d3004cace8ffcfa91c5b6 I0508 02:16:54.123482 2646 daemon.go:892] Validating against pending config rendered-worker-c2a0628c4b495939f2f1d70907b10365 I0508 02:16:54.126062 2646 daemon.go:908] Validated on-disk state I0508 02:16:54.140736 2646 daemon.go:942] Completing pending config rendered-worker-c2a0628c4b495939f2f1d70907b10365 I0508 02:16:54.147994 2646 update.go:993] completed update for config rendered-worker-c2a0628c4b495939f2f1d70907b10365 I0508 02:16:54.150319 2646 daemon.go:955] In desired config rendered-worker-c2a0628c4b495939f2f1d70907b10365 I0508 06:40:22.312840 2646 update.go:201] Checking Reconcilable for config rendered-worker-c2a0628c4b495939f2f1d70907b10365 to rendered-worker-7becd32232fe3362f41961d7ae89ce54 I0508 06:40:22.316829 2646 update.go:993] Starting update from rendered-worker-c2a0628c4b495939f2f1d70907b10365 to rendered-worker-7becd32232fe3362f41961d7ae89ce54: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false} I0508 06:40:22.322374 2646 update.go:993] Update prepared; beginning drain I0508 06:40:22.331700 2646 update.go:89] cordoned node "compute-1" I0508 06:40:22.393807 2646 update.go:93] deleting pods with local storage: alertmanager-main-0, alertmanager-main-2, grafana-854f66454c-b7bgj, kube-state-metrics-7bdc885788-z8z89, prometheus-adapter-7f649db549-lv56d, prometheus-k8s-1; ignoring DaemonSet-managed pods: tuned-72hb7, dns-default-2vz7l, node-ca-gfxk6, machine-config-daemon-9k62m, node-exporter-lnwfr, multus-mknp2, ovs-7thrh, sdn-cndjz I0508 06:40:22.449312 2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0508 06:40:25.907502 2646 update.go:89] pod "kube-state-metrics-7bdc885788-z8z89" removed (evicted) I0508 06:40:26.108030 2646 update.go:89] pod "alertmanager-main-0" removed (evicted) I0508 06:40:26.707929 2646 update.go:89] pod "alertmanager-main-2" removed (evicted) I0508 06:40:27.455158 2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0508 06:40:27.908037 2646 update.go:89] pod "downloads-65d5666bb5-gwzrc" removed (evicted) I0508 06:40:28.520458 2646 update.go:89] pod "grafana-854f66454c-b7bgj" removed (evicted) I0508 06:40:29.530267 2646 update.go:89] pod "prometheus-k8s-1" removed (evicted) I0508 06:40:32.460947 2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0508 06:40:33.529952 2646 update.go:89] pod "prometheus-adapter-7f649db549-lv56d" removed (evicted) I0508 06:40:37.467765 2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0508 06:40:42.472982 2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0508 06:40:47.478749 2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. ... I0508 09:54:39.256122 2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0508 09:54:44.262612 2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0508 09:54:49.268884 2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. oc logs router-default-85f8d8cfff-c7268 -n openshift-ingress I0508 02:20:29.665627 1 template.go:303] Starting template router (v4.2.30-202004200449) I0508 02:20:29.668365 1 metrics.go:147] Router health and metrics port listening at 0.0.0.0:1936 on HTTP and HTTPS I0508 02:20:29.676089 1 router.go:306] Watching "/etc/pki/tls/private" for changes E0508 02:20:29.677616 1 haproxy.go:392] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory I0508 02:20:29.718681 1 router.go:561] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). I0508 02:20:29.718713 1 router.go:255] Router is including routes in all namespaces I0508 02:20:29.955147 1 router.go:561] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). I0508 02:20:34.958016 1 router.go:561] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). I0508 02:20:39.958305 1 router.go:561] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). E0508 02:20:43.094677 1 reflector.go:322] github.com/openshift/router/pkg/router/template/service_lookup.go:32: Failed to watch *v1.Service: Get https://172.30.0.1:443/api/v1/services?resourceVersion=30400&timeoutSeconds=344&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0508 02:20:43.095469 1 reflector.go:322] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: Failed to watch *v1.Route: Get https://172.30.0.1:443/apis/route.openshift.io/v1/routes?resourceVersion=30422&timeoutSeconds=449&watch=true: dial tcp 172.30.0.1:443: connect: connection refused I0508 02:20:44.966015 1 router.go:561] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). .. - Health check ok : 0 retry attempt(s). W0508 09:40:23.215903 1 reflector.go:341] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. I0508 09:53:57.189106 1 router.go:561] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). I0508 09:54:05.728030 1 router.go:561] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). W0508 09:54:06.322228 1 reflector.go:341] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. I0508 09:54:10.714367 1 router.go:561] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). I0508 09:54:15.727002 1 router.go:561] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). I0508 09:54:20.723666 1 router.go:561] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). W0508 10:09:22.377285 1 reflector.go:341] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. # oc get pod --all-namespaces | grep -Ev "Running|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cluster-storage-operator cluster-storage-operator-6d55d8dbbd-x2ttp 0/1 ImagePullBackOff 0 3h53m openshift-cluster-version cluster-version-operator-66ff49777b-rjd44 0/1 ImagePullBackOff 0 3h53m openshift-console downloads-65d5666bb5-4g7qz 0/1 ImagePullBackOff 0 3h51m openshift-ingress-operator ingress-operator-7b5b669578-t9j7j 0/1 ImagePullBackOff 0 3h55m openshift-ingress router-default-85f8d8cfff-sbs2p 0/1 ImagePullBackOff 0 161m openshift-marketplace certified-operators-64d745c9c6-9twrq 0/1 ImagePullBackOff 0 53m openshift-marketplace certified-operators-6c6b6f4c5d-srf2h 0/1 ImagePullBackOff 0 53m openshift-marketplace community-operators-848b86fc84-5zdfx 0/1 ImagePullBackOff 0 3h25m openshift-marketplace redhat-operators-7dbc98b497-tt2pm 0/1 ImagePullBackOff 0 3h25m openshift-monitoring openshift-state-metrics-594bdd755d-v969v 2/3 ImagePullBackOff 0 3h55m openshift-monitoring prometheus-adapter-7f649db549-rjgpd 0/1 ImagePullBackOff 0 160m *Expected results:* All nodes machineconfig state should show `Done` state *Additional info:* All steps are prepare steps for upgrade
Created attachment 1686442 [details] compute_1_machine-config-daemon-9k62m.log
Created attachment 1686443 [details] machine-config-controller-dd6bd7cc5-twt4t.log
I don't understand what's wrong uhm. If the router can't yet be evicted it's correct for the MCD to show Working as that's what it is doing. If the MCD finishes and evicts the pod the state goes to Done, if it remains Working, that's a bug but it doesn't sound like the case here.
I0508 09:54:49.268884 2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. The log above isn't an error also so the assumption that Working isn't the right state is wrong. That can very well be transient. openshift-marketplace community-operators-848b86fc84-5zdfx 0/1 ImagePullBackOff 0 3h25m ImagePullBackOff is a complete different error, so if that's happening on a node it's highly unlikely the MCO is playing a role there.
Can you please provide a full must-gather for this cluster?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days