Bug 1833284

Summary: The machineconfig state shows in `Working` for one node due to error while evicting pod "router-default-X"
Product: OpenShift Container Platform Reporter: xiyuan
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Michael Nguyen <mnguyen>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.2.zCC: hongli, kgarriso, smilner
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-20 17:59:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
compute_1_machine-config-daemon-9k62m.log
none
machine-config-controller-dd6bd7cc5-twt4t.log none

Description xiyuan 2020-05-08 10:22:25 UTC
*Description of problem:*
The machineconfig state shows in `Working` for one node due to error while evicting pod "router-default-X"
# oc describe node | grep  machineconfiguration.openshift.io/state
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/state: Working
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/state: Done


*Version-Release number of selected component (if applicable):*
4.2.30-x86_64

*How reproducible:*
Only Once

*Steps to Reproduce:*
 1. install a upi-vsphere-disconnected cluster, and scale 3 Rhel nodes
 2. mirror image:
 # oc adm release mirror  --from=registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-05-07-171148 --to=vsphere.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release --to-release-image=vsphere.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release:4.3.0-0.nightly-2020-05-07-171148
 3. create image ImageContentSourcePolicy
 # oc create -f - << EOF
> apiVersion: operator.openshift.io/v1alpha1
> kind: ImageContentSourcePolicy
> metadata:
>   name: example
> spec:
>   repositoryDigestMirrors:
>   - mirrors:
>     - vsphere.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release
>     source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
>   - mirrors:
>     - vsphere.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release
>     source: registry.svc.ci.openshift.org/ocp/release
> EOF
4. Run `oc describe node | grep machineconfig` to ensure every node' machineconfig state is in `Done` state

*Actual results:*
The compute-1 node shows machineconfig state in `Working` howerver, other nodes shows `Done`

#oc describe node
              
Name:               compute-1
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=compute-1
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-worker-c2a0628c4b495939f2f1d70907b10365
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-7becd32232fe3362f41961d7ae89ce54
                    machineconfiguration.openshift.io/reason:
                    machineconfiguration.openshift.io/state: Working
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 07 May 2020 21:46:35 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  compute-1
  AcquireTime:     <unset>
  RenewTime:       Fri, 08 May 2020 04:02:14 -0400
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Fri, 08 May 2020 04:02:03 -0400   Thu, 07 May 2020 22:16:43 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 08 May 2020 04:02:03 -0400   Thu, 07 May 2020 22:16:43 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 08 May 2020 04:02:03 -0400   Thu, 07 May 2020 22:16:43 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 08 May 2020 04:02:03 -0400   Thu, 07 May 2020 22:16:53 -0400   KubeletReady                 kubelet is posting ready status

# oc logs machine-config-daemon-9k62m -n openshift-machine-config-operator
...
I0508 02:16:54.112791    2646 daemon.go:678] Current config: rendered-worker-b2f7e3e33c5d3004cace8ffcfa91c5b6
I0508 02:16:54.112806    2646 daemon.go:679] Desired config: rendered-worker-c2a0628c4b495939f2f1d70907b10365
I0508 02:16:54.119530    2646 update.go:993] Disk currentConfig rendered-worker-c2a0628c4b495939f2f1d70907b10365 overrides node annotation rendered-worker-b2f7e3e33c5d3004cace8ffcfa91c5b6
I0508 02:16:54.123482    2646 daemon.go:892] Validating against pending config rendered-worker-c2a0628c4b495939f2f1d70907b10365
I0508 02:16:54.126062    2646 daemon.go:908] Validated on-disk state
I0508 02:16:54.140736    2646 daemon.go:942] Completing pending config rendered-worker-c2a0628c4b495939f2f1d70907b10365
I0508 02:16:54.147994    2646 update.go:993] completed update for config rendered-worker-c2a0628c4b495939f2f1d70907b10365
I0508 02:16:54.150319    2646 daemon.go:955] In desired config rendered-worker-c2a0628c4b495939f2f1d70907b10365
I0508 06:40:22.312840    2646 update.go:201] Checking Reconcilable for config rendered-worker-c2a0628c4b495939f2f1d70907b10365 to rendered-worker-7becd32232fe3362f41961d7ae89ce54
I0508 06:40:22.316829    2646 update.go:993] Starting update from rendered-worker-c2a0628c4b495939f2f1d70907b10365 to rendered-worker-7becd32232fe3362f41961d7ae89ce54: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false}
I0508 06:40:22.322374    2646 update.go:993] Update prepared; beginning drain
I0508 06:40:22.331700    2646 update.go:89] cordoned node "compute-1"
I0508 06:40:22.393807    2646 update.go:93] deleting pods with local storage: alertmanager-main-0, alertmanager-main-2, grafana-854f66454c-b7bgj, kube-state-metrics-7bdc885788-z8z89, prometheus-adapter-7f649db549-lv56d, prometheus-k8s-1; ignoring DaemonSet-managed pods: tuned-72hb7, dns-default-2vz7l, node-ca-gfxk6, machine-config-daemon-9k62m, node-exporter-lnwfr, multus-mknp2, ovs-7thrh, sdn-cndjz
I0508 06:40:22.449312    2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0508 06:40:25.907502    2646 update.go:89] pod "kube-state-metrics-7bdc885788-z8z89" removed (evicted)
I0508 06:40:26.108030    2646 update.go:89] pod "alertmanager-main-0" removed (evicted)
I0508 06:40:26.707929    2646 update.go:89] pod "alertmanager-main-2" removed (evicted)
I0508 06:40:27.455158    2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0508 06:40:27.908037    2646 update.go:89] pod "downloads-65d5666bb5-gwzrc" removed (evicted)
I0508 06:40:28.520458    2646 update.go:89] pod "grafana-854f66454c-b7bgj" removed (evicted)
I0508 06:40:29.530267    2646 update.go:89] pod "prometheus-k8s-1" removed (evicted)
I0508 06:40:32.460947    2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0508 06:40:33.529952    2646 update.go:89] pod "prometheus-adapter-7f649db549-lv56d" removed (evicted)
I0508 06:40:37.467765    2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0508 06:40:42.472982    2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0508 06:40:47.478749    2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
...
I0508 09:54:39.256122    2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0508 09:54:44.262612    2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0508 09:54:49.268884    2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

oc logs router-default-85f8d8cfff-c7268 -n openshift-ingress
I0508 02:20:29.665627       1 template.go:303] Starting template router (v4.2.30-202004200449)
I0508 02:20:29.668365       1 metrics.go:147] Router health and metrics port listening at 0.0.0.0:1936 on HTTP and HTTPS
I0508 02:20:29.676089       1 router.go:306] Watching "/etc/pki/tls/private" for changes
E0508 02:20:29.677616       1 haproxy.go:392] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0508 02:20:29.718681       1 router.go:561] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0508 02:20:29.718713       1 router.go:255] Router is including routes in all namespaces
I0508 02:20:29.955147       1 router.go:561] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0508 02:20:34.958016       1 router.go:561] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0508 02:20:39.958305       1 router.go:561] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
E0508 02:20:43.094677       1 reflector.go:322] github.com/openshift/router/pkg/router/template/service_lookup.go:32: Failed to watch *v1.Service: Get https://172.30.0.1:443/api/v1/services?resourceVersion=30400&timeoutSeconds=344&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
E0508 02:20:43.095469       1 reflector.go:322] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: Failed to watch *v1.Route: Get https://172.30.0.1:443/apis/route.openshift.io/v1/routes?resourceVersion=30422&timeoutSeconds=449&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
I0508 02:20:44.966015       1 router.go:561] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
..
 - Health check ok : 0 retry attempt(s).
W0508 09:40:23.215903       1 reflector.go:341] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
I0508 09:53:57.189106       1 router.go:561] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0508 09:54:05.728030       1 router.go:561] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
W0508 09:54:06.322228       1 reflector.go:341] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
I0508 09:54:10.714367       1 router.go:561] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0508 09:54:15.727002       1 router.go:561] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0508 09:54:20.723666       1 router.go:561] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
W0508 10:09:22.377285       1 reflector.go:341] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
# oc get pod --all-namespaces | grep -Ev "Running|Completed"
NAMESPACE                                               NAME                                                              READY   STATUS             RESTARTS   AGE
openshift-cluster-storage-operator                      cluster-storage-operator-6d55d8dbbd-x2ttp                         0/1     ImagePullBackOff   0          3h53m
openshift-cluster-version                               cluster-version-operator-66ff49777b-rjd44                         0/1     ImagePullBackOff   0          3h53m
openshift-console                                       downloads-65d5666bb5-4g7qz                                        0/1     ImagePullBackOff   0          3h51m
openshift-ingress-operator                              ingress-operator-7b5b669578-t9j7j                                 0/1     ImagePullBackOff   0          3h55m
openshift-ingress                                       router-default-85f8d8cfff-sbs2p                                   0/1     ImagePullBackOff   0          161m
openshift-marketplace                                   certified-operators-64d745c9c6-9twrq                              0/1     ImagePullBackOff   0          53m
openshift-marketplace                                   certified-operators-6c6b6f4c5d-srf2h                              0/1     ImagePullBackOff   0          53m
openshift-marketplace                                   community-operators-848b86fc84-5zdfx                              0/1     ImagePullBackOff   0          3h25m
openshift-marketplace                                   redhat-operators-7dbc98b497-tt2pm                                 0/1     ImagePullBackOff   0          3h25m
openshift-monitoring                                    openshift-state-metrics-594bdd755d-v969v                          2/3     ImagePullBackOff   0          3h55m
openshift-monitoring                                    prometheus-adapter-7f649db549-rjgpd                               0/1     ImagePullBackOff   0          160m



*Expected results:*
All nodes machineconfig state should show `Done` state
*Additional info:*
All steps are prepare steps for upgrade

Comment 1 xiyuan 2020-05-08 11:03:30 UTC
Created attachment 1686442 [details]
compute_1_machine-config-daemon-9k62m.log

Comment 2 xiyuan 2020-05-08 11:04:34 UTC
Created attachment 1686443 [details]
machine-config-controller-dd6bd7cc5-twt4t.log

Comment 3 Antonio Murdaca 2020-05-08 11:39:51 UTC
I don't understand what's wrong uhm. If the router can't yet be evicted it's correct for the MCD to show Working as that's what it is doing. If the MCD finishes and evicts the pod the state goes to Done, if it remains Working, that's a bug but it doesn't sound like the case here.

Comment 4 Antonio Murdaca 2020-05-08 11:40:05 UTC
I don't understand what's wrong uhm. If the router can't yet be evicted it's correct for the MCD to show Working as that's what it is doing. If the MCD finishes and evicts the pod the state goes to Done, if it remains Working, that's a bug but it doesn't sound like the case here.

Comment 5 Antonio Murdaca 2020-05-08 11:41:46 UTC
I0508 09:54:49.268884    2646 update.go:89] error when evicting pod "router-default-85f8d8cfff-c7268" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.


The log above isn't an error also so the assumption that Working isn't the right state is wrong. That can very well be transient. 


openshift-marketplace                                   community-operators-848b86fc84-5zdfx                              0/1     ImagePullBackOff   0          3h25m



ImagePullBackOff is a complete different error, so if that's happening on a node it's highly unlikely the MCO is playing a role there.

Comment 6 Kirsten Garrison 2020-05-08 23:43:09 UTC
Can you please provide a full must-gather for this cluster?

Comment 9 Red Hat Bugzilla 2023-09-14 05:57:40 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days