Created attachment 1818463 [details] machine-config-daemon logs Description of problem: Inan OpenShift upgrade on baremetal from 4.7.11 to 4.7.24, during the upgrade of the machine-config operator, the machine-config-daemon pods take a long time to terminate which not only causes the machine-config operator to degrade but also contributes a lot of time to the overall upgrade time - delaying it significantly. From what I've see each pod stays stuck in terminating for atleast 5-7 mins. Log snippet =============================================================================== I0827 18:12:04.150851 5901 update.go:1292] Deleting stale data I0827 18:12:04.163526 5901 update.go:1735] Writing SSHKeys at "/home/core/.ssh/authorized_keys" I0827 18:12:04.182142 5901 update.go:1904] Node has Desired Config rendered-worker-group-4-be7070fffc9b1bb28637063800e3cfef, skipping reboot I0827 18:12:04.183722 5901 daemon.go:802] Current config: rendered-worker-be7070fffc9b1bb28637063800e3cfef I0827 18:12:04.183732 5901 daemon.go:803] Desired config: rendered-worker-group-4-be7070fffc9b1bb28637063800e3cfef I0827 18:12:04.193152 5901 daemon.go:1151] Completing pending config rendered-worker-group-4-be7070fffc9b1bb28637063800e3cfef I0827 18:12:04.193166 5901 update.go:1904] completed update for config rendered-worker-group-4-be7070fffc9b1bb28637063800e3cfef I0827 18:12:04.194679 5901 daemon.go:1167] In desired config rendered-worker-group-4-be7070fffc9b1bb28637063800e3cfef I0827 22:34:51.888447 5901 daemon.go:586] Got SIGTERM, but actively updating ============================================================================= [kni@e16-h18-b03-fc640 kube-burner-templates]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-config-controller-7d9bcdf859-mg54q 1/1 Running 1 2d23h 10.128.0.23 master-0 <none> <none> machine-config-daemon-2bxlp 2/2 Running 0 16m 192.168.216.52 worker039-fc640 <none> <none> machine-config-daemon-2jsk4 2/2 Running 0 2d22h 192.168.216.61 worker048-fc640 <none> <none> machine-config-daemon-2mvbq 2/2 Running 0 16m 192.168.216.113 worker100-fc640 <none> <none> machine-config-daemon-2xbq4 2/2 Running 0 2d22h 192.168.216.53 worker040-fc640 <none> <none> machine-config-daemon-2xpwl 2/2 Running 0 2d19h 192.168.216.94 worker081-fc640 <none> <none> machine-config-daemon-2xzdc 2/2 Running 0 2d19h 192.168.216.106 worker093-fc640 <none> <none> machine-config-daemon-474qf 2/2 Running 0 2d22h 192.168.216.75 worker062-fc640 <none> <none> machine-config-daemon-49j6b 2/2 Running 0 2d22h 192.168.216.81 worker068-fc640 <none> <none> machine-config-daemon-4b2tw 2/2 Running 0 16m 192.168.216.69 worker056-fc640 <none> <none> machine-config-daemon-4h4sr 2/2 Running 0 2d22h 192.168.216.29 worker016-fc640 <none> <none> machine-config-daemon-4rrwz 2/2 Running 0 2d22h 192.168.216.56 worker043-fc640 <none> <none> machine-config-daemon-598mz 2/2 Running 0 26m 192.168.216.99 worker086-fc640 <none> <none> machine-config-daemon-5c6kk 2/2 Terminating 0 2d22h 192.168.216.80 worker067-fc640 <none> <none> machine-config-daemon-5j9wp 2/2 Running 0 26m 192.168.216.78 worker065-fc640 <none> <none> machine-config-daemon-5lk7r 2/2 Running 0 16m 192.168.216.44 worker031-fc640 <none> <none> machine-config-daemon-5x4xm 2/2 Running 0 2d22h 192.168.216.66 worker053-fc640 <none> <none> machine-config-daemon-5xqmw 2/2 Running 0 2d22h 192.168.216.27 worker014-fc640 <none> <none> machine-config-daemon-69f54 2/2 Running 0 2d22h 192.168.216.91 worker078-fc640 <none> <none> machine-config-daemon-6cs8x 2/2 Running 0 16m 192.168.216.129 worker116-fc640 <none> <none> machine-config-daemon-6nxlj 2/2 Running 0 2d19h 192.168.216.118 worker105-fc640 <none> <none> machine-config-daemon-6q6n4 2/2 Running 0 2d22h 192.168.216.45 worker032-fc640 <none> <none> machine-config-daemon-6qsq5 2/2 Running 0 2d19h 192.168.216.130 worker117-r640 <none> <none> machine-config-daemon-6r8w8 2/2 Running 0 2d19h 192.168.216.107 worker094-fc640 <none> <none> machine-config-daemon-76vw9 2/2 Running 0 2d19h 192.168.216.110 worker097-fc640 <none> <none> machine-config-daemon-7fzv2 2/2 Running 0 6m10s 192.168.216.30 worker017-fc640 <none> <none> machine-config-daemon-87gpz 2/2 Running 0 2d19h 192.168.216.102 worker089-fc640 <none> <none> machine-config-daemon-8dx2l 2/2 Running 0 26m 192.168.216.59 worker046-fc640 <none> <none> machine-config-daemon-9b94m 2/2 Running 0 26m 192.168.216.37 worker024-fc640 <none> <none> machine-config-daemon-9bvcm 2/2 Terminating 0 2d22h 192.168.216.89 worker076-fc640 <none> <none> machine-config-daemon-9twkq 2/2 Running 0 2d22h 192.168.216.76 worker063-fc640 <none> <none> machine-config-daemon-9w2mg 2/2 Running 0 16m 192.168.216.16 worker003-fc640 <none> <none> machine-config-daemon-9wtq9 2/2 Running 0 2d19h 192.168.216.126 worker113-fc640 <none> <none> machine-config-daemon-b26hp 2/2 Running 0 2d22h 192.168.216.25 worker012-fc640 <none> <none> machine-config-daemon-bv7d5 2/2 Running 0 6m11s 192.168.216.95 worker082-fc640 <none> <none> machine-config-daemon-bw7cs 2/2 Running 0 6m21s 192.168.216.22 worker009-fc640 <none> <none> machine-config-daemon-c5wqs 2/2 Running 0 2d22h 192.168.216.48 worker035-fc640 <none> <none> machine-config-daemon-c66nv 2/2 Terminating 0 2d22h 192.168.216.72 worker059-fc640 <none> <none> machine-config-daemon-cghbz 2/2 Running 0 2d22h 192.168.216.39 worker026-fc640 <none> <none> machine-config-daemon-cjxkx 2/2 Terminating 0 2d22h 192.168.216.21 worker008-fc640 <none> <none> machine-config-daemon-cx8t4 2/2 Running 0 2d22h 192.168.216.88 worker075-fc640 <none> <none> machine-config-daemon-dgskf 2/2 Running 0 26m 192.168.216.86 worker073-fc640 <none> <none> machine-config-daemon-dtkjx 2/2 Running 0 2d19h 192.168.216.122 worker109-fc640 <none> <none> machine-config-daemon-f4npx 2/2 Running 0 26m 192.168.216.108 worker095-fc640 <none> <none> machine-config-daemon-fjrs2 2/2 Running 0 26m 192.168.216.82 worker069-fc640 <none> <none> machine-config-daemon-fn2nk 2/2 Running 0 2d22h 192.168.216.73 worker060-fc640 <none> <none> machine-config-daemon-g8dsp 2/2 Running 0 2d22h 192.168.216.32 worker019-fc640 <none> <none> machine-config-daemon-g96sq 2/2 Terminating 0 2d19h 192.168.216.93 worker080-fc640 <none> <none> machine-config-daemon-gl4sv 2/2 Terminating 0 2d22h 192.168.216.70 worker057-fc640 <none> <none> machine-config-daemon-gr9ls 2/2 Running 0 2d22h 192.168.216.57 worker044-fc640 <none> <none> machine-config-daemon-h9ltw 2/2 Running 0 6m17s 192.168.216.14 worker001-fc640 <none> <none> machine-config-daemon-hgt59 2/2 Running 0 2d19h 192.168.216.97 worker084-fc640 <none> <none> machine-config-daemon-hpb4r 2/2 Running 0 2d22h 192.168.216.24 worker011-fc640 <none> <none> machine-config-daemon-hrhnc 2/2 Running 0 2d22h 192.168.216.35 worker022-fc640 <none> <none> machine-config-daemon-hzrwt 2/2 Running 0 2d22h 192.168.216.47 worker034-fc640 <none> <none> machine-config-daemon-jgsrh 2/2 Running 0 2d22h 192.168.216.68 worker055-fc640 <none> <none> machine-config-daemon-jhqxl 2/2 Running 0 16m 192.168.216.46 worker033-fc640 <none> <none> machine-config-daemon-jtb8b 2/2 Running 0 2d22h 192.168.216.31 worker018-fc640 <none> <none> machine-config-daemon-jtz4f 2/2 Terminating 0 2d19h 192.168.216.105 worker092-fc640 <none> <none> machine-config-daemon-kdzwz 2/2 Running 0 2d22h 192.168.216.60 worker047-fc640 <none> <none> machine-config-daemon-kgjmq 2/2 Running 0 2d22h 192.168.216.87 worker074-fc640 <none> <none> machine-config-daemon-kgrjk 2/2 Terminating 0 2d22h 192.168.216.85 worker072-fc640 <none> <none> machine-config-daemon-kh2fm 2/2 Running 0 2d19h 192.168.216.115 worker102-fc640 <none> <none> machine-config-daemon-krbsz 2/2 Running 0 26m 192.168.216.74 worker061-fc640 <none> <none> machine-config-daemon-ktckx 2/2 Running 0 2d19h 192.168.216.119 worker106-fc640 <none> <none> machine-config-daemon-ktz4x 2/2 Running 0 2d23h 192.168.216.11 master-1 <none> <none> machine-config-daemon-l7vhs 2/2 Running 0 2d19h 192.168.216.112 worker099-fc640 <none> <none> machine-config-daemon-lbbvp 2/2 Running 0 6m27s 192.168.216.127 worker114-fc640 <none> <none> machine-config-daemon-lct9m 2/2 Running 0 16m 192.168.216.104 worker091-fc640 <none> <none> machine-config-daemon-ldbbl 2/2 Running 0 2d22h 192.168.216.65 worker052-fc640 <none> <none> machine-config-daemon-lfv9p 2/2 Running 0 16m 192.168.216.50 worker037-fc640 <none> <none> machine-config-daemon-lks2p 2/2 Running 0 2d22h 192.168.216.34 worker021-fc640 <none> <none> machine-config-daemon-lrbzb 2/2 Running 0 2d22h 192.168.216.67 worker054-fc640 <none> <none> machine-config-daemon-lw7qh 2/2 Running 0 2d19h 192.168.216.103 worker090-fc640 <none> <none> machine-config-daemon-m65cw 2/2 Running 0 2d19h 192.168.216.125 worker112-fc640 <none> <none> machine-config-daemon-m9jr6 2/2 Running 0 26m 192.168.216.49 worker036-fc640 <none> <none> machine-config-daemon-mgv64 2/2 Running 0 2d22h 192.168.216.13 worker000-fc640 <none> <none> machine-config-daemon-mkspb 2/2 Running 0 2d22h 192.168.216.36 worker023-fc640 <none> <none> machine-config-daemon-msbpk 2/2 Running 0 2d22h 192.168.216.26 worker013-fc640 <none> <none> machine-config-daemon-mvsp6 2/2 Running 0 26m 192.168.216.63 worker050-fc640 <none> <none> machine-config-daemon-nz4d2 2/2 Running 0 2d22h 192.168.216.79 worker066-fc640 <none> <none> machine-config-daemon-p4xcb 2/2 Running 0 2d22h 192.168.216.62 worker049-fc640 <none> <none> machine-config-daemon-pdnn5 2/2 Running 0 2d23h 192.168.216.10 master-0 <none> <none> machine-config-daemon-pq5zx 2/2 Running 0 6m23s 192.168.216.90 worker077-fc640 <none> <none> machine-config-daemon-pt45j 2/2 Running 0 2d19h 192.168.216.116 worker103-fc640 <none> <none> machine-config-daemon-pvdrf 2/2 Running 0 6m17s 192.168.216.38 worker025-fc640 <none> <none> machine-config-daemon-pvr8d 2/2 Running 0 36m 192.168.216.18 worker005-fc640 <none> <none> machine-config-daemon-pvs9w 2/2 Running 0 2d22h 192.168.216.43 worker030-fc640 <none> <none> machine-config-daemon-pvw2k 2/2 Running 0 6m29s 192.168.216.71 worker058-fc640 <none> <none> machine-config-daemon-pwbbk 2/2 Running 0 2d19h 192.168.216.117 worker104-fc640 <none> <none> machine-config-daemon-pz2lq 2/2 Running 0 16m 192.168.216.17 worker004-fc640 <none> <none> machine-config-daemon-qpd76 2/2 Running 0 16m 192.168.216.40 worker027-fc640 <none> <none> machine-config-daemon-qvbht 2/2 Running 0 2d22h 192.168.216.54 worker041-fc640 <none> <none> machine-config-daemon-rbv6b 2/2 Running 0 16m 192.168.216.121 worker108-fc640 <none> <none> machine-config-daemon-rghkc 2/2 Running 0 6m29s 192.168.216.128 worker115-fc640 <none> <none> machine-config-daemon-rtnzr 2/2 Running 0 2d22h 192.168.216.83 worker070-fc640 <none> <none> machine-config-daemon-rx7xs 2/2 Running 0 26m 192.168.216.33 worker020-fc640 <none> <none> machine-config-daemon-rzlf9 2/2 Terminating 0 2d22h 192.168.216.41 worker028-fc640 <none> <none> machine-config-daemon-s5qps 2/2 Running 0 6m7s 192.168.216.55 worker042-fc640 <none> <none> machine-config-daemon-s8t6c 2/2 Running 0 2d19h 192.168.216.96 worker083-fc640 <none> <none> machine-config-daemon-scq75 2/2 Running 0 2d19h 192.168.216.114 worker101-fc640 <none> <none> machine-config-daemon-sdjh2 2/2 Running 0 2d22h 192.168.216.84 worker071-fc640 <none> <none> machine-config-daemon-sh8hr 2/2 Running 0 26m 192.168.216.58 worker045-fc640 <none> <none> machine-config-daemon-sjrgd 2/2 Running 0 5m59s 192.168.216.51 worker038-fc640 <none> <none> machine-config-daemon-sqvf8 2/2 Terminating 0 2d22h 192.168.216.42 worker029-fc640 <none> <none> machine-config-daemon-tcbnk 2/2 Running 0 2d22h 192.168.216.77 worker064-fc640 <none> <none> machine-config-daemon-td2bl 2/2 Running 0 2d19h 192.168.216.111 worker098-fc640 <none> <none> machine-config-daemon-tw9xw 2/2 Running 0 6m 192.168.216.19 worker006-fc640 <none> <none> machine-config-daemon-vkpw8 2/2 Running 0 2d19h 192.168.216.123 worker110-fc640 <none> <none> machine-config-daemon-w8hd2 2/2 Running 0 2d22h 192.168.216.23 worker010-fc640 <none> <none> machine-config-daemon-w9l6h 2/2 Running 0 6m15s 192.168.216.109 worker096-fc640 <none> <none> machine-config-daemon-wctwv 2/2 Running 0 2d19h 192.168.216.120 worker107-fc640 <none> <none> machine-config-daemon-wf69d 2/2 Running 0 2d19h 192.168.216.100 worker087-fc640 <none> <none> machine-config-daemon-wrg7l 2/2 Terminating 0 2d19h 192.168.216.124 worker111-fc640 <none> <none> machine-config-daemon-wvfs5 2/2 Running 0 6m22s 192.168.216.28 worker015-fc640 <none> <none> machine-config-daemon-x98qj 2/2 Running 0 26m 192.168.216.15 worker002-fc640 <none> <none> machine-config-daemon-xcr72 2/2 Running 0 16m 192.168.216.12 master-2 <none> <none> machine-config-daemon-xzv49 2/2 Terminating 0 2d22h 192.168.216.20 worker007-fc640 <none> <none> machine-config-daemon-zmz9b 2/2 Running 0 2d22h 192.168.216.92 worker079-fc640 <none> <none> machine-config-daemon-zrlrp 2/2 Running 0 2d19h 192.168.216.101 worker088-fc640 <none> <none> machine-config-daemon-zzpcc 2/2 Running 0 2d19h 192.168.216.98 worker085-fc640 <none> <none> machine-config-operator-b67f5997c-5qcwn 1/1 Running 0 38m 10.129.0.6 master-1 <none> <none> machine-config-server-4v4fx 1/1 Running 0 2d23h 192.168.216.11 master-1 <none> <none> machine-config-server-p9fvw 1/1 Running 0 2d23h 192.168.216.12 master-2 <none> <none> machine-config-server-vf58h 1/1 Running 0 2d23h 192.168.216.10 master-0 <none> <none> =========================================================================== [kni@e16-h18-b03-fc640 kube-burner-templates]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.24 True False False 126m baremetal 4.7.24 True False False 2d23h cloud-credential 4.7.24 True False False 2d23h cluster-autoscaler 4.7.24 True False False 2d23h config-operator 4.7.24 True False False 2d23h console 4.7.24 True False False 133m csi-snapshot-controller 4.7.24 True False False 126m dns 4.7.24 True False False 2d23h etcd 4.7.24 True False False 2d23h image-registry 4.7.24 True False False 3h58m ingress 4.7.24 True False False 2d22h insights 4.7.24 True False False 2d23h kube-apiserver 4.7.24 True False False 2d23h kube-controller-manager 4.7.24 True False False 2d23h kube-scheduler 4.7.24 True False False 2d23h kube-storage-version-migrator 4.7.24 True False False 2d22h machine-api 4.7.24 True False False 2d23h machine-approver 4.7.24 True False False 2d23h machine-config 4.7.11 False True True 39m marketplace 4.7.24 True False False 134m monitoring 4.7.24 True False False 132m network 4.7.24 True False False 2d23h node-tuning 4.7.24 True False False 135m openshift-apiserver 4.7.24 True False False 126m openshift-controller-manager 4.7.24 True False False 2d23h openshift-samples 4.7.24 True False False 135m operator-lifecycle-manager 4.7.24 True False False 2d23h operator-lifecycle-manager-catalog 4.7.24 True False False 2d23h operator-lifecycle-manager-packageserver 4.7.24 True False False 135m service-ca 4.7.24 True False False 2d23h storage 4.7.24 True False False 2d23h ========================================================================= Version-Release number of selected component (if applicable): 4.7.11 -> 4.7.24 uprade How reproducible: 100% Steps to Reproduce: 1. Kick upgrade on cluster 2. wait until machine-config operator is updated 3. Observe the status of the machine-config-daemon pods Actual results: Pods are stuck in terminating for a long time Expected results: machine-config-daemon pods like other pods should terminate gracefull in a short time after receiving SIGTERM Additional info:
This could be related to BZ https://bugzilla.redhat.com/show_bug.cgi?id=1995853? Before update, was there a MachineConig change applied that didn't require node reboot?
In the attached MCD logs, we see: I0827 18:11:43.227189 6634 update.go:1904] Node has Desired Config rendered-worker-group-3-be7070fffc9b1bb28637063800e3cfef, skipping reboot So it is very likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1995853, which is in the backport process. I will mark this as a duplicate and up the urgency of that. If you would like to make sure, please attach a must-gather with the fully logs, so we can see what the update was and the timing. *** This bug has been marked as a duplicate of bug 1995853 ***
(In reply to Sinny Kumari from comment #1) > This could be related to BZ > https://bugzilla.redhat.com/show_bug.cgi?id=1995853? Before update, was > there a MachineConig change applied that didn't require node reboot? Yes, this is a large 120 node environment. So we split up existing worker nodes into 11 MCPs and since the configuration didn't change - a reboot was not required.
So yes, I did split up the worker nodes into multiple MCPs before the upgrade so they got added to a new MCP without needing a reboot - so rebootless upgrades are the trigger even for https://bugzilla.redhat.com/show_bug.cgi?id=1995853 right?
Correct. https://bugzilla.redhat.com/show_bug.cgi?id=1995853 would manifest if you perform a rebootless update of any kind, and then another update. So it sounds like a duplicate. The fix is already in 4.9 and 4.8, if you would like to test that. Otherwise we need to wait for patch manager approval for the linked BZ for 4.7