Bug 1998673

Summary: Machine Config Daemon pod takes a long time to terminate due to "Got SIGTERM, but actively updating"
Product: OpenShift Container Platform Reporter: Sai Sindhur Malleni <smalleni>
Component: Machine Config OperatorAssignee: Yu Qi Zhang <jerzhang>
Status: CLOSED DUPLICATE QA Contact: Jian Zhang <jiazha>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.7CC: dblack, jerzhang, murali, skumari
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-30 17:32:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
machine-config-daemon logs none

Description Sai Sindhur Malleni 2021-08-27 22:42:56 UTC
Created attachment 1818463 [details]
machine-config-daemon logs

Description of problem:
Inan OpenShift upgrade on baremetal from 4.7.11 to 4.7.24, during the upgrade of the machine-config operator, the machine-config-daemon pods take a long time to terminate which not only causes the machine-config operator to degrade but also contributes a lot of time to the overall upgrade time - delaying it significantly. From what I've see each pod stays stuck in terminating for atleast 5-7 mins.


Log snippet
===============================================================================
I0827 18:12:04.150851    5901 update.go:1292] Deleting stale data
I0827 18:12:04.163526    5901 update.go:1735] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0827 18:12:04.182142    5901 update.go:1904] Node has Desired Config rendered-worker-group-4-be7070fffc9b1bb28637063800e3cfef, skipping reboot
I0827 18:12:04.183722    5901 daemon.go:802] Current config: rendered-worker-be7070fffc9b1bb28637063800e3cfef
I0827 18:12:04.183732    5901 daemon.go:803] Desired config: rendered-worker-group-4-be7070fffc9b1bb28637063800e3cfef
I0827 18:12:04.193152    5901 daemon.go:1151] Completing pending config rendered-worker-group-4-be7070fffc9b1bb28637063800e3cfef
I0827 18:12:04.193166    5901 update.go:1904] completed update for config rendered-worker-group-4-be7070fffc9b1bb28637063800e3cfef
I0827 18:12:04.194679    5901 daemon.go:1167] In desired config rendered-worker-group-4-be7070fffc9b1bb28637063800e3cfef
I0827 22:34:51.888447    5901 daemon.go:586] Got SIGTERM, but actively updating
=============================================================================


[kni@e16-h18-b03-fc640 kube-burner-templates]$ oc get pods -o wide
NAME                                         READY   STATUS        RESTARTS   AGE     IP                NODE              NOMINATED NODE   READINESS GATES
machine-config-controller-7d9bcdf859-mg54q   1/1     Running       1          2d23h   10.128.0.23       master-0          <none>           <none>
machine-config-daemon-2bxlp                  2/2     Running       0          16m     192.168.216.52    worker039-fc640   <none>           <none>
machine-config-daemon-2jsk4                  2/2     Running       0          2d22h   192.168.216.61    worker048-fc640   <none>           <none>
machine-config-daemon-2mvbq                  2/2     Running       0          16m     192.168.216.113   worker100-fc640   <none>           <none>
machine-config-daemon-2xbq4                  2/2     Running       0          2d22h   192.168.216.53    worker040-fc640   <none>           <none>
machine-config-daemon-2xpwl                  2/2     Running       0          2d19h   192.168.216.94    worker081-fc640   <none>           <none>
machine-config-daemon-2xzdc                  2/2     Running       0          2d19h   192.168.216.106   worker093-fc640   <none>           <none>
machine-config-daemon-474qf                  2/2     Running       0          2d22h   192.168.216.75    worker062-fc640   <none>           <none>
machine-config-daemon-49j6b                  2/2     Running       0          2d22h   192.168.216.81    worker068-fc640   <none>           <none>
machine-config-daemon-4b2tw                  2/2     Running       0          16m     192.168.216.69    worker056-fc640   <none>           <none>
machine-config-daemon-4h4sr                  2/2     Running       0          2d22h   192.168.216.29    worker016-fc640   <none>           <none>
machine-config-daemon-4rrwz                  2/2     Running       0          2d22h   192.168.216.56    worker043-fc640   <none>           <none>
machine-config-daemon-598mz                  2/2     Running       0          26m     192.168.216.99    worker086-fc640   <none>           <none>
machine-config-daemon-5c6kk                  2/2     Terminating   0          2d22h   192.168.216.80    worker067-fc640   <none>           <none>
machine-config-daemon-5j9wp                  2/2     Running       0          26m     192.168.216.78    worker065-fc640   <none>           <none>
machine-config-daemon-5lk7r                  2/2     Running       0          16m     192.168.216.44    worker031-fc640   <none>           <none>
machine-config-daemon-5x4xm                  2/2     Running       0          2d22h   192.168.216.66    worker053-fc640   <none>           <none>
machine-config-daemon-5xqmw                  2/2     Running       0          2d22h   192.168.216.27    worker014-fc640   <none>           <none>
machine-config-daemon-69f54                  2/2     Running       0          2d22h   192.168.216.91    worker078-fc640   <none>           <none>
machine-config-daemon-6cs8x                  2/2     Running       0          16m     192.168.216.129   worker116-fc640   <none>           <none>
machine-config-daemon-6nxlj                  2/2     Running       0          2d19h   192.168.216.118   worker105-fc640   <none>           <none>
machine-config-daemon-6q6n4                  2/2     Running       0          2d22h   192.168.216.45    worker032-fc640   <none>           <none>
machine-config-daemon-6qsq5                  2/2     Running       0          2d19h   192.168.216.130   worker117-r640    <none>           <none>
machine-config-daemon-6r8w8                  2/2     Running       0          2d19h   192.168.216.107   worker094-fc640   <none>           <none>
machine-config-daemon-76vw9                  2/2     Running       0          2d19h   192.168.216.110   worker097-fc640   <none>           <none>
machine-config-daemon-7fzv2                  2/2     Running       0          6m10s   192.168.216.30    worker017-fc640   <none>           <none>
machine-config-daemon-87gpz                  2/2     Running       0          2d19h   192.168.216.102   worker089-fc640   <none>           <none>
machine-config-daemon-8dx2l                  2/2     Running       0          26m     192.168.216.59    worker046-fc640   <none>           <none>
machine-config-daemon-9b94m                  2/2     Running       0          26m     192.168.216.37    worker024-fc640   <none>           <none>
machine-config-daemon-9bvcm                  2/2     Terminating   0          2d22h   192.168.216.89    worker076-fc640   <none>           <none>
machine-config-daemon-9twkq                  2/2     Running       0          2d22h   192.168.216.76    worker063-fc640   <none>           <none>
machine-config-daemon-9w2mg                  2/2     Running       0          16m     192.168.216.16    worker003-fc640   <none>           <none>
machine-config-daemon-9wtq9                  2/2     Running       0          2d19h   192.168.216.126   worker113-fc640   <none>           <none>
machine-config-daemon-b26hp                  2/2     Running       0          2d22h   192.168.216.25    worker012-fc640   <none>           <none>
machine-config-daemon-bv7d5                  2/2     Running       0          6m11s   192.168.216.95    worker082-fc640   <none>           <none>
machine-config-daemon-bw7cs                  2/2     Running       0          6m21s   192.168.216.22    worker009-fc640   <none>           <none>
machine-config-daemon-c5wqs                  2/2     Running       0          2d22h   192.168.216.48    worker035-fc640   <none>           <none>
machine-config-daemon-c66nv                  2/2     Terminating   0          2d22h   192.168.216.72    worker059-fc640   <none>           <none>
machine-config-daemon-cghbz                  2/2     Running       0          2d22h   192.168.216.39    worker026-fc640   <none>           <none>
machine-config-daemon-cjxkx                  2/2     Terminating   0          2d22h   192.168.216.21    worker008-fc640   <none>           <none>
machine-config-daemon-cx8t4                  2/2     Running       0          2d22h   192.168.216.88    worker075-fc640   <none>           <none>
machine-config-daemon-dgskf                  2/2     Running       0          26m     192.168.216.86    worker073-fc640   <none>           <none>
machine-config-daemon-dtkjx                  2/2     Running       0          2d19h   192.168.216.122   worker109-fc640   <none>           <none>
machine-config-daemon-f4npx                  2/2     Running       0          26m     192.168.216.108   worker095-fc640   <none>           <none>
machine-config-daemon-fjrs2                  2/2     Running       0          26m     192.168.216.82    worker069-fc640   <none>           <none>
machine-config-daemon-fn2nk                  2/2     Running       0          2d22h   192.168.216.73    worker060-fc640   <none>           <none>
machine-config-daemon-g8dsp                  2/2     Running       0          2d22h   192.168.216.32    worker019-fc640   <none>           <none>
machine-config-daemon-g96sq                  2/2     Terminating   0          2d19h   192.168.216.93    worker080-fc640   <none>           <none>
machine-config-daemon-gl4sv                  2/2     Terminating   0          2d22h   192.168.216.70    worker057-fc640   <none>           <none>
machine-config-daemon-gr9ls                  2/2     Running       0          2d22h   192.168.216.57    worker044-fc640   <none>           <none>
machine-config-daemon-h9ltw                  2/2     Running       0          6m17s   192.168.216.14    worker001-fc640   <none>           <none>
machine-config-daemon-hgt59                  2/2     Running       0          2d19h   192.168.216.97    worker084-fc640   <none>           <none>
machine-config-daemon-hpb4r                  2/2     Running       0          2d22h   192.168.216.24    worker011-fc640   <none>           <none>
machine-config-daemon-hrhnc                  2/2     Running       0          2d22h   192.168.216.35    worker022-fc640   <none>           <none>
machine-config-daemon-hzrwt                  2/2     Running       0          2d22h   192.168.216.47    worker034-fc640   <none>           <none>
machine-config-daemon-jgsrh                  2/2     Running       0          2d22h   192.168.216.68    worker055-fc640   <none>           <none>
machine-config-daemon-jhqxl                  2/2     Running       0          16m     192.168.216.46    worker033-fc640   <none>           <none>
machine-config-daemon-jtb8b                  2/2     Running       0          2d22h   192.168.216.31    worker018-fc640   <none>           <none>
machine-config-daemon-jtz4f                  2/2     Terminating   0          2d19h   192.168.216.105   worker092-fc640   <none>           <none>
machine-config-daemon-kdzwz                  2/2     Running       0          2d22h   192.168.216.60    worker047-fc640   <none>           <none>
machine-config-daemon-kgjmq                  2/2     Running       0          2d22h   192.168.216.87    worker074-fc640   <none>           <none>
machine-config-daemon-kgrjk                  2/2     Terminating   0          2d22h   192.168.216.85    worker072-fc640   <none>           <none>
machine-config-daemon-kh2fm                  2/2     Running       0          2d19h   192.168.216.115   worker102-fc640   <none>           <none>
machine-config-daemon-krbsz                  2/2     Running       0          26m     192.168.216.74    worker061-fc640   <none>           <none>
machine-config-daemon-ktckx                  2/2     Running       0          2d19h   192.168.216.119   worker106-fc640   <none>           <none>
machine-config-daemon-ktz4x                  2/2     Running       0          2d23h   192.168.216.11    master-1          <none>           <none>
machine-config-daemon-l7vhs                  2/2     Running       0          2d19h   192.168.216.112   worker099-fc640   <none>           <none>
machine-config-daemon-lbbvp                  2/2     Running       0          6m27s   192.168.216.127   worker114-fc640   <none>           <none>
machine-config-daemon-lct9m                  2/2     Running       0          16m     192.168.216.104   worker091-fc640   <none>           <none>
machine-config-daemon-ldbbl                  2/2     Running       0          2d22h   192.168.216.65    worker052-fc640   <none>           <none>
machine-config-daemon-lfv9p                  2/2     Running       0          16m     192.168.216.50    worker037-fc640   <none>           <none>
machine-config-daemon-lks2p                  2/2     Running       0          2d22h   192.168.216.34    worker021-fc640   <none>           <none>
machine-config-daemon-lrbzb                  2/2     Running       0          2d22h   192.168.216.67    worker054-fc640   <none>           <none>
machine-config-daemon-lw7qh                  2/2     Running       0          2d19h   192.168.216.103   worker090-fc640   <none>           <none>
machine-config-daemon-m65cw                  2/2     Running       0          2d19h   192.168.216.125   worker112-fc640   <none>           <none>
machine-config-daemon-m9jr6                  2/2     Running       0          26m     192.168.216.49    worker036-fc640   <none>           <none>
machine-config-daemon-mgv64                  2/2     Running       0          2d22h   192.168.216.13    worker000-fc640   <none>           <none>
machine-config-daemon-mkspb                  2/2     Running       0          2d22h   192.168.216.36    worker023-fc640   <none>           <none>
machine-config-daemon-msbpk                  2/2     Running       0          2d22h   192.168.216.26    worker013-fc640   <none>           <none>
machine-config-daemon-mvsp6                  2/2     Running       0          26m     192.168.216.63    worker050-fc640   <none>           <none>
machine-config-daemon-nz4d2                  2/2     Running       0          2d22h   192.168.216.79    worker066-fc640   <none>           <none>
machine-config-daemon-p4xcb                  2/2     Running       0          2d22h   192.168.216.62    worker049-fc640   <none>           <none>
machine-config-daemon-pdnn5                  2/2     Running       0          2d23h   192.168.216.10    master-0          <none>           <none>
machine-config-daemon-pq5zx                  2/2     Running       0          6m23s   192.168.216.90    worker077-fc640   <none>           <none>
machine-config-daemon-pt45j                  2/2     Running       0          2d19h   192.168.216.116   worker103-fc640   <none>           <none>
machine-config-daemon-pvdrf                  2/2     Running       0          6m17s   192.168.216.38    worker025-fc640   <none>           <none>
machine-config-daemon-pvr8d                  2/2     Running       0          36m     192.168.216.18    worker005-fc640   <none>           <none>
machine-config-daemon-pvs9w                  2/2     Running       0          2d22h   192.168.216.43    worker030-fc640   <none>           <none>
machine-config-daemon-pvw2k                  2/2     Running       0          6m29s   192.168.216.71    worker058-fc640   <none>           <none>
machine-config-daemon-pwbbk                  2/2     Running       0          2d19h   192.168.216.117   worker104-fc640   <none>           <none>
machine-config-daemon-pz2lq                  2/2     Running       0          16m     192.168.216.17    worker004-fc640   <none>           <none>
machine-config-daemon-qpd76                  2/2     Running       0          16m     192.168.216.40    worker027-fc640   <none>           <none>
machine-config-daemon-qvbht                  2/2     Running       0          2d22h   192.168.216.54    worker041-fc640   <none>           <none>
machine-config-daemon-rbv6b                  2/2     Running       0          16m     192.168.216.121   worker108-fc640   <none>           <none>
machine-config-daemon-rghkc                  2/2     Running       0          6m29s   192.168.216.128   worker115-fc640   <none>           <none>
machine-config-daemon-rtnzr                  2/2     Running       0          2d22h   192.168.216.83    worker070-fc640   <none>           <none>
machine-config-daemon-rx7xs                  2/2     Running       0          26m     192.168.216.33    worker020-fc640   <none>           <none>
machine-config-daemon-rzlf9                  2/2     Terminating   0          2d22h   192.168.216.41    worker028-fc640   <none>           <none>
machine-config-daemon-s5qps                  2/2     Running       0          6m7s    192.168.216.55    worker042-fc640   <none>           <none>
machine-config-daemon-s8t6c                  2/2     Running       0          2d19h   192.168.216.96    worker083-fc640   <none>           <none>
machine-config-daemon-scq75                  2/2     Running       0          2d19h   192.168.216.114   worker101-fc640   <none>           <none>
machine-config-daemon-sdjh2                  2/2     Running       0          2d22h   192.168.216.84    worker071-fc640   <none>           <none>
machine-config-daemon-sh8hr                  2/2     Running       0          26m     192.168.216.58    worker045-fc640   <none>           <none>
machine-config-daemon-sjrgd                  2/2     Running       0          5m59s   192.168.216.51    worker038-fc640   <none>           <none>
machine-config-daemon-sqvf8                  2/2     Terminating   0          2d22h   192.168.216.42    worker029-fc640   <none>           <none>
machine-config-daemon-tcbnk                  2/2     Running       0          2d22h   192.168.216.77    worker064-fc640   <none>           <none>
machine-config-daemon-td2bl                  2/2     Running       0          2d19h   192.168.216.111   worker098-fc640   <none>           <none>
machine-config-daemon-tw9xw                  2/2     Running       0          6m      192.168.216.19    worker006-fc640   <none>           <none>
machine-config-daemon-vkpw8                  2/2     Running       0          2d19h   192.168.216.123   worker110-fc640   <none>           <none>
machine-config-daemon-w8hd2                  2/2     Running       0          2d22h   192.168.216.23    worker010-fc640   <none>           <none>
machine-config-daemon-w9l6h                  2/2     Running       0          6m15s   192.168.216.109   worker096-fc640   <none>           <none>
machine-config-daemon-wctwv                  2/2     Running       0          2d19h   192.168.216.120   worker107-fc640   <none>           <none>
machine-config-daemon-wf69d                  2/2     Running       0          2d19h   192.168.216.100   worker087-fc640   <none>           <none>
machine-config-daemon-wrg7l                  2/2     Terminating   0          2d19h   192.168.216.124   worker111-fc640   <none>           <none>
machine-config-daemon-wvfs5                  2/2     Running       0          6m22s   192.168.216.28    worker015-fc640   <none>           <none>
machine-config-daemon-x98qj                  2/2     Running       0          26m     192.168.216.15    worker002-fc640   <none>           <none>
machine-config-daemon-xcr72                  2/2     Running       0          16m     192.168.216.12    master-2          <none>           <none>
machine-config-daemon-xzv49                  2/2     Terminating   0          2d22h   192.168.216.20    worker007-fc640   <none>           <none>
machine-config-daemon-zmz9b                  2/2     Running       0          2d22h   192.168.216.92    worker079-fc640   <none>           <none>
machine-config-daemon-zrlrp                  2/2     Running       0          2d19h   192.168.216.101   worker088-fc640   <none>           <none>
machine-config-daemon-zzpcc                  2/2     Running       0          2d19h   192.168.216.98    worker085-fc640   <none>           <none>
machine-config-operator-b67f5997c-5qcwn      1/1     Running       0          38m     10.129.0.6        master-1          <none>           <none>
machine-config-server-4v4fx                  1/1     Running       0          2d23h   192.168.216.11    master-1          <none>           <none>
machine-config-server-p9fvw                  1/1     Running       0          2d23h   192.168.216.12    master-2          <none>           <none>
machine-config-server-vf58h                  1/1     Running       0          2d23h   192.168.216.10    master-0          <none>           <none>

===========================================================================
[kni@e16-h18-b03-fc640 kube-burner-templates]$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.24    True        False         False      126m
baremetal                                  4.7.24    True        False         False      2d23h
cloud-credential                           4.7.24    True        False         False      2d23h
cluster-autoscaler                         4.7.24    True        False         False      2d23h
config-operator                            4.7.24    True        False         False      2d23h
console                                    4.7.24    True        False         False      133m
csi-snapshot-controller                    4.7.24    True        False         False      126m
dns                                        4.7.24    True        False         False      2d23h
etcd                                       4.7.24    True        False         False      2d23h
image-registry                             4.7.24    True        False         False      3h58m
ingress                                    4.7.24    True        False         False      2d22h
insights                                   4.7.24    True        False         False      2d23h
kube-apiserver                             4.7.24    True        False         False      2d23h
kube-controller-manager                    4.7.24    True        False         False      2d23h
kube-scheduler                             4.7.24    True        False         False      2d23h
kube-storage-version-migrator              4.7.24    True        False         False      2d22h
machine-api                                4.7.24    True        False         False      2d23h
machine-approver                           4.7.24    True        False         False      2d23h
machine-config                             4.7.11    False       True          True       39m
marketplace                                4.7.24    True        False         False      134m
monitoring                                 4.7.24    True        False         False      132m
network                                    4.7.24    True        False         False      2d23h
node-tuning                                4.7.24    True        False         False      135m
openshift-apiserver                        4.7.24    True        False         False      126m
openshift-controller-manager               4.7.24    True        False         False      2d23h
openshift-samples                          4.7.24    True        False         False      135m
operator-lifecycle-manager                 4.7.24    True        False         False      2d23h
operator-lifecycle-manager-catalog         4.7.24    True        False         False      2d23h
operator-lifecycle-manager-packageserver   4.7.24    True        False         False      135m
service-ca                                 4.7.24    True        False         False      2d23h
storage                                    4.7.24    True        False         False      2d23h
=========================================================================
Version-Release number of selected component (if applicable):
4.7.11 -> 4.7.24 uprade

How reproducible:
100%

Steps to Reproduce:
1. Kick upgrade on cluster
2. wait until machine-config operator is updated
3. Observe the status of the machine-config-daemon pods

Actual results:
Pods are stuck in terminating for a long time

Expected results:
machine-config-daemon pods like other pods should terminate gracefull in a short time after receiving SIGTERM

Additional info:

Comment 1 Sinny Kumari 2021-08-30 16:41:44 UTC
This could be related to  BZ https://bugzilla.redhat.com/show_bug.cgi?id=1995853? Before update, was there a MachineConig change applied that didn't require node reboot?

Comment 2 Yu Qi Zhang 2021-08-30 17:32:04 UTC
In the attached MCD logs, we see:

I0827 18:11:43.227189    6634 update.go:1904] Node has Desired Config rendered-worker-group-3-be7070fffc9b1bb28637063800e3cfef, skipping reboot

So it is very likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1995853, which is in the backport process. I will mark this as a duplicate and up the urgency of that.

If you would like to make sure, please attach a must-gather with the fully logs, so we can see what the update was and the timing.

*** This bug has been marked as a duplicate of bug 1995853 ***

Comment 3 Sai Sindhur Malleni 2021-08-30 18:19:30 UTC
(In reply to Sinny Kumari from comment #1)
> This could be related to  BZ
> https://bugzilla.redhat.com/show_bug.cgi?id=1995853? Before update, was
> there a MachineConig change applied that didn't require node reboot?

Yes, this is a  large 120 node environment. So we split up existing worker nodes into 11 MCPs and since the configuration didn't change - a reboot was not required.

Comment 4 Sai Sindhur Malleni 2021-08-30 18:24:24 UTC
So yes, I did split up the worker nodes into multiple MCPs before the upgrade so they got added to a new MCP without needing a reboot - so rebootless upgrades are the trigger even for https://bugzilla.redhat.com/show_bug.cgi?id=1995853 right?

Comment 5 Yu Qi Zhang 2021-08-30 19:41:58 UTC
Correct. https://bugzilla.redhat.com/show_bug.cgi?id=1995853 would manifest if you perform a rebootless update of any kind, and then another update. So it sounds like a duplicate.

The fix is already in 4.9 and 4.8, if you would like to test that. Otherwise we need to wait for patch manager approval for the linked BZ for 4.7