1728873 – Node stuck on scheduling disabled due to MCD being degraded

Bug 1728873 - Node stuck on scheduling disabled due to MCD being degraded

Summary: Node stuck on scheduling disabled due to MCD being degraded

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Antonio Murdaca
QA Contact:	Robert Fairley
Docs Contact:
URL:
Whiteboard:
Depends On:	1732120
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-10 20:07 UTC by Abhinav Dahiya
Modified:	2019-09-12 18:56 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-12 18:56:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 995	0	None	closed	Bug 1728873: pkg/daemon: reconcile killed just prior drain+reboot	2021-01-12 04:24:24 UTC
Red Hat Product Errata	RHBA-2019:2681	0	None	None	None	2019-09-12 18:56:19 UTC

Description Abhinav Dahiya 2019-07-10 20:07:04 UTC

I've had a node which has been in 'SchedulingDisabled' state for a week due to the following annotations on the node:

```
Annotations:        machine.openshift.io/machine: openshift-machine-api/02cd6e91-gn7q6-worker-us-west-1b-j8rn8
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-7f899c2c109ce7eea3cde169e3f51092
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-71095e466f5b906cdbcc9ff8cb04414c
                    machineconfiguration.openshift.io/reason:
                      during bootstrap: pending config rendered-worker-71095e466f5b906cdbcc9ff8cb04414c bootID e4144af2-e895-44b6-97f1-2b205684e956 matches curr...
```

Manually rebooting the host fixed the issue (because the boot id no longer matched and it re-ran the reconciliation loop on the currentConfig)

Comment 1 Antonio Murdaca 2019-07-10 21:35:03 UTC

can you provide journals from the node which is stuck please to further debug why it didn't reboot

Comment 3 Antonio Murdaca 2019-07-11 10:25:42 UTC

looks like there's no hint in those journals - can you run `oc adm must-gather` and send me the archive?

Comment 4 Antonio Murdaca 2019-07-11 11:08:13 UTC

```
west-1.compute.internal: Unauthorized\\nI0702 19:41:32.074086    1585 openshift-tuned.go:176] Extracting tuned profiles\\nI0702 19:41:32.077351    1585 openshift-tuned.go:596] Resync period to pull node/pod labels: 59 [s]\\nE0702 19:41:32.151300    1585 openshift-tuned.go:686] Error getting node ip-192-168-179-98.us-west-1.compute.internal: Unauthorized\\nI0702 19:41:37.151566    1585 openshift-tuned.go:176] Extracting tuned profiles\\nI0702 19:41:37.155047    1585 openshift-tuned.go:596] Resync period to pull node/pod labels: 56 [s]\\nE0702 19:41:37.190108    1585 openshift-tuned.go:686] Error getting node ip-192-168-179-98.us-west-1.compute.internal: Unauthorized\\nI0702 19:41:42.191031    1585 openshift-tuned.go:176] Extracting tuned profiles\\nI0702 19:41:42.194887    1585 openshift-tuned.go:596] Resync period to pull node/pod labels: 63 [s]\\nE0702 19:41:42.214100    1585 openshift-tuned.go:686] Error getting node ip-192-168-179-98.us-west-1.compute.internal: Unauthorized\\nI0702 19:41:47.214347    1585 openshift-tuned.go:176] Extracting tuned profiles\\nI0702 19:41:47.217476    1585 openshift-tuned.go:596] Resync period to pull node/pod labels: 65 [s]\\nE0702 19:41:47.258283    1585 openshift-tuned.go:686] Error getting node ip-192-168-179-98.us-west-1.compute.internal: Unauthorized\\nI0702 19:41:52.258524    1585 openshift-tuned.go:176] Extracting tuned profiles\\nI0702 19:41:52.262917    1585 openshift-tuned.go:596] Resync period to pull node/pod labels: 54 [s]\\nE0702 19:41:52.333866    1585 openshift-tuned.go:686] Error getting node ip-192-168-179-98.us-west-1.compute.internal: Unauthorized\\nE0702 19:41:52.333895    1585 openshift-tuned.go:690] Seen 5 errors in 25 seconds, terminating...\\npanic: Error getting node ip-192-168-179-98.us-west-1.compute.internal: Unauthorized\\n\\ngoroutine 1 [running]:\\nmain.main()\\n\\t/go/src/github.com/openshift/openshift-tuned/cmd/openshift-tuned.go:720 +0x283\\n\",\"reason\":\"Error\",\"startedAt\":\"2019-07-01T23:36:50Z\"}}}]}}"
```

saw this panic in the log also, not sure who to ping for tuned

Comment 5 Antonio Murdaca 2019-07-11 11:28:42 UTC

Logs show that we were draining and took more than 600s and that's why we've been killed and restarted. I think it's a fair assumption to say that _if_ we have the pending config but same bootid, we were draining the node or rebooting and failed. Wondering if we can safely reconcile such state, it should not incur in a reboot loop anyway.

Comment 6 Colin Walters 2019-07-11 13:17:11 UTC

Right, we should be able to detect when we just failed to drain and were restarted; a stamp file in /run or something would be easy.

Comment 7 Colin Walters 2019-07-11 13:18:41 UTC

An interesting question here is *why* we failed to drain.  I bet for example if a user adds an app that has a PDB that doesn't tolerate even maxUnavailable=1 (e.g. they have the default 3 workers and their app needs 3 replicas up) then we'll get into this MCD failure state.

Comment 41 Brian 'redbeard' Harrington 2019-08-15 20:10:00 UTC

As of OpenShift 4.1.9 this error was still occurring.  At some point it spawned another node (increasing my AWS bill without any active notification):

[~/openshift-install-02cd6e91]$ oc get nodes
NAME                                            STATUS                     ROLES    AGE   VERSION
ip-192-168-176-110.us-west-1.compute.internal   Ready                      master   79d   v1.13.4+0cb23916f
ip-192-168-176-174.us-west-1.compute.internal   Ready                      master   79d   v1.13.4+0cb23916f
ip-192-168-177-192.us-west-1.compute.internal   Ready                      worker   30d   v1.13.4+c9e4f28ff
ip-192-168-177-21.us-west-1.compute.internal    Ready                      worker   30d   v1.13.4+c9e4f28ff
ip-192-168-178-191.us-west-1.compute.internal   Ready                      master   79d   v1.13.4+0cb23916f
ip-192-168-178-211.us-west-1.compute.internal   Ready                      worker   30d   v1.13.4+c9e4f28ff
ip-192-168-179-98.us-west-1.compute.internal    Ready,SchedulingDisabled   worker   79d   v1.13.4+9b19d73a0

Deleting the node was successful and didn't respawn a failed node:

[~/openshift-install-02cd6e91]$ oc delete node/ip-192-168-179-98.us-west-1.compute.internal
node "ip-192-168-179-98.us-west-1.compute.internal" deleted
[~/openshift-install-02cd6e91]$ oc get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
ip-192-168-176-110.us-west-1.compute.internal   Ready    master   79d   v1.13.4+0cb23916f
ip-192-168-176-174.us-west-1.compute.internal   Ready    master   79d   v1.13.4+0cb23916f
ip-192-168-177-192.us-west-1.compute.internal   Ready    worker   30d   v1.13.4+c9e4f28ff
ip-192-168-177-21.us-west-1.compute.internal    Ready    worker   30d   v1.13.4+c9e4f28ff
ip-192-168-178-191.us-west-1.compute.internal   Ready    master   79d   v1.13.4+0cb23916f
ip-192-168-178-211.us-west-1.compute.internal   Ready    worker   30d   v1.13.4+c9e4f28ff


A half hour later everything was still fine and I began the update to 4.1.11:
[~/openshift-install-02cd6e91]$ oc get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
ip-192-168-176-110.us-west-1.compute.internal   Ready    master   79d   v1.13.4+0cb23916f
ip-192-168-176-174.us-west-1.compute.internal   Ready    master   79d   v1.13.4+0cb23916f
ip-192-168-177-192.us-west-1.compute.internal   Ready    worker   30d   v1.13.4+c9e4f28ff
ip-192-168-177-21.us-west-1.compute.internal    Ready    worker   30d   v1.13.4+c9e4f28ff
ip-192-168-178-191.us-west-1.compute.internal   Ready    master   79d   v1.13.4+0cb23916f
ip-192-168-178-211.us-west-1.compute.internal   Ready    worker   30d   v1.13.4+c9e4f28ff
[~/openshift-install-02cd6e91]$ oc adm upgrade 
Cluster version is 4.1.9

Updates:

VERSION IMAGE
4.1.11  quay.io/openshift-release-dev/ocp-release@sha256:bfca31dbb518b35f312cc67516fa18aa40df9925dc84fdbcd15f8bbca425d7ff
[~/openshift-install-02cd6e91]$ oc adm upgrade  --to=4.1.11
Updating to 4.1.11



Subsequently one of the nodes has gone unschedulable again:

[~/openshift-install-02cd6e91]$ oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.10-201908061216+68e229c-dirty", GitCommit:"68e229c", GitTreeState:"dirty", BuildDate:"2019-08-06T18:30:32Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+df9cebc", GitCommit:"df9cebc", GitTreeState:"clean", BuildDate:"2019-08-06T18:31:45Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}
[~/openshift-install-02cd6e91]$ oc adm upgrade  
Error while reconciling 4.1.11: some cluster operators have not yet rolled out

No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.
[~/openshift-install-02cd6e91]$ oc get nodes
NAME                                            STATUS                     ROLES    AGE   VERSION
ip-192-168-176-110.us-west-1.compute.internal   Ready                      master   79d   v1.13.4+0cb23916f
ip-192-168-176-174.us-west-1.compute.internal   Ready                      master   79d   v1.13.4+0cb23916f
ip-192-168-177-192.us-west-1.compute.internal   Ready                      worker   30d   v1.13.4+0cb23916f
ip-192-168-177-21.us-west-1.compute.internal    Ready                      worker   30d   v1.13.4+0cb23916f
ip-192-168-178-191.us-west-1.compute.internal   Ready                      master   79d   v1.13.4+0cb23916f
ip-192-168-178-211.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   30d   v1.13.4+c9e4f28ff

Comment 42 Brian 'redbeard' Harrington 2019-08-15 21:17:52 UTC

And after the update to 4.1.11 has completed the nodes are still in the same state:

[~/openshift-install-02cd6e91]$ oc adm upgrade 
Cluster version is 4.1.11

No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.
[~/openshift-install-02cd6e91]$ oc get nodes
NAME                                            STATUS                     ROLES    AGE   VERSION
ip-192-168-176-110.us-west-1.compute.internal   Ready                      master   79d   v1.13.4+d81afa6ba
ip-192-168-176-174.us-west-1.compute.internal   Ready                      master   79d   v1.13.4+d81afa6ba
ip-192-168-177-192.us-west-1.compute.internal   Ready                      worker   30d   v1.13.4+0cb23916f
ip-192-168-177-21.us-west-1.compute.internal    Ready                      worker   31d   v1.13.4+0cb23916f
ip-192-168-178-191.us-west-1.compute.internal   Ready                      master   79d   v1.13.4+d81afa6ba
ip-192-168-178-211.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   30d   v1.13.4+c9e4f28ff
[~/openshift-install-02cd6e91]$ oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.10-201908061216+68e229c-dirty", GitCommit:"68e229c", GitTreeState:"dirty", BuildDate:"2019-08-06T18:30:32Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+df9cebc", GitCommit:"df9cebc", GitTreeState:"clean", BuildDate:"2019-08-06T18:31:45Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}

Comment 43 Brian 'redbeard' Harrington 2019-08-22 18:22:48 UTC

Same state, nothing has converged in the last week.



[~/openshift-install-02cd6e91]$ oc get nodes
NAME                                            STATUS                     ROLES    AGE   VERSION
ip-192-168-176-110.us-west-1.compute.internal   Ready                      master   86d   v1.13.4+d81afa6ba
ip-192-168-176-174.us-west-1.compute.internal   Ready                      master   86d   v1.13.4+d81afa6ba
ip-192-168-177-192.us-west-1.compute.internal   Ready                      worker   37d   v1.13.4+0cb23916f
ip-192-168-177-21.us-west-1.compute.internal    Ready                      worker   37d   v1.13.4+0cb23916f
ip-192-168-178-191.us-west-1.compute.internal   Ready                      master   86d   v1.13.4+d81afa6ba
ip-192-168-178-211.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   37d   v1.13.4+c9e4f28ff
[~/openshift-install-02cd6e91]$ oc adm upgrade 
Cluster version is 4.1.11

No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.
[~/openshift-install-02cd6e91]$ oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.12-201908150609+4b989a4-dirty", GitCommit:"4b989a4", GitTreeState:"dirty", BuildDate:"2019-08-15T13:03:05Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+df9cebc", GitCommit:"df9cebc", GitTreeState:"clean", BuildDate:"2019-08-06T18:31:45Z", GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}

Comment 44 Antonio Murdaca 2019-08-28 11:08:32 UTC

(In reply to Brian 'redbeard' Harrington from comment #43)
> Same state, nothing has converged in the last week.
> 
> 
> 
> [~/openshift-install-02cd6e91]$ oc get nodes
> NAME                                            STATUS                    
> ROLES    AGE   VERSION
> ip-192-168-176-110.us-west-1.compute.internal   Ready                     
> master   86d   v1.13.4+d81afa6ba
> ip-192-168-176-174.us-west-1.compute.internal   Ready                     
> master   86d   v1.13.4+d81afa6ba
> ip-192-168-177-192.us-west-1.compute.internal   Ready                     
> worker   37d   v1.13.4+0cb23916f
> ip-192-168-177-21.us-west-1.compute.internal    Ready                     
> worker   37d   v1.13.4+0cb23916f
> ip-192-168-178-191.us-west-1.compute.internal   Ready                     
> master   86d   v1.13.4+d81afa6ba
> ip-192-168-178-211.us-west-1.compute.internal   Ready,SchedulingDisabled  
> worker   37d   v1.13.4+c9e4f28ff
> [~/openshift-install-02cd6e91]$ oc adm upgrade 
> Cluster version is 4.1.11
> 
> No updates available. You may force an upgrade to a specific release image,
> but doing so may not be supported and result in downtime or data loss.
> [~/openshift-install-02cd6e91]$ oc version
> Client Version: version.Info{Major:"4", Minor:"1+",
> GitVersion:"v4.1.12-201908150609+4b989a4-dirty", GitCommit:"4b989a4",
> GitTreeState:"dirty", BuildDate:"2019-08-15T13:03:05Z",
> GoVersion:"go1.11.6", Compiler:"gc", Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13+",
> GitVersion:"v1.13.4+df9cebc", GitCommit:"df9cebc", GitTreeState:"clean",
> BuildDate:"2019-08-06T18:31:45Z", GoVersion:"go1.11.6", Compiler:"gc",
> Platform:"linux/amd64"}

We're working on verifying the 4.2 BZ here https://bugzilla.redhat.com/show_bug.cgi?id=1732120 - taking some time but afterwards, we'll be working on a 4.1 backport. The current workaround is to jump and reboot the node manually.

Comment 48 Robert Fairley 2019-09-06 19:00:29 UTC

Verified on `4.1.0-0.nightly-2019-09-05-023547`. `killed by kube, retrying...` is shown in the logs, before the drain is resumed. Followed the steps in https://bugzilla.redhat.com/show_bug.cgi?id=1732120#c7, with minor modifications:

- used `machineconfigpool/worker` instead of `mcp/worker` in step 2
- added `-n openshift-machine-config-operator` to command in step 4 (alternatively, work in the `openshift-machine-config-operator` project)
- range of scheduled nodes is [751 - 1000] which I had set in step 3

- $ oc logs -f machine-config-daemon-wfncp -n openshift-machine-config-operator
-- snip --
I0906 18:33:50.169569    3133 update.go:848] Update prepared; beginning drain
I0906 18:33:50.223938    3133 update.go:89] cordoned node "ip-10-0-154-7.ec2.internal"
I0906 18:33:50.577247    3133 update.go:93] deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: ip-10-0-154-7ec2internal-debug, zzz751, zzz752, zzz753, zzz754, zzz755, zzz756, zzz757, zzz758, zzz759, zzz760, zzz761, zzz762, zzz763, zzz764, zzz765, zzz766, zzz767, zzz768, zzz769, zzz770, zzz771, zzz772, zzz773, zzz774, zzz775, zzz776, zzz777, zzz778, zzz779, zzz780, zzz781, zzz782, zzz783, zzz784, zzz785, zzz786, zzz787, zzz788, zzz789, zzz790, zzz791, zzz792, zzz793, zzz794, zzz795, zzz796, zzz797, zzz798, zzz799, zzz800, zzz801, zzz802, zzz803, zzz804, zzz805, zzz806, zzz807, zzz808, zzz809, zzz810, zzz811, zzz812, zzz813, zzz814, zzz815, zzz816, zzz817, zzz818, zzz819, zzz820, zzz821, zzz822, zzz823, zzz824, zzz825, zzz826, zzz827, zzz828, zzz829, zzz830, zzz831, zzz832, zzz833, zzz834, zzz835, zzz836, zzz837, zzz838, zzz839, zzz840, zzz841, zzz842, zzz843, zzz844, zzz845, zzz846, zzz847, zzz848, zzz849, zzz850, zzz851, zzz852, zzz853, zzz854, zzz855, zzz856, zzz857, zzz858, zzz859, zzz860, zzz861, zzz862, zzz863, zzz864, zzz865, zzz866, zzz867, zzz868, zzz869, zzz870, zzz871, zzz872, zzz873, zzz874, zzz875, zzz876, zzz877, zzz878, zzz879, zzz880, zzz881, zzz882, zzz883, zzz884, zzz885, zzz886, zzz887, zzz888, zzz889, zzz890, zzz891, zzz892, zzz893, zzz894, zzz895, zzz896, zzz897, zzz898, zzz899, zzz900, zzz901, zzz902, zzz903, zzz904, zzz905, zzz906, zzz907, zzz908, zzz909, zzz910, zzz911, zzz912, zzz913, zzz914, zzz915, zzz916, zzz917, zzz918, zzz919, zzz920, zzz921, zzz922, zzz923, zzz924, zzz925, zzz926, zzz927, zzz928, zzz929, zzz930, zzz931, zzz932, zzz933, zzz934, zzz935, zzz936, zzz937, zzz938, zzz939, zzz940, zzz941, zzz942, zzz943, zzz944, zzz945, zzz946, zzz947, zzz948, zzz949, zzz950, zzz951, zzz952, zzz953, zzz954, zzz955, zzz956, zzz957, zzz958, zzz959, zzz960, zzz961, zzz962, zzz963, zzz964, zzz965, zzz966, zzz967, zzz968, zzz969, zzz970, zzz971, zzz972, zzz973, zzz974, zzz975, zzz976, zzz977, zzz978, zzz979, zzz980, zzz981, zzz982; ignoring DaemonSet-managed pods: tuned-4whtm, dns-default-ds947, node-ca-dwl4s, machine-config-daemon-wfncp, node-exporter-gccbf, multus-xgvvf, ovs-s7bhc, sdn-8ht7c; deleting pods with local storage: alertmanager-main-2, kube-state-metrics-779d7c6547-qknpt
I0906 18:34:00.379183    3133 update.go:89] pod "zzz764" removed (evicted)
I0906 18:34:01.179166    3133 update.go:89] pod "zzz765" removed (evicted)
I0906 18:34:01.782169    3133 update.go:89] pod "zzz761" removed (evicted)
I0906 18:34:02.796623    3133 update.go:89] pod "zzz767" removed (evicted)
I0906 18:34:04.779051    3133 update.go:89] pod "zzz763" removed (evicted)
I0906 18:34:08.578740    3133 update.go:89] pod "zzz755" removed (evicted)
I0906 18:34:09.980593    3133 update.go:89] pod "zzz756" removed (evicted)

oc logs -f machine-config-daemon-b86fb -n openshift-machine-config-operator
I0906 18:36:57.022341   10378 start.go:67] Version: 4.1.15-201909041605-dirty (916adbf5fdc1381714fadc327c17000a8b3707e1)
I0906 18:36:57.023086   10378 start.go:100] Starting node writer
I0906 18:36:57.027220   10378 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0906 18:36:57.107391   10378 daemon.go:200] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fb025ff2b405cf5902a451675cd27a9d702ced7c88b5a33280d42069524874c6 (410.8.20190904.0)
I0906 18:36:57.110675   10378 start.go:196] Calling chroot("/rootfs")
I0906 18:36:57.110707   10378 start.go:206] Starting MachineConfigDaemon
I0906 18:36:57.111328   10378 update.go:848] Starting to manage node: ip-10-0-154-7.ec2.internal
I0906 18:36:57.114990   10378 run.go:22] Running captured: rpm-ostree status
I0906 18:36:57.168417   10378 daemon.go:740] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fb025ff2b405cf5902a451675cd27a9d702ced7c88b5a33280d42069524874c6
              CustomOrigin: Managed by pivot tool
                   Version: 410.8.20190904.0 (2019-09-04T20:31:27Z)

  pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/ootpa@sha256:683a6a866a8ec789fedb5da63b6a2ff68c1b0788ec90e7def778f0c4c13197a4
              CustomOrigin: Provisioned from oscontainer
                   Version: 410.8.20190520.0 (2019-05-20T20:10:04Z)
I0906 18:36:57.168445   10378 run.go:22] Running captured: journalctl --list-boots
I0906 18:36:57.174857   10378 daemon.go:747] journalctl --list-boots:
-1 9e51bf6bb831421abb162ae2b515957f Fri 2019-09-06 17:57:26 UTC—Fri 2019-09-06 17:59:04 UTC
 0 3d2d8be6dce94f80adbf9adfe174bdf6 Fri 2019-09-06 17:59:33 UTC—Fri 2019-09-06 18:36:57 UTC
I0906 18:36:57.174884   10378 daemon.go:494] Enabling Kubelet Healthz Monitor
I0906 18:37:00.884948   10378 update.go:737] logger doesn't support --jounald, grepping the journal
I0906 18:37:01.014860   10378 daemon.go:702] Current config: rendered-worker-1487d7a166bb6b60d6db96b0d5f34e31
I0906 18:37:01.014892   10378 daemon.go:703] Desired config: rendered-worker-3ff06e8b04f5751b838b9a61a833df14
I0906 18:37:01.014905   10378 update.go:848] using pending config same as desired
I0906 18:37:01.017828   10378 update.go:848] killed by kube, retrying...
I0906 18:37:01.020332   10378 update.go:813] logger doesn't support --jounald, logging json directly
I0906 18:37:01.025173   10378 update.go:848] Update prepared; beginning drain
I0906 18:37:01.120695   10378 update.go:93] deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: ip-10-0-154-7ec2internal-debug, zzz774, zzz845, zzz846, zzz847, zzz848, zzz849, zzz850, zzz851, zzz852, zzz853, zzz854, zzz855, zzz856, zzz857, zzz858, zzz859, zzz860, zzz861, zzz862, zzz863, zzz864, zzz865, zzz866, zzz867, zzz868, zzz869, zzz870, zzz871, zzz872, zzz873, zzz874, zzz875, zzz876, zzz877, zzz878, zzz879, zzz880, zzz881, zzz882, zzz883, zzz884, zzz885, zzz886, zzz887, zzz888, zzz889, zzz890, zzz891, zzz892, zzz893, zzz894, zzz895, zzz896, zzz897, zzz898, zzz899, zzz900, zzz901, zzz902, zzz903, zzz904, zzz905, zzz906, zzz907, zzz908, zzz909, zzz910, zzz911, zzz912, zzz913, zzz914, zzz926, zzz928, zzz929, zzz930, zzz931, zzz932, zzz933, zzz934, zzz935, zzz936, zzz937, zzz938, zzz939, zzz940, zzz941, zzz942, zzz943, zzz944, zzz945, zzz946, zzz947, zzz948, zzz949, zzz950, zzz951, zzz952, zzz953, zzz954, zzz955, zzz956, zzz957, zzz958, zzz959, zzz960, zzz961, zzz962, zzz963, zzz964, zzz965, zzz966, zzz967, zzz968, zzz969, zzz970, zzz971, zzz972, zzz973, zzz974, zzz975, zzz976, zzz977, zzz978, zzz979, zzz980, zzz981, zzz982; ignoring DaemonSet-managed pods: tuned-4whtm, dns-default-ds947, node-ca-dwl4s, machine-config-daemon-b86fb, node-exporter-gccbf, multus-xgvvf, ovs-s7bhc, sdn-8ht7c; deleting pods with local storage: alertmanager-main-2, kube-state-metrics-779d7c6547-qknpt
I0906 18:37:50.630424   10378 update.go:89] pod "community-operators-7c66dcb698-x7mbj" removed (evicted)
I0906 18:37:52.230818   10378 update.go:89] pod "alertmanager-main-2" removed (evicted)
I0906 18:37:52.641665   10378 update.go:89] pod "kube-state-metrics-779d7c6547-qknpt" removed (evicted)
I0906 18:38:04.230108   10378 update.go:89] pod "prometheus-operator-bbcc7db88-vb5mq" removed (evicted)
I0906 18:38:15.030237   10378 update.go:89] pod "image-registry-54cfffdb49-fvjn8" removed (evicted)
I0906 18:38:15.430177   10378 update.go:89] pod "router-default-c6fb89cd7-rk8w8" removed (evicted)
I0906 18:38:17.030069   10378 update.go:89] pod "certified-operators-7f6f86fc49-p87kp" removed (evicted)
I0906 18:38:18.831410   10378 update.go:89] pod "redhat-operators-7f96d497b-xm5fg" removed (evicted)
I0906 18:44:30.030474   10378 update.go:89] pod "ip-10-0-154-7ec2internal-debug" removed (evicted)
I0906 18:45:38.830078   10378 update.go:89] pod "zzz774" removed (evicted)
I0906 18:47:27.433488   10378 update.go:89] pod "zzz896" removed (evicted)
I0906 18:47:27.632685   10378 update.go:89] pod "zzz894" removed (evicted)
I0906 18:47:28.231005   10378 update.go:89] pod "zzz845" removed (evicted)
I0906 18:47:37.834994   10378 update.go:89] pod "zzz847" removed (evicted)
I0906 18:47:45.234606   10378 update.go:89] pod "zzz890" removed (evicted)
I0906 18:47:49.430267   10378 update.go:89] pod "zzz854" removed (evicted)
I0906 18:47:49.830147   10378 update.go:89] pod "zzz904" removed (evicted)
I0906 18:48:01.430390   10378 update.go:89] pod "zzz898" removed (evicted)
I0906 18:48:03.630277   10378 update.go:89] pod "zzz977" removed (evicted)
I0906 18:48:04.430040   10378 update.go:89] pod "zzz859" removed (evicted)
I0906 18:48:04.630092   10378 update.go:89] pod "zzz939" removed (evicted)
I0906 18:48:05.231503   10378 update.go:89] pod "zzz848" removed (evicted)
I0906 18:48:05.830636   10378 update.go:89] pod "zzz940" removed (evicted)
I0906 18:48:06.030610   10378 update.go:89] pod "zzz901" removed (evicted)
I0906 18:48:08.430552   10378 update.go:89] pod "zzz881" removed (evicted)
I0906 18:48:09.430287   10378 update.go:89] pod "zzz945" removed (evicted)
I0906 18:48:09.630471   10378 update.go:89] pod "zzz899" removed (evicted)
I0906 18:48:10.030184   10378 update.go:89] pod "zzz850" removed (evicted)
I0906 18:48:11.230557   10378 update.go:89] pod "zzz885" removed (evicted)
I0906 18:48:11.830403   10378 update.go:89] pod "zzz903" removed (evicted)
I0906 18:48:12.230234   10378 update.go:89] pod "zzz864" removed (evicted)
I0906 18:48:14.230342   10378 update.go:89] pod "zzz905" removed (evicted)
I0906 18:48:15.830244   10378 update.go:89] pod "zzz951" removed (evicted)
I0906 18:48:16.230257   10378 update.go:89] pod "zzz895" removed (evicted)
I0906 18:48:16.430640   10378 update.go:89] pod "zzz897" removed (evicted)
I0906 18:48:17.430235   10378 update.go:89] pod "zzz930" removed (evicted)
I0906 18:48:21.030387   10378 update.go:89] pod "zzz933" removed (evicted)
I0906 18:48:23.036390   10378 update.go:89] pod "zzz873" removed (evicted)
I0906 18:48:24.030370   10378 update.go:89] pod "zzz900" removed (evicted)
I0906 18:48:24.430501   10378 update.go:89] pod "zzz935" removed (evicted)
I0906 18:48:24.630253   10378 update.go:89] pod "zzz934" removed (evicted)
I0906 18:48:26.030147   10378 update.go:89] pod "zzz851" removed (evicted)
I0906 18:48:26.230610   10378 update.go:89] pod "zzz875" removed (evicted)
I0906 18:48:28.030325   10378 update.go:89] pod "zzz979" removed (evicted)
I0906 18:48:28.230403   10378 update.go:89] pod "zzz877" removed (evicted)
I0906 18:48:29.030383   10378 update.go:89] pod "zzz941" removed (evicted)
I0906 18:48:31.033897   10378 update.go:89] pod "zzz879" removed (evicted)
I0906 18:48:31.630253   10378 update.go:89] pod "zzz849" removed (evicted)
I0906 18:48:31.830611   10378 update.go:89] pod "zzz863" removed (evicted)
I0906 18:48:33.430315   10378 update.go:89] pod "zzz853" removed (evicted)
I0906 18:48:33.630284   10378 update.go:89] pod "zzz949" removed (evicted)
I0906 18:48:34.030414   10378 update.go:89] pod "zzz886" removed (evicted)
I0906 18:48:35.430009   10378 update.go:89] pod "zzz889" removed (evicted)
I0906 18:48:35.629966   10378 update.go:89] pod "zzz866" removed (evicted)
I0906 18:48:36.031034   10378 update.go:89] pod "zzz929" removed (evicted)
I0906 18:48:36.230292   10378 update.go:89] pod "zzz893" removed (evicted)
I0906 18:48:37.230105   10378 update.go:89] pod "zzz954" removed (evicted)
I0906 18:48:37.630300   10378 update.go:89] pod "zzz892" removed (evicted)
I0906 18:48:38.230798   10378 update.go:89] pod "zzz931" removed (evicted)
I0906 18:48:40.830152   10378 update.go:89] pod "zzz906" removed (evicted)
I0906 18:48:41.430476   10378 update.go:89] pod "zzz857" removed (evicted)
I0906 18:48:42.030206   10378 update.go:89] pod "zzz871" removed (evicted)
I0906 18:48:42.629922   10378 update.go:89] pod "zzz971" removed (evicted)
I0906 18:48:43.430093   10378 update.go:89] pod "zzz846" removed (evicted)
I0906 18:48:43.830591   10378 update.go:89] pod "zzz980" removed (evicted)
I0906 18:48:45.230108   10378 update.go:89] pod "zzz909" removed (evicted)
I0906 18:48:46.030160   10378 update.go:89] pod "zzz938" removed (evicted)
I0906 18:48:46.431169   10378 update.go:89] pod "zzz852" removed (evicted)
I0906 18:48:46.630276   10378 update.go:89] pod "zzz876" removed (evicted)
I0906 18:48:47.031533   10378 update.go:89] pod "zzz911" removed (evicted)
I0906 18:48:47.238579   10378 update.go:89] pod "zzz910" removed (evicted)
I0906 18:48:47.430218   10378 update.go:89] pod "zzz861" removed (evicted)
I0906 18:48:48.629939   10378 update.go:89] pod "zzz880" removed (evicted)
I0906 18:48:49.630653   10378 update.go:89] pod "zzz947" removed (evicted)
I0906 18:48:50.030621   10378 update.go:89] pod "zzz913" removed (evicted)
I0906 18:48:50.229995   10378 update.go:89] pod "zzz883" removed (evicted)
I0906 18:48:50.629983   10378 update.go:89] pod "zzz948" removed (evicted)
I0906 18:48:50.830264   10378 update.go:89] pod "zzz926" removed (evicted)
I0906 18:48:51.030507   10378 update.go:89] pod "zzz887" removed (evicted)
I0906 18:48:51.630453   10378 update.go:89] pod "zzz928" removed (evicted)
I0906 18:48:52.030142   10378 update.go:89] pod "zzz950" removed (evicted)
I0906 18:48:52.230036   10378 update.go:89] pod "zzz952" removed (evicted)
I0906 18:48:54.030270   10378 update.go:89] pod "zzz960" removed (evicted)
I0906 18:48:55.630420   10378 update.go:89] pod "zzz966" removed (evicted)
I0906 18:48:56.030203   10378 update.go:89] pod "zzz856" removed (evicted)
I0906 18:48:56.230379   10378 update.go:89] pod "zzz969" removed (evicted)
I0906 18:48:56.430431   10378 update.go:89] pod "zzz965" removed (evicted)
I0906 18:48:56.830211   10378 update.go:89] pod "zzz872" removed (evicted)
I0906 18:48:57.230210   10378 update.go:89] pod "zzz907" removed (evicted)
I0906 18:48:57.630099   10378 update.go:89] pod "zzz968" removed (evicted)
I0906 18:48:57.834684   10378 update.go:89] pod "zzz858" removed (evicted)
I0906 18:48:58.230328   10378 update.go:89] pod "zzz974" removed (evicted)
I0906 18:48:58.429992   10378 update.go:89] pod "zzz936" removed (evicted)
I0906 18:48:58.630245   10378 update.go:89] pod "zzz937" removed (evicted)
I0906 18:48:58.830118   10378 update.go:89] pod "zzz981" removed (evicted)
I0906 18:48:59.030430   10378 update.go:89] pod "zzz975" removed (evicted)
I0906 18:49:00.030176   10378 update.go:89] pod "zzz874" removed (evicted)
I0906 18:49:00.430265   10378 update.go:89] pod "zzz902" removed (evicted)
I0906 18:49:00.830284   10378 update.go:89] pod "zzz944" removed (evicted)
I0906 18:49:01.230335   10378 update.go:89] pod "zzz862" removed (evicted)
I0906 18:49:01.430425   10378 update.go:89] pod "zzz882" removed (evicted)
I0906 18:49:02.030374   10378 update.go:89] pod "zzz943" removed (evicted)
I0906 18:49:02.631130   10378 update.go:89] pod "zzz865" removed (evicted)
I0906 18:49:03.230660   10378 update.go:89] pod "zzz958" removed (evicted)
I0906 18:49:03.430689   10378 update.go:89] pod "zzz959" removed (evicted)
I0906 18:49:04.030459   10378 update.go:89] pod "zzz955" removed (evicted)
I0906 18:49:04.229950   10378 update.go:89] pod "zzz867" removed (evicted)
I0906 18:49:04.430224   10378 update.go:89] pod "zzz962" removed (evicted)
I0906 18:49:05.230165   10378 update.go:89] pod "zzz869" removed (evicted)
I0906 18:49:05.430161   10378 update.go:89] pod "zzz961" removed (evicted)
I0906 18:49:05.831872   10378 update.go:89] pod "zzz964" removed (evicted)
I0906 18:49:06.045103   10378 update.go:89] pod "zzz870" removed (evicted)
I0906 18:49:06.230726   10378 update.go:89] pod "zzz855" removed (evicted)
I0906 18:49:06.431556   10378 update.go:89] pod "zzz967" removed (evicted)
I0906 18:49:07.830407   10378 update.go:89] pod "zzz976" removed (evicted)
I0906 18:49:08.030158   10378 update.go:89] pod "zzz860" removed (evicted)
I0906 18:49:08.630250   10378 update.go:89] pod "zzz884" removed (evicted)
I0906 18:49:08.830166   10378 update.go:89] pod "zzz912" removed (evicted)
I0906 18:49:10.430251   10378 update.go:89] pod "zzz957" removed (evicted)
I0906 18:49:10.630241   10378 update.go:89] pod "zzz963" removed (evicted)
I0906 18:49:11.630074   10378 update.go:89] pod "zzz978" removed (evicted)
I0906 18:49:11.832262   10378 update.go:89] pod "zzz973" removed (evicted)
I0906 18:49:12.030238   10378 update.go:89] pod "zzz908" removed (evicted)
I0906 18:49:12.431965   10378 update.go:89] pod "zzz878" removed (evicted)
I0906 18:49:12.631312   10378 update.go:89] pod "zzz946" removed (evicted)
I0906 18:49:13.230258   10378 update.go:89] pod "zzz891" removed (evicted)
I0906 18:49:13.430333   10378 update.go:89] pod "zzz953" removed (evicted)
I0906 18:49:13.630208   10378 update.go:89] pod "zzz868" removed (evicted)
I0906 18:49:14.030403   10378 update.go:89] pod "zzz932" removed (evicted)
I0906 18:49:14.230721   10378 update.go:89] pod "zzz970" removed (evicted)
I0906 18:49:14.430212   10378 update.go:89] pod "zzz972" removed (evicted)
I0906 18:49:14.830484   10378 update.go:89] pod "zzz942" removed (evicted)
I0906 18:49:15.630252   10378 update.go:89] pod "zzz982" removed (evicted)
I0906 18:49:16.230271   10378 update.go:89] pod "zzz888" removed (evicted)
I0906 18:49:16.642332   10378 update.go:89] pod "zzz914" removed (evicted)
I0906 18:49:17.841294   10378 update.go:89] pod "zzz956" removed (evicted)
I0906 18:49:17.841345   10378 update.go:89] drained node "ip-10-0-154-7.ec2.internal"
I0906 18:49:17.841359   10378 update.go:848] drain complete
I0906 18:49:17.844990   10378 update.go:848] initiating reboot: Node will reboot into config rendered-worker-3ff06e8b04f5751b838b9a61a833df14
I0906 18:49:17.947572   10378 start.go:215] Shutting down MachineConfigDaemon

Comment 49 Robert Fairley 2019-09-06 19:05:01 UTC

To note, not all 250 nodes in [751 - 1000] show in logs above as some nodes were still Pending at the time `oc create -f file.yaml` was done to apply the dummy config.

Comment 51 errata-xmlrpc 2019-09-12 18:56:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2681

Note You need to log in before you can comment on or make changes to this bug.