1838206 – Nodes on cluster reboots continually

Bug 1838206 - Nodes on cluster reboots continually

Summary: Nodes on cluster reboots continually

Keywords:
Status:	CLOSED DUPLICATE of bug 1809007
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-20 16:06 UTC by Mario Abajo
Modified:	2023-09-07 23:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-22 12:40:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mario Abajo 2020-05-20 16:06:13 UTC

Description of problem:
Nodes on Openshift 4.3 cluster are continually rebooted. The reason of the reboot is the machineconfigpool being applied to the nodes:

- extract from "machine-config-daemon-2mfbb" log:

~~~
I0515 10:23:54.059595    2287 start.go:74] Version: v4.3.14-202004200457-dirty (f6d1fe753cbcecb3aa1c2d3d3edd4a5d04ffca54)
I0515 10:23:54.068432    2287 start.go:84] Calling chroot("/rootfs")
I0515 10:23:54.069495    2287 rpm-ostree.go:366] Running captured: rpm-ostree status --json
I0515 10:23:54.252483    2287 daemon.go:209] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4cd521fb34c0d362205a1e55ad8c9c8dd6c7365b71a357ef705692ed80f7b112 (43.81.202004280317.0)
...output suppressed...
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4cd521fb34c0d362205a1e55ad8c9c8dd6c7365b71a357ef705692ed80f7b112
              CustomOrigin: Managed by machine-config-operator
                   Version: 43.81.202004280317.0 (2020-04-28T03:22:44Z)

  ostree://624bfc39e8091d69b3c48bb16e85683ff5166dc8b03ac0686753d8e555613b54
                   Version: 43.81.202003191953.0 (2020-03-19T19:59:17Z)
I0515 10:23:54.572041    2287 rpm-ostree.go:366] Running captured: journalctl --list-boots
I0515 10:23:57.110763    2287 daemon.go:785] journalctl --list-boots:
-126 2a4ac46b42b543ab9eef1617a3a4c161 Sat 2020-05-09 23:08:27 UTC—Sat 2020-05-09 23:08:39 UTC
-125 d8315c36fdfe4c688db9bcca4d8f2ee3 Sat 2020-05-09 23:08:56 UTC—Sun 2020-05-10 00:55:50 UTC
-124 09d0d5845b8f4a279a8233684f9d7343 Sun 2020-05-10 00:56:08 UTC—Sun 2020-05-10 01:09:12 UTC
...output suppressed...
  -2 82d07edfd9a84a24878493c5d31993be Fri 2020-05-15 07:29:30 UTC—Fri 2020-05-15 10:11:45 UTC
  -1 42f374f40fc643e2a71e868da52a5d8d Fri 2020-05-15 10:12:01 UTC—Fri 2020-05-15 10:23:24 UTC
   0 b7fae3fd0a3b4e9da1f6a5dabe318bdd Fri 2020-05-15 10:23:41 UTC—Fri 2020-05-15 10:23:57 UTC
...output suppressed...
I0515 10:24:04.196965    2287 daemon.go:731] Current config: rendered-worker-e09847e8fddf3cca3a3cfc89a033eee6
I0515 10:24:04.196986    2287 daemon.go:732] Desired config: rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73
I0515 10:24:04.205520    2287 update.go:1051] Disk currentConfig rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73 overrides node annotation rendered-worker-e09847e8fddf3cca3a3cfc89a033eee6
I0515 10:24:04.208700    2287 daemon.go:955] Validating against pending config rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73
I0515 10:24:04.211216    2287 daemon.go:971] Validated on-disk state
I0515 10:24:04.224263    2287 daemon.go:1005] Completing pending config rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73
I0515 10:24:04.229678    2287 update.go:1051] completed update for config rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73
I0515 10:24:04.232522    2287 daemon.go:1021] In desired config rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73
~~~

As you can appreciate, the node rebooted 126 times in a period of 5 days, and this is happening alternatively to all nodes in cluster.

Investigating about the issue it happens that the machineconfigpools are being regenerated, pushing new configurations to the nodes and then rebooting them (this was taking at a different moment, thus the different in observedGeneration vs reboots):

~~~
$ oc get mcp -o name | while read mcp; do echo "------------  $mcp -------------------"; oc get $mcp -oyaml | grep observed; done
------------  machineconfigpool.machineconfiguration.openshift.io/infra -------------------
  observedGeneration: 260
------------  machineconfigpool.machineconfiguration.openshift.io/master -------------------
  observedGeneration: 233
------------  machineconfigpool.machineconfiguration.openshift.io/worker -------------------
  observedGeneration: 261
~~~

Following this investigation line we verified machine configs and see that there are two of them that are being continuously regenerated:

$ cat 03-machineconfigs.log | awk '/^Name:|Generation:/ {print $0}'
Name:         00-master
  Generation:          3
Name:         00-worker
  Generation:          3
...output suppressed...
Name:         99-master-a3e5f6eb-a1c2-4a80-8a5c-53ccc9f0b56e-registries
  Generation:          147
Name:         99-worker-d747fbe5-23a7-4baf-bbe7-15798060d81d-registries
  Generation:          175
...output suppressed...

I'm attaching the machineconfigs for your analysis.
This two particular machineconfigs establish the contents of files "/etc/containers/registries.conf" and "/etc/containers/policy.json" with customer specific contents. This files are also previously set in default config files "01-master-container-runtime" and "01-worker-container-runtime". 
The only interesting thing that i found about this two machineconfigs are that they have this annotation:

Name:         99-master-a3e5f6eb-a1c2-4a80-8a5c-53ccc9f0b56e-registries
Annotations:  machineconfiguration.openshift.io/generated-by-controller-version: f6d1fe753cbcecb3aa1c2d3d3edd4a5d04ffca54

Name:         99-worker-d747fbe5-23a7-4baf-bbe7-15798060d81d-registries
Annotations:  machineconfiguration.openshift.io/generated-by-controller-version: f6d1fe753cbcecb3aa1c2d3d3edd4a5d04ffca54

that makes me thing that they are probably regenerated by machine-config-operator.


Version-Release number of selected component (if applicable):
Openshift 4.3.18

How reproducible:
Not tested in lab

Steps to Reproduce:
1.
2.
3.

Actual results:
Nodes continuously reboot.

Expected results:
Nodes stay stable

Additional info:
The cluster is recently installed.
We stopped an external artifact: for gitops customer is using ArgoCD. But nodes still reboots after that.
Customer have just removed the configurations they added that are this ones:
99-worker-z-container-registry-conf
99-master-z-container-registry-conf

Interesting enough is that this two machine configs are not the ones being regenerated. This makes me thing that somewhat these machine configs are triggering the generation of the previously mentioned.
The result of this "deletion" is going to be reviewed next monday.

Comment 2 Kirsten Garrison 2020-05-20 17:41:54 UTC

Can you please attach a must gather for the cluster?

Comment 4 Kirsten Garrison 2020-05-20 18:08:19 UTC

This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1809007

In this case, we'd advise upgrading to 4.3.19 or higher to pick up the fix before applying the changes.

Reassigning to Node team to verify on the mustgather and analysis.

Comment 5 Urvashi Mohnani 2020-05-22 12:40:02 UTC


*** This bug has been marked as a duplicate of bug 1809007 ***

Note You need to log in before you can comment on or make changes to this bug.