Bug 1783621 - After successful upgrade to 4.1.27, machineconfigpools stuck in degraded state
Summary: After successful upgrade to 4.1.27, machineconfigpools stuck in degraded state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.1.z
Assignee: Jonathan Lebon
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1781665
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-14 05:12 UTC by brad.williams
Modified: 2023-09-14 05:48 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-13 06:14:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift pivot pull 58 0 None closed cmd/root: randomize pivot container name 2020-10-07 13:09:15 UTC
Red Hat Product Errata RHBA-2020:0399 0 None None None 2020-02-13 06:14:11 UTC

Description brad.williams 2019-12-14 05:12:49 UTC
Description of problem:
On 12/5/19, we upgraded the starter clusters to version 4.1.27.  The upgrade was successful and all verification of the environment were successful.  Earlier today, 12/13/19, an upgrade to 4.1.28 was initiated and failed to begin due to the "worker" machineconfigpool being degraded.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.27    True        False         7d22h   Cluster version is 4.1.27


How reproducible:
We upgraded 3 clusters from 4.1.26 to 4.1.27.  Of those 3 clusters 2 are now experiencing this problem.

Steps to Reproduce:
1. Upgrade cluster to 4.1.27
2. Observe that at least one of the machineconfigpools ended up stuck in a degraded state


Actual results:
Machineconfigpool stuck in a degraded state

Expected results:
Cluster upgrades successfully.

Additional info:
I had originally commented about this problem in another BZ:
     https://bugzilla.redhat.com/show_bug.cgi?id=1752508
However, it was suggested that I open a separate BZ.

Comment 2 brad.williams 2019-12-14 05:15:43 UTC
Adding original comments from original BZ...

@brad briefly looking this seems to be different than the original post.

i see in mcc:
```
2019-12-13T20:50:58.681266227Z E1213 20:50:58.680777       1 render_controller.go:216] error finding pools for machineconfig: could not find any MachineConfigPool set for MachineConfig managed-ssh-keys-infra with labels: map[cr.applier/hash:00338f448a68ac10398cab84da147ec6eb29fb7b cr.applier/unit:machineconfig machineconfiguration.openshift.io/role:infra]
```

so perhaps check that your pools are set up correctly.

Also seeing:

```
2019-12-13T20:56:22.353291917Z E1213 20:56:22.353269   21974 daemon.go:1136] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8895eadc29a0aca62c7be9343dd547a60700be5185151d6a959728a28ac52003
2019-12-13T20:56:22.353308527Z E1213 20:56:22.353294   21974 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-worker-e4183503741870a5d8f36d4f4135b773
```

Comment 3 brad.williams 2019-12-14 05:19:44 UTC
@Kirsten the "managed-ssh-keys-infra" message is most likely benign.  The machine config is present, but we do not have any running infra nodes in these clusters.  Also, these are 50 node clusters, with identical machineconfigs, and all the machines, except the few mentioned in this BZ, are running as expected.

Comment 4 brad.williams 2019-12-14 05:21:08 UTC
Some more information from the original BZ, regarding the second starter cluster that's having the same issue...

I added a cluster verification step to detect this scenario and ran it against all our production clusters...

2 out of 5 clusters have this issue:

1) us-east-2 (4.1.27)
    Node ip-10-0-166-99.us-east-2.compute.internal is reporting: "failed to run pivot: failed to start pivot.service: exit status 1"

2) ca-central-1 (4.1.27)
    Node ip-10-0-135-172.ca-central-1.compute.internal is reporting: "failed to run pivot: failed to start pivot.service: exit status 1", 
    Node ip-10-0-129-175.ca-central-1.compute.internal is reporting: "failed to run pivot: failed to start pivot.service: exit status 1"

Please let me know if you need anything to help investigate this issue!

Comment 5 Kirsten Garrison 2019-12-16 19:18:07 UTC
@brad, thanks for the info and new bz.

from the mcd on the failing worker node(ip-10-0-166-99.us-east-2.compute.internal):
2019-12-13T20:20:17.72795471Z Deployments:
2019-12-13T20:20:17.72795471Z * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d9aa91b1085a877be1fb496c2146a005741c93be52f4cb0685e398c96eb37eb
2019-12-13T20:20:17.72795471Z               CustomOrigin: Managed by pivot tool
2019-12-13T20:20:17.72795471Z                    Version: 410.81.20191126.0 (2019-11-26T08:43:52Z)
2019-12-13T20:20:17.72795471Z 
2019-12-13T20:20:17.72795471Z   pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60bd2e7abab54720fc9cafa6c4b7898ad4c0bb46a39a2bf620da22d8dd532c28
2019-12-13T20:20:17.72795471Z               CustomOrigin: Managed by pivot tool
2019-12-13T20:20:17.72795471Z                    Version: 410.81.20191119.0 (2019-11-19T08:42:54Z)

on a non-failing mcd on a non-failing worker node:
019-12-13T20:20:17.984653212Z AutomaticUpdates: disabled
2019-12-13T20:20:17.984653212Z Deployments:
2019-12-13T20:20:17.984653212Z * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8895eadc29a0aca62c7be9343dd547a60700be5185151d6a959728a28ac52003
2019-12-13T20:20:17.984653212Z               CustomOrigin: Managed by pivot tool
2019-12-13T20:20:17.984653212Z                    Version: 410.81.20191204.0 (2019-12-04T04:42:46Z)
2019-12-13T20:20:17.984653212Z 
2019-12-13T20:20:17.984653212Z   pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d9aa91b1085a877be1fb496c2146a005741c93be52f4cb0685e398c96eb37eb
2019-12-13T20:20:17.984653212Z               CustomOrigin: Managed by pivot tool
2019-12-13T20:20:17.984653212Z                    Version: 410.81.20191126.0 (2019-11-26T08:43:52Z)

4.1.27 is a new z stream release (from last week) unsure why pivot.service isn't starting on 1 node out of 50... the mcd is behaving correctly i think since the pivot service failed, it can't validate the on disk state of the new config. it's expecting to see the successfully pivoted to osImageURL(from 4.1.27) on disk but it doesn't exist as the pivot service didnt start...

Moving to RHCOS to take a look.

Comment 6 Stephen Cuppett 2019-12-17 11:43:11 UTC
Setting to 4.1.z Target for now. Will this fix be needed in 4.1+ or just 4.1.z?

Comment 8 Stephen Cuppett 2019-12-18 12:22:00 UTC
Setting to current version for investigation (4.3) as potential upgrade bug. Once investigated, clones will be created in the needed releases for backports, if any. Can we take a look today?

Comment 9 brad.williams 2019-12-18 19:12:47 UTC
We really need to get some eyes on this problem as soon as possible.  We are holding our 4.1.28 upgrades while we wait for someone to investigate this problem and with 4.1.29 due out later this week, we really need these environments to be available.

Any assistance it greatly appreciated! 
Please let me know if you need anything from our end.

Comment 10 Jonathan Lebon 2019-12-18 19:40:42 UTC
Hmm, so it looks like the error here is:

```
Dec 13 20:13:13 ip-10-0-166-99 pivot[2401]: I1213 20:13:13.752278    2401 run.go:16] Running: podman create --net=none --name ostree-container-pivot quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8895eadc29a0aca62c7be9343dd547a60700be5185151d6a959728a28ac52003
Dec 13 20:13:14 ip-10-0-166-99 pivot[2401]: Error: error creating container storage: the container name "ostree-container-pivot" is already in use by "9ec7acf7f8dba6eed7438b45b0254dfd5748e0613875468e0d2e8e9135fd9d93". You have to remove that container to be able to reuse that name.: that name is already in use
Dec 13 20:13:14 ip-10-0-166-99 pivot[2401]: F1213 20:13:14.045294    2401 run.go:84] podman: exit status 125
Dec 13 20:13:14 ip-10-0-166-99 systemd[1]: pivot.service: Main process exited, code=exited, status=255/n/a
```

Will try to reproduce this.

Comment 11 Colin Walters 2019-12-18 19:43:25 UTC
This is a dup, see https://github.com/openshift/machine-config-operator/pull/1285

Comment 12 Kirsten Garrison 2019-12-18 19:46:42 UTC
@colin so I can just backport that to 4.2 then 4.1?

Comment 13 brad.williams 2019-12-18 20:04:35 UTC
Is there a known workaround?

Comment 17 Jonathan Lebon 2019-12-18 20:51:36 UTC
> Is there a known workaround?

You should be able to log into the affected machines and try to remove it forcibly from podman as mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1775647#c5.

And actually, I think one will *have* to do this since we can't deliver the patch (which is part of the OS) without pivoting to it. You could probably script something with selectors + `oc debug` to run the workaround on all the machines.

Comment 19 brad.williams 2019-12-18 22:08:58 UTC
I have verified that the steps outlined above (https://bugzilla.redhat.com/show_bug.cgi?id=1775647#c5) do resolve the pivot.service issue.  However, I did have to manually reboot the node to get things moving again.

Comment 23 Michael Nguyen 2020-01-18 00:29:25 UTC
Verified on registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2020-01-03-052918 running 410.81.20200106.0 (the first RHCOS with the updated pivot).  After upgrading to registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2020-01-14-065036, the journal showed the ostree container pivot with a UUID suffix.

Jan 18 00:14:43 ip-10-0-139-8 pivot[57570]: I0118 00:14:43.173832   57570 run.go:16] Running: podman kill ostree-container-pivot-7b31be9b-3987-11ea-8cab-0ad255f99f4b
Jan 18 00:14:43 ip-10-0-139-8 rpm-ostree[57578]: client(id:cli dbus:1.412 unit:pivot.service uid:0) vanished; remaining=0
Jan 18 00:14:43 ip-10-0-139-8 pivot[57570]: Error: can only kill running containers. e8f4e8e6e8166ed9ae14c38aa9cb2b2e30f41e114c6495384829fde81d46571c is in state configured: container state improper
Jan 18 00:14:43 ip-10-0-139-8 pivot[57570]: W0118 00:14:43.277940   57570 run.go:75] (ignored) podman: exit status 125
Jan 18 00:14:43 ip-10-0-139-8 pivot[57570]: I0118 00:14:43.277975   57570 run.go:16] Running: podman rm -f ostree-container-pivot-7b31be9b-3987-11ea-8cab-0ad255f99f4b
Jan 18 00:14:43 ip-10-0-139-8 podman[58913]: 2020-01-18 00:14:43.417900806 +0000 UTC m=+0.106096849 container remove e8f4e8e6e8166ed9ae14c38aa9cb2b2e30f41e114c6495384829fde81d46571c (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c97112885ec4102014f8cd56825d7170f92dbcc9d17ab243566843040ab33aa2, name=ostree-container-pivot-7b31be9b-3987-11ea-8cab-0ad255f99f4b)
Jan 18 00:14:43 ip-10-0-139-8 pivot[57570]: e8f4e8e6e8166ed9ae14c38aa9cb2b2e30f41e114c6495384829fde81d46571c
Jan 18 00:14:43 ip-10-0-139-8 pivot[57570]: I0118 00:14:43.421987   57570 run.go:16] Running: podman rmi quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c97112885ec4102014f8cd56825d7170f92dbcc9d17ab243566843040ab33aa2
Jan 18 00:14:44 ip-10-0-139-8 pivot[57570]: f699d91e08a680da3fc940a16d25ba5eb030ea08538069ce767d361a3c16782f
Jan 18 00:14:44 ip-10-0-139-8 pivot[57570]: I0118 00:14:44.588981   57570 root.go:403] unable to parse tuning file /etc/pivot/kernel-args: stat /etc/pivot/kernel-args: no such file or directory
Jan 18 00:14:44 ip-10-0-139-8 systemd[1]: pivot.service: Consumed 37.904s CPU time
Jan 18 00:14:44 ip-10-0-139-8 hyperkube[1360]: I0118 00:14:44.592452    1360 manager.go:1068] Destroyed container: "/system.slice/pivot.service" (aliases: [], namespace: "")
Jan 18 00:15:59 localhost systemd[1]: Starting dracut pre-pivot and cleanup hook...
Jan 18 00:15:59 localhost dracut-pre-pivot[839]: Jan 18 00:15:59 | /etc/multipath.conf does not exist, blacklisting all devices.
Jan 18 00:15:59 localhost dracut-pre-pivot[839]: Jan 18 00:15:59 | You can run "/sbin/mpathconf --enable" to create
Jan 18 00:15:59 localhost dracut-pre-pivot[839]: Jan 18 00:15:59 | /etc/multipath.conf. See man mpathconf(8) for more details
Jan 18 00:15:59 localhost systemd[1]: Started dracut pre-pivot and cleanup hook.
Jan 18 00:15:59 localhost systemd[1]: Stopped dracut pre-pivot and cleanup hook.
Jan 18 00:16:08 localhost.localdomain rpm-ostree[1245]: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c97112885ec4102014f8cd56825d7170f92dbcc9d17ab243566843040ab33aa2
Jan 18 00:16:08 localhost.localdomain rpm-ostree[1245]:               CustomOrigin: Managed by pivot tool

Comment 25 errata-xmlrpc 2020-02-13 06:14:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0399

Comment 26 Red Hat Bugzilla 2023-09-14 05:48:45 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.