1827712 – Rebase to 2020.2 to fix initrd regeneration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1827712 - Rebase to 2020.2 to fix initrd regeneration

Summary: Rebase to 2020.2 to fix initrd regeneration

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	rpm-ostree
Sub Component:
Version:	8.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	8.3
Assignee:	Colin Walters
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Depends On:	1836306
Blocks:	1845427
TreeView+	depends on / blocked

Reported:	2020-04-24 15:30 UTC by Antonio Murdaca
Modified:	2020-11-04 03:12 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-04 03:11:13 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1593769	0	unspecified	CLOSED	local-fs.target: Job var-lib-nfs-rpc_pipefs.mount/stop deleted to break ordering cycle ...	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2020:4708	0	None	None	None	2020-11-04 03:11:25 UTC

Description Antonio Murdaca 2020-04-24 15:30:43 UTC

Description of problem:

The scenario is as follows (RT kernel feature):

- switching to realtime (works 100% of the time):

I0424 13:35:34.004411    2154 update.go:1352] Initiating switch from kernel default to realtime
I0424 13:35:34.007494    2154 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json registry.svc.ci.openshift.org/ci-op-rmdnn7k7/stable@sha256:84fdfb9305bb8ab1ba96e27cf7051e836a84496ce1d15753f39dc9048f6f0eb7
911e6fd3f6449785f67780583687050438f7ea0b444b7e3c17a840356f23f8ef
I0424 13:35:34.579774    2154 rpm-ostree.go:368] Running captured: podman create --net=none --annotation=org.openshift.machineconfigoperator.pivot=true --name mcd-1f0fb1c9-113e-4da9-9b93-dce24a250074 registry.svc.ci.openshift.org/ci-op-rmdnn7k7/stable@sha256:84fdfb9305bb8ab1ba96e27cf7051e836a84496ce1d15753f39dc9048f6f0eb7
I0424 13:35:34.681894    2154 rpm-ostree.go:368] Running captured: podman mount b1a209a0996dc5c4ecd25f9b416a7af94f6af101ab9e51b332b88eadad54db67
I0424 13:35:34.765150    2154 update.go:1352] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "/var/lib/containers/storage/overlay/4bbc7af933accb547ada917fd4770054da858dc52f9f1d5d26fcc1ea8c9c1fe3/merged/kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64.rpm" "--install" "/var/lib/containers/storage/overlay/4bbc7af933accb547ada917fd4770054da858dc52f9f1d5d26fcc1ea8c9c1fe3/merged/kernel-rt-kvm-4.18.0-147.8.1.rt24.101.el8_1.x86_64.rpm" "--install" "/var/lib/containers/storage/overlay/4bbc7af933accb547ada917fd4770054da858dc52f9f1d5d26fcc1ea8c9c1fe3/merged/kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64.rpm" "--install" "/var/lib/containers/storage/overlay/4bbc7af933accb547ada917fd4770054da858dc52f9f1d5d26fcc1ea8c9c1fe3/merged/kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64.rpm"]
I0424 13:37:24.913737    2154 update.go:1352] Deleted container and removed OSContainer image

The above will boot into this:

Deployments:
* pivot://registry.svc.ci.openshift.org/ci-op-rmdnn7k7/stable@sha256:84fdfb9305bb8ab1ba96e27cf7051e836a84496ce1d15753f39dc9048f6f0eb7
              CustomOrigin: Managed by machine-config-operator
                   Version: 45.81.202004240827-0 (2020-04-24T08:33:13Z)
       RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-147.8.1.el8_1
             LocalPackages: kernel-rt-kvm-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                            kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                            kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                            kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64

  pivot://registry.svc.ci.openshift.org/ci-op-rmdnn7k7/stable@sha256:84fdfb9305bb8ab1ba96e27cf7051e836a84496ce1d15753f39dc9048f6f0eb7
              CustomOrigin: Managed by machine-config-operator
                   Version: 45.81.202004240827-0 (2020-04-24T08:33:13Z)

- switching back to default kernel (rollback) (flakes out 1/3 on a some CI jobs, link below):

I0424 13:40:06.449101    2794 update.go:1352] Initiating switch from kernel realtime to default
I0424 13:40:06.454685    2794 rpm-ostree.go:368] Running captured: rpm-ostree status --json
I0424 13:40:06.585732    2794 update.go:1352] Switching to kernelType=default, invoking rpm-ostree ["override" "reset" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--uninstall" "kernel-rt-kvm-4.18.0-147.8.1.rt24.101.el8_1.x86_64" "--uninstall" "kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64" "--uninstall" "kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64" "--uninstall" "kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64"]

The above, 1/3 won't boot in the other correct deployment but stays real time as you can see below:

Deployments:
* pivot://registry.svc.ci.openshift.org/ci-op-rmdnn7k7/stable@sha256:84fdfb9305bb8ab1ba96e27cf7051e836a84496ce1d15753f39dc9048f6f0eb7
              CustomOrigin: Managed by machine-config-operator
                   Version: 45.81.202004240827-0 (2020-04-24T08:33:13Z)
       RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-147.8.1.el8_1
             LocalPackages: kernel-rt-kvm-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                            kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                            kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                            kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64

  pivot://registry.svc.ci.openshift.org/ci-op-rmdnn7k7/stable@sha256:84fdfb9305bb8ab1ba96e27cf7051e836a84496ce1d15753f39dc9048f6f0eb7
              CustomOrigin: Managed by machine-config-operator
                   Version: 45.81.202004240827-0 (2020-04-24T08:33:13Z)

A normal rollback would boot back into this:

Deployments:
* pivot://registry.svc.ci.openshift.org/ci-op-rmdnn7k7/stable@sha256:84fdfb9305bb8ab1ba96e27cf7051e836a84496ce1d15753f39dc9048f6f0eb7
              CustomOrigin: Managed by machine-config-operator
                   Version: 45.81.202004240827-0 (2020-04-24T08:33:13Z)

  pivot://registry.svc.ci.openshift.org/ci-op-rmdnn7k7/stable@sha256:84fdfb9305bb8ab1ba96e27cf7051e836a84496ce1d15753f39dc9048f6f0eb7
              CustomOrigin: Managed by machine-config-operator
                   Version: 45.81.202004240827-0 (2020-04-24T08:33:13Z)
       RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-147.8.1.el8_1
             LocalPackages: kernel-rt-kvm-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                            kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                            kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                            kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64

Version-Release number of selected component (if applicable):

RHCOS 4.5 rpm-ostree I guess


How reproducible:

1/3 in some jobs, not 100% reproducible. One job is here: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1682/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2025

The logs from above are here as well, the first one is the MCD logs for the node that ISN'T rolling back but rpm-ostree didn't error out so it assumes everything is ok, but kernel never rolled back:

- https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1682/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2025/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-q4rl4_machine-config-daemon.log

These 2 nodes rolled back successfully instead:

- https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1682/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2025/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-jz5tp_machine-config-daemon.log
- https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1682/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2025/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-l72qp_machine-config-daemon.log

Unfortunately, the CI isn't grabbing the journal for the node but only the one from last boot (which isn't the time our test runs). So, no journal, tho it seems something easily reproducible with rpm-ostree if the bug is there.

Steps to Reproduce:
1. not sure, but there are jobs which have failed like this
2.
3.

Actual results:

RHCOS didn't boot in the correct deployment w/o kernel RT


Expected results:

RHCOS booted into kernel RT despite rpm-ostree telling it successfully rolled back


Additional info:

Comment 1 Jonathan Lebon 2020-04-24 15:56:58 UTC

When the node doesn't roll back, it's likely because ostree-finalize-staged.service failed. Need to look at the journal logs of the previous boot.

> Unfortunately, the CI isn't grabbing the journal for the node but only the one from last boot (which isn't the time our test runs). So, no journal, tho it seems something easily reproducible with rpm-ostree if the bug is there.

Ahh... that would have been helpful. :) Is there a way to configure CI to capture the journal? Could we just add a `journalctl -b -1` to the MCO tests when it reboots and finds that the node isn't back in vanilla kernel?

We can try to reproduce this locally, but if it only happens 1/3 of the time, I'm going to bet it's some I/O related thing which just doing in qemu won't reproduce (rpm-ostree upstream CI has tons of tests that implicit test ostree-finalize-staged.service). So, we'd probably need to reproduce this in an AWS cluster with kernel-rt turned on.

Comment 2 Antonio Murdaca 2020-04-24 16:37:00 UTC

- grab previous boot journal in CI: https://github.com/openshift/release/pull/8558
- make the MCO fails tests whenever a test fail (this prevents the tests from going on so we grab the logs we need): https://github.com/openshift/machine-config-operator/pull/1683

Once those merge, I'll write back here and first occurrences we see in our CI we should be able to provide those logs to you!

Comment 3 Sinny Kumari 2020-04-27 06:59:45 UTC

with Antonio's PR merged( https://github.com/openshift/release/pull/8558), we now have journal log from previous boot. From one of the failing e2-gcp-op test journal log for worker node is at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1685/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2034/artifacts/e2e-gcp-op/nodes/workers-journal.  From the node log where RT kernel rollback failed, below failure could be related:

Apr 27 02:18:19.799745 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal ostree[35041]: error: Unexpected state: /run/ostree-booted found, but no /boot/loader directory
Apr 27 02:18:19.800480 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal systemd[1]: Stopped target Sockets.
Apr 27 02:18:19.806791 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal systemd[1]: Closed D-Bus System Message Bus Socket.
Apr 27 02:18:19.817182 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal systemd[1]: ostree-finalize-staged.service: Control process exited, code=exited status=1
Apr 27 02:18:19.817591 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal systemd[1]: ostree-finalize-staged.service: Failed with result 'exit-code'.
Apr 27 02:18:19.819596 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal systemd[1]: Stopped OSTree Finalize Staged Deployment.

Comment 4 Antonio Murdaca 2020-04-27 16:10:40 UTC

here's another failed job because of this with logs if it can help further https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1688/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2040

Comment 6 Colin Walters 2020-05-08 14:07:48 UTC

> Apr 27 02:18:19.799745 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal ostree[35041]: error: Unexpected state: /run/ostree-booted found, but no /boot/loader directory

Hmmm.  I wonder if we're sometimes racing with having `/boot` being unmounted or something?  Definitely a CoreOS issue.

Will see if we can reproduce.

Comment 7 Colin Walters 2020-05-08 14:09:40 UTC

A bit of background on this, see https://github.com/ostreedev/ostree/commit/eb506c759c666af2461f1ba3dda4e31ea49ebc41

Comment 8 Antonio Murdaca 2020-05-08 14:13:15 UTC

(In reply to Colin Walters from comment #6)
> > Apr 27 02:18:19.799745 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal ostree[35041]: error: Unexpected state: /run/ostree-booted found, but no /boot/loader directory
> 
> Hmmm.  I wonder if we're sometimes racing with having `/boot` being
> unmounted or something?  Definitely a CoreOS issue.
> 
> Will see if we can reproduce.

Might not be fully related, but we're hitting this on the RT kernel test which just installs the RT rpms, verify it works, and rollback (and the rollback fails). We never encountered this behavior in the other MCO tests.

Comment 9 Colin Walters 2020-05-08 14:13:56 UTC

Two further notes; rpm-ostree has special support for *displaying* this state:
https://github.com/coreos/rpm-ostree/pull/1609

And in theory the MCO could actively say "hmm, last deploy failed, let's try again".
(Though we still want to fix this)

Also related, if we were doing https://github.com/openshift/machine-config-operator/issues/1190
then the "source of truth" would be ostree - which means when the MCD came up again it'd
be more obvious that "ok I tried to reboot but it didn't work", and none of the new config
changes would be applied either.

Comment 10 Jonathan Lebon 2020-05-08 14:29:41 UTC

> Hmmm.  I wonder if we're sometimes racing with having `/boot` being unmounted or something?  Definitely a CoreOS issue.

Yeah, I think you're right. Looking at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1688/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2040/artifacts/e2e-gcp-op/nodes/workers-journal makes it really easy to see:

Apr 27 15:31:18.993110 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: Unmounting Boot partition...
...
Apr 27 15:31:19.057339 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: boot.mount: Consumed 27ms CPU time
...
Apr 27 15:31:19.291991 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: Stopping OSTree Finalize Staged Deployment...
Apr 27 15:31:19.324848 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal ostree[40698]: error: Unexpected state: /run/ostree-booted found, but no /boot/loader directory

So it might somehow be a regression from https://github.com/ostreedev/ostree/commit/a8754ae3708431d72be63723a6362a4edc82df84 ? It's weird though because we have `After=local-fs.target`, so we should already be stopped before boot.mount.

Comment 11 Colin Walters 2020-05-11 21:31:44 UTC

https://github.com/ostreedev/ostree/pull/2096

Comment 12 Colin Walters 2020-05-12 17:29:05 UTC

Ahh this might be it:


Apr 27 15:34:13.321405 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found ordering cycle on var-lib-nfs-rpc_pipefs.moun
t/stop
Apr 27 15:34:13.321421 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found dependency on systemd-tmpfiles-setup.service/
stop
Apr 27 15:34:13.321426 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found dependency on local-fs.target/stop
Apr 27 15:34:13.321433 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Job var-lib-nfs-rpc_pipefs.mount/stop deleted to br
eak ordering cycle starting with local-fs.target/stop

Comment 13 Colin Walters 2020-05-12 19:31:51 UTC

This type of ordering cycle is a "really shouldn't happen" bug that we need some alerting on
(xref https://github.com/openshift/machine-config-operator/issues/1365 )

I'm not reproducing it in a local 4.4 libvirt cluster even after `systemctl start nfs-server`.

I also did some ad-hoc checking of the journals from this recent 4.4.3 upgrade test:
https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/27710/artifacts/e2e-aws-upgrade/nodes/
and not seeing it there.

Also looked at the logs from a recent MCO merged PR that passed:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1711/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2099
and it's not there either.

At least one benefit of our CI system is we have...a whole lot...of logs from running clusters, so I can probably work up a script to scrape the journals for this and see whether it's some kind of race.

Comment 14 Sinny Kumari 2020-05-13 06:26:07 UTC

(In reply to Colin Walters from comment #13)

> Also looked at the logs from a recent MCO merged PR that passed:
> https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-
> logs/pull/openshift_machine-config-operator/1711/pull-ci-openshift-machine-
> config-operator-master-e2e-gcp-op/2099
> and it's not there either.

yeah, we haven't seen this issue recently. It existed consistently for few days and seems to have gone away, but who knows when it will back to surprise us.

Comment 15 Jonathan Lebon 2020-05-13 14:17:40 UTC

> Ahh this might be it:
>
> Apr 27 15:34:13.321405 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found ordering cycle on var-lib-nfs-rpc_pipefs.mount/stop
> Apr 27 15:34:13.321421 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found dependency on systemd-tmpfiles-setup.service/stop
> Apr 27 15:34:13.321426 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found dependency on local-fs.target/stop
> Apr 27 15:34:13.321433 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Job var-lib-nfs-rpc_pipefs.mount/stop deleted to break ordering cycle starting with local-fs.target/stop

Ahh yeah, I think you're on the right track. Specifically, I think it's this other slightly different occurrence of the ordering cycle:

> Apr 27 15:31:16.363678 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: systemd-tmpfiles-setup.service: Found ordering cycle on local-fs.target/stop
> Apr 27 15:31:16.363693 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: systemd-tmpfiles-setup.service: Found dependency on var-lib-nfs-rpc_pipefs.mount/stop
> Apr 27 15:31:16.363697 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: systemd-tmpfiles-setup.service: Found dependency on systemd-tmpfiles-setup.service/stop
> Apr 27 15:31:16.363703 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: systemd-tmpfiles-setup.service: Job local-fs.target/stop deleted to break ordering cycle starting with systemd-tmpfiles-setup.service/stop

Here, we deleted local-fs.target/stop, which I think caused systemd to then run ostree-finalized-staged later than it normally would (since it's normally keyed off of local-fs.target), when /boot was already unmounted.

Thinking more on this, the NFS stuff definitely rings a bell. Ahh yup: https://github.com/openshift/os/issues/379 (the ordering cycle shows up in https://github.com/openshift/os/issues/379#issuecomment-472164793#issuecomment-472164793), which leads to https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/213.

And looking at the same journal logs, I see:

> Apr 27 15:20:35.501506 localhost rpcbind[629]: cannot get uid of 'rpc': No such file or directory
> Apr 27 15:20:35.584554 localhost rpc.statd[633]: Version 2.3.3 starting
> Apr 27 15:20:35.590062 localhost rpc.statd[633]: Initializing NSM state
> Apr 27 15:20:35.635044 localhost rpc.statd[633]: Running as root.  chown /var/lib/nfs/statd to choose different user
> Apr 27 15:20:35.637588 localhost rpc.statd[633]: Failed to register (statd, 1, udp): svc_reg() err: RPC: Remote system error - Connection refused
> Apr 27 15:20:35.639994 localhost rpc.statd[633]: Failed to register (statd, 1, tcp): svc_reg() err: RPC: Remote system error - Connection refused
> Apr 27 15:20:35.642382 localhost rpc.statd[633]: Failed to register (statd, 1, udp6): svc_reg() err: RPC: Remote system error - Connection refused
> Apr 27 15:20:35.644711 localhost rpc.statd[633]: Failed to register (statd, 1, tcp6): svc_reg() err: RPC: Remote system error - Connection refused
> Apr 27 15:20:35.644742 localhost rpc.statd[633]: failed to create RPC listeners, exiting
> Apr 27 15:20:35.694535 localhost rpc.idmapd[638]: Setting log level to 0

in the initrd -- something is pulling the nfs module back in?

I think what might be going on here is that the nfs dracut module is mounting /var/lib/nfs/rpc_pipefs, which confuses real root systemd's handling of the var-lib-nfs-rpc_pipefs.mount unit because it has `After=systemd-tmpfiles-setup.service` yet it's already mounted. So on shutdown, it hits the dependency cycle.

And ohhh, the reason why the NFS module is getting pulled back in is that we're rebuilding the initramfs in this case and we're not re-running dracut with the canonical args. I think backporting https://github.com/coreos/rpm-ostree/commit/e9011530e52fbc4e7b002c29ff20d2c08a97435c would fix this (we initially were going to backport that for https://bugzilla.redhat.com/show_bug.cgi?id=1806588, but ended up keeping it conservative).

An easy workaround to try of course for now would be to add `rpm-ostree initramfs --enable --arg='--omit' --arg='nfs'`. Let's see if that helps and if so, use this ticket to track backporting the rpm-ostree patch for canonical dracut args? At the same time, we should engage with NFS maintainers to fix the dependencies there. Offhand I think that the initrd mount should be torn down on switch root if rootfs isn't on NFS?

Comment 16 Colin Walters 2020-05-13 14:22:38 UTC

OK this search turned up more hits: "Job var-lib-nfs-rpc_pipefs.mount/stop deleted to break ordering cycle"

- https://github.com/openshift/os/issues/379
- https://github.com/systemd/systemd/issues/5991

Hmmm...but I'm still not hitting this trying to enable NFS and am confused how it's happening because again AFAICS the pipefs mount *isn't* ordered against local-fs.target.

Comment 17 Colin Walters 2020-05-13 14:23:13 UTC

> And ohhh, the reason why the NFS module is getting pulled back in is that we're rebuilding the initramfs in this case and we're not re-running dracut with the canonical args. I think backporting https://github.com/coreos/rpm-ostree/commit/e9011530e52fbc4e7b002c29ff20d2c08a97435c would fix this (we initially were going to backport that for https://bugzilla.redhat.com/show_bug.cgi?id=1806588, but ended up keeping it conservative).

Nice call!  Yes that has to be it.

Comment 18 Colin Walters 2020-05-13 17:49:57 UTC

OK so I think let's get new ostree/rpm-ostree releases spun up for RHCOS and queued for RHEL 8.3 on general principle, which will pull in the above patch.  Then we can cherry pick those.

Comment 22 Sinny Kumari 2020-05-19 09:56:30 UTC

Noticed this today in MCO e2e-gcp-op test run on 4.4 PR https://github.com/openshift/machine-config-operator/pull/1737#issuecomment-630716314

Comment 23 Yaniv Joseph 2020-07-02 09:05:15 UTC

Any plans to backport to 8.2 ?

Comment 24 Micah Abbott 2020-07-24 12:57:43 UTC

I ran a loop on an OCP 4.5.2 cluster (RHCOS 45.82.202007141718-0 - rpm-ostree-2020.2-2.el8) that created a MachineConfig to switch to the RT kernel and then deleted the MachineConfig to switch back to the vanilla kernel.

The loop completed 10+ times successfully, confirming each time that the nodes were properly updated with the right kernel.

Marking as VERIFIED.

Comment 27 errata-xmlrpc 2020-11-04 03:11:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (rpm-ostree bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4708

Note You need to log in before you can comment on or make changes to this bug.