Bug 1827712
Summary: | Rebase to 2020.2 to fix initrd regeneration | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Antonio Murdaca <amurdaca> |
Component: | rpm-ostree | Assignee: | Colin Walters <walters> |
Status: | CLOSED ERRATA | QA Contact: | Micah Abbott <miabbott> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 8.3 | CC: | bbreard, imcleod, jlebon, jligon, miabbott, nstielau, skumari, smilner, yjoseph |
Target Milestone: | rc | Keywords: | Rebase |
Target Release: | 8.3 | Flags: | pm-rhel:
mirror+
|
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-04 03:11:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1836306 | ||
Bug Blocks: | 1845427 |
Description
Antonio Murdaca
2020-04-24 15:30:43 UTC
When the node doesn't roll back, it's likely because ostree-finalize-staged.service failed. Need to look at the journal logs of the previous boot.
> Unfortunately, the CI isn't grabbing the journal for the node but only the one from last boot (which isn't the time our test runs). So, no journal, tho it seems something easily reproducible with rpm-ostree if the bug is there.
Ahh... that would have been helpful. :) Is there a way to configure CI to capture the journal? Could we just add a `journalctl -b -1` to the MCO tests when it reboots and finds that the node isn't back in vanilla kernel?
We can try to reproduce this locally, but if it only happens 1/3 of the time, I'm going to bet it's some I/O related thing which just doing in qemu won't reproduce (rpm-ostree upstream CI has tons of tests that implicit test ostree-finalize-staged.service). So, we'd probably need to reproduce this in an AWS cluster with kernel-rt turned on.
- grab previous boot journal in CI: https://github.com/openshift/release/pull/8558 - make the MCO fails tests whenever a test fail (this prevents the tests from going on so we grab the logs we need): https://github.com/openshift/machine-config-operator/pull/1683 Once those merge, I'll write back here and first occurrences we see in our CI we should be able to provide those logs to you! with Antonio's PR merged( https://github.com/openshift/release/pull/8558), we now have journal log from previous boot. From one of the failing e2-gcp-op test journal log for worker node is at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1685/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2034/artifacts/e2e-gcp-op/nodes/workers-journal. From the node log where RT kernel rollback failed, below failure could be related: Apr 27 02:18:19.799745 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal ostree[35041]: error: Unexpected state: /run/ostree-booted found, but no /boot/loader directory Apr 27 02:18:19.800480 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal systemd[1]: Stopped target Sockets. Apr 27 02:18:19.806791 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal systemd[1]: Closed D-Bus System Message Bus Socket. Apr 27 02:18:19.817182 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal systemd[1]: ostree-finalize-staged.service: Control process exited, code=exited status=1 Apr 27 02:18:19.817591 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal systemd[1]: ostree-finalize-staged.service: Failed with result 'exit-code'. Apr 27 02:18:19.819596 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal systemd[1]: Stopped OSTree Finalize Staged Deployment. here's another failed job because of this with logs if it can help further https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1688/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2040 > Apr 27 02:18:19.799745 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal ostree[35041]: error: Unexpected state: /run/ostree-booted found, but no /boot/loader directory
Hmmm. I wonder if we're sometimes racing with having `/boot` being unmounted or something? Definitely a CoreOS issue.
Will see if we can reproduce.
A bit of background on this, see https://github.com/ostreedev/ostree/commit/eb506c759c666af2461f1ba3dda4e31ea49ebc41 (In reply to Colin Walters from comment #6) > > Apr 27 02:18:19.799745 ci-op-49sqd-w-c-8jd2x.c.openshift-gce-devel-ci.internal ostree[35041]: error: Unexpected state: /run/ostree-booted found, but no /boot/loader directory > > Hmmm. I wonder if we're sometimes racing with having `/boot` being > unmounted or something? Definitely a CoreOS issue. > > Will see if we can reproduce. Might not be fully related, but we're hitting this on the RT kernel test which just installs the RT rpms, verify it works, and rollback (and the rollback fails). We never encountered this behavior in the other MCO tests. Two further notes; rpm-ostree has special support for *displaying* this state: https://github.com/coreos/rpm-ostree/pull/1609 And in theory the MCO could actively say "hmm, last deploy failed, let's try again". (Though we still want to fix this) Also related, if we were doing https://github.com/openshift/machine-config-operator/issues/1190 then the "source of truth" would be ostree - which means when the MCD came up again it'd be more obvious that "ok I tried to reboot but it didn't work", and none of the new config changes would be applied either. > Hmmm. I wonder if we're sometimes racing with having `/boot` being unmounted or something? Definitely a CoreOS issue. Yeah, I think you're right. Looking at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1688/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2040/artifacts/e2e-gcp-op/nodes/workers-journal makes it really easy to see: Apr 27 15:31:18.993110 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: Unmounting Boot partition... ... Apr 27 15:31:19.057339 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: boot.mount: Consumed 27ms CPU time ... Apr 27 15:31:19.291991 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: Stopping OSTree Finalize Staged Deployment... Apr 27 15:31:19.324848 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal ostree[40698]: error: Unexpected state: /run/ostree-booted found, but no /boot/loader directory So it might somehow be a regression from https://github.com/ostreedev/ostree/commit/a8754ae3708431d72be63723a6362a4edc82df84 ? It's weird though because we have `After=local-fs.target`, so we should already be stopped before boot.mount. Ahh this might be it: Apr 27 15:34:13.321405 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found ordering cycle on var-lib-nfs-rpc_pipefs.moun t/stop Apr 27 15:34:13.321421 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found dependency on systemd-tmpfiles-setup.service/ stop Apr 27 15:34:13.321426 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found dependency on local-fs.target/stop Apr 27 15:34:13.321433 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Job var-lib-nfs-rpc_pipefs.mount/stop deleted to br eak ordering cycle starting with local-fs.target/stop This type of ordering cycle is a "really shouldn't happen" bug that we need some alerting on (xref https://github.com/openshift/machine-config-operator/issues/1365 ) I'm not reproducing it in a local 4.4 libvirt cluster even after `systemctl start nfs-server`. I also did some ad-hoc checking of the journals from this recent 4.4.3 upgrade test: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/27710/artifacts/e2e-aws-upgrade/nodes/ and not seeing it there. Also looked at the logs from a recent MCO merged PR that passed: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1711/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2099 and it's not there either. At least one benefit of our CI system is we have...a whole lot...of logs from running clusters, so I can probably work up a script to scrape the journals for this and see whether it's some kind of race. (In reply to Colin Walters from comment #13) > Also looked at the logs from a recent MCO merged PR that passed: > https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr- > logs/pull/openshift_machine-config-operator/1711/pull-ci-openshift-machine- > config-operator-master-e2e-gcp-op/2099 > and it's not there either. yeah, we haven't seen this issue recently. It existed consistently for few days and seems to have gone away, but who knows when it will back to surprise us. > Ahh this might be it: > > Apr 27 15:34:13.321405 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found ordering cycle on var-lib-nfs-rpc_pipefs.mount/stop > Apr 27 15:34:13.321421 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found dependency on systemd-tmpfiles-setup.service/stop > Apr 27 15:34:13.321426 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Found dependency on local-fs.target/stop > Apr 27 15:34:13.321433 ci-op-vdzgv-w-d-q8t5p.c.openshift-gce-devel-ci.internal systemd[1]: local-fs.target: Job var-lib-nfs-rpc_pipefs.mount/stop deleted to break ordering cycle starting with local-fs.target/stop Ahh yeah, I think you're on the right track. Specifically, I think it's this other slightly different occurrence of the ordering cycle: > Apr 27 15:31:16.363678 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: systemd-tmpfiles-setup.service: Found ordering cycle on local-fs.target/stop > Apr 27 15:31:16.363693 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: systemd-tmpfiles-setup.service: Found dependency on var-lib-nfs-rpc_pipefs.mount/stop > Apr 27 15:31:16.363697 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: systemd-tmpfiles-setup.service: Found dependency on systemd-tmpfiles-setup.service/stop > Apr 27 15:31:16.363703 ci-op-vdzgv-w-b-mjlw9.c.openshift-gce-devel-ci.internal systemd[1]: systemd-tmpfiles-setup.service: Job local-fs.target/stop deleted to break ordering cycle starting with systemd-tmpfiles-setup.service/stop Here, we deleted local-fs.target/stop, which I think caused systemd to then run ostree-finalized-staged later than it normally would (since it's normally keyed off of local-fs.target), when /boot was already unmounted. Thinking more on this, the NFS stuff definitely rings a bell. Ahh yup: https://github.com/openshift/os/issues/379 (the ordering cycle shows up in https://github.com/openshift/os/issues/379#issuecomment-472164793#issuecomment-472164793), which leads to https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/213. And looking at the same journal logs, I see: > Apr 27 15:20:35.501506 localhost rpcbind[629]: cannot get uid of 'rpc': No such file or directory > Apr 27 15:20:35.584554 localhost rpc.statd[633]: Version 2.3.3 starting > Apr 27 15:20:35.590062 localhost rpc.statd[633]: Initializing NSM state > Apr 27 15:20:35.635044 localhost rpc.statd[633]: Running as root. chown /var/lib/nfs/statd to choose different user > Apr 27 15:20:35.637588 localhost rpc.statd[633]: Failed to register (statd, 1, udp): svc_reg() err: RPC: Remote system error - Connection refused > Apr 27 15:20:35.639994 localhost rpc.statd[633]: Failed to register (statd, 1, tcp): svc_reg() err: RPC: Remote system error - Connection refused > Apr 27 15:20:35.642382 localhost rpc.statd[633]: Failed to register (statd, 1, udp6): svc_reg() err: RPC: Remote system error - Connection refused > Apr 27 15:20:35.644711 localhost rpc.statd[633]: Failed to register (statd, 1, tcp6): svc_reg() err: RPC: Remote system error - Connection refused > Apr 27 15:20:35.644742 localhost rpc.statd[633]: failed to create RPC listeners, exiting > Apr 27 15:20:35.694535 localhost rpc.idmapd[638]: Setting log level to 0 in the initrd -- something is pulling the nfs module back in? I think what might be going on here is that the nfs dracut module is mounting /var/lib/nfs/rpc_pipefs, which confuses real root systemd's handling of the var-lib-nfs-rpc_pipefs.mount unit because it has `After=systemd-tmpfiles-setup.service` yet it's already mounted. So on shutdown, it hits the dependency cycle. And ohhh, the reason why the NFS module is getting pulled back in is that we're rebuilding the initramfs in this case and we're not re-running dracut with the canonical args. I think backporting https://github.com/coreos/rpm-ostree/commit/e9011530e52fbc4e7b002c29ff20d2c08a97435c would fix this (we initially were going to backport that for https://bugzilla.redhat.com/show_bug.cgi?id=1806588, but ended up keeping it conservative). An easy workaround to try of course for now would be to add `rpm-ostree initramfs --enable --arg='--omit' --arg='nfs'`. Let's see if that helps and if so, use this ticket to track backporting the rpm-ostree patch for canonical dracut args? At the same time, we should engage with NFS maintainers to fix the dependencies there. Offhand I think that the initrd mount should be torn down on switch root if rootfs isn't on NFS? OK this search turned up more hits: "Job var-lib-nfs-rpc_pipefs.mount/stop deleted to break ordering cycle" - https://github.com/openshift/os/issues/379 - https://github.com/systemd/systemd/issues/5991 Hmmm...but I'm still not hitting this trying to enable NFS and am confused how it's happening because again AFAICS the pipefs mount *isn't* ordered against local-fs.target. > And ohhh, the reason why the NFS module is getting pulled back in is that we're rebuilding the initramfs in this case and we're not re-running dracut with the canonical args. I think backporting https://github.com/coreos/rpm-ostree/commit/e9011530e52fbc4e7b002c29ff20d2c08a97435c would fix this (we initially were going to backport that for https://bugzilla.redhat.com/show_bug.cgi?id=1806588, but ended up keeping it conservative).
Nice call! Yes that has to be it.
OK so I think let's get new ostree/rpm-ostree releases spun up for RHCOS and queued for RHEL 8.3 on general principle, which will pull in the above patch. Then we can cherry pick those. Noticed this today in MCO e2e-gcp-op test run on 4.4 PR https://github.com/openshift/machine-config-operator/pull/1737#issuecomment-630716314 Any plans to backport to 8.2 ? I ran a loop on an OCP 4.5.2 cluster (RHCOS 45.82.202007141718-0 - rpm-ostree-2020.2-2.el8) that created a MachineConfig to switch to the RT kernel and then deleted the MachineConfig to switch back to the vanilla kernel. The loop completed 10+ times successfully, confirming each time that the nodes were properly updated with the right kernel. Marking as VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (rpm-ostree bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4708 |