On today's Fedora-IoT 38 compose, greenboot-grub2-set-counter.service fails to start with this error: Aug 27 02:32:02 localhost greenboot-grub2-set-counter[605]: grub2-editenv: error: cannot open `/boot/grub2/grubenv.new': No such file or directory. This is a violation of "All system services present after installation with one of the release-blocking package sets must start properly, unless they require hardware which is not present" and "The Greenboot service must be present on all images and installed by default when using the ISO installer. The service must be enabled to run on boot and function as intended", so proposing as a Final blocker. It *may* also be interfering with the iot_rpmostree_rebase test which is also failing, so it may be a Beta blocker. Not sure about that.
Yeah, the iot_rpmostree_rebase test failed again on retry and it seems to be failing because the rebase it attempts to do gets blocked/nerfed/superseded by one triggered by the greenboot failure. The test runs a rebase to 'stable' then boots and checks that the output of `rpm-ostree status -b` includes "stable", but it actually gets this: State: busy Transaction: rollback Initiator: client(id:cli dbus:1.13 unit:greenboot-rpm-ostree-grub2-check-fallback.service uid:0) so it seems like the manual rebase the test tries to do is sorta colliding with an automatic one triggered by the greenboot service failure, I think. I'm not sure if we should strictly count this as a failure of the "It must be possible to rebase to a different IoT release by using the rpm-ostree rebase command" criterion though, because I don't think rollbacks are *broken* exactly, just the test is hitting an unexpected scenario that causes it to fail.
This does also affect F37, as expected: https://openqa.fedoraproject.org/tests/1403294
Discussed during the 2022-08-29 blocker review meeting: [0] The decision to classify this bug as both an "AcceptedBlocker (Final)" and an "AcceptedFreezeException (Final)" was made as this is a clear violation of IoT criterion "The Greenboot service must be present on all images and installed by default when using the ISO installer. The service must be enabled to run on boot and function as intended" and the generic "all services must start successfully" criterion. It is also accepted as a Beta FE as it's a significant issue we would like to be fixed. [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2022-08-29/f37-blocker-review.2022-08-29-16.01.txt
Edit to comment 3: "AcceptedFreezeException (Beta)"
Ping? This is an F37 Final blocker, it needs addressing. Peter, can you do anything to move this along?
(In reply to Adam Williamson from comment #5) > Ping? This is an F37 Final blocker, it needs addressing. Peter, can you do > anything to move this along? Are we sure this is greenboot and not grub2, there's been a lot of changes in grub2 of late and I suspect that might be more likely.
Unless there was a change to mount /boot ro in Fedora IoT, I'd also think this is a grub2 issue. The service that failed here has not had any changes to it in 1 year. It calls `grub2-editenv list` and conditionally `grub2-editenv - unset boot_counter`. I wonder why grub-editenv is looking at `/boot/grub2/grubenv.new', though. IIRC that path used to be `/boot/grub2/grubenv'. Has anything relating to that changed in grub2 recently?
I was looking at the wrong service, the set-counter one actually does: grub2-editenv - set boot_counter=3 grub2-editenv - set boot_success=0 but that service also hasn't changed in 10 months. So this still points to grub2.
"Unless there was a change to mount /boot ro in Fedora IoT" Uh...we *did* do that, didn't we? Isn't that the source of all the other fallout we've been dealing with too?
I think it's just `/sysroot` that got changed to be read-only: https://fedoraproject.org/wiki/Changes/Silverblue_Kinoite_readonly_sysroot
Christian: the code that makes it look for (actually, create) `grubenv.new` is `grub_util_create_envblk_file`: https://github.com/rhboot/grub2/blob/fedora-38/util/editenv.c#L39 . namenew = xasprintf ("%s.new", name); fp = grub_util_fopen (namenew, "wb"); if (! fp) grub_util_error (_("cannot open `%s': %s"), namenew, strerror (errno)); that's where we're failing. (Note: `grub_util_fopen` is really just `fopen`.) That code isn't new, it's been around for at least 9 years according to the git history. That git repo is where RH maintains our changes to grub2, but this code isn't part of our downstream changes, it's part of upstream grub2, and AFAICT it's been there forever. So it looks a lot like doing `fopen (namenew, "wb");` used to work, but now fails. I've no idea why.
hmm, well, it could be something else. An interesting question I guess is how do we wind up in `grub_util_create_envblk_file`. I think we must be getting there from here: https://github.com/rhboot/grub2/blob/fedora-38/util/grub-editenv.c#L135-L139 i.e. first grub2-editenv tries to open the existing /boot/grub2/grubenv for reading, and if that, it tries to create it (which involves creating /boot/grub2/grubenv.new and then renaming it to /boot/grub2/grubenv). So another question is, why is trying to open the existing file for reading failing... meh, I guess I need to grab an IoT image and poke around at this a bit myself.
So I reproduced this locally, and I think I see what's going on. I think the service is actually racing with the mount of `/boot`. The service has almost no restrictions on ordering: it has `DefaultDependencies=no` and no `Requires=` or `After=`. Its only ordering directive is `Before=ostree-finalize-staged.service`. So systemd could theoretically order it before almost anything else. If I restart the service after the system boots, and /boot is mounted, it works fine. Finally, the smoking gun: in the system journal, the failure of greenboot-grub2-set-counter comes just before the line "Mounted boot.mount - /boot." (I'd paste the log extract here, but IoT doesn't have fpaste and it's awkward getting it out any other way). So I think the fix here is to ensure greenboot-grub2-set-counter.service gets ordered after /boot is mounted. I think the appropriate way to do this would be to set `RequiresMountsFor=/boot`. And indeed if I copy the unit to `/etc/systemd/system` and add that line and reboot, the service starts successfully. I guess I'll send a PR somewhere.
FEDORA-2022-627769ab98 has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2022-627769ab98
FEDORA-2022-627769ab98 has been pushed to the Fedora 37 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2022-627769ab98` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-627769ab98 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
Adam, thank you for your thorough investigation and filing the fix!
FEDORA-2022-627769ab98 has been pushed to the Fedora 37 stable repository. If problem still persists, please make note of it in this bug report.