Bug 2121944 - greenboot-grub2-set-counter.service fails on 38 IoT with "cannot open `/boot/grub2/grubenv.new`: No such file or directory."
Summary: greenboot-grub2-set-counter.service fails on 38 IoT with "cannot open `/boot/...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: greenboot
Version: rawhide
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Sayan
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: openqa AcceptedBlocker AcceptedFreeze...
Depends On:
Blocks: F37BetaFreezeException F37FinalBlocker
TreeView+ depends on / blocked
 
Reported: 2022-08-27 16:57 UTC by Adam Williamson
Modified: 2022-09-28 11:16 UTC (History)
11 users (show)

Fixed In Version: greenboot-0.15.2-2.fc37
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-28 11:16:38 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github fedora-iot greenboot pull 84 0 None open greenboot-grub2-set-counter.service: ensure /boot is mounted 2022-09-22 21:02:46 UTC

Description Adam Williamson 2022-08-27 16:57:28 UTC
On today's Fedora-IoT 38 compose, greenboot-grub2-set-counter.service fails to start with this error:

Aug 27 02:32:02 localhost greenboot-grub2-set-counter[605]: grub2-editenv: error: cannot open `/boot/grub2/grubenv.new': No such file or directory.

This is a violation of "All system services present after installation with one of the release-blocking package sets must start properly, unless they require hardware which is not present" and "The Greenboot service must be present on all images and installed by default when using the ISO installer. The service must be enabled to run on boot and function as intended", so proposing as a Final blocker.

It *may* also be interfering with the iot_rpmostree_rebase test which is also failing, so it may be a Beta blocker. Not sure about that.

Comment 1 Adam Williamson 2022-08-27 17:04:23 UTC
Yeah, the iot_rpmostree_rebase test failed again on retry and it seems to be failing because the rebase it attempts to do gets blocked/nerfed/superseded by one triggered by the greenboot failure. The test runs a rebase to 'stable' then boots and checks that the output of `rpm-ostree status -b` includes "stable", but it actually gets this:

State: busy
Transaction: rollback
  Initiator: client(id:cli dbus:1.13 unit:greenboot-rpm-ostree-grub2-check-fallback.service uid:0)

so it seems like the manual rebase the test tries to do is sorta colliding with an automatic one triggered by the greenboot service failure, I think. I'm not sure if we should strictly count this as a failure of the "It must be possible to rebase to a different IoT release by using the rpm-ostree rebase command" criterion though, because I don't think rollbacks are *broken* exactly, just the test is hitting an unexpected scenario that causes it to fail.

Comment 2 Adam Williamson 2022-08-29 17:04:11 UTC
This does also affect F37, as expected:

https://openqa.fedoraproject.org/tests/1403294

Comment 3 Geoffrey Marr 2022-08-29 21:37:43 UTC
Discussed during the 2022-08-29 blocker review meeting: [0]

The decision to classify this bug as both an "AcceptedBlocker (Final)" and an "AcceptedFreezeException (Final)" was made as this is a clear violation of IoT criterion "The Greenboot service must be present on all images and installed by default when using the ISO installer. The service must be enabled to run on boot and function as intended" and the generic "all services must start successfully" criterion. It is also accepted as a Beta FE as it's a significant issue we would like to be fixed.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2022-08-29/f37-blocker-review.2022-08-29-16.01.txt

Comment 4 Geoffrey Marr 2022-08-29 21:39:45 UTC
Edit to comment 3: "AcceptedFreezeException (Beta)"

Comment 5 Adam Williamson 2022-09-19 19:18:23 UTC
Ping? This is an F37 Final blocker, it needs addressing. Peter, can you do anything to move this along?

Comment 6 Peter Robinson 2022-09-20 11:02:15 UTC
(In reply to Adam Williamson from comment #5)
> Ping? This is an F37 Final blocker, it needs addressing. Peter, can you do
> anything to move this along?

Are we sure this is greenboot and not grub2, there's been a lot of changes in grub2 of late and I suspect that might be more likely.

Comment 7 Christian Glombek 2022-09-20 12:24:03 UTC
Unless there was a change to mount /boot ro in Fedora IoT, I'd also think this is a grub2 issue.

The service that failed here has not had any changes to it in 1 year.
It calls `grub2-editenv list` and conditionally `grub2-editenv - unset boot_counter`.

I wonder why grub-editenv is looking at `/boot/grub2/grubenv.new', though. IIRC that path used to be `/boot/grub2/grubenv'.
Has anything relating to that changed in grub2 recently?

Comment 8 Christian Glombek 2022-09-20 12:30:15 UTC
I was looking at the wrong service, the set-counter one actually does:

grub2-editenv - set boot_counter=3
grub2-editenv - set boot_success=0

but that service also hasn't changed in 10 months. So this still points to grub2.

Comment 9 Adam Williamson 2022-09-20 19:21:29 UTC
"Unless there was a change to mount /boot ro in Fedora IoT"

Uh...we *did* do that, didn't we? Isn't that the source of all the other fallout we've been dealing with too?

Comment 10 Ben Cotton 2022-09-22 16:48:20 UTC
I think it's just `/sysroot` that got changed to be read-only: https://fedoraproject.org/wiki/Changes/Silverblue_Kinoite_readonly_sysroot

Comment 11 Adam Williamson 2022-09-22 19:25:51 UTC
Christian: the code that makes it look for (actually, create) `grubenv.new` is `grub_util_create_envblk_file`: https://github.com/rhboot/grub2/blob/fedora-38/util/editenv.c#L39 .

      namenew = xasprintf ("%s.new", name);
      fp = grub_util_fopen (namenew, "wb");
      if (! fp)
        grub_util_error (_("cannot open `%s': %s"), namenew,
    		     strerror (errno));

that's where we're failing. (Note: `grub_util_fopen` is really just `fopen`.) That code isn't new, it's been around for at least 9 years according to the git history. That git repo is where RH maintains our changes to grub2, but this code isn't part of our downstream changes, it's part of upstream grub2, and AFAICT it's been there forever.

So it looks a lot like doing `fopen (namenew, "wb");` used to work, but now fails. I've no idea why.

Comment 12 Adam Williamson 2022-09-22 19:31:24 UTC
hmm, well, it could be something else.

An interesting question I guess is how do we wind up in `grub_util_create_envblk_file`. I think we must be getting there from here:

https://github.com/rhboot/grub2/blob/fedora-38/util/grub-editenv.c#L135-L139

i.e. first grub2-editenv tries to open the existing /boot/grub2/grubenv for reading, and if that, it tries to create it (which involves creating /boot/grub2/grubenv.new and then renaming it to /boot/grub2/grubenv). So another question is, why is trying to open the existing file for reading failing...

meh, I guess I need to grab an IoT image and poke around at this a bit myself.

Comment 13 Adam Williamson 2022-09-22 20:58:00 UTC
So I reproduced this locally, and I think I see what's going on.

I think the service is actually racing with the mount of `/boot`. The service has almost no restrictions on ordering: it has `DefaultDependencies=no` and no `Requires=` or `After=`. Its only ordering directive is `Before=ostree-finalize-staged.service`. So systemd could theoretically order it before almost anything else.

If I restart the service after the system boots, and /boot is mounted, it works fine.

Finally, the smoking gun: in the system journal, the failure of greenboot-grub2-set-counter comes just before the line "Mounted boot.mount - /boot." (I'd paste the log extract here, but IoT doesn't have fpaste and it's awkward getting it out any other way).

So I think the fix here is to ensure greenboot-grub2-set-counter.service gets ordered after /boot is mounted. I think the appropriate way to do this would be to set `RequiresMountsFor=/boot`. And indeed if I copy the unit to `/etc/systemd/system` and add that line and reboot, the service starts successfully. I guess I'll send a PR somewhere.

Comment 14 Fedora Update System 2022-09-23 23:41:44 UTC
FEDORA-2022-627769ab98 has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2022-627769ab98

Comment 15 Fedora Update System 2022-09-24 03:00:10 UTC
FEDORA-2022-627769ab98 has been pushed to the Fedora 37 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2022-627769ab98`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-627769ab98

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 16 Christian Glombek 2022-09-26 15:27:26 UTC
Adam, thank you for your thorough investigation and filing the fix!

Comment 17 Fedora Update System 2022-09-28 11:16:38 UTC
FEDORA-2022-627769ab98 has been pushed to the Fedora 37 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.