Bug 1959861

Summary: redboot-auto-reboot: SYSTEM is UNHEALTHY, but boot_counter is unset in grubenv
Product: [Fedora] Fedora Reporter: Paul Whalen <pwhalen>
Component: greenbootAssignee: Christian Glombek <cglombek>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: awilliam, cglombek, jnoguera, pbrobinson, robatino
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: openqa
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-16 16:20:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1891953    
Bug Blocks: 1269538    

Description Paul Whalen 2021-05-12 13:39:56 UTC
Description of problem:

redboot-auto-reboot.service fails to start in rawhide 

× redboot-auto-reboot.service - Reboot on red boot status
     Loaded: loaded (/usr/lib/systemd/system/redboot-auto-reboot.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2021-05-12 09:27:54 EDT; 3min 12s ago
    Process: 754 ExecStart=/usr/libexec/greenboot/redboot-auto-reboot (code=exited, status=1/FAILURE)
   Main PID: 754 (code=exited, status=1/FAILURE)
        CPU: 95ms

May 12 09:27:54 fedora systemd[1]: Starting Reboot on red boot status...
May 12 09:27:54 fedora redboot-auto-reboot[754]: SYSTEM is UNHEALTHY, but boot_counter is unset in grubenv. Manual intervention nece>
May 12 09:27:54 fedora systemd[1]: redboot-auto-reboot.service: Main process exited, code=exited, status=1/FAILURE
May 12 09:27:54 fedora systemd[1]: redboot-auto-reboot.service: Failed with result 'exit-code'.
May 12 09:27:55 fedora systemd[1]: Failed to start Reboot on red boot status.

[root@fedora ~]# cat /boot/grub2/grubenv
# GRUB Environment Block
# WARNING: Do not edit this file by tools other than grub-editenv!!!
boot_success=1
#####################

Version-Release number of selected component (if applicable):
greenboot-0.11-3.fc34.aarch64
greenboot-grub2-0.11-3.fc34.aarch64
greenboot-rpm-ostree-grub2-0.11-3.fc34.aarch64
greenboot-reboot-0.11-3.fc34.aarch64
greenboot-status-0.11-3.fc34.aarch64


Additional info:

Failure in OpenQA:

https://openqa.fedoraproject.org/tests/883789#step/base_services_start/13

Comment 1 Paul Whalen 2021-05-12 13:41:31 UTC
This is only happening in Rawhide, F34 is working as expected with the same version - https://openqa.fedoraproject.org/tests/884861#step/base_services_start/7

Comment 2 Christian Glombek 2021-05-12 16:48:37 UTC
What's the status of `greenboot-rpm-ostree-grub2-check-fallback.service`? (ref: https://github.com/fedora-iot/greenboot/blob/main/usr/libexec/greenboot/redboot-auto-reboot#L12-L16)

Comment 3 Paul Whalen 2021-05-17 15:00:35 UTC
[root@fedora ~]# systemctl --all --failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION              
● redboot-auto-reboot.service loaded failed failed Reboot on red boot status

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
To show all installed unit files use 'systemctl list-unit-files'.

[root@fedora ~]# systemctl status greenboot-rpm-ostree-grub2-check-fallback.service
● greenboot-rpm-ostree-grub2-check-fallback.service - Check for fallback boot
     Loaded: loaded (/usr/lib/systemd/system/greenboot-rpm-ostree-grub2-check-fallback.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2021-05-17 09:52:13 CDT; 6min ago
    Process: 703 ExecStart=/usr/libexec/greenboot/greenboot-rpm-ostree-grub2-check-fallback (code=exited, status=0/SUCCESS)
   Main PID: 703 (code=exited, status=0/SUCCESS)
        CPU: 21ms

May 17 09:52:13 fedora systemd[1]: Starting Check for fallback boot...
May 17 09:52:13 fedora systemd[1]: Finished Check for fallback boot.

Comment 4 Christian Glombek 2021-05-17 16:06:36 UTC
I looked into this some more and to me it seems like systemd is seriously misbehaving here.

Both the redboot-task-runner.service and redboot-auto-reboot.service should only be triggered when redboot.target is activated (which happens OnFailure of greenboot-healthcheck.service),
however they seem to have been triggered somehow anyway, even though the greenboot-healthcheck.service ran successfully, and redboot.target was never activated:

[lorbus@lenovo ~]$ journalctl --file /home/lorbus/Downloads/base_services_start-var_log/var/log/journal/ddea5a7b2df740dd9aa380b9befce9f3/system.journal -u greenboot-healthcheck.service
-- Journal begins at Mon 2021-05-10 22:11:33 CEST, ends at Mon 2021-05-10 22:13:09 CEST. --
May 10 22:11:41 fedora systemd[1]: Starting greenboot Health Checks Runner...
May 10 22:11:41 fedora greenboot[672]: Running Required Health Check Scripts...
May 10 22:11:41 fedora 00_required_scripts_start.sh[684]: Running greenboot Required Health Check Scripts
May 10 22:11:41 fedora greenboot[672]: Script '00_required_scripts_start.sh' SUCCESS
May 10 22:11:41 fedora greenboot[672]: Running Wanted Health Check Scripts...
May 10 22:11:41 fedora 00_wanted_scripts_start.sh[693]: Running greenboot Wanted Health Check Scripts
May 10 22:11:41 fedora greenboot[672]: Script '00_wanted_scripts_start.sh' SUCCESS
May 10 22:11:41 fedora systemd[1]: Finished greenboot Health Checks Runner.

[lorbus@lenovo ~]$ journalctl --file /home/lorbus/Downloads/base_services_start-var_log/var/log/journal/ddea5a7b2df740dd9aa380b9befce9f3/system.journal -u boot-complete.target
-- Journal begins at Mon 2021-05-10 22:11:33 CEST, ends at Mon 2021-05-10 22:13:09 CEST. --
May 10 22:11:41 fedora systemd[1]: Reached target Boot Completion Check.

[lorbus@lenovo ~]$ journalctl --file /home/lorbus/Downloads/base_services_start-var_log/var/log/journal/ddea5a7b2df740dd9aa380b9befce9f3/system.journal -u redboot.target
-- Journal begins at Mon 2021-05-10 22:11:33 CEST, ends at Mon 2021-05-10 22:13:09 CEST. --
-- No entries --

[lorbus@lenovo ~]$ journalctl --file /home/lorbus/Downloads/base_services_start-var_log/var/log/journal/ddea5a7b2df740dd9aa380b9befce9f3/system.journal -u redboot-task-runner
-- Journal begins at Mon 2021-05-10 22:11:33 CEST, ends at Mon 2021-05-10 22:13:09 CEST. --
May 10 22:11:41 fedora systemd[1]: Starting greenboot Failure Scripts Runner...
May 10 22:11:41 fedora greenboot[655]: Boot Status is RED - Health Check FAILURE!
May 10 22:11:41 fedora greenboot[655]: Running Red Scripts...
May 10 22:11:41 fedora systemd[1]: Finished greenboot Failure Scripts Runner.

[lorbus@lenovo ~]$ journalctl --file /home/lorbus/Downloads/base_services_start-var_log/var/log/journal/ddea5a7b2df740dd9aa380b9befce9f3/system.journal -u redboot-auto-reboot
-- Journal begins at Mon 2021-05-10 22:11:33 CEST, ends at Mon 2021-05-10 22:13:09 CEST. --
May 10 22:11:41 fedora systemd[1]: Starting Reboot on red boot status...
May 10 22:11:41 fedora redboot-auto-reboot[675]: SYSTEM is UNHEALTHY, but boot_counter is unset in grubenv. Manual intervention necessary.
May 10 22:11:41 fedora systemd[1]: redboot-auto-reboot.service: Main process exited, code=exited, status=1/FAILURE
May 10 22:11:41 fedora systemd[1]: redboot-auto-reboot.service: Failed with result 'exit-code'.
May 10 22:11:41 fedora systemd[1]: Failed to start Reboot on red boot status.

Comment 5 Christian Glombek 2021-05-20 18:33:27 UTC
@pwhalen could you run `systemctl list-dependencies` on an affected machine. I'd like to figure out why/by what the two services (redboot-task-runner and redboot-auto-rebootwere) were started.

Comment 6 Paul Whalen 2021-05-25 14:49:53 UTC
[root@fedora ~]# systemctl list-dependencies
default.target
● ├─auditd.service
● ├─chronyd.service
● ├─dbus-parsec.service
● ├─firewalld.service
○ ├─greenboot-grub2-set-counter.service
● ├─greenboot-grub2-set-success.service
● ├─greenboot-healthcheck.service
● ├─greenboot-rpm-ostree-grub2-check-fallback.service
● ├─greenboot-status.service
● ├─greenboot-task-runner.service
● ├─ModemManager.service
● ├─NetworkManager.service
● ├─ostree-finalize-staged.path
● ├─ostree-remount.service
● ├─parsec.service
× ├─redboot-auto-reboot.service
● ├─redboot-task-runner.service
● ├─rngd.service
● ├─sshd.service
● ├─systemd-ask-password-wall.path
● ├─systemd-homed.service
● ├─systemd-logind.service
● ├─systemd-oomd.service
● ├─systemd-resolved.service
○ ├─systemd-update-utmp-runlevel.service
● ├─systemd-user-sessions.service
● ├─zezere_ignition.timer
○ ├─zezere_ignition_banner.service
● ├─basic.target
..

Comment 7 Adam Williamson 2021-05-31 21:21:51 UTC
Christian: you filed a bug and PR in 2019 requesting that it be preset to enabled on IoT:

https://bugzilla.redhat.com/show_bug.cgi?id=1733799
https://src.fedoraproject.org/rpms/fedora-release/pull-request/81

The PR was merged. Is that not why it's starting?

Comment 8 Adam Williamson 2021-05-31 21:29:56 UTC
BTW, now I look into this, i think the root issue is that redboot-auto-reboot *always* detects the system as unhealthy for some reason. I think this specific scenario is how that plays it when it decides it's unhealthy on the *first* boot. In another failing test, we reboot after the first boot; in that case, redboot again decides the system is unhealthy, but the result is that it bootloops the system like three times. On the final loop, some mechanism kicks in and selects a different boot menu entry (an older snapshot, I think), and redboot logs:

May 29 08:59:00 fedora redboot-auto-reboot[723]: FALLBACK BOOT DETECTED but SYSTEM is still UNHEALTHY. Manual intervention necessary.

and then stops the reboot loop.

Comment 9 Adam Williamson 2021-05-31 21:32:41 UTC
The other failing test is an ostree overlay test, https://openqa.fedoraproject.org/tests/899155 - it ultimately fails because it tries to install a package via rpm overlay and when it checks whether the package is present, it isn't. But the *reason* that check fails, I think, is because of redboot forcing the boot to the 'fallback' snapshot which is presumably from before the overlay package installation happening. If redboot would just get out of the way and let the first reboot succeed, we'd be in the current snapshot, and that check would probably pass.

Comment 10 Adam Williamson 2021-05-31 21:37:24 UTC
Proposing as a Beta blocker as a violation of "It must be possible to install additional software with the rpm-ostree install command. Software installation must also include dependencies where necessary and installed software should provide the intended functionality." - https://fedoraproject.org/wiki/Basic_Release_Criteria#rpm-ostree_requirements . This effectively makes that impossible. Also note Final criterion "All system services present after installation with one of the release-blocking package sets must start properly, unless they require hardware which is not present."

Comment 11 Christian Glombek 2021-06-01 11:11:20 UTC
@awilliam: Yes the service is enabled, but it is still not expected to start in this case, because it shouldn't be pulled in by default.target/multi-user.target, but instead by redboot.target only (see https://github.com/fedora-iot/greenboot/blob/main/usr/lib/systemd/system/redboot-auto-reboot.service).

This works well up until F35, and only seems to be breaking in Rawhide.
Therefore, I suspect this is either a systemd issue, or there are wrong symlinks in place linking both redboot-auto-reboot.service and redboot-task-runner.service with multi-user.target.

Comment 12 Paul Whalen 2021-06-01 13:14:26 UTC

(In reply to Adam Williamson from comment #9)
> The other failing test is an ostree overlay test,
> https://openqa.fedoraproject.org/tests/899155 - it ultimately fails because
> it tries to install a package via rpm overlay and when it checks whether the
> package is present, it isn't. But the *reason* that check fails, I think, is
> because of redboot forcing the boot to the 'fallback' snapshot which is
> presumably from before the overlay package installation happening. If
> redboot would just get out of the way and let the first reboot succeed, we'd
> be in the current snapshot, and that check would probably pass.

Package layering works as expected when redboot is disabled.

Comment 13 Jose 2021-06-11 09:05:07 UTC
Issue has been found. It was that redboot.target, redboot-task-runner.service and redboot-auto-reboot.service were included in multi-user.target.wants and therefore called at boot. Deleted the symlinks to redboot from /etc/systemd/system/multi-user.target.wants/ and boot without failing. Tried to force a fallback reboot and it worked perfectly.

@cglombek submitted a PR for fixing this: https://pagure.io/fedora-iot/ostree/pull-request/3

Comment 14 Paul Whalen 2021-06-16 16:20:41 UTC
Verified fixed, closing.