Bug 1959861
Summary: | redboot-auto-reboot: SYSTEM is UNHEALTHY, but boot_counter is unset in grubenv | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Paul Whalen <pwhalen> |
Component: | greenboot | Assignee: | Christian Glombek <cglombek> |
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | awilliam, cglombek, jnoguera, pbrobinson, robatino |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | openqa | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-06-16 16:20:41 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1891953 | ||
Bug Blocks: | 1269538 |
Description
Paul Whalen
2021-05-12 13:39:56 UTC
This is only happening in Rawhide, F34 is working as expected with the same version - https://openqa.fedoraproject.org/tests/884861#step/base_services_start/7 What's the status of `greenboot-rpm-ostree-grub2-check-fallback.service`? (ref: https://github.com/fedora-iot/greenboot/blob/main/usr/libexec/greenboot/redboot-auto-reboot#L12-L16) [root@fedora ~]# systemctl --all --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● redboot-auto-reboot.service loaded failed failed Reboot on red boot status LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 1 loaded units listed. To show all installed unit files use 'systemctl list-unit-files'. [root@fedora ~]# systemctl status greenboot-rpm-ostree-grub2-check-fallback.service ● greenboot-rpm-ostree-grub2-check-fallback.service - Check for fallback boot Loaded: loaded (/usr/lib/systemd/system/greenboot-rpm-ostree-grub2-check-fallback.service; enabled; vendor preset: enabled) Active: active (exited) since Mon 2021-05-17 09:52:13 CDT; 6min ago Process: 703 ExecStart=/usr/libexec/greenboot/greenboot-rpm-ostree-grub2-check-fallback (code=exited, status=0/SUCCESS) Main PID: 703 (code=exited, status=0/SUCCESS) CPU: 21ms May 17 09:52:13 fedora systemd[1]: Starting Check for fallback boot... May 17 09:52:13 fedora systemd[1]: Finished Check for fallback boot. I looked into this some more and to me it seems like systemd is seriously misbehaving here. Both the redboot-task-runner.service and redboot-auto-reboot.service should only be triggered when redboot.target is activated (which happens OnFailure of greenboot-healthcheck.service), however they seem to have been triggered somehow anyway, even though the greenboot-healthcheck.service ran successfully, and redboot.target was never activated: [lorbus@lenovo ~]$ journalctl --file /home/lorbus/Downloads/base_services_start-var_log/var/log/journal/ddea5a7b2df740dd9aa380b9befce9f3/system.journal -u greenboot-healthcheck.service -- Journal begins at Mon 2021-05-10 22:11:33 CEST, ends at Mon 2021-05-10 22:13:09 CEST. -- May 10 22:11:41 fedora systemd[1]: Starting greenboot Health Checks Runner... May 10 22:11:41 fedora greenboot[672]: Running Required Health Check Scripts... May 10 22:11:41 fedora 00_required_scripts_start.sh[684]: Running greenboot Required Health Check Scripts May 10 22:11:41 fedora greenboot[672]: Script '00_required_scripts_start.sh' SUCCESS May 10 22:11:41 fedora greenboot[672]: Running Wanted Health Check Scripts... May 10 22:11:41 fedora 00_wanted_scripts_start.sh[693]: Running greenboot Wanted Health Check Scripts May 10 22:11:41 fedora greenboot[672]: Script '00_wanted_scripts_start.sh' SUCCESS May 10 22:11:41 fedora systemd[1]: Finished greenboot Health Checks Runner. [lorbus@lenovo ~]$ journalctl --file /home/lorbus/Downloads/base_services_start-var_log/var/log/journal/ddea5a7b2df740dd9aa380b9befce9f3/system.journal -u boot-complete.target -- Journal begins at Mon 2021-05-10 22:11:33 CEST, ends at Mon 2021-05-10 22:13:09 CEST. -- May 10 22:11:41 fedora systemd[1]: Reached target Boot Completion Check. [lorbus@lenovo ~]$ journalctl --file /home/lorbus/Downloads/base_services_start-var_log/var/log/journal/ddea5a7b2df740dd9aa380b9befce9f3/system.journal -u redboot.target -- Journal begins at Mon 2021-05-10 22:11:33 CEST, ends at Mon 2021-05-10 22:13:09 CEST. -- -- No entries -- [lorbus@lenovo ~]$ journalctl --file /home/lorbus/Downloads/base_services_start-var_log/var/log/journal/ddea5a7b2df740dd9aa380b9befce9f3/system.journal -u redboot-task-runner -- Journal begins at Mon 2021-05-10 22:11:33 CEST, ends at Mon 2021-05-10 22:13:09 CEST. -- May 10 22:11:41 fedora systemd[1]: Starting greenboot Failure Scripts Runner... May 10 22:11:41 fedora greenboot[655]: Boot Status is RED - Health Check FAILURE! May 10 22:11:41 fedora greenboot[655]: Running Red Scripts... May 10 22:11:41 fedora systemd[1]: Finished greenboot Failure Scripts Runner. [lorbus@lenovo ~]$ journalctl --file /home/lorbus/Downloads/base_services_start-var_log/var/log/journal/ddea5a7b2df740dd9aa380b9befce9f3/system.journal -u redboot-auto-reboot -- Journal begins at Mon 2021-05-10 22:11:33 CEST, ends at Mon 2021-05-10 22:13:09 CEST. -- May 10 22:11:41 fedora systemd[1]: Starting Reboot on red boot status... May 10 22:11:41 fedora redboot-auto-reboot[675]: SYSTEM is UNHEALTHY, but boot_counter is unset in grubenv. Manual intervention necessary. May 10 22:11:41 fedora systemd[1]: redboot-auto-reboot.service: Main process exited, code=exited, status=1/FAILURE May 10 22:11:41 fedora systemd[1]: redboot-auto-reboot.service: Failed with result 'exit-code'. May 10 22:11:41 fedora systemd[1]: Failed to start Reboot on red boot status. @pwhalen could you run `systemctl list-dependencies` on an affected machine. I'd like to figure out why/by what the two services (redboot-task-runner and redboot-auto-rebootwere) were started. [root@fedora ~]# systemctl list-dependencies default.target ● ├─auditd.service ● ├─chronyd.service ● ├─dbus-parsec.service ● ├─firewalld.service ○ ├─greenboot-grub2-set-counter.service ● ├─greenboot-grub2-set-success.service ● ├─greenboot-healthcheck.service ● ├─greenboot-rpm-ostree-grub2-check-fallback.service ● ├─greenboot-status.service ● ├─greenboot-task-runner.service ● ├─ModemManager.service ● ├─NetworkManager.service ● ├─ostree-finalize-staged.path ● ├─ostree-remount.service ● ├─parsec.service × ├─redboot-auto-reboot.service ● ├─redboot-task-runner.service ● ├─rngd.service ● ├─sshd.service ● ├─systemd-ask-password-wall.path ● ├─systemd-homed.service ● ├─systemd-logind.service ● ├─systemd-oomd.service ● ├─systemd-resolved.service ○ ├─systemd-update-utmp-runlevel.service ● ├─systemd-user-sessions.service ● ├─zezere_ignition.timer ○ ├─zezere_ignition_banner.service ● ├─basic.target .. Christian: you filed a bug and PR in 2019 requesting that it be preset to enabled on IoT: https://bugzilla.redhat.com/show_bug.cgi?id=1733799 https://src.fedoraproject.org/rpms/fedora-release/pull-request/81 The PR was merged. Is that not why it's starting? BTW, now I look into this, i think the root issue is that redboot-auto-reboot *always* detects the system as unhealthy for some reason. I think this specific scenario is how that plays it when it decides it's unhealthy on the *first* boot. In another failing test, we reboot after the first boot; in that case, redboot again decides the system is unhealthy, but the result is that it bootloops the system like three times. On the final loop, some mechanism kicks in and selects a different boot menu entry (an older snapshot, I think), and redboot logs: May 29 08:59:00 fedora redboot-auto-reboot[723]: FALLBACK BOOT DETECTED but SYSTEM is still UNHEALTHY. Manual intervention necessary. and then stops the reboot loop. The other failing test is an ostree overlay test, https://openqa.fedoraproject.org/tests/899155 - it ultimately fails because it tries to install a package via rpm overlay and when it checks whether the package is present, it isn't. But the *reason* that check fails, I think, is because of redboot forcing the boot to the 'fallback' snapshot which is presumably from before the overlay package installation happening. If redboot would just get out of the way and let the first reboot succeed, we'd be in the current snapshot, and that check would probably pass. Proposing as a Beta blocker as a violation of "It must be possible to install additional software with the rpm-ostree install command. Software installation must also include dependencies where necessary and installed software should provide the intended functionality." - https://fedoraproject.org/wiki/Basic_Release_Criteria#rpm-ostree_requirements . This effectively makes that impossible. Also note Final criterion "All system services present after installation with one of the release-blocking package sets must start properly, unless they require hardware which is not present." @awilliam: Yes the service is enabled, but it is still not expected to start in this case, because it shouldn't be pulled in by default.target/multi-user.target, but instead by redboot.target only (see https://github.com/fedora-iot/greenboot/blob/main/usr/lib/systemd/system/redboot-auto-reboot.service). This works well up until F35, and only seems to be breaking in Rawhide. Therefore, I suspect this is either a systemd issue, or there are wrong symlinks in place linking both redboot-auto-reboot.service and redboot-task-runner.service with multi-user.target. (In reply to Adam Williamson from comment #9) > The other failing test is an ostree overlay test, > https://openqa.fedoraproject.org/tests/899155 - it ultimately fails because > it tries to install a package via rpm overlay and when it checks whether the > package is present, it isn't. But the *reason* that check fails, I think, is > because of redboot forcing the boot to the 'fallback' snapshot which is > presumably from before the overlay package installation happening. If > redboot would just get out of the way and let the first reboot succeed, we'd > be in the current snapshot, and that check would probably pass. Package layering works as expected when redboot is disabled. Issue has been found. It was that redboot.target, redboot-task-runner.service and redboot-auto-reboot.service were included in multi-user.target.wants and therefore called at boot. Deleted the symlinks to redboot from /etc/systemd/system/multi-user.target.wants/ and boot without failing. Tried to force a fallback reboot and it worked perfectly. @cglombek submitted a PR for fixing this: https://pagure.io/fedora-iot/ostree/pull-request/3 Verified fixed, closing. |