Bug 1803293
| Summary: | Plymouth runs again after it should have quit (bootsplash appears over tty) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> | ||||||
| Component: | systemd | Assignee: | systemd-maint | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | rawhide | CC: | gmarr, lnykryn, msekleta, pbrobinson, ppywlkiqletw, robatino, ssahani, s, systemd-maint, zbyszek | ||||||
| Target Milestone: | --- | Keywords: | Reopened | ||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | openqa AcceptedBlocker | ||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2020-02-26 16:37:59 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1705303 | ||||||||
| Attachments: |
|
||||||||
|
Description
Adam Williamson
2020-02-15 02:06:51 UTC
#1798956 might be a different manifestation of the same issue, at least plymouth running again would explain why agetty gets a chance to talk to it while it didn't before. Sorry, I'm on PTO and will not look into this before the 24th. Some observations: In the journal I see that plymouth-start.service is started "spontaniously" even if was properly stopped. I can kill the plymouth process and/or stop plymouth-start.service and later it will be started again some time later. I also noticed when runing dnf updates in multi-user mode (runlevel 3) plymouth will be started every time the dnf process completes. It also seems to happen when logging out from the desktop environment (xfce in my case). "systemctl mask plymouth-start.service" will also cure the problem as a work-around. After downgrading to systemd-244.1-2.fc32.x86_64 the problem went away. Proposing as a Beta blocker per Beta criterion "A system installed without a graphical package set must boot to a working login prompt without any unintended user intervention, and all virtual consoles intended to provide a working login prompt must do so." - https://fedoraproject.org/wiki/Fedora_32_Beta_Release_Criteria#Expected_installed_system_boot_behavior this definitely interferes with that. The installed system does not boot to a working login prompt, it boots to a useless splash screen. Discussed during the 2020-02-17 blocker review meeting: [0] The decision to classify this bug as an "AcceptedBlocker" was made as it violates the following criterion: "A system installed without a graphical package set must boot to a working login prompt without any unintended user intervention, and all virtual consoles intended to provide a working login prompt must do so." [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2020-02-17/f32-blocker-review.2020-02-17-17.02.txt Created attachment 1663815 [details]
systemd debug log extract
Here's an extract from the journal with systemd debug logging enabled. These are the messages from two key points on an affected boot: 08:14:39 is when I switch to tty2, and 08:14:43 is when I successfully login (as root) on tty2 and a session starts. The first plymouth-start.service (the correct, boot-time one) and plymouth-quit.service ran ~20 seconds earlier.
It seems like at both these points, we get this:
Feb 18 08:14:39 localhost.localdomain systemd[1]: Pulling in plymouth-start.service/start from sysinit.target/start
Next I'll do the same thing with systemd 244 to compare the logs.
Created attachment 1663828 [details]
systemd 244 debug log extract
Here's the equivalent extract with systemd 244.
Aha, so, this is interesting. The key difference between 244 and 245 is this: in 244, we see this line: Feb 18 08:28:57 localhost.localdomain systemd[1]: Found redundant job sysinit.target/start, dropping from transaction. which then, I believe, leads to *these* lines: Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-binfmt.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-boot-system-token.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-sysusers.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-machine-id-commit.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-hwdb-update.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job ldconfig.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job import-state.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job plymouth-read-write.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-update-done.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-journal-catalog-update.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-modules-load.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-firstboot.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job plymouth-start.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-vconsole-setup.service/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-ask-password-console.path/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job sys-fs-fuse-connections.mount/start Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job selinux-autorelabel-mark.service/start In systemd 245, sysinit.target/start is *not* recognized as a 'redundant job', and no jobs are 'garbage collected'. Given that systemd-vconsole-setup.service is also affected by this, I suspect that this and https://bugzilla.redhat.com/show_bug.cgi?id=1801272 may be the same bug after all. Heh. And just like that, a suspect emerges:
[adamw@adam systemd (master)]$ git log v244..v245-rc1 src/core/transaction.c
commit 097537f07a2fab3cb73aef7bc59f2a66aa93f533
Author: Kevin Kuehler <keur.edu>
Date: Tue Nov 19 13:43:58 2019 -0800
job: Don't mark as redundant if deps are relevant
In the steps given in #13850, the resulting graph looks like:
C (Anchor) -> B -> A
Since B is inactive, it will be flagged as redundant and removed from
the transaction, causing A to get garbage collected. The proposed fix is
to not mark nodes as redundant if doing so causes a relevant node to be
garbage collected.
Fixes #13850
So, we're looking at https://github.com/systemd/systemd/pull/14086 and https://github.com/systemd/systemd/issues/13850 here.
systemd-245~rc1-3.fc32 and systemd-245~rc1-3.fc33 should fix this. Confirmed fixed with the new systemd build, in Rawhide compose. 32 compose got stuck, but no reason to think the fix won't work there too. IIUC, the upstream patch is correct,and the issue is in the plymouth unit dependency tree and/or the plymouth binary should check if the tty is not already used and exit. So this is just a work-around, and not a real fix. I don't want to carry this patch more than strictly necessary. Zbigniew: well, it affected several other units as well as plymouth-start, at least one of which - systemd-vconsole-setup - seemed to have undesirable effects (discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1801272 ). Please do not restore the change without being *very* sure all possible consequences are dealt with. Also, can you please deal with this via another bug report? The bug reported here - plymouth running again after it should have quit, which is a blocker bug - *is* fixed. Your current issue is "systemd has a downstream reversion patch I want to get rid of", or alternatively, "PartOf= does not work on inactive units" (the bug the problematic patch was trying to fix). Neither of those is the same as this bug and neither of them is a release blocker. (In reply to Zbigniew Jędrzejewski-Szmek from comment #12) > IIUC, the upstream patch is correct,and the issue is in the plymouth unit > dependency tree > and/or the plymouth binary should check if the tty is not already used and > exit. > So this is just a work-around, and not a real fix. I don't want to carry > this patch > more than strictly necessary. How about making plymouth-start a oneshot remain-after-exit service? (In reply to Villy Kruse from comment #15) > (In reply to Zbigniew Jędrzejewski-Szmek from comment #12) > > IIUC, the upstream patch is correct,and the issue is in the plymouth unit > > dependency tree > > and/or the plymouth binary should check if the tty is not already used and > > exit. > > So this is just a work-around, and not a real fix. I don't want to carry > > this patch > > more than strictly necessary. > > How about making plymouth-start a oneshot remain-after-exit service? I submitted bug 1807771 for the plymouth package. |