Bug 1803293 - Plymouth runs again after it should have quit (bootsplash appears over tty)
Summary: Plymouth runs again after it should have quit (bootsplash appears over tty)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: rawhide
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: openqa AcceptedBlocker
Depends On:
Blocks: F32BetaBlocker
TreeView+ depends on / blocked
 
Reported: 2020-02-15 02:06 UTC by Adam Williamson
Modified: 2020-02-27 08:41 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-26 16:37:59 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
systemd debug log extract (492.46 KB, text/plain)
2020-02-18 16:22 UTC, Adam Williamson
no flags Details
systemd 244 debug log extract (330.63 KB, text/plain)
2020-02-18 16:35 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2020-02-15 02:06:51 UTC
Today's openQA tests of Fedora 32 show a lot of failures all with the same basic cause: shortly after a tty starts up, plymouth seems to run and show the bootsplash over it. See e.g. this test:

https://openqa.stg.fedoraproject.org/tests/723133

if you look at the thumbnails for the 'server_firewall_default' test, you can see we initially get a tty:

https://openqa.stg.fedoraproject.org/tests/723133#step/server_firewall_default/1

 but then (while the test is still working through the login process), the bootsplash suddenly appears over it:

https://openqa.stg.fedoraproject.org/tests/723133#step/server_firewall_default/4

Looking in the system logs, we see this:

Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Started Getty on tty3.
Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=getty@tty3 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: systemd-vconsole-setup.service: Succeeded.
Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Started Setup Virtual Console.
Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Starting Show Plymouth Boot Screen...
Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Received SIGRTMIN+20 from PID 916 (plymouthd).
Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Started Show Plymouth Boot Screen.

That doesn't look right. We shouldn't be running plymouth-start.service (that's 'Show Plymouth Boot Screen') *after* tty3 starts up. We should only see it run *earlier* during boot, then we should see "Terminate Plymouth Boot Screen" (plymouth-quit.service), then after that we shouldn't see it start up again. But in fact we get this:

Feb 14 17:32:38 localhost.localdomain systemd[1]: Starting Show Plymouth Boot Screen...
Feb 14 17:32:38 localhost.localdomain systemd[1]: Started Show Plymouth Boot Screen.
Feb 14 17:32:54 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Starting Terminate Plymouth Boot Screen...
Feb 14 17:32:54 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Started Terminate Plymouth Boot Screen.
Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Starting Show Plymouth Boot Screen...
Feb 14 17:33:02 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Started Show Plymouth Boot Screen.

i.e. we get the expected early start, then we get the termination...then we see it run *again* (after tty3 starts up, as we can see from the context above).

I'm filing against systemd as plymouth itself has not changed since October, systemd seems the most likely culprit.

Comment 1 Zbigniew Jędrzejewski-Szmek 2020-02-15 06:47:31 UTC
#1798956 might be a different manifestation of the same issue, at least plymouth running
again would explain why agetty gets a chance to talk to it while it didn't before.

Sorry, I'm on PTO and will not look into this before the 24th.

Comment 2 Villy Kruse 2020-02-16 16:28:20 UTC
Some observations:

In the journal I see that plymouth-start.service is started "spontaniously" even if was properly stopped.
I can kill the plymouth process and/or stop plymouth-start.service  and later it will be started again some time later.

I also noticed when runing dnf updates in multi-user mode (runlevel 3) plymouth will be started every time the dnf process completes.

It also seems to happen when logging out from the desktop environment (xfce in my case).

"systemctl mask plymouth-start.service" will also cure the problem as a work-around.

After downgrading to systemd-244.1-2.fc32.x86_64 the problem went away.

Comment 3 Adam Williamson 2020-02-17 17:32:37 UTC
Proposing as a Beta blocker per Beta criterion "A system installed without a graphical package set must boot to a working login prompt without any unintended user intervention, and all virtual consoles intended to provide a working login prompt must do so." - https://fedoraproject.org/wiki/Fedora_32_Beta_Release_Criteria#Expected_installed_system_boot_behavior

this definitely interferes with that. The installed system does not boot to a working login prompt, it boots to a useless splash screen.

Comment 4 Geoffrey Marr 2020-02-17 19:39:18 UTC
Discussed during the 2020-02-17 blocker review meeting: [0]

The decision to classify this bug as an "AcceptedBlocker" was made as it violates the following criterion:

"A system installed without a graphical package set must boot to a working login prompt without any unintended user intervention, and all virtual consoles intended to provide a working login prompt must do so."

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2020-02-17/f32-blocker-review.2020-02-17-17.02.txt

Comment 5 Adam Williamson 2020-02-18 16:22:56 UTC
Created attachment 1663815 [details]
systemd debug log extract

Here's an extract from the journal with systemd debug logging enabled. These are the messages from two key points on an affected boot: 08:14:39 is when I switch to tty2, and 08:14:43 is when I successfully login (as root) on tty2 and a session starts. The first plymouth-start.service (the correct, boot-time one) and plymouth-quit.service ran ~20 seconds earlier.

It seems like at both these points, we get this:

Feb 18 08:14:39 localhost.localdomain systemd[1]: Pulling in plymouth-start.service/start from sysinit.target/start

Next I'll do the same thing with systemd 244 to compare the logs.

Comment 6 Adam Williamson 2020-02-18 16:35:31 UTC
Created attachment 1663828 [details]
systemd 244 debug log extract

Here's the equivalent extract with systemd 244.

Comment 7 Adam Williamson 2020-02-18 16:38:53 UTC
Aha, so, this is interesting. The key difference between 244 and 245 is this: in 244, we see this line:

Feb 18 08:28:57 localhost.localdomain systemd[1]: Found redundant job sysinit.target/start, dropping from transaction.

which then, I believe, leads to *these* lines:

Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-binfmt.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-boot-system-token.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-sysusers.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-machine-id-commit.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-hwdb-update.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job ldconfig.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job import-state.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job plymouth-read-write.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-update-done.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-journal-catalog-update.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-modules-load.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-firstboot.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Keeping job systemd-vconsole-setup.service/start because of plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job plymouth-start.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-vconsole-setup.service/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job systemd-ask-password-console.path/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job sys-fs-fuse-connections.mount/start
Feb 18 08:28:57 localhost.localdomain systemd[1]: Garbage collecting job selinux-autorelabel-mark.service/start

In systemd 245, sysinit.target/start is *not* recognized as a 'redundant job', and no jobs are 'garbage collected'.

Comment 8 Adam Williamson 2020-02-18 16:39:41 UTC
Given that systemd-vconsole-setup.service is also affected by this, I suspect that this and https://bugzilla.redhat.com/show_bug.cgi?id=1801272 may be the same bug after all.

Comment 9 Adam Williamson 2020-02-18 16:42:46 UTC
Heh. And just like that, a suspect emerges:

[adamw@adam systemd (master)]$ git log v244..v245-rc1 src/core/transaction.c
commit 097537f07a2fab3cb73aef7bc59f2a66aa93f533
Author: Kevin Kuehler <keur.edu>
Date:   Tue Nov 19 13:43:58 2019 -0800

    job: Don't mark as redundant if deps are relevant
    
    In the steps given in #13850, the resulting graph looks like:
    
        C (Anchor) -> B -> A
    
    Since B is inactive, it will be flagged as redundant and removed from
    the transaction, causing A to get garbage collected. The proposed fix is
    to not mark nodes as redundant if doing so causes a relevant node to be
    garbage collected.
    
    Fixes #13850

So, we're looking at https://github.com/systemd/systemd/pull/14086 and https://github.com/systemd/systemd/issues/13850 here.

Comment 10 Adam Williamson 2020-02-18 18:07:38 UTC
systemd-245~rc1-3.fc32 and systemd-245~rc1-3.fc33 should fix this.

Comment 11 Adam Williamson 2020-02-19 18:42:01 UTC
Confirmed fixed with the new systemd build, in Rawhide compose. 32 compose got stuck, but no reason to think the fix won't work there too.

Comment 12 Zbigniew Jędrzejewski-Szmek 2020-02-26 15:01:31 UTC
IIUC, the upstream patch is correct,and the issue is in the plymouth unit dependency tree
and/or the plymouth binary should check if the tty is not already used and exit.
So this is just a work-around, and not a real fix. I don't want to carry this patch
more than strictly necessary.

Comment 13 Adam Williamson 2020-02-26 16:36:04 UTC
Zbigniew: well, it affected several other units as well as plymouth-start, at least one of which - systemd-vconsole-setup - seemed to have undesirable effects (discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1801272 ). Please do not restore the change without being *very* sure all possible consequences are dealt with.

Comment 14 Adam Williamson 2020-02-26 16:37:44 UTC
Also, can you please deal with this via another bug report? The bug reported here - plymouth running again after it should have quit, which is a blocker bug - *is* fixed. Your current issue is "systemd has a downstream reversion patch I want to get rid of", or alternatively, "PartOf= does not work on inactive units" (the bug the problematic patch was trying to fix). Neither of those is the same as this bug and neither of them is a release blocker.

Comment 15 Villy Kruse 2020-02-27 08:33:11 UTC
(In reply to Zbigniew Jędrzejewski-Szmek from comment #12)
> IIUC, the upstream patch is correct,and the issue is in the plymouth unit
> dependency tree
> and/or the plymouth binary should check if the tty is not already used and
> exit.
> So this is just a work-around, and not a real fix. I don't want to carry
> this patch
> more than strictly necessary.

How about making plymouth-start a oneshot remain-after-exit service?

Comment 16 Villy Kruse 2020-02-27 08:41:27 UTC
(In reply to Villy Kruse from comment #15)
> (In reply to Zbigniew Jędrzejewski-Szmek from comment #12)
> > IIUC, the upstream patch is correct,and the issue is in the plymouth unit
> > dependency tree
> > and/or the plymouth binary should check if the tty is not already used and
> > exit.
> > So this is just a work-around, and not a real fix. I don't want to carry
> > this patch
> > more than strictly necessary.
> 
> How about making plymouth-start a oneshot remain-after-exit service?

I submitted bug 1807771 for the plymouth package.


Note You need to log in before you can comment on or make changes to this bug.