Bug 830482

Summary: Encrypted file systems not mounted in late part of boot because of systemd units missing from plymouth
Product: [Fedora] Fedora Reporter: Bruno Wolff III <bruno>
Component: plymouthAssignee: Ray Strode [halfline] <rstrode>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: bruno, fedora, harald, johannbg, lnykryn, metherid, mschmidt, msekleta, notting, plautrba, robatino, rstrode, systemd-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: plymouth-0.8.6.1-1.fc18 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-27 12:35:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 752654    
Attachments:
Description Flags
This includes the initial failure plus stuff I did to get the system usable.
none
log from systemd-185-2 boot
none
dmesg output
none
dmesg output for plymouth-0.8.6.2-0.2012.07.23.fc18.i686
none
dmesg output in permissive mode none

Description Bruno Wolff III 2012-06-10 06:01:03 UTC
Description of problem:
With systemd-185-5.gita2368a3.fc18 and plymouth-0.8.5.1-1.fc18 the boot process stalls when reaching the step where file system devices get mounted. Neither /home nor swap get mounted. After a pause, a number of systemd targets and services fail because of dependency issues. I can get the system up on a vt by entering the root password and then manually starting the graphical target. I get prompted to enter the passwords for /home and swap. gdm never seems to start, but I can switch to a vt to make changes. In this case I downgraded to systemd-185-2.fc18 and plymouth-0.8.4-0.20120319.3.fc18. After doing that then the system booted normally again.

Note that both /home and swap are on encrypted software raid 1 devices.

Version-Release number of selected component (if applicable):


How reproducible:
It seems to happen every boot. I don't know if raid or encryption are needed to trigger the problem. Maybe having any non-root file system will do it.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Michal Schmidt 2012-06-12 19:50:53 UTC
(In reply to comment #0)
> After a pause, a number of systemd targets and services fail because of
> dependency issues.

Can you capture the failures? Maybe it's logged in /var/log/boot.log.
Even better if you could capture the debugging output of systemd by following the instructions here:
http://freedesktop.org/wiki/Software/systemd/Debugging#When_you_can_get_a_shell

Comment 2 Bruno Wolff III 2012-06-13 12:14:21 UTC
Created attachment 591458 [details]
This includes the initial failure plus stuff I did to get the system usable.

This includes the logs until after I had the system usable.

systemctl start home.mount failed after setting up the luks device. I noticed that it didn't set up a symlink like it did for the /boot and swap devices. I manually mounted /home from the luks device.

Stopping and starting the swap.target got swap up.

The problem logging in graphically was not being able to update /var/lib/gdm/.Iceauthority . Setting selinux in permissive mode fixed this. However I don't see this when using systemd 185-2, so it may be that some labelling got messed up when I manually started the graphical.target.

Comment 3 Bruno Wolff III 2012-06-13 12:18:37 UTC
In case anyone else wants to downgrade, I'm noting that you need to do this in yum shell as you need to both erase systemd-libs and downgrade plymouth* and systemd* as systemd-libs is a new subpackage that has file conflicts with the downgraded version of systemd. And of course you'll have needed to put the downgraded versions of the plymouth and systemd packages in a local repo.

Comment 4 Bruno Wolff III 2012-06-13 12:41:50 UTC
Created attachment 591474 [details]
log from systemd-185-2 boot

It looks like the gdm/selinux issue is new and not part of the systemd problem as I am now seeing that even with 185-2. I'll file a separate bug report for that.
I also noticed that I didn't get any vt's when I booted using 185-2. I had to use ssh to set selinux in permissive mode so that I could get logged in graphically.

Comment 5 Michal Schmidt 2012-06-15 00:07:47 UTC
Thanks for the logs. They show that we're missing the systemd-ask-password-plymouth.* units. We must have lost them in the move of the plymouth-related units from systemd to the plymouth package.

Comment 6 Bruno Wolff III 2012-06-19 04:15:12 UTC
Should this stay assigned to systemd or get reassigned to plymouth?

Comment 7 Michal Schmidt 2012-06-20 14:12:29 UTC
Reassigning to plymouth to have the systemd-ask-password-plymouth.* units added there.

Comment 8 Bruno Wolff III 2012-06-27 11:17:28 UTC
This is still broken with plymouth-0.8.5.1-3.fc18.i686.

Comment 9 Bruno Wolff III 2012-06-29 21:47:24 UTC
I filed this at freedesktop.org as well:
https://bugs.freedesktop.org/show_bug.cgi?id=51573

Comment 10 Bruno Wolff III 2012-07-10 10:55:35 UTC
Created attachment 597304 [details]
dmesg output

systemd is still not able to pass along device passwords after plymouth 0.8.6-1 was installed, which is supposed to have the missing units.

I captured some more information. After starting the normal boot, I manually tried to run home.mount and swap.target which unlocked the luks containers but did not finnish mounting home or enabling swap. At that point I captured dmesg output. I was able to manually mount home, turn on swap and finish by starting the graphical.target.

Comment 11 Ray Strode [halfline] 2012-07-10 14:39:47 UTC
I screwed up the 0.8.6 release, i'm doing a 0.8.6.1 now.

Comment 12 Ray Strode [halfline] 2012-07-10 21:05:41 UTC
Reopening, there are additional problems I found in someone's cube today.

Comment 13 Bruno Wolff III 2012-07-10 21:42:21 UTC
I tested plymouth-0.8.6.1-1.fc18 and dracut-020-72.git20120710.fc18 and it still isn't working. I also saw a bit different output when booting with enforcing=0 than without it, so there might be a couple of problems. I am short on time now, but I'll try to collect logs either tonight or tomorrow.

Comment 14 Bruno Wolff III 2012-07-11 12:21:43 UTC
One note I'd like to add is that if the person has an encrypted rootfs and a USB keyboard then they might run across bug 839075 (which prevents entering a password for the rootfs) making it hard to test this bug.

Comment 15 Bruno Wolff III 2012-07-20 15:22:28 UTC
Is this still a plymouth issue? Perhaps this needs to be assigned back to systemd?

Comment 16 Michal Schmidt 2012-07-20 15:37:25 UTC
There are still some parts missing from the transition of the units from the systemd package to plymouth. Frederic Crozat posted a patch for it today:
http://lists.freedesktop.org/archives/plymouth/2012-July/000660.html
Let's see if applying it helps.

Comment 17 Bruno Wolff III 2012-07-20 16:18:26 UTC
I'm going to try a rebuild of plymouth with that patch and see if it fixes the problem.

Comment 18 Bruno Wolff III 2012-07-20 16:59:59 UTC
I tried out the patch and encrypted home and swap are still not getting setup properly.

Comment 19 Bruno Wolff III 2012-07-24 04:04:52 UTC
I tried out plymouth-0.8.6.2-0.2012.07.23.fc18.i686 and it didn't help.

Comment 20 Ray Strode [halfline] 2012-07-24 22:03:38 UTC
Did you rebuild the initrd after installing it?

How is it failing exactly? Same way?

There's still one race condition I've noticed wrt to modesetting drivers loading that causes graphics corruption in some cases, but the main issues should be fixed with plymouth-0.8.6.2-0.2012.07.23.fc18.i686 as far as I know.

Comment 21 Bruno Wolff III 2012-07-25 01:39:18 UTC
Yes I rebuilt the initramfs. There have been some changes since I originally reported the issue, but I don't know that there due to plymouth. The systemd journal service doesn't seem to be always starting. Sometimes when I start home.mount I get asked for the swap device password first. The devices do get luksOpen'd when I do that, but the next step of mounting home or enabling the swap device timeout and then fail. I manually do that part before starting the graphical target. I'll get another log of the boot process.

Comment 22 Bruno Wolff III 2012-07-25 02:05:40 UTC
Created attachment 600202 [details]
dmesg output for plymouth-0.8.6.2-0.2012.07.23.fc18.i686

Comment 23 Harald Hoyer 2012-07-25 10:43:28 UTC
(In reply to comment #21)
> The systemd journal service doesn't seem to be always starting.

Oh! This should be fixed with systemd-187 and dracut-022. If not, can you please post a dmesg, where this is the case?

Comment 24 Bruno Wolff III 2012-07-25 12:31:34 UTC
I think what I saw was included in the last dmesg output I attached to this bug. I was using systemd 187-1 and a 022 dracut.
It looks like the journal gets restarted a few times and there is at least one selinux avc.

Comment 25 Michal Schmidt 2012-07-25 13:35:01 UTC
Anything like the denials here?: https://bugzilla.redhat.com/show_bug.cgi?id=842758#c7

Comment 26 Harald Hoyer 2012-07-25 13:37:10 UTC
[   42.925425] systemd-journald[652]: Unknown socket passed.

Comment 27 Bruno Wolff III 2012-07-25 14:41:10 UTC
As a reply to comment 25, yes I see similar avcs. Though I normally boot in enforcing mode.
I don't think that is likely to be the problem with mounting home or enabling swap though.
I have tried booting in permissive mode and still see pretty much the same thing.
There seems to be two issues. The password entered for the rootfs isn't used to setup the other luks containers (nor is there a prompt for them). And when the luks containers are setup using start home.mount or start swap.target, they aren't used by whatever is supposed to make use of them.

Comment 28 Ray Strode [halfline] 2012-07-25 20:59:09 UTC
This are still pretty hosed, looking at your log. From a plymouth point of view the first thing I see is this:

[    1.693627] systemd[1]: About to execute: /usr/bin/plymouth update-root-fs --read-write

The plymouth client is getting run to tell the plymouth daemon the root filesystem is writable, before the root file system is mounted and before the plymouth daemon is started.

The next thing I see is this:

[    1.709654] systemd[1]: About to execute: /usr/sbin/plymouthd --mode=boot --pid-file=/var/run/plymouth/pid --attach-to-session
[    1.718184] systemd[1]: Forked /usr/sbin/plymouthd as 59

So plymouthd is now started.  Next up:

[    1.727452] systemd[1]: About to execute: /bin/udevadm settle --timeout=30 --exit-if-exists=/sys/class/drm/card0/dev

Plymouth is running udevadm settle to block until the drm device is ready. After that, failure:

[    1.736585] systemd[1]: Child 55 died (code=exited, status=1/FAILURE)
[    1.736592] systemd[1]: Child 55 belongs to plymouth-read-write.service

That weird read-write service that got called way too early finally finished in failure.

Next up, some time later:

[    2.305590] systemd[1]: Got SIGCHLD for process 64 (udevadm)
[    2.305731] systemd[1]: Child 64 died (code=exited, status=0/SUCCESS)

drm device is loaded, now time to check if the fb device is loaded:

[    2.311191] systemd[1]: About to execute: /bin/udevadm settle --timeout=30 --exit-if-exists=/sys/class/graphics/fb0/dev
[    2.320128] systemd[1]: Forked /bin/udevadm as 162
[    2.324792] systemd[1]: Child 162 died (code=exited, status=0/SUCCESS)

Yup, now time to show the plymouth splash:

[    2.330218] systemd[1]: About to execute: /usr/bin/plymouth show-splash
[    2.341169] systemd[1]: Forked /usr/bin/plymouth as 165

So splash screen is about to be shown, but meanwhile:

[    2.362671] systemd[1]: Received SIGRTMIN+20 from PID 62 (plymouthd).
[    2.362689] systemd[1]: Enabling showing of status.

plymouthd told systemd to be verbose so when the user hits escape we have something to show them. And now:

[    2.364596] systemd[1]: Received SIGCHLD from PID 165 (plymouth).
[    2.364676] systemd[1]: Got SIGCHLD for process 165 (plymouth)

splash screen is shown.  What next?

[    2.369333] systemd[1]: systemd-ask-password-plymouth.path changed dead -> waiting
[    2.369354] systemd[1]: Job systemd-ask-password-plymouth.path/start finished, result=done
[    2.369389] systemd[1]: Starting of systemd-ask-password-console.path requested but condition failed. Ignoring.

The systemd agent thing that dispatches password requests to plymouth is started and the agent thing that dispatches directly to console when plymouth isn't around decided not to start since plymouth is around. Next:

[    2.611618] [drm] Initialized drm 1.1.0 20060810

err wait, the drm device just now initialized, so we showed the splash screen too soon.  It probably didn't show up. Our udevadm settle command from 1.72745 isn't working. Next:

[   11.154385] md: bind<sdb1>
[   11.171740] md: bind<sda1>
[   11.188344] md: raid1 personality registered for level 1

Some raid setup.

[   11.401167]  md13: unknown partition table
[   25.422971] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support!

Now a mysterious 14 second pause in boot with no messages, followed by something getting mounted!

You have root=/dev/mapper/luks-9a976b86-8aaa-40d9-8039-89d710eac5c9 on your kernel command line, so I assume that's what's getting mounted.  I guess that mysterious 14 seconds was you entering your password, but there's no record of you getting asked to enter your password in the log, so that's very strange.

Next:

[   25.833576] systemd[1]: About to execute: /bin/dracut-pre-pivot
[   25.846059] systemd[1]: Forked /bin/dracut-pre-pivot as 536

Okay, dracut thinks your root fs is mounted and is about to swivel to it.  In preparation to swiveling it's trying to remove your device-mapper device for your root fs:

[   26.345595] device-mapper: ioctl: unable to remove open device luks-9a976b86-8aaa-40d9-8039-89d710eac5c9

Anyway it finished its pre-privot service:

[   35.855918] systemd[1]: dracut-pre-pivot.service: main process exited, code=exited, status=0

Now, on to the slaying. It's going to kill everything it can before jumping to the root fs:

[   35.856099] systemd[1]: About to execute: /usr/bin/systemctl --no-block isolate initrd-switch-root.target

And next mention of plymouth:

[   35.884069] systemd[1]: Installed new job plymouth-switch-root.service/start as 36

That service needs to run before the switch root, and of course plymouth shouldn't get slayed. And:

[   35.884151] systemd[1]: Installed new job systemd-ask-password-plymouth.path/stop as 46

The agent that was started before to dispatch password requests to plymouth is set to get stopped now. Plymouth is told about the root switch:

[   35.885856] systemd[1]: About to execute: /usr/bin/plymouth update-root-fs --new-root-dir=/sysroot
[   35.895071] systemd[1]: Forked /usr/bin/plymouth as 563
[   35.932544] systemd[1]: Child 563 died (code=exited, status=0/SUCCESS)

So now we're sitting on your root fs, and then:

[   41.423723] systemd[1]: Looking at job plymouth-quit.service/stop conflicted_by=yes
[   41.423730] systemd[1]: Looking at job plymouth-quit.service/start conflicted_by=no
[   41.423735] systemd[1]: Fixing conflicting jobs by deleting job plymouth-quit.service/start
[   41.423745] systemd[1]: Looking at job prefdm.service/start conflicted_by=no
[   41.423750] systemd[1]: Looking at job prefdm.service/stop conflicted_by=no
[   41.423755] systemd[1]: Fixing conflicting jobs by deleting job prefdm.service/stop

We remove plymouth-quit.service since prefdm is going to handle quitting plymouth at the end of boot up.  We also have:

[   45.330743] systemd[1]: Starting of plymouth-start.service requested but condition failed. Ignoring.
[   45.330753] systemd[1]: Job plymouth-start.service/start finished, result=done
plymouth-start.service again incase you weren't running an initrd.  We ignore it since you were.

Also, again:

[   45.330928] systemd[1]: Starting of systemd-ask-password-console.path requested but condition failed. Ignoring.

We ignore the fallback password service, since we have plymouth, and readd the job to start the agent for dispatching passwords to plymouth:

[   45.333335] systemd[1]: systemd-ask-password-plymouth.path changed dead -> waiting
[   45.333368] systemd[1]: Job systemd-ask-password-plymouth.path/start finished, result=done

Now we run some fedora storage service thing, but it fails:

[   45.459537] systemd[1]: About to execute: /lib/systemd/fedora-storage-init
[   45.462106] systemd[758]: Failed at step STDOUT spawning /lib/systemd/fedora-storage-init: Connection refused

We do notice your swap device though:

[   45.707516] systemd[1]: Job dev-disk-by\x2duuid-987f98d1\x2d0144\x2d4bb4\x2da433\x2df7233784ca40.device/start finished, result=done

and then try to decrypt it:

[   45.707924] systemd[1]: About to execute: /usr/lib/systemd/systemd-cryptsetup attach luks-987f98d1-0144-4bb4-a433-f7233784ca40 /dev/disk/by-uuid/987f98d1-0144-4bb4-a433-f7233784ca40 none

But fail before asking plymouth:

[   45.717056] systemd[774]: Failed at step STDOUT spawning /usr/lib/systemd/systemd-cryptsetup: Connection refused

And then we try to decrypt you're already decrypted root fs device:

[   45.766620] systemd[1]: About to execute: /usr/lib/systemd/systemd-cryptsetup attach luks-9a976b86-8aaa-40d9-8039-89d710eac5c9 /dev/disk/by-uuid/9a976b86-8aaa-40d9-8039-89d710eac5c9 none

and fail again:

[   45.769615] systemd[777]: Failed at step STDOUT spawning /usr/lib/systemd/systemd-cryptsetup: Connection refused

Now your home device:

[   46.248281] systemd[1]: About to execute: /usr/lib/systemd/systemd-cryptsetup attach luks-451045dc-5ca4-49a7-9e2d-95a78e6fabca /dev/disk/by-uuid/451045dc-5ca4-49a7-9e2d-95a78e6fabca none

and fail again:

[   46.251087] systemd[816]: Failed at step STDOUT spawning /usr/lib/systemd/systemd-cryptsetup: Connection refused

Now time to resort to desparate measures:

[   46.404407] systemd[1]: Triggering OnFailure= dependencies of local-fs.target.
[   46.616412] systemd[1]: Enqueued job emergency.target/start as 249

Trying to mount /boot, but failing:

[   47.204481] systemd[1]: About to execute: /bin/mount /dev/md11 /boot -t ext4 -o defaults,noatime,barrier=1
[   47.206958] systemd[825]: Failed at step STDOUT spawning /bin/mount: Connection refused

Plymouthd quit in preparation for emergency mode:

[   47.331673] systemd[1]: Got SIGCHLD for process 62 (plymouthd)
[   47.331816] systemd[1]: Child 62 died (code=exited, status=0/SUCCESS)
[   47.339202] systemd[1]: About to execute: /bin/echo 'Welcome to emergency mode. Use "systemctl default" or ^D to enter default mode.'

And the shell gets started:

[   47.401108] systemd[1]: Forked /sbin/sulogin as 831

40 seconds later, presumably after manual fix ups, we're back to booting:

[   86.461123] systemd[1]: Accepted connection on private bus.

And here we ago again, let's try to decrypt /home:

[   86.470337] systemd[1]: About to execute: /usr/lib/systemd/systemd-cryptsetup attach luks-451045dc-5ca4-49a7-9e2d-95a78e6fabca /dev/disk/by-uuid/451045dc-5ca4-49a7-9e2d-95a78e6fabca none

Success!
[   96.913152] systemd[1]: Job systemd-cryptsetup@luks\x2d451045dc\x2d5ca4\x2d49a7\x2d9e2d\x2d95a78e6fabca.service/start finished, result=done

10 seconds of reading and typing the password and cryptsetup finished.  Of course, plymouth isn't running at this point, so it must have been some fallback text thing.

Or was it a success? almost a minute and a half later we see:

[  176.462926] systemd[1]: Job dev-mapper-luks\x2d451045dc\x2d5ca4\x2d49a7\x2d9e2d\x2d95a78e6fabca.device/start timed out.

So systemd didn't realize things succeeded and sat there waiting.  Next up, your swap again:

[  207.186811] systemd[1]: About to execute: /usr/lib/systemd/systemd-cryptsetup attach luks-987f98d1-0144-4bb4-a433-f7233784ca40 /dev/disk/by-uuid/987f98d1-0144-4bb4-a433-f7233784ca40 none

Success!

[  220.686155] systemd[1]: Child 1000 died (code=exited, status=0/SUCCESS)

Err, fail:

[  297.186592] systemd[1]: Job dev-mapper-luks\x2d987f98d1\x2d0144\x2d4bb4\x2da433\x2df7233784ca40.device/start timed out.

Err, success:

[  302.825070] Adding 10482612k swap on /dev/mapper/luks-987f98d1-0144-4bb4-a433-f7233784ca40.  Priority:-1 extents:1 across:10482612k
[  302.841516] systemd[1]: dev-mapper-luks\x2d987f98d1\x2d0144\x2d4bb4\x2da433\x2df7233784ca40.swap changed dead -> active

Comment 29 Bruno Wolff III 2012-07-25 21:42:51 UTC
Just to fill in a couple of blanks, the boot process looks like this from my perspective:

Get prompt for and enter root fs passphrase (which is the same for all there luks containers).

Wait while a bunch of messages are printed that indicate the boot isn't going well.

Get prompt for and enter the root password to get a shell.

systemctl start home.mount

Get prompted for and enter the home fs password (and sometimes the swap container as well).

Wait for timeout and error message (but the luksOpen still happens).

mount /dev/mapper/*f6 /home
systemctl start swap.target

Get prompted for and enter the swap container password.

Wait for timeout and error message (but the luksOpen still happens).

swapon -a
systemctl start graphical.target

System finishes booting up into graphical mode.

Comment 30 Bruno Wolff III 2012-07-25 21:48:13 UTC
I have two machines that I see this on, and I can get you a more up to date log from the second machine if you think it would help. To me, they both look pretty similar.

I am going to be going out of town for a week at the end of this weekend. I won't have access to complete a reboot of either of these machines during that time and won't be able to test fixes until I get back.

Comment 31 Ray Strode [halfline] 2012-07-26 16:48:56 UTC
It would be interesting to see if the various

"Failed at step STDOUT spawning $FOO: Connection refused"

messages are caused by some sort of selinux policy and/or labeling problem.

Comment 32 Bruno Wolff III 2012-07-26 20:24:55 UTC
Created attachment 600604 [details]
dmesg output in permissive mode

The boot worked correctly this time. I hadn't tested a permissive mode boot in a while, so it looks like you fixed plymouth and now there still remains a second bug related to selinux.

This system isn't the same one the last dump was for, because I am at this one now and probably wouldn't get a chance to do a reboot there until late tonight or tomorrow. The disk encryption setup has a very similar pattern and they were having the same problem.

Thanks.

Comment 33 Michal Schmidt 2012-07-27 09:20:14 UTC
(In reply to comment #31)
> It would be interesting to see if the various
> 
> "Failed at step STDOUT spawning $FOO: Connection refused"
> 
> messages are caused by some sort of selinux policy and/or labeling problem.

Yes, these are caused by a broken journald and it is broken because of SELinux issues that are being discussed in bug 842758.

Comment 34 Bruno Wolff III 2012-07-27 12:35:04 UTC
I think it is safe to close this.
I reported all of the avc's in another bug and it was reported back they're all fixed now in bug 843699.

Comment 35 Ray Strode [halfline] 2012-07-27 18:52:15 UTC
Okay,

I committed:

http://cgit.freedesktop.org/plymouth/commit/?id=c8e548d5e4266ce1583e5b532fe3a3559932d3be

to cover the read-write service getting started very very early, and harald pushed some commits to dracut to make it interact better with plymouth getting run from systemd, so hopefully all loose ends are mopped up.