Red Hat Bugzilla – Bug 830482
Encrypted file systems not mounted in late part of boot because of systemd units missing from plymouth
Last modified: 2012-07-27 14:52:15 EDT
Description of problem:
With systemd-185-5.gita2368a3.fc18 and plymouth-0.8.5.1-1.fc18 the boot process stalls when reaching the step where file system devices get mounted. Neither /home nor swap get mounted. After a pause, a number of systemd targets and services fail because of dependency issues. I can get the system up on a vt by entering the root password and then manually starting the graphical target. I get prompted to enter the passwords for /home and swap. gdm never seems to start, but I can switch to a vt to make changes. In this case I downgraded to systemd-185-2.fc18 and plymouth-0.8.4-0.20120319.3.fc18. After doing that then the system booted normally again.
Note that both /home and swap are on encrypted software raid 1 devices.
Version-Release number of selected component (if applicable):
It seems to happen every boot. I don't know if raid or encryption are needed to trigger the problem. Maybe having any non-root file system will do it.
Steps to Reproduce:
(In reply to comment #0)
> After a pause, a number of systemd targets and services fail because of
> dependency issues.
Can you capture the failures? Maybe it's logged in /var/log/boot.log.
Even better if you could capture the debugging output of systemd by following the instructions here:
Created attachment 591458 [details]
This includes the initial failure plus stuff I did to get the system usable.
This includes the logs until after I had the system usable.
systemctl start home.mount failed after setting up the luks device. I noticed that it didn't set up a symlink like it did for the /boot and swap devices. I manually mounted /home from the luks device.
Stopping and starting the swap.target got swap up.
The problem logging in graphically was not being able to update /var/lib/gdm/.Iceauthority . Setting selinux in permissive mode fixed this. However I don't see this when using systemd 185-2, so it may be that some labelling got messed up when I manually started the graphical.target.
In case anyone else wants to downgrade, I'm noting that you need to do this in yum shell as you need to both erase systemd-libs and downgrade plymouth* and systemd* as systemd-libs is a new subpackage that has file conflicts with the downgraded version of systemd. And of course you'll have needed to put the downgraded versions of the plymouth and systemd packages in a local repo.
Created attachment 591474 [details]
log from systemd-185-2 boot
It looks like the gdm/selinux issue is new and not part of the systemd problem as I am now seeing that even with 185-2. I'll file a separate bug report for that.
I also noticed that I didn't get any vt's when I booted using 185-2. I had to use ssh to set selinux in permissive mode so that I could get logged in graphically.
Thanks for the logs. They show that we're missing the systemd-ask-password-plymouth.* units. We must have lost them in the move of the plymouth-related units from systemd to the plymouth package.
Should this stay assigned to systemd or get reassigned to plymouth?
Reassigning to plymouth to have the systemd-ask-password-plymouth.* units added there.
This is still broken with plymouth-0.8.5.1-3.fc18.i686.
I filed this at freedesktop.org as well:
Created attachment 597304 [details]
systemd is still not able to pass along device passwords after plymouth 0.8.6-1 was installed, which is supposed to have the missing units.
I captured some more information. After starting the normal boot, I manually tried to run home.mount and swap.target which unlocked the luks containers but did not finnish mounting home or enabling swap. At that point I captured dmesg output. I was able to manually mount home, turn on swap and finish by starting the graphical.target.
I screwed up the 0.8.6 release, i'm doing a 0.8.6.1 now.
Reopening, there are additional problems I found in someone's cube today.
I tested plymouth-0.8.6.1-1.fc18 and dracut-020-72.git20120710.fc18 and it still isn't working. I also saw a bit different output when booting with enforcing=0 than without it, so there might be a couple of problems. I am short on time now, but I'll try to collect logs either tonight or tomorrow.
One note I'd like to add is that if the person has an encrypted rootfs and a USB keyboard then they might run across bug 839075 (which prevents entering a password for the rootfs) making it hard to test this bug.
Is this still a plymouth issue? Perhaps this needs to be assigned back to systemd?
There are still some parts missing from the transition of the units from the systemd package to plymouth. Frederic Crozat posted a patch for it today:
Let's see if applying it helps.
I'm going to try a rebuild of plymouth with that patch and see if it fixes the problem.
I tried out the patch and encrypted home and swap are still not getting setup properly.
I tried out plymouth-0.8.6.2-0.2012.07.23.fc18.i686 and it didn't help.
Did you rebuild the initrd after installing it?
How is it failing exactly? Same way?
There's still one race condition I've noticed wrt to modesetting drivers loading that causes graphics corruption in some cases, but the main issues should be fixed with plymouth-0.8.6.2-0.2012.07.23.fc18.i686 as far as I know.
Yes I rebuilt the initramfs. There have been some changes since I originally reported the issue, but I don't know that there due to plymouth. The systemd journal service doesn't seem to be always starting. Sometimes when I start home.mount I get asked for the swap device password first. The devices do get luksOpen'd when I do that, but the next step of mounting home or enabling the swap device timeout and then fail. I manually do that part before starting the graphical target. I'll get another log of the boot process.
Created attachment 600202 [details]
dmesg output for plymouth-0.8.6.2-0.2012.07.23.fc18.i686
(In reply to comment #21)
> The systemd journal service doesn't seem to be always starting.
Oh! This should be fixed with systemd-187 and dracut-022. If not, can you please post a dmesg, where this is the case?
I think what I saw was included in the last dmesg output I attached to this bug. I was using systemd 187-1 and a 022 dracut.
It looks like the journal gets restarted a few times and there is at least one selinux avc.
Anything like the denials here?: https://bugzilla.redhat.com/show_bug.cgi?id=842758#c7
[ 42.925425] systemd-journald: Unknown socket passed.
As a reply to comment 25, yes I see similar avcs. Though I normally boot in enforcing mode.
I don't think that is likely to be the problem with mounting home or enabling swap though.
I have tried booting in permissive mode and still see pretty much the same thing.
There seems to be two issues. The password entered for the rootfs isn't used to setup the other luks containers (nor is there a prompt for them). And when the luks containers are setup using start home.mount or start swap.target, they aren't used by whatever is supposed to make use of them.
This are still pretty hosed, looking at your log. From a plymouth point of view the first thing I see is this:
[ 1.693627] systemd: About to execute: /usr/bin/plymouth update-root-fs --read-write
The plymouth client is getting run to tell the plymouth daemon the root filesystem is writable, before the root file system is mounted and before the plymouth daemon is started.
The next thing I see is this:
[ 1.709654] systemd: About to execute: /usr/sbin/plymouthd --mode=boot --pid-file=/var/run/plymouth/pid --attach-to-session
[ 1.718184] systemd: Forked /usr/sbin/plymouthd as 59
So plymouthd is now started. Next up:
[ 1.727452] systemd: About to execute: /bin/udevadm settle --timeout=30 --exit-if-exists=/sys/class/drm/card0/dev
Plymouth is running udevadm settle to block until the drm device is ready. After that, failure:
[ 1.736585] systemd: Child 55 died (code=exited, status=1/FAILURE)
[ 1.736592] systemd: Child 55 belongs to plymouth-read-write.service
That weird read-write service that got called way too early finally finished in failure.
Next up, some time later:
[ 2.305590] systemd: Got SIGCHLD for process 64 (udevadm)
[ 2.305731] systemd: Child 64 died (code=exited, status=0/SUCCESS)
drm device is loaded, now time to check if the fb device is loaded:
[ 2.311191] systemd: About to execute: /bin/udevadm settle --timeout=30 --exit-if-exists=/sys/class/graphics/fb0/dev
[ 2.320128] systemd: Forked /bin/udevadm as 162
[ 2.324792] systemd: Child 162 died (code=exited, status=0/SUCCESS)
Yup, now time to show the plymouth splash:
[ 2.330218] systemd: About to execute: /usr/bin/plymouth show-splash
[ 2.341169] systemd: Forked /usr/bin/plymouth as 165
So splash screen is about to be shown, but meanwhile:
[ 2.362671] systemd: Received SIGRTMIN+20 from PID 62 (plymouthd).
[ 2.362689] systemd: Enabling showing of status.
plymouthd told systemd to be verbose so when the user hits escape we have something to show them. And now:
[ 2.364596] systemd: Received SIGCHLD from PID 165 (plymouth).
[ 2.364676] systemd: Got SIGCHLD for process 165 (plymouth)
splash screen is shown. What next?
[ 2.369333] systemd: systemd-ask-password-plymouth.path changed dead -> waiting
[ 2.369354] systemd: Job systemd-ask-password-plymouth.path/start finished, result=done
[ 2.369389] systemd: Starting of systemd-ask-password-console.path requested but condition failed. Ignoring.
The systemd agent thing that dispatches password requests to plymouth is started and the agent thing that dispatches directly to console when plymouth isn't around decided not to start since plymouth is around. Next:
[ 2.611618] [drm] Initialized drm 1.1.0 20060810
err wait, the drm device just now initialized, so we showed the splash screen too soon. It probably didn't show up. Our udevadm settle command from 1.72745 isn't working. Next:
[ 11.154385] md: bind<sdb1>
[ 11.171740] md: bind<sda1>
[ 11.188344] md: raid1 personality registered for level 1
Some raid setup.
[ 11.401167] md13: unknown partition table
[ 25.422971] EXT4-fs: Warning: mounting with data=journal disables delayed allocation and O_DIRECT support!
Now a mysterious 14 second pause in boot with no messages, followed by something getting mounted!
You have root=/dev/mapper/luks-9a976b86-8aaa-40d9-8039-89d710eac5c9 on your kernel command line, so I assume that's what's getting mounted. I guess that mysterious 14 seconds was you entering your password, but there's no record of you getting asked to enter your password in the log, so that's very strange.
[ 25.833576] systemd: About to execute: /bin/dracut-pre-pivot
[ 25.846059] systemd: Forked /bin/dracut-pre-pivot as 536
Okay, dracut thinks your root fs is mounted and is about to swivel to it. In preparation to swiveling it's trying to remove your device-mapper device for your root fs:
[ 26.345595] device-mapper: ioctl: unable to remove open device luks-9a976b86-8aaa-40d9-8039-89d710eac5c9
Anyway it finished its pre-privot service:
[ 35.855918] systemd: dracut-pre-pivot.service: main process exited, code=exited, status=0
Now, on to the slaying. It's going to kill everything it can before jumping to the root fs:
[ 35.856099] systemd: About to execute: /usr/bin/systemctl --no-block isolate initrd-switch-root.target
And next mention of plymouth:
[ 35.884069] systemd: Installed new job plymouth-switch-root.service/start as 36
That service needs to run before the switch root, and of course plymouth shouldn't get slayed. And:
[ 35.884151] systemd: Installed new job systemd-ask-password-plymouth.path/stop as 46
The agent that was started before to dispatch password requests to plymouth is set to get stopped now. Plymouth is told about the root switch:
[ 35.885856] systemd: About to execute: /usr/bin/plymouth update-root-fs --new-root-dir=/sysroot
[ 35.895071] systemd: Forked /usr/bin/plymouth as 563
[ 35.932544] systemd: Child 563 died (code=exited, status=0/SUCCESS)
So now we're sitting on your root fs, and then:
[ 41.423723] systemd: Looking at job plymouth-quit.service/stop conflicted_by=yes
[ 41.423730] systemd: Looking at job plymouth-quit.service/start conflicted_by=no
[ 41.423735] systemd: Fixing conflicting jobs by deleting job plymouth-quit.service/start
[ 41.423745] systemd: Looking at job prefdm.service/start conflicted_by=no
[ 41.423750] systemd: Looking at job prefdm.service/stop conflicted_by=no
[ 41.423755] systemd: Fixing conflicting jobs by deleting job prefdm.service/stop
We remove plymouth-quit.service since prefdm is going to handle quitting plymouth at the end of boot up. We also have:
[ 45.330743] systemd: Starting of plymouth-start.service requested but condition failed. Ignoring.
[ 45.330753] systemd: Job plymouth-start.service/start finished, result=done
plymouth-start.service again incase you weren't running an initrd. We ignore it since you were.
[ 45.330928] systemd: Starting of systemd-ask-password-console.path requested but condition failed. Ignoring.
We ignore the fallback password service, since we have plymouth, and readd the job to start the agent for dispatching passwords to plymouth:
[ 45.333335] systemd: systemd-ask-password-plymouth.path changed dead -> waiting
[ 45.333368] systemd: Job systemd-ask-password-plymouth.path/start finished, result=done
Now we run some fedora storage service thing, but it fails:
[ 45.459537] systemd: About to execute: /lib/systemd/fedora-storage-init
[ 45.462106] systemd: Failed at step STDOUT spawning /lib/systemd/fedora-storage-init: Connection refused
We do notice your swap device though:
[ 45.707516] systemd: Job dev-disk-by\x2duuid-987f98d1\x2d0144\x2d4bb4\x2da433\x2df7233784ca40.device/start finished, result=done
and then try to decrypt it:
[ 45.707924] systemd: About to execute: /usr/lib/systemd/systemd-cryptsetup attach luks-987f98d1-0144-4bb4-a433-f7233784ca40 /dev/disk/by-uuid/987f98d1-0144-4bb4-a433-f7233784ca40 none
But fail before asking plymouth:
[ 45.717056] systemd: Failed at step STDOUT spawning /usr/lib/systemd/systemd-cryptsetup: Connection refused
And then we try to decrypt you're already decrypted root fs device:
[ 45.766620] systemd: About to execute: /usr/lib/systemd/systemd-cryptsetup attach luks-9a976b86-8aaa-40d9-8039-89d710eac5c9 /dev/disk/by-uuid/9a976b86-8aaa-40d9-8039-89d710eac5c9 none
and fail again:
[ 45.769615] systemd: Failed at step STDOUT spawning /usr/lib/systemd/systemd-cryptsetup: Connection refused
Now your home device:
[ 46.248281] systemd: About to execute: /usr/lib/systemd/systemd-cryptsetup attach luks-451045dc-5ca4-49a7-9e2d-95a78e6fabca /dev/disk/by-uuid/451045dc-5ca4-49a7-9e2d-95a78e6fabca none
and fail again:
[ 46.251087] systemd: Failed at step STDOUT spawning /usr/lib/systemd/systemd-cryptsetup: Connection refused
Now time to resort to desparate measures:
[ 46.404407] systemd: Triggering OnFailure= dependencies of local-fs.target.
[ 46.616412] systemd: Enqueued job emergency.target/start as 249
Trying to mount /boot, but failing:
[ 47.204481] systemd: About to execute: /bin/mount /dev/md11 /boot -t ext4 -o defaults,noatime,barrier=1
[ 47.206958] systemd: Failed at step STDOUT spawning /bin/mount: Connection refused
Plymouthd quit in preparation for emergency mode:
[ 47.331673] systemd: Got SIGCHLD for process 62 (plymouthd)
[ 47.331816] systemd: Child 62 died (code=exited, status=0/SUCCESS)
[ 47.339202] systemd: About to execute: /bin/echo 'Welcome to emergency mode. Use "systemctl default" or ^D to enter default mode.'
And the shell gets started:
[ 47.401108] systemd: Forked /sbin/sulogin as 831
40 seconds later, presumably after manual fix ups, we're back to booting:
[ 86.461123] systemd: Accepted connection on private bus.
And here we ago again, let's try to decrypt /home:
[ 86.470337] systemd: About to execute: /usr/lib/systemd/systemd-cryptsetup attach luks-451045dc-5ca4-49a7-9e2d-95a78e6fabca /dev/disk/by-uuid/451045dc-5ca4-49a7-9e2d-95a78e6fabca none
[ 96.913152] systemd: Job systemd-cryptsetup@luks\x2d451045dc\x2d5ca4\x2d49a7\x2d9e2d\x2d95a78e6fabca.service/start finished, result=done
10 seconds of reading and typing the password and cryptsetup finished. Of course, plymouth isn't running at this point, so it must have been some fallback text thing.
Or was it a success? almost a minute and a half later we see:
[ 176.462926] systemd: Job dev-mapper-luks\x2d451045dc\x2d5ca4\x2d49a7\x2d9e2d\x2d95a78e6fabca.device/start timed out.
So systemd didn't realize things succeeded and sat there waiting. Next up, your swap again:
[ 207.186811] systemd: About to execute: /usr/lib/systemd/systemd-cryptsetup attach luks-987f98d1-0144-4bb4-a433-f7233784ca40 /dev/disk/by-uuid/987f98d1-0144-4bb4-a433-f7233784ca40 none
[ 220.686155] systemd: Child 1000 died (code=exited, status=0/SUCCESS)
[ 297.186592] systemd: Job dev-mapper-luks\x2d987f98d1\x2d0144\x2d4bb4\x2da433\x2df7233784ca40.device/start timed out.
[ 302.825070] Adding 10482612k swap on /dev/mapper/luks-987f98d1-0144-4bb4-a433-f7233784ca40. Priority:-1 extents:1 across:10482612k
[ 302.841516] systemd: dev-mapper-luks\x2d987f98d1\x2d0144\x2d4bb4\x2da433\x2df7233784ca40.swap changed dead -> active
Just to fill in a couple of blanks, the boot process looks like this from my perspective:
Get prompt for and enter root fs passphrase (which is the same for all there luks containers).
Wait while a bunch of messages are printed that indicate the boot isn't going well.
Get prompt for and enter the root password to get a shell.
systemctl start home.mount
Get prompted for and enter the home fs password (and sometimes the swap container as well).
Wait for timeout and error message (but the luksOpen still happens).
mount /dev/mapper/*f6 /home
systemctl start swap.target
Get prompted for and enter the swap container password.
Wait for timeout and error message (but the luksOpen still happens).
systemctl start graphical.target
System finishes booting up into graphical mode.
I have two machines that I see this on, and I can get you a more up to date log from the second machine if you think it would help. To me, they both look pretty similar.
I am going to be going out of town for a week at the end of this weekend. I won't have access to complete a reboot of either of these machines during that time and won't be able to test fixes until I get back.
It would be interesting to see if the various
"Failed at step STDOUT spawning $FOO: Connection refused"
messages are caused by some sort of selinux policy and/or labeling problem.
Created attachment 600604 [details]
dmesg output in permissive mode
The boot worked correctly this time. I hadn't tested a permissive mode boot in a while, so it looks like you fixed plymouth and now there still remains a second bug related to selinux.
This system isn't the same one the last dump was for, because I am at this one now and probably wouldn't get a chance to do a reboot there until late tonight or tomorrow. The disk encryption setup has a very similar pattern and they were having the same problem.
(In reply to comment #31)
> It would be interesting to see if the various
> "Failed at step STDOUT spawning $FOO: Connection refused"
> messages are caused by some sort of selinux policy and/or labeling problem.
Yes, these are caused by a broken journald and it is broken because of SELinux issues that are being discussed in bug 842758.
I think it is safe to close this.
I reported all of the avc's in another bug and it was reported back they're all fixed now in bug 843699.
to cover the read-write service getting started very very early, and harald pushed some commits to dracut to make it interact better with plymouth getting run from systemd, so hopefully all loose ends are mopped up.