Bug 1013767 - rootfs on thinp not found, startup failure
rootfs on thinp not found, startup failure
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: dracut (Show other bugs)
20
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: dracut-maint
Fedora Extras Quality Assurance
AcceptedBlocker
:
: 1028616 (view as bug list)
Depends On:
Blocks: F20BetaBlocker 1020855
  Show dependency treegraph
 
Reported: 2013-09-30 14:05 EDT by Chris Murphy
Modified: 2013-11-11 10:07 EST (History)
22 users (show)

See Also:
Fixed In Version: dracut-034-19.git20131021.fc20
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1020855 (view as bug list)
Environment:
Last Closed: 2013-10-23 21:39:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
rdsosreport.txt (166.47 KB, text/plain)
2013-09-30 14:06 EDT, Chris Murphy
no flags Details
anaconda program.log (52.96 KB, text/plain)
2013-09-30 14:12 EDT, Chris Murphy
no flags Details
rdsosreport.txt 2 (64.80 KB, text/plain)
2013-10-15 13:11 EDT, Chris Murphy
no flags Details
Proposed patch (799 bytes, patch)
2013-10-16 04:53 EDT, Harald Hoyer
no flags Details | Diff
Second half of the patch. (1.45 KB, patch)
2013-10-18 07:31 EDT, Marian Csontos
no flags Details | Diff
custom partitioning /root layout (123.81 KB, image/png)
2013-11-11 10:06 EST, IBM Bug Proxy
no flags Details
custom partitioning /boot layout (103.25 KB, image/png)
2013-11-11 10:06 EST, IBM Bug Proxy
no flags Details
Serial console logs from affected machine (16.33 KB, text/plain)
2013-11-11 10:07 EST, IBM Bug Proxy
no flags Details
Complete journalctl logs from the affected machine. (30.42 KB, text/x-log)
2013-11-11 10:07 EST, IBM Bug Proxy
no flags Details
Complete dmesg logs from the affected machine. (12.83 KB, text/x-log)
2013-11-11 10:07 EST, IBM Bug Proxy
no flags Details

  None (edit)
Description Chris Murphy 2013-09-30 14:05:53 EDT
Description of problem: Fedora 20 guided partitioning scheme LVM Thing Provision is selected and system is installed successfully but on reboot, dropped to dracut prompt.


Version-Release number of selected component (if applicable):
dracut-032-1.fc20.x86-64



How reproducible:
Always


Steps to Reproduce:
1. Fedora 20 desktop ISO, boot and install using LVM Thin Provisioning in guided partitioning.
2. Reboot after successful install

Actual results:
dracut-initqueue[196]: Warning: /dev/fedora_f20s/root does not exist


Expected results:
Startup to a login prompt or desktop.

Additional info:
Comment 1 Chris Murphy 2013-09-30 14:06:43 EDT
Created attachment 805434 [details]
rdsosreport.txt
Comment 2 Chris Murphy 2013-09-30 14:08:50 EDT
Regression: booting the F20alpha desktop media, /dev/fedora_f20s/root and /dev/fedora_f20s/home LVs are active and mountable. Both are virtual LVs drawing from /dev/fedora_f20s/pool00 thin pool. So everything appears to have been created by anaconda correctly.

Could this be a systemd bug?
Comment 3 Chris Murphy 2013-09-30 14:12:27 EDT
Created attachment 805436 [details]
anaconda program.log
Comment 4 Chris Murphy 2013-09-30 21:22:51 EDT
Fedora-Live-Desktop-x86_64-20-Alpha-4.iso was used to install. Plus these updates:
anaconda-20.20-1.fc20.x86_64
anaconda-widgets-20.20-1.fc20.x86_64
python-blivet-0.22-1.fc20.noarch

Proposing as beta blocker. Post-install requirement: "No part of any release-blocking desktop's panel (or equivalent) configuration may crash on startup or be entirely non-functional." 

Also, expected to fail with non-desktop install, so this applies: "A system installed without a graphical package set must boot to a working login prompt without any unintended user intervention, and all virtual consoles intended to provide a working login prompt must do so."
Comment 5 Jóhann B. Guðmundsson 2013-10-02 11:58:30 EDT
Sounds like this should have been fixed while back

 * Wed Jun 12 2013 Harald Hoyer <harald@redhat.com> 028-1 - lvm: fixed "thin" recognition

Harald regression?
Comment 6 Mike Ruckman 2013-10-02 16:24:08 EDT
Discussed this in the 2013-10-02 Blocker Review Meeting [1]. This has been voted an AcceptedBlocker. This violates the following F20 alpha release criterion for LVM thinp autopart installs: "A system installed with a release-blocking desktop must boot to a log in screen where it is possible to log in to a working desktop using a user account created during installation or a 'first boot' utility." [2]

[1] http://meetbot.fedoraproject.org/fedora-blocker-review/2013-10-02/
[2] http://fedoraproject.org/wiki/Fedora_20_Alpha_Release_Criteria#Expected_installed_system_boot_behavior
Comment 7 Harald Hoyer 2013-10-04 09:53:40 EDT
[   11.731170] f20s.localdomain kernel: apple 0003:05AC:0230.0015: hidraw1: USB HID v1.11 Device [Apple, Inc. Apple Internal Keyboard / Trackpad] on usb-0000:00:1d.2-2/input1
...

seems like automatic lvm assembly did not work. dracut times out and calls lvm directly
can somebody from the lvm team please test?

...
[  127.838961] f20s.localdomain dracut-initqueue[196]: ACTIVE '/dev/fedora_f20s/swap' [3.89 GiB] inherit
[  127.839358] f20s.localdomain dracut-initqueue[196]: inactive '/dev/fedora_f20s/pool00' [421.10 GiB] inherit
[  127.839668] f20s.localdomain dracut-initqueue[196]: inactive '/dev/fedora_f20s/home' [371.10 GiB] inherit
[  127.839953] f20s.localdomain dracut-initqueue[196]: inactive '/dev/fedora_f20s/root' [50.00 GiB] inherit
[  127.842119] f20s.localdomain dracut-initqueue[196]: PARTIAL MODE. Incomplete logical volumes will be processed.
[  127.899855] f20s.localdomain dracut-initqueue[196]: /usr/sbin/thin_check: execvp failed: No such file or directory
[  127.900290] f20s.localdomain dracut-initqueue[196]: Check of thin pool fedora_f20s/pool00 failed (status:2). Manual repair required (thin_dump --repair /dev/mapper/fedora_f20s-pool00_tmeta)!
[  127.900663] f20s.localdomain dracut-initqueue[196]: /usr/sbin/thin_check: execvp failed: No such file or directory
[  127.901050] f20s.localdomain dracut-initqueue[196]: Unable to change internal LV pool00_tdata directly


Hmm... what does thin_check want to execvp?
Comment 8 Jóhann B. Guðmundsson 2013-10-04 10:29:08 EDT
I'm pretty sure the lvm team can correct me if I'm wrong here but is not the check done before activation and after deactivation of thin pool? 

Harald what happens when thin checking has been set to skip the check in lvm ( thin_check_executable = "" to the global{} in lvm.conf )?
Comment 9 Mike Snitzer 2013-10-04 10:36:17 EDT
(In reply to Harald Hoyer from comment #7)
> [   11.731170] f20s.localdomain kernel: apple 0003:05AC:0230.0015: hidraw1:
> USB HID v1.11 Device [Apple, Inc. Apple Internal Keyboard / Trackpad] on
> usb-0000:00:1d.2-2/input1
> ...
> 
> seems like automatic lvm assembly did not work. dracut times out and calls
> lvm directly
...
> [  127.900663] f20s.localdomain dracut-initqueue[196]: /usr/sbin/thin_check:
> execvp failed: No such file or directory
> [  127.901050] f20s.localdomain dracut-initqueue[196]: Unable to change
> internal LV pool00_tdata directly
> 
> 
> Hmm... what does thin_check want to execvp?

thin_check doesn't call execvp, lvm2 calls execvp to run thin_check -- via lib/misc/lvm-exec.c:exec_cmd().  So it would seem the problem is /usr/sbin/thin_check doesn't exist in the initramfs.  Does dracut copy thin_check (and the other thin binaries) to a different path?

dracut/modules.d/90lvm/module_setup.sh has:
    if [[ $_needthin ]]; then
        inst_multiple -o thin_dump thin_restore thin_check thin_repair
    fi

I cannot easily tell from dracut-functions.sh:inst_multiple() where dracut places these binaries in the initramfs.  Please advise, thanks!

As an aside:
Strikes me as a bug in how dracut executes other programs given it hides which binary is actually generating the above errors.  It looks like the above errors are coming from the dracut-initqueue service directly rather than lvm2.
Comment 10 Harald Hoyer 2013-10-08 04:09:48 EDT
(In reply to Mike Snitzer from comment #9)
> (In reply to Harald Hoyer from comment #7)
> > [   11.731170] f20s.localdomain kernel: apple 0003:05AC:0230.0015: hidraw1:
> > USB HID v1.11 Device [Apple, Inc. Apple Internal Keyboard / Trackpad] on
> > usb-0000:00:1d.2-2/input1
> > ...
> > 
> > seems like automatic lvm assembly did not work. dracut times out and calls
> > lvm directly
> ...
> > [  127.900663] f20s.localdomain dracut-initqueue[196]: /usr/sbin/thin_check:
> > execvp failed: No such file or directory
> > [  127.901050] f20s.localdomain dracut-initqueue[196]: Unable to change
> > internal LV pool00_tdata directly
> > 
> > 
> > Hmm... what does thin_check want to execvp?
> 
> thin_check doesn't call execvp, lvm2 calls execvp to run thin_check -- via
> lib/misc/lvm-exec.c:exec_cmd().  So it would seem the problem is
> /usr/sbin/thin_check doesn't exist in the initramfs.  Does dracut copy
> thin_check (and the other thin binaries) to a different path?
> 
> dracut/modules.d/90lvm/module_setup.sh has:
>     if [[ $_needthin ]]; then
>         inst_multiple -o thin_dump thin_restore thin_check thin_repair
>     fi
> 
> I cannot easily tell from dracut-functions.sh:inst_multiple() where dracut
> places these binaries in the initramfs.  Please advise, thanks!

inst_multiple just installs these binaries in the same place, as they are on the host, given they can be found via PATH.

"-o" means "optional", meaning dracut would not error out, if it can't find one of those. 

> 
> As an aside:
> Strikes me as a bug in how dracut executes other programs given it hides
> which binary is actually generating the above errors.  It looks like the
> above errors are coming from the dracut-initqueue service directly rather
> than lvm2.

Well, because the shell script, which executes all these lvm commands, is a systemd service, the logs are prefixed with the service name.
Comment 11 Harald Hoyer 2013-10-08 04:13:40 EDT
Mike, can you tell from the information in the attachment 1 [details] , why the lvm was not assembled automatically?
Comment 12 Mike Ruckman 2013-10-09 17:25:44 EDT
Discussed this in 2013-10-09 Blocker Review Meeting [1]. Will revisit next meeting - nothing new to vote on.

[1] http://meetbot.fedoraproject.org/fedora-blocker-review/2013-10-09/
Comment 13 Jaroslav Reznik 2013-10-15 09:06:55 EDT
Mike (or LVM guys), could you take a look on Harald's question in comment #11? It's Beta blocking bug and stuck for a week. Thanks.
Comment 14 Heinz Mauelshagen 2013-10-15 09:11:05 EDT
(In reply to Jaroslav Reznik from comment #13)
> Mike (or LVM guys), could you take a look on Harald's question in comment
> #11? It's Beta blocking bug and stuck for a week. Thanks.

File bogus!?

$ tar tvf DBI-1.11.tar.gz 

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Comment 15 Mike Snitzer 2013-10-15 09:42:17 EDT
I'm having the same problem as Heinz (comment#11) with that attachment.

I still think lvm2 couldn't execute thin_check, would be interesting if the original reporter did what was suggested in comment#8 -- disable thin_check in lvm.conf and then rebuild the initramfs.  If not possible, since this is the root fs we're talking about, I'm not sure what else to try.

When dracut runs in the installer environment, does it actually have the device-mapper-persistent-data package installed and available?
Comment 16 Chris Murphy 2013-10-15 11:47:42 EDT
The link in c11 is bogus, its URL is the first ever redhat bugzilla attachment. It's from the pleistocene. The correct attachment is the first one in the attachment section, rdsosreport.txt (166.47 KB, text/plain), it is not compressed.

I don't know if it'll work, but in-progress is a new thinp install with beta TC4 from DVD. If it doesn't work I'll use DVD rescue mode, chroot the broken install, change the lvm.conf and rebuild the initramfs. It would be useful if someone can tell me exactly what entry and location within lvm.conf to use, in case it's not obvious, once I get to that step.
Comment 17 Chris Murphy 2013-10-15 12:27:07 EDT
Unmodified install from Fedora-20-Beta-TC4-x86_64-DVD.iso still fails (this attempt is on qemu/kvm using a conventional LV on the host as backing; the original description is on baremetal) when using the default kernel option.

However, when using the rescue kernel option, which has a differently produced initramfs, the system is bootable.
Comment 18 Marian Csontos 2013-10-15 12:39:32 EDT
Harald, try applying the following bash-fu, please.

I believe the snippet below (found in 90lvm/module-setup.sh from el7 build) is the culprit. It is setting _needthin variable in a subprocess as there's a pipe used:

    if [[ $hostonly ]]; then
        get_host_lvs | while read line; do
            printf "%s" " rd.lvm.lv=$line"
            if ! [[ $_needthin ]]; then
                [[ "$(lvs --noheadings -o segtype ${line%%/*} 2>/dev/null)" == *thin* ]] && _needthin=1
            fi
        done >> "${initdir}/etc/cmdline.d/90lvm.conf"
        echo >> "${initdir}/etc/cmdline.d/90lvm.conf"
    else
        _needthin=1
    fi

It could be worked around by feeding the get_host_lvs output using `< <()` instead of pipe. I verified this change.

Also there may be another issue: for some reason /dev/VG/POOL and /dev/VG/POOL_tdata are sought for and not found as these are not supposed to be there. They are not to be used except by LVM. Simply exiting the dracut shell system boots just fine.

-- Martian
Comment 19 Marian Csontos 2013-10-15 12:42:35 EDT
Chris, could you please try if the suggested change works for you on F20?
Comment 20 Chris Murphy 2013-10-15 12:50:23 EDT
In /etc/lvm/lvm.conf replaced
# thin_check_executable = "/usr/sbin/thin_check"
with
# thin_check_executable = ""
then
dracut -f initramfs-3.11.4-302.fc20.x86_64.img 3.11.4-302.fc20.x86_64
reboot

Still does not boot.
Comment 21 Marian Csontos 2013-10-15 12:55:17 EDT
Including the hash sign? That's a comment character and has to be removed to take effect.
Comment 22 Chris Murphy 2013-10-15 13:05:55 EDT
In /etc/lvm/lvm.conf replaced
# thin_check_executable = "/usr/sbin/thin_check"
with
thin_check_executable = ""
then
dracut -f initramfs-3.11.4-302.fc20.x86_64.img 3.11.4-302.fc20.x86_64
reboot

Does not boot. The rescue kernel option (unrebuilt initramfs) does boot.
Comment 23 Chris Murphy 2013-10-15 13:11:26 EDT
Created attachment 812632 [details]
rdsosreport.txt 2

Applies to the boot failure in c22.
Comment 24 Marian Csontos 2013-10-15 13:13:48 EDT
See comment #18. Just press Ctrl-D. I have not found why are these checked at all. I will have a look tomorrow if not fixed by then.
Comment 25 Chris Murphy 2013-10-15 13:21:04 EDT
Yes, I understand I can exit from the dracut shell and boot. The beta release requirement is "must boot to a working login prompt without any unintended user intervention" for non-graphical, and "No part of any release-blocking desktop's panel (or equivalent) configuration may crash on startup or be entirely non-functional" for graphical. So in either case it's still a beta release blocking problem, even with exit work around.
Comment 26 Harald Hoyer 2013-10-16 04:42:16 EDT
(In reply to Marian Csontos from comment #18)
> Harald, try applying the following bash-fu, please.
> 
> I believe the snippet below (found in 90lvm/module-setup.sh from el7 build)
> is the culprit. It is setting _needthin variable in a subprocess as there's
> a pipe used:
> 
>     if [[ $hostonly ]]; then
>         get_host_lvs | while read line; do
>             printf "%s" " rd.lvm.lv=$line"
>             if ! [[ $_needthin ]]; then
>                 [[ "$(lvs --noheadings -o segtype ${line%%/*} 2>/dev/null)"
> == *thin* ]] && _needthin=1
>             fi
>         done >> "${initdir}/etc/cmdline.d/90lvm.conf"
>         echo >> "${initdir}/etc/cmdline.d/90lvm.conf"
>     else
>         _needthin=1
>     fi
> 
> It could be worked around by feeding the get_host_lvs output using `< <()`
> instead of pipe. I verified this change.

Oh.. yes.. right! Thanks for the find!


"< <()" uses a fd and requires /dev/fd to be setup as a link to /proc/self/fd.

We can't use that, because all the installer do not setup our chroot installation environment correctly :-(

"<<<$()" could be used instead, though.
Comment 27 Harald Hoyer 2013-10-16 04:53:32 EDT
Created attachment 812831 [details]
Proposed patch
Comment 28 Marian Csontos 2013-10-16 05:30:42 EDT
That's only half of the problem.

The other half is internal LVM devices (for example /dev/VG/POOL) are written to 90lvm.conf. This is less obvious why (at least to me.)
Comment 29 Peter Rajnoha 2013-10-16 10:44:52 EDT
To avoid scanning internal LVM devices, the only thing we need to do to make sure that the /dev/<VGNAME>/<LVNAME> exists. So if the /dev is scanned for suitable devices with certain FS UUID, we need to also take care to look at DM UUID and if it contains "LVM-" prefix, only the devices that have the /dev/<VGNAME>/<LVNAME> symlink should be scanned by blkid (if I understand the problem correctly from discussion with Marian that dracut tries to run the scan for all the /dev content, if not, please, correct me).

The devices that have /dev/<VGNAME>/<LVNAME> symlink are the ones that are suitable for use and which are not internal.
Comment 30 Peter Rajnoha 2013-10-16 10:56:21 EDT
...you can get DM uuid quickly by looking at /sys/block/dm-X/dm/uuid.
Comment 31 Marian Csontos 2013-10-18 07:31:31 EDT
Created attachment 813734 [details]
Second half of the patch.

Harald, review and commit the patch please.

Works fine with tpool on dmraid.
Comment 32 Fedora Update System 2013-10-18 09:24:36 EDT
dracut-034-18.git20131018.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/dracut-034-18.git20131018.fc20
Comment 33 Fedora Update System 2013-10-18 20:09:37 EDT
Package dracut-034-18.git20131018.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing dracut-034-18.git20131018.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-19382/dracut-034-18.git20131018.fc20
then log in and leave karma (feedback).
Comment 34 Michal Schmidt 2013-10-19 04:25:47 EDT
The fix may have caused a regression: bug 1021083.
Comment 35 Chris Murphy 2013-10-20 18:49:14 EDT
dracut-034-18.git20131018.fc20 appears to fix this bug.
1. Install Fedora-20-Beta-TC5-x86_64-DVD.iso using guided partitioning, LVM Thin Provisioning.
2. Reboot using regular kernel = fail.
2. Reboot using rescue kernel = succeed.
3. yum update dracut
4. dracut -f initramfs-3.11.5-302.fc20.x86-64.img 3.11.5-302.fc20.x86-64 
5. Reboot using regular kernel = succeed.


However, the last five lines in this sequence seem strange: monitoring tpool, no longer monitoring tpool, then monitoring tpool?

[root@localhost ~]# journalctl -b | grep -i lvm
Oct 20 16:33:09 localhost.localdomain kernel: Command line: BOOT_IMAGE=/vmlinuz-3.11.5-302.fc20.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=fedora/root vconsole.keymap=us LANG=en_US.UTF-8
Oct 20 16:33:09 localhost.localdomain kernel: Kernel command line: BOOT_IMAGE=/vmlinuz-3.11.5-302.fc20.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=fedora/root vconsole.keymap=us LANG=en_US.UTF-8
Oct 20 16:33:11 localhost.localdomain dracut-initqueue[189]: Scanning devices vda2  for LVM logical volumes fedora/swap fedora/swap fedora/root
Oct 20 16:33:11 localhost.localdomain systemd[1]: Found device /dev/disk/by-id/dm-uuid-LVM-ABK5buBtMllKynbEEyeOCSIl9Yw7M9CW35UpKKbZ2hyApfF0Y2r5EQ8MOtDMacmM.
Oct 20 16:33:14 localhost.localdomain systemd[1]: Starting LVM2 metadata daemon...
Oct 20 16:33:14 localhost.localdomain systemd[1]: Started LVM2 metadata daemon.
Oct 20 16:33:15 localhost.localdomain lvm[417]: Monitoring thin fedora-pool00-tpool.
Oct 20 16:33:15 localhost.localdomain lvm[371]: 5 logical volume(s) in volume group "fedora" monitored
Oct 20 16:33:15 localhost.localdomain systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
Oct 20 16:33:15 localhost.localdomain lvm[417]: No longer monitoring thin fedora-pool00-tpool.
Oct 20 16:33:15 localhost.localdomain lvm[417]: Monitoring thin fedora-pool00-tpool.
Comment 36 Fedora Update System 2013-10-21 04:05:13 EDT
dracut-034-19.git20131021.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/dracut-034-19.git20131021.fc20
Comment 37 Fedora Update System 2013-10-23 21:39:51 EDT
dracut-034-19.git20131021.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 38 Chris Murphy 2013-10-27 03:46:25 EDT
Working with Fedora-20-Beta-TC6-x86_64-DVD.iso (install and reboot from this ISO).
Comment 39 Harald Hoyer 2013-11-11 04:50:12 EST
*** Bug 1028616 has been marked as a duplicate of this bug. ***
Comment 40 IBM Bug Proxy 2013-11-11 10:06:35 EST
Created attachment 822449 [details]
custom partitioning /root layout
Comment 41 IBM Bug Proxy 2013-11-11 10:06:53 EST
Created attachment 822450 [details]
custom partitioning /boot layout
Comment 42 IBM Bug Proxy 2013-11-11 10:07:11 EST
Created attachment 822451 [details]
Serial console logs from affected machine
Comment 43 IBM Bug Proxy 2013-11-11 10:07:33 EST
Created attachment 822452 [details]
Complete journalctl logs from the affected machine.
Comment 44 IBM Bug Proxy 2013-11-11 10:07:52 EST
Created attachment 822453 [details]
Complete dmesg logs from the affected machine.

Note You need to log in before you can comment on or make changes to this bug.