Created attachment 954858 [details] kernel panic after auto-install reboot Description: Use edit-node tools to add plugin will cause kernel panic after auto-install Version: rhev-hypervisor6-6.6-20141106.1.iso ovirt-node-3.0.1-19.el6.22.noarch.rpm ovirt-node-tools-3.0.1-19.el6.22.noarch.rpm ovirt-node-plugin-puppet-3.0.1-19.el6.22.noarch.rpm Steps: 1. add puppet plugin edit-node --install-plugin=ovirt-node-plugin-puppet-3.0.1-19.el6.22.noarch.rpm --repo=edit-node.repo /rhev-hypervisor6-6.6-20141106.1.iso 2. auto-install with the edited iso "storage_init=ata firstboot" Actual results: after reboot, will kernel panic Additional info: this will also happens when add automation plugin, so will block auto-test
This bug impact RHEVH 6.6 for 3.4.z automation testing, about 100+ automation test cases can not be executed.
Created attachment 955047 [details] Console output
Harald - This appears to be a problem with devicemapper. If you drop to a rescue shell on the image (which is only possible by changing root=... to something different, possible bug? I'd expect rdshell rddebug to give me a shell instead of panicing if it times out waiting for the root device), devicemapper is holding on to the disk. /dev/mapper/1ATA_QEMU_HARDDISK_QM00001 is present, but none of its partitions. /dev/sda3 is present, but not able to be mounted. dmsetup table shows: 1ATA_QEMU_HARDDISK_QM00001: 0 16777216 multipath 0 0 1 1 round-robing 0 1 1 8:0 1 Removing it (dmsetup remove 1ATA...) lets sda3 mount, and booting continue as normal. This is unexpected behavior when booting with rd_NO_DM, and a change from the last image. An SRPM diff is below, and it's in one of these packages: --- rhev-hypervisor6-6.6-20141021.0.iso.d/isolinux/manifest-srpm.txt 2014-10-21 00:35:05.000000000 -0700 +++ rhev-hypervisor6-6.6-20141106.1.iso.d/isolinux/manifest-srpm.txt 2014-11-06 10:24:00.000000000 -0700 -curl-7.19.7-37.el6_5.3.src.rpm +curl-7.19.7-40.el6_6.1.src.rpm -initscripts-9.03.46-1.el6.src.rpm -ipmitool-1.8.11-21.el6.src.rpm -iproute-2.6.32-32.el6_5.src.rpm +initscripts-9.03.46-1.el6_6.1.src.rpm +ipmitool-1.8.11-20.el6.src.rpm +iproute-2.6.32-33.el6_6.src.rpm -kernel-2.6.32-504.1.2.el6.src.rpm +kernel-2.6.32-504.el6.src.rpm -ovirt-node-3.0.1-19.el6.18.src.rpm -ovirt-node-plugin-vdsm-0.1.1-26.el6ev.src.rpm +ovirt-node-3.0.1-19.el6.22.src.rpm +ovirt-node-plugin-vdsm-0.1.1-27.el6ev.src.rpm -tzdata-2014h-1.el6.src.rpm +tzdata-2014i-1.el6.src.rpm -vdsm-4.14.17-1.el6ev.src.rpm +vdsm-4.14.13-2.el6ev.src.rpm -wget-1.12-5.el6.src.rpm +wget-1.12-5.el6_6.1.src.rpm Serial output with rdinitdebug rddebug is attached. If I knew how to generate an rdsosreport manually, I'd attach that, too... Let me know if you need anything else.
When install via TUI still kernel panic after reboot with same error output
(In reply to Ryan Barry from comment #3) > This is unexpected behavior when booting with rd_NO_DM, and a change from > the last image. An SRPM diff is below, and it's in one of these packages: So, dracut gets triggered by the existence of /dev/disk/by-label/Root . If I understand it correctly, multipath stole /dev/sda and does not provide partitions for the multipath disk. Is multipath wanted for the disk? If no, I suggest adding rd_NO_MULTIPATH to the kernel command line.
If multipath _is_ wanted, please check, that the initramfs image contains 40-multipath.rules. # lsinitrd <image> | grep 40-multipath.rules And if the initramfs contains this file, please reassign the bug to device-mapper-multipath, because these rules should run kpartx, which should provide the partitions.
(In reply to Harald Hoyer from comment #5) > (In reply to Ryan Barry from comment #3) > > This is unexpected behavior when booting with rd_NO_DM, and a change from > > the last image. An SRPM diff is below, and it's in one of these packages: > > So, dracut gets triggered by the existence of /dev/disk/by-label/Root . > > If I understand it correctly, multipath stole /dev/sda and does not provide > partitions for the multipath disk. > > Is multipath wanted for the disk? If no, I suggest adding rd_NO_MULTIPATH to > the kernel command line. We use rd_NO_MULTIPATH for installing, but the actual boots exclude it. Your understanding is spot-on. multipath stole /dev/sda, but doesn't provide partitions for the multipath disk. Experimentally, running kpartx from the dracut shell correctly handles this. This is not reproducible in earlier images. The big change that I'm noticing is to 40-multipath.rules: http://gerrit.ovirt.org/#/c/34792/ Is it possible that this is a regression caused by the patch in bug#1148979 comment#12, which we've applied to resolve a different issue?
Maybe Peter also knows something on this issue, he was also involved in bug 1148979
(In reply to Fabian Deutsch from comment #10) > Maybe Peter also knows something on this issue, he was also involved in bug > 1148979 When dropped to dracut debug shell, what is the udev db content (udevadm info --export-db)? Can you attach it here? There should be an mpath device set up with DM_ACTIVATION=1 variable set - this is the variable based on which the kpartx is triggered then... Let's check if this is so.
Created attachment 956680 [details] Booted with udev debug output Peter, I can not reproduce this on my local machine, but I've got some logs from a previous attempt, do they help as well? (See attachement)
This could also be a dupe of bug 1161520
(In reply to Fabian Deutsch from comment #14) > Created attachment 956680 [details] > Booted with udev debug output > > Peter, I can not reproduce this on my local machine, but I've got some logs > from a previous attempt, do they help as well? (See attachement) Well, it would be better to grab the comple "udevadm info --export-db", the logs attached contain only information about executed commands within udev mostly where I can see various variables imported from external commands (like blkid or dmsetup), but I can't see the DM_ACTIVATION variable which is set solely based on udev rule (not importing it from any external command). As such, I can see that kpartx was not triggered, but I don't see why - I think that full udev db content would reveal more info... So I can't say from udev debug logs only. If you could reproduce once more and grab the udevadm info --export-db, that would be great.
(In reply to Peter Rajnoha from comment #16) > (In reply to Fabian Deutsch from comment #14) … > As such, I can see that kpartx was not triggered, but I don't see why - I > think that full udev db content would reveal more info... So I can't say > from udev debug logs only. If you could reproduce once more and grab the > udevadm info --export-db, that would be great. Right, no problem, I just can't provide it, but I think Ryan can.
Created attachment 956764 [details] udevadm info --export-db
Workaround:Using "--skip-initramfs" parameter of edit-node tools when add plugin into iso, no kernel panic for boot after install this iso into host: edit-node --repo=edit-node.repo --install=ovirt-node-plugin-puppet-3.1.0-0.25.20141107gitf6dc7b9.el6.noarch.rpm --skip-initramfs rhev-hypervisor6-6.6-20141107.0.iso so we could guess root cause happens in "_rebuild_initramfs" function in edit-node tools code.
(In reply to Ryan Barry from comment #18) > Created attachment 956764 [details] > udevadm info --export-db This log shows that the mpath device was not even set up in this case - there's no dm-* device present (however, in previous case, the log from comment #14 shows that there is at least one dm device - the mpath one presumably based on the commands executed by udev just missing the kpartx call), so it's also different case from comment #3 it seems. Was the reproducing environmnet exactly the same as in comment #3?
AFAIK yes, he uses a VM.
Created attachment 957189 [details] udevadm info console
(In reply to Peter Rajnoha from comment #20) > (In reply to Ryan Barry from comment #18) > > Created attachment 956764 [details] > > udevadm info --export-db > > This log shows that the mpath device was not even set up in this case - > there's no dm-* device present (however, in previous case, the log from > comment #14 shows that there is at least one dm device - the mpath one > presumably based on the commands executed by udev just missing the kpartx > call), so it's also different case from comment #3 it seems. > > Was the reproducing environmnet exactly the same as in comment #3? Same virtual machine, same cmdline. The only difference in the environment is that once I hit the dracut shell, I manually stopped the dm devices (which was necessary to mount a partition to save the udev info). I don't know if that would affect I attached a log from the serial console with the DM device left active.
What do you mean by "manually stopped the dm devices"? I'm trying to figure out if the multipath device got removed, or was never created.
(In reply to Ryan Barry from comment #3) > /dev/mapper/1ATA_QEMU_HARDDISK_QM00001 is present, but none of its > partitions. > > /dev/sda3 is present, but not able to be mounted. > > dmsetup table shows: > 1ATA_QEMU_HARDDISK_QM00001: 0 16777216 multipath 0 0 1 1 round-robing 0 1 1 > 8:0 1 > > Removing it (dmsetup remove 1ATA...) lets sda3 mount, and booting continue > as normal. The part quoted above is relevant. It was created, but none of its partitions were created. It grabbed sda, and I wasn't able to mount any partitions on sda (to save the udevadm info --export-db output) without "dmsetup remove". A devicemapper device is created for the disk, but not for any of its partitions. /dev/disk/by-label/Root points to sda3, but sda3 is not able to be mounted since multipath is holding onto the disk. It should point to 1ATA_QEMU_HARDDISK_QM00001p3, but that device does not exist. The most recent attachment (udevadm info console) did not have this step taken, since I grabbed the output by attaching a serial console to the VM, and I didn't need to mount any partitions to save it.
(In reply to Ryan Barry from comment #22) > Created attachment 957189 [details] > udevadm info console Hmm, the variable based on which the kpartx should trigger is there: P: /devices/virtual/block/dm-0 N: dm-0 ... S: mapper/1ATA_QEMU_HARDDISK_QM00001 ... E: MAJOR=253 E: MINOR=0 E: DEVNAME=/dev/dm-0 E: DEVTYPE=disk E: SUBSYSTEM=block ... E: DM_ACTIVATION=1 ... E: ID_PART_TABLE_TYPE=gpt ... Also, it's clear there's a partition table header (blkid gives E: ID_PART_TABLE_TYPE=gpt). So for some reason the kpartx call is skipped, I'll try to recheck the rules...
(In reply to Peter Rajnoha from comment #26) > (In reply to Ryan Barry from comment #22) > > Created attachment 957189 [details] > > udevadm info console > > Hmm, the variable based on which the kpartx should trigger is there: > This one exactly: > E: DM_ACTIVATION=1
For completeness, the difference between a working and non working image is exactly the rule which got introduced with bug 1148979 (a: does not work, b: works) diff -ur a/etc/udev/rules.d/40-multipath.rules b/etc/udev/rules.d/40-multipath.rules --- a/etc/udev/rules.d/40-multipath.rules 2014-11-13 16:21:38.965409873 +0100 +++ b/etc/udev/rules.d/40-multipath.rules 2014-11-13 16:22:17.326373921 +0100 @@ -20,5 +20,5 @@ ENV{DM_UUID}!="mpath-?*", GOTO="end_mpath" ENV{DM_SUSPENDED}=="1", GOTO="end_mpath" ENV{DM_ACTION}=="PATH_FAILED", GOTO="end_mpath" -ENV{DM_ACTIVATION}==1, RUN+="$env{MPATH_SBIN_PATH}/kpartx -a -p p $tempnode" +RUN+="$env{MPATH_SBIN_PATH}/kpartx -a -p p $tempnode" LABEL="end_mpath"
(In reply to Fabian Deutsch from comment #28) > For completeness, the difference between a working and non working image is > exactly the rule which got introduced with bug 1148979 > > (a: does not work, b: works) > > diff -ur a/etc/udev/rules.d/40-multipath.rules > b/etc/udev/rules.d/40-multipath.rules > --- a/etc/udev/rules.d/40-multipath.rules 2014-11-13 16:21:38.965409873 +0100 > +++ b/etc/udev/rules.d/40-multipath.rules 2014-11-13 16:22:17.326373921 +0100 > @@ -20,5 +20,5 @@ > ENV{DM_UUID}!="mpath-?*", GOTO="end_mpath" > ENV{DM_SUSPENDED}=="1", GOTO="end_mpath" > ENV{DM_ACTION}=="PATH_FAILED", GOTO="end_mpath" > -ENV{DM_ACTIVATION}==1, RUN+="$env{MPATH_SBIN_PATH}/kpartx -a -p p $tempnode" (...I'm thinking whether that should be ENV{DM_ACTIVATION}=="1" instead of ...==1 (no quotes))
Yes, it's the quoting :) So it should be ENV{DM_ACTIVATION}=="1"
(so just a typo when copying the original patch from bug #1148979)
Created attachment 958409 [details] console_output_after_autoinstall_reboot Tested on: rhev-hypervisor6-6.6-20141113.0.iso ovirt-node-plugin-puppet-3.0.1-19.el6.23.noarch.rpm ovirt-node-tools-3.0.1-19.el6.23.noarch Test Steps: auto-install with "bootif= adminpw= storage_init= firstboot" after reboot, still kernel panic
I would be very surprised if 6.6-20141113 worked, since it does not have the fix attached to this bug included. Fabian, was that the intended build?
Tested on: rhev-hypervisor6-6.6-20141114.0.el6ev ovirt-node-plugin-puppet-3.0.1-19.el6.23.noarch.rpm ovirt-node-tools-3.0.1-19.el6.23.noarch Test Steps: auto-install with "bootif= adminpw= storage_init= firstboot" after reboot, can successful log-in, issue fixed on this build
Fabian, according to comment 37, the patch is valid to fix this issue, so we need to backport the patch to 3.4.z.
Tested on: rhev-hypervisor6-6.6-20141119.0.el6ev ovirt-node-plugin-puppet-3.0.1-19.el6.23.noarch.rpm ovirt-node-tools-3.0.1-19.el6.23.noarch Test Steps: auto-install with "bootif= adminpw= storage_init= firstboot" after reboot, can successful log-in, pupper show in TUI menu, issue fixed on this build
correct ovirt-node-plugin-puppet and ovirt-node-tools version from comment 41 they should are ovirt-node-plugin-puppet-3.0.1-19.el6.24.noarch.rpm ovirt-node-tools-3.0.1-19.el6.24.noarch.rpm Tested on: rhev-hypervisor6-6.6-20141119.0.el6ev ovirt-node-plugin-puppet-3.0.1-19.el6.24.noarch.rpm ovirt-node-tools-3.0.1-19.el6.24.noarch.rpm Test Steps: auto-install with "bootif= adminpw= storage_init= firstboot" after reboot, can successful log-in, pupper show in TUI menu, issue fixed on this build
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0160.html