Created attachment 947164 [details] dracut.png Description of problem: RHEVH7 Failed to boot and report "Dracut: FATAL:Failed to mount block device of live image, System halted" There can only obtain below info on monitor, I can't provide more log info for debug due to can't login the hypervisor. ============================================================= Dracut-initqueue[424]: mount:/dev/sda3 is already mounted or /run/initramfs/live busy Dracut: FATAL:Failed to mount block device of live image Dracut: Refusing to continue System halted. This issue often happens on below machines: Dell OPTIPLEX 9010 Dell OPTIPLEX 790 hp-z220(8G CORE4 ENT01 EPT INTEL QEPDU14 SINGLE VMX) No such issue on VM. Version-Release number of selected component (if applicable): rhev-hypervisor7-7.0-20141006.0.el7ev ovirt-node-3.1.0-0.20.20141006gitc421e04.el7.noarch.rpm dracut-033-161.el7.x86_64 kernel-3.10.0-123.8.1.el7.x86_64 How reproducible: 70% Steps to Reproduce: 1. Install RHEVH7.0 build to above machine. 2. Reboot RHEVH. 3. Actual results: RHEVH7 Failed to boot and report "Dracut: FATAL:Failed to mount block device of live image, System halted" Expected results: Boot RHEVH can succeed on all machines. Additional info:
We can reproduce this issue on 70%+ test machines, so I update the priority to Urgent to make more attention on it.
Could you please try to gather more logs from dracut. Please use the rdsosreport tool.
Created attachment 947399 [details] rdsosreport.txt Add "rd.debug single" to grub, and boot the hypervisor will generate rdsosreport.txt on path "/run/initramfs/rdsosreport.txt".
Harald, do you have an idea why this could happen - it seems that this is happening inside dracut, without any of our code jumping in.
Ah, multipath! /sbin/dmsquash-live-root /dev/sda3 is called. So somehow one of the udev rules: KERNEL=="disk/by-label/Root", RUN+="/sbin/initqueue --settled --onetime --unique /sbin/dmsquash-live-root $env{DEVNAME}" SYMLINK=="disk/by-label/Root", RUN+="/sbin/initqueue --settled --onetime --unique /sbin/dmsquash-live-root $env{DEVNAME}" was triggered by sda3 before it was added to the multipath device /dev/mapper/ST500DM002-1BD142_W2ACE2N6p3: LABEL="Root" UUID="5b9d68a2-2125-4a81-9684-652967d945e6" TYPE="ext2" PARTLABEL="primary" PARTUUID="8a0b5636-8293-492d-93fe-27ee1b70f87e" /dev/disk/by-label: lrwxrwxrwx 1 root 0 10 Oct 16 03:06 Root -> ../../dm-3
Proposed patch: http://git.kernel.org/cgit/boot/dracut/dracut.git/commit/?id=d829e7fce273e9dbd8a71dbf71612556331f28fa
Sorted out on IRC, now the request: 1. boot into grub 2. add rdshell to the cmdline and remove the quiet arg 3. boot 4. get rdsosreport and attach it to this bug
Oops. The last comment was actually for a different bug.
Created attachment 953922 [details] rdsosreport(1030build).txt Please refer the new attachment rdsosreport(1030build).txt. Thanks!
Still encounter this issue. Test version: rhev-hypervisor7-7.0-20141107.0 ovirt-node-3.1.0-0.25.20141107gitf6dc7b9.el7.noarch device-mapper-1.02.84-14.el7.x86_64 dracut-033-161.el7_0.173.x86_64 Command line: BOOT_IMAGE=/vmlinuz0 root=live:LABEL=Root ro rootfstype=auto rootflags=ro ksdevice=bootif rd.lvm=0 rd.dm=0 elevator=deadline lang= max_loop=256 rd.md=0 rd.live.check rd.luks=0 rd.live.image crashkernel=128M rd.debug single console=ttyS0,115200n8 dracut:/#blkid /dev/sda3: LABEL="Root" UUID="34a23678-013c-41cc-8b15-d741ab4beb33" TYPE="ext2" PARTLABEL="primary" PARTUUID="c8be010a-f7bd-4439-80a7-fa129fd6dcae" /dev/sda2: LABEL="RootBackup" UUID="11f00a8e-7977-45fe-9c66-37543738ad27" TYPE="ext2" PARTLABEL="primary" PARTUUID="51e2c4b0-4efa-455d-a762-dea891d9d2a9" /dev/sda4: UUID="UVqdV1-141O-yQUE-tsc0-ruhh-DcFp-jhTsXc" TYPE="LVM2_member" PARTLABEL="primary" PARTUUID="405c1e02-8b0e-41c5-886c-3e17df1c88c7" /dev/mapper/TOSHIBA_DT01ACA100_33A7TVBMS2: LABEL="RootBackup" UUID="11f00a8e-7977-45fe-9c66-37543738ad27" TYPE="ext2" PARTLABEL="primary" PARTUUID="51e2c4b0-4efa-455d-a762-dea891d9d2a9" /dev/mapper/TOSHIBA_DT01ACA100_33A7TVBMS3: LABEL="Root" UUID="34a23678-013c-41cc-8b15-d741ab4beb33" TYPE="ext2" PARTLABEL="primary" PARTUUID="c8be010a-f7bd-4439-80a7-fa129fd6dcae" /dev/mapper/TOSHIBA_DT01ACA100_33A7TVBMS4: UUID="UVqdV1-141O-yQUE-tsc0-ruhh-DcFp-jhTsXc" TYPE="LVM2_member" PARTLABEL="primary" PARTUUID="405c1e02-8b0e-41c5-886c-3e17df1c88c7" /dev/mapper/TOSHIBA_DT01ACA100_33A7TVBMS: PTTYPE="gpt"
Created attachment 956265 [details] console_output log
Created attachment 956266 [details] rdsosreport_rhevh720141107.txt
Created attachment 958417 [details] rdsosreport-1113.txt (In reply to Fabian Deutsch from comment #22) > Hey, could you please check if this build fixes the problem: > > https://brewweb.devel.redhat.com/buildinfo?buildID=398693 > > This build contains the complete udev fix. Test version: rhev-hypervisor7-7.0-20141113.0 Test machine: Dell OPTIPLEX 9010 Test result: Still met kernel panic issue. Please refer new attachment "rdsosreport-1113.txt" for more info. Thanks!
Created attachment 958479 [details] rdsosreport-1114.txt (In reply to Fabian Deutsch from comment #24) > I apologize, I used the incorrect brew link:-/ > > Please use: > > rhev-hypervisor7-7.0-20141114.0.iso > https://brewweb.devel.redhat.com/taskinfo?taskID=8250792 Test version: rhev-hypervisor7-7.0-20141114.0 Test machine: Dell OPTIPLEX 9010 Dell 790 Test result: Still met kernel panic issue. Please refer new attachment "rdsosreport-1114.txt" for more info. Thanks!
Created attachment 958891 [details] rdsosreport-1118.txt Test version: rhev-hypervisor7-7.0-20141118.0.iso How reproducible: 30% Test machine: Dell 9010 Test result: Still met kernel panic issue with 30% reproduce rate. Please refer attachment "rdsosreport-1118" for more details. Thanks!
Thanks Chen Shao. Could you please check if this bug also exists with a stcok RHEL 7 livecd? i.e. the boot iso
Harald, do you maybe see any suspicious in the rdsosreport in comment 2?
If this is still reproducable with the build in comment 33, please provide details on the hosts: 1. Exact name/manufacturer/typee 2. Storage hardware
(In reply to Fabian Deutsch from comment #34) > If this is still reproducable with the build in comment 33, please provide > details on the hosts: Yes, still can reproduce with low reproduce rate(<20). > 1. Exact name/manufacturer/typee > 2. Storage hardware Test machine: Dell 9010 Test version: rhev-hypervisor7-7.0-20141119.0.iso ovirt-node-3.1.0-0.27.20141119git24e087e.el7.noarch DMI: Dell Inc. OptiPlex 9010/0H5XPC, BIOS A14 06/11/2013 SMBIOS 2.7 present. TOSHIBA_DT01ACA100_33A7TVBMS dm-0 ATA ,TOSHIBA DT01ACA1 size=932G features='0' hwhandler='0' wp=rw please see attachment "dmesg+cpu+lspci" for more hardware info. Thanks!
Created attachment 959245 [details] rdsosreport-1119.txt
Created attachment 959246 [details] hardware-info.tar.gz
Test version: rhev-hypervisor7-7.0-20141120.0.iso ovirt-node-3.1.991-0.0.master.el7.centos.noarch Download link: http://download.devel.redhat.com/brewroot/work/tasks/6476/8276476/rhev-hypervisor7-7.0-20141120.0.iso Test machine: dell-per515-01 I test several times didn't met the system halted issue. @ycui, How about dell 9010?
A build with the final fix is not yet available. So this bug can probably still be reproduced.
> @ycui, > > How about dell 9010? Tested the build(rhev-hypervisor7-7.0-20141120.0) Ryan provided it this morning, test 3 times, 1 time system halted on dell 9010 machine. So the bug is still here. ------screen output/display---- Dracut-initqueue[416]: mount:/dev/sda3 is already mounted or /run/initramfs/live busy Dracut: FATAL:Failed to mount block device of live image Dracut: Refusing to continue System halted.
(In reply to Ying Cui from comment #47) > > @ycui, > > > > How about dell 9010? > > Tested the build(rhev-hypervisor7-7.0-20141120.0) Ryan provided it this > morning, test 3 times, 1 time system halted on dell 9010 machine. So the bug > is still here. > > ------screen output/display---- > Dracut-initqueue[416]: mount:/dev/sda3 is already mounted or > /run/initramfs/live busy > Dracut: FATAL:Failed to mount block device of live image > Dracut: Refusing to continue > System halted. Can we grab a new rdsosreport?
Additional info: 1. After communicated with cshao on this bug today, we can not reproduce this bug on QE's dell server(r515 and r510), so we may doubt this bug only occur on desktop and workstation. 2. After rhevh7 installed, sometimes I can boot into rhevh7 correct, no such issue, try to reboot this host again and again, sometimes this bug happen. so the bug workaround is to reboot rhevh7 host again and again till you can boot successful to login page.(for testing experience today, reboot host 3 times, you encounter this bug once) Just for reference.
Please provide all the rdsosreports from all failures. Otherwise we will not be able to nail this bug down.
Created attachment 960054 [details] rdsosreport_20141120.txt
Ying, could you please try the following: Please install RHEV-H with the nompath keyword. Please make sure that it's used during installation and on regular boots.
Workaround for now: boot with "nompath" on the kernel command line. This has to be fixed in the multipath/udev integration. At no point of time should udev be settled and the original device have the /dev/disk symlink. Asynchronous claiming devices for multipath only leads to race conditions and I will never add "sleep 10" to be sure multipathd has kicked in and the device symlinks are rerouted.
Created attachment 961044 [details] fc-lun
Comment 59 confirms that the workaround works, thus it seems to be an issue with device-mapper-multipath.
According to our current understanding the problem is as follows: During boot, a race occurs between simple udev devices (vd*, sd*, hd*) and assembled mpath devices (dev/mapper/mpath*). The assembly of the multipath devices happens asynchronously - as noted in comment 57 - thus, sometimes they get picked up by the boot logic, but sometimes, the mpath devices are not ready in time, and the simple udev devices are used. If a simple udev device is used, the described bug appears, once the multipath device get's assembled, because from that point on the simple udev device will not be accessible anymore. The workaround, until the race has been fixed, is to disable multipath in initrd using mpath. The sideeffect is that no multipath is available for the rootfs. For bug verification: Please verify this bug by verifying the workaround. (appending mpath to the kernel cmdline).
Change this bug back to MODIFIED and waiting the platform bug 1167620 fix firstly, then build new rhevh image to check this bug.
Doc text is updated for 3.5 Beta 5. Please make sure you update the doc text for GA release note once the bug fix is fixed or set the 'requires_release_note' flag to -. Cheers, Julie
Created attachment 963681 [details] rdsosreport_20141201.txt for comment 67
(In reply to Ying Cui from comment #68) > Created attachment 963681 [details] > rdsosreport_20141201.txt for comment 67 Hi ycui, Thank you. I appreciate your help. So cancel the needinfo flag. Thanks!
Created attachment 965011 [details] iscsi-vs-fc.png
Regarding comment 71, this should have been a very small change which didn't go anywhere near the installer code, but I'll grab the image and see if I can reproduce. Ben - After booting an image with "find_multipaths yes" in multipath.conf in the initrd, there appears to be a behavior change on images. dmsetup shows multipath grabbing an iscsi device: HostVG-Logging: 0 4194304 linear 8:20 8128512 HostVG-Swap: 0 8110080 linear 8:20 2048 HostVG-Data: 0 11517952 linear 8:20 12322816 WDC_WD2502ABYS-18B7A0_WD-WCAT19558392: 0 488281250 multipath 0 0 1 1 service-time 0 1 2 8:0 1 1 live-base: 0 3145728 linear 7:1 0 36090a038d0f731381e035566b2497f85: 0 62914560 multipath 0 0 1 1 service-time 0 1 2 8:32 1 1 HostVG-Config: 0 16384 linear 8:20 8112128 live-rw: 0 3145728 snapshot 7:1 7:2 P 8 And multipath -ll shows the same: WDC_WD2502ABYS-18B7A0_WD-WCAT19558392 dm-6 ATA ,WDC WD2502ABYS-1 size=233G features='0' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 0:0:0:0 sda 8:0 active ready running 36090a038d0f731381e035566b2497f85 dm-7 EQLOGIC ,100E-00 size=30G features='0' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 4:0:1:0 sdc 8:32 active ready running But there's no matching device in /dev/mapper: crw-------. 1 root root 10, 236 Dec 5 07:40 control lrwxrwxrwx. 1 root root 7 Dec 5 07:40 HostVG-Config -> ../dm-3 lrwxrwxrwx. 1 root root 7 Dec 5 07:40 HostVG-Data -> ../dm-5 lrwxrwxrwx. 1 root root 7 Dec 5 07:40 HostVG-Logging -> ../dm-4 lrwxrwxrwx. 1 root root 7 Dec 5 07:40 HostVG-Swap -> ../dm-2 lrwxrwxrwx. 1 root root 7 Dec 5 07:40 live-base -> ../dm-1 lrwxrwxrwx. 1 root root 7 Dec 5 07:40 live-rw -> ../dm-0 lrwxrwxrwx. 1 root root 7 Dec 5 07:41 WDC_WD2502ABYS-18B7A0_WD-WCAT19558392 -> ../dm-6 Any idea why this would be?
I don't understand why multipath would grab those devices in the first place, if find_multipaths was enabled. With find_multipaths enabled, multipath should only create a device if one of three things is true: 1. It sees two different paths with the same wwids. Looking at the multipath -l output: WDC_WD2502ABYS-18B7A0_WD-WCAT19558392 dm-6 ATA ,WDC WD2502ABYS-1 size=233G features='0' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 0:0:0:0 sda 8:0 active ready running 36090a038d0f731381e035566b2497f85 dm-7 EQLOGIC ,100E-00 size=30G features='0' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 4:0:1:0 sdc 8:32 active ready running Each of these devices only has one path. 2. multipath is specifically told to run on the device by running (for instance) # multipath /dev/sda This overrides find_multipaths. Do you know if this is happening? multipath is never called like this during the regular RHEL boot process. I'm not sure if things work differently in RHEV 3. the wwid for the device already exists in /etc/multipath/wwids. If this is a stock initramfs, and mpath.wwid wasn't on the kernel command line, I don't see how this would be possible. As to why no device got created, I have no idea. udev is responsible for that. If you remove the devices and recreate them (this should work since they should now be in the wwids file), do the symlinks get created correctly? Could you post the initramfs file somewhere I can look at it?
Created attachment 965205 [details] mulitpath.conf
Created attachment 965207 [details] wwids
I should have clarified -- this behavior is on a running system, not in the initrd. I attached multipath.conf and wwids from the running system. find_multipaths is not enabled there, but never has been, to my knowledge. Should it be? Unfortunately, I don't know what steps QE took to get them to their current state. The only references I see to multipath in the code are "multipath -r" and various incantation so "multipath -ll", never directly "multipath /dev/foo" The initrd is at: http://rbarry.org/initrd0.img
(In reply to Ryan Barry from comment #80) > I should have clarified -- this behavior is on a running system, not in the > initrd. > > I attached multipath.conf and wwids from the running system. > find_multipaths is not enabled there, but never has been, to my knowledge. > Should it be? With find_multipaths enbabled (and the wwids removed from /etc/multipath/wwids) you won't be multipathing these single path devices at all. Do you want them multipathed? At any rate, the /etc/multipath.conf used in the regular filesystem should match the one used in the initramfs. In Comment 76, you said that the initramfs used find_multipaths, so it should be enabled in the regular file system as well. > Unfortunately, I don't know what steps QE took to get them to their current > state. The only references I see to multipath in the code are "multipath -r" > and various incantation so "multipath -ll", never directly "multipath > /dev/foo" Without find_multipaths enabled, why the multipath devices exist makes sense. So the only question is, "why didn't the device symlink get created?" Do you know if /dev/dm-7? Like I said before, this is all handled by udev, and I don't know why it would fail to create the symlink > The initrd is at: > > http://rbarry.org/initrd0.img
Created attachment 966113 [details] /var/log > Ying, it would be important to get a logfile from such a failed attempt, at > best: > all of /var/log and journalctl -b attachment for comment 83 and bug 1136300.
Created attachment 966114 [details] journalctl -b for comment 83
/tmp/ovirt.log in attachment 966113 [details] as ovirt-tmp.log
While playing with the device which does not show up as an mpath device, I noticed: [root@dell-pet105-02 ~]# multipath /dev/sdb Dec 09 14:57:57 | multipath.conf +5, invalid keyword: getuid_callout Dec 09 14:57:57 | multipath.conf +37, invalid keyword: getuid_callout Dec 09 14:57:57 | 36090a038d0f721901d033566b2493f23: ignoring map [root@dell-pet105-02 ~]# dmesg | tail [31413.413945] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready [66997.450122] systemd-journald[1984]: Vacuuming done, freed 0 bytes [120186.991804] systemd-journald[1984]: Vacuuming done, freed 0 bytes [173307.255451] systemd-journald[1984]: Vacuuming done, freed 0 bytes [226529.492222] systemd-journald[1984]: Vacuuming done, freed 0 bytes [279556.885257] systemd-journald[1984]: Vacuuming done, freed 0 bytes [332666.181429] systemd-journald[1984]: Vacuuming done, freed 0 bytes [360952.614330] NFSD: starting 90-second grace period (net ffffffff819a29c0) [371632.245637] device-mapper: table: 253:8: multipath: error getting device [371632.245643] device-mapper: ioctl: error adding target to table [root@dell-pet105-02 ~]# Ben - can you tell what these errors are about? (The first invalid keyowrk problem is tracked here: bug 1172186)
This all makes sense: In RHEL7, multipath no longer uses "getuid_callout" to grab the wwid. It gets the wwid from the udev database or the udev environment variables (if the value is not in the database yet). By default is uses ID_SERIAL. You can change that with the "uid_attribute" option. Having "getuid_callout" in your config file will trigger a warning, but doesn't harm anything. Also, running multipath will try to multiphath /dev/sdb, since find_multipaths in not enabled in /etc/multipath.conf, and the device is not blacklisted. When you run # multipath /dev/sdb That will attempt to multipath /dev/sdb, even if find_multipaths is enabled, since you explicitly specified the device. Multipath will fail on /dev/sdb because the device is already in use: /dev/sdb3 is currently mounted as /run/initramfs/live and /dev/sdb4 is PV for a currently active Volume group. This is all working as designed.
I'm most curious why it's /dev/sdb and not /dev/mapper/36090a038d0f721901d033566b2493f23 From an earlier spin of EL7-based RHEV-H with an iSCSI root: # dmsetup table HostVG-Logging: 0 4194304 linear 253:7 5105664 HostVG-Swap: 0 5087232 linear 253:7 2048 HostVG-Data: 0 6471680 linear 253:7 9299968 QEMU_HARDDISK_QM00001p3: 0 499712 linear 253:0 999424 live-base: 0 3145728 linear 7:1 0 QEMU_HARDDISK_QM00001p2: 0 499712 linear 253:0 499712 QEMU_HARDDISK_QM00001p1: 0 497664 linear 253:0 2048 360000000000000000e00000000040001p3: 0 15775744 linear 253:4 999424 360000000000000000e00000000040001p2: 0 499712 linear 253:4 499712 360000000000000000e00000000040001p1: 0 497664 linear 253:4 2048 HostVG-Config: 0 16384 linear 253:7 5089280 live-rw: 0 3145728 snapshot 7:1 7:2 P 8 QEMU_HARDDISK_QM00001: 0 16777216 multipath 0 0 1 1 service-time 0 1 2 8:0 1 1 360000000000000000e00000000040001: 0 16777216 multipath 0 0 1 1 service-time 0 1 2 8:16 1 1 There are dm devices for the partitions. Our fear at this point is that we saw a similar problem which required a udev patch. Namely, our installer doesn't use Anaconda, and we pull a weighted list of disks. But when multipath claims the disk and we try to partition /dev/sd[ab..], it fails, since the disk is already claimed. With the disk claimed but no device-mapper devices created that we can use for installing or partitioning, a number of operations fail... So we're basically hoping to either: Have device-mapper claim all the devices, including partitions. This would be ideal, since it wouldn't be a behavior change (and while a behavior change is ok in EL7, we're seeing the same things in 6.6). Use /dev/sd* devices, but not have multipath claim them. Another team handles the multipath.conf on running images (synced from the management engine), but we can probably get them to add find_multipaths...
Summarizing the god and bad from comment 70 onwards: Progress: 1. Comment 72: THIS bug is fixed with local disk storage 2. Comment 72: THIS bug is fixed with FC disk storage 3. Comment 72: THIS bug is fixed with iSCSI disk storage The progress shows that this bug really solved the multipath race we saw before. The race was: Before multipather claimed a raw device, dracut decided to use the raw device (/dev/sdb) for booting. When the final switchroot/mounts took place, the raw device got assembled into a dm mpath device (which included that the raw device got claimed by dm). By using the find_multipath directive, multipath will not assemble the mpath device, and dm will thus not claim it, that is why we can boot. A note on the specific machine used for testing: The rootfs resides on a not-multipathed-device. This means: The OS is only seeing one path to teh iSCSI target, means, no multipath device should not (and is not) created. Ying, what is intended here? Should the iSCSI target be reachable by multiple paths or a asingle path? We need to clarify this to know if everything is discovered correctly. Regressions: 1. Comment 71: bug 1136300 appears again Partitioning fails, disks claimed? 2. Comment 72: iSCSI devices seem to not use multiple paths 3. Comment 72: iSCSI device is missing symlinks in mapper (looks like a regular disks) Possible explanations: * Regression 1: This can be a race of claiming disks or partitions during the installation. * Regression 2: This can happen because the target is only accessible via one path. Please check that there are really several paths to the target. Looking at the machine, it looks like there is only a single path, this means that there is nothing wrong. * Regression 3: It seems that the symlinks to the iSCSI device is missing, that is why it appears as a regular disk We need to check with the udev team why this could be happening. Ying/Chen, can you please double check what kind of targets are exported from the EQLOGIC? Harald, can you tell why the symlinks into /dev/mapper/ were not created for the iSCSI device in use (/dev/sdb)? Please look at comment 88 of how to access the machine.
adding back needinfo for harald.
Chen, for comment 91, thanks.
> 2. Comment 72: > iSCSI devices seem to not use multiple paths This iscsi machine using hardware iscsi device but not software iscsi. Seem iscsi devices is use multiple paths, the output as follows: But I don't known why the boot lun(36090a038d0f721901d033566b2493f23) can't be listed by run multipath -ll. actually it could be listed with other rhevh build. # multipath -ll Dec 10 07:43:26 | multipath.conf +5, invalid keyword: getuid_callout Dec 10 07:43:26 | multipath.conf +18, invalid keyword: getuid_callout Dec 10 07:43:26 | multipath.conf +37, invalid keyword: getuid_callout WDC_WD2502ABYS-18B7A0_WD-WCAT19558392 dm-6 ATA ,WDC WD2502ABYS-1 size=233G features='0' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 0:0:0:0 sda 8:0 active ready running 36090a038d0f731381e035566b2497f85 dm-7 EQLOGIC ,100E-00 size=30G features='0' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 4:0:1:0 sdc 8:32 active ready running > Ying/Chen, can you please double check what kind of targets are exported > from the EQLOGIC? qla4xxx: # iscsiadm -m session qla4xxx: [1] 10.66.90.100:3260,1 iqn.2001-05.com.equallogic:0-8a0906-9021f7d03-233f49b26635031d-s1-gouyang-165404-01 (flash) qla4xxx: [2] 10.66.90.100:3260,1 iqn.2001-05.com.equallogic:0-8a0906-3831f7d03-857f49b26655031e-s1-gouyang-165404-02 (flash)
(In reply to shaochen from comment #94) > > 2. Comment 72: > > iSCSI devices seem to not use multiple paths > > This iscsi machine using hardware iscsi device but not software iscsi. > Seem iscsi devices is use multiple paths, the output as follows: > But I don't known why the boot lun(36090a038d0f721901d033566b2493f23) can't > be listed by run multipath -ll. actually it could be listed with other rhevh > build. > > # multipath -ll > Dec 10 07:43:26 | multipath.conf +5, invalid keyword: getuid_callout > Dec 10 07:43:26 | multipath.conf +18, invalid keyword: getuid_callout > Dec 10 07:43:26 | multipath.conf +37, invalid keyword: getuid_callout > WDC_WD2502ABYS-18B7A0_WD-WCAT19558392 dm-6 ATA ,WDC WD2502ABYS-1 > size=233G features='0' hwhandler='0' wp=rw > `-+- policy='service-time 0' prio=1 status=active > `- 0:0:0:0 sda 8:0 active ready running > 36090a038d0f731381e035566b2497f85 dm-7 EQLOGIC ,100E-00 > size=30G features='0' hwhandler='0' wp=rw > `-+- policy='service-time 0' prio=1 status=active > `- 4:0:1:0 sdc 8:32 active ready running Yes, but there was something wrong with the other RHEV-H 7.0 builds. They _always_ tried to use multipath - which is wrong. With the build which is used here, multtipaths are only used if the same disk (based on the serial) appears at least twice. And that is not the case here. The two disks exposed by the HBA use different serials, so are different disks, see below. > > Ying/Chen, can you please double check what kind of targets are exported > > from the EQLOGIC? > > qla4xxx: > > # iscsiadm -m session > qla4xxx: [1] 10.66.90.100:3260,1 > iqn.2001-05.com.equallogic:0-8a0906-9021f7d03-233f49b26635031d-s1-gouyang- > 165404-01 (flash) > qla4xxx: [2] 10.66.90.100:3260,1 > iqn.2001-05.com.equallogic:0-8a0906-3831f7d03-857f49b26655031e-s1-gouyang- > 165404-02 (flash) Yes, I see two sessions, but the devices these session point to, seem to be different ones. The serials are different: [root@dell-pet105-02 ~]# lsblk --nodeps -o name,serial lsblk: dm-7: failed to get device path NAME SERIAL sda WD-WCAT19558392 sdb 6090a038d0f721901d033566b2493f23 sdc 6090a038d0f731381e035566b2497f85 … I wonder if the eqlogic is configured correctly?
(In reply to Fabian Deutsch from comment #91) > Harald, can you tell why the symlinks into /dev/mapper/ were not created for > the iSCSI device in use (/dev/sdb)? Please look at comment 88 of how to > access the machine. Dec 05 07:40:11 localhost systemd-udevd[367]: timeout '/sbin/multipath -c /dev/sdc' Dec 05 07:40:11 localhost multipathd[142]: sdc: add path (uevent) Dec 05 07:40:11 localhost systemd-udevd[368]: timeout '/sbin/multipath -c /dev/sdb' Dec 05 07:40:11 localhost multipathd[142]: sdb: add path (uevent) most likely because: Dec 05 07:40:11 localhost kernel: qla4xxx 0000:06:01.1: scsi4:0:0: Abort command issued cmd=ffff8800cf1f0700, cdb=0x28 Dec 05 07:40:11 localhost kernel: qla4xxx 0000:06:01.1: scsi4: qla4xxx_mailbox_command: FAILED, MBOX CMD = 00000015, MBOX STS = 00004005 00000001 00000000 00 Dec 05 07:40:11 localhost kernel: qla4xxx 0000:06:01.1: scsi4:0:0: Abort command - failed Dec 05 07:40:11 localhost kernel: qla4xxx 0000:06:01.1: scsi4:1:0: Abort command issued cmd=ffff8801232cd500, cdb=0x28 Dec 05 07:40:11 localhost kernel: qla4xxx 0000:06:01.1: scsi4: qla4xxx_mailbox_command: FAILED, MBOX CMD = 00000015, MBOX STS = 00004005 00000001 00000000 01 Dec 05 07:40:11 localhost kernel: qla4xxx 0000:06:01.1: scsi4:1:0: Abort command - failed Dec 05 07:40:11 localhost kernel: qla4xxx 0000:06:01.1: scsi4:0:0:0: DEVICE RESET ISSUED. Dec 05 07:40:11 localhost kernel: qla4xxx 0000:06:01.1: scsi(4:0:0:0): DEVICE RESET SUCCEEDED. Dec 05 07:40:11 localhost kernel: qla4xxx 0000:06:01.1: scsi4:0:1:0: DEVICE RESET ISSUED. Dec 05 07:40:11 localhost kernel: qla4xxx 0000:06:01.1: scsi(4:0:1:0): DEVICE RESET SUCCEEDED.
(In reply to Ying Cui from comment #93) > Chen, for comment 91, thanks. Can you make comment 91 non-private, I can't see it.
Is the whole kernel log from a bad boot available?
(In reply to Ryan Barry from comment #90) > Have device-mapper claim all the devices, including partitions. This would > be ideal, since it wouldn't be a behavior change (and while a behavior > change is ok in EL7, we're seeing the same things in 6.6). If you'd like multipath to claim every device, I can backport the "-i" option from rhel7. That tells multipath to ignore the wwids file and just look at the blacklist. So, if you run "multipath -i -c <dev>" in the udev rules, multipath will immediately claim all non-blacklisted block devices. There won't be a race here, because multipath will claim the device the moment it sees it.
Created attachment 967091 [details] dmesg from failed boot Chad, here you go with a journalctl -k form a failed boot.
Wrapping up once again: We made some multipath changes (using find_multipaths yes) and included some dracut fixes, which solved this bug. QE; is the regression mentioned in comment 71 (bug 1136300) still existent? Or what else problems do we have with the builds from 1212, which contain all necessary fixes for this bug.
Hi fabiand, Can we request iscsi machine (dell-pet105-02) back for our new build(1212) testing? Thanks!
Yes, please do so.
At firstly we just might suspect that was not present in the dell _server_ (comment 51), but while the test gradually and deeply in the limited time, QE covered more server testing with previous 1204 builds(hp-dl388g7-01 and hp-dl385pg8-11, each 6 times boot), all servers were not encountered this system halted issue. So we 90% doubt this issue only happened on desktop and workstation. Based on the present test and this bug is related probability reproduce, I have to make this bug priority down to High.
Created attachment 969587 [details] 1212.tar.gz
Created attachment 969801 [details] 1204-vs-1212.png
for this _probability_ bug, and base on the previous testing(before 1204) also communicated with shaochen, desktop and workstation are affected, not server. summarizing here for QE test probability and test machines: server: dell-per510-01 - tested about 10 times, 0 happened, not encountered. dell-per515-01 - tested about 10 times, 0 happened, not encountered. hp-dl388g7-01 - tested about 6 times, 0 happened, not encountered. hp-dl385pg8-11 - tested about 6 times, 0 happened, not encountered. workstation: hp-z220 - tested about 3 times, 2 times encountered this bug 1152948. desktop: dell 9010 - tested about 20+ times, 10+ times encountered this bug 1152948. dell 790 - tested about 10 times, 3 times encountered this bug 1152948.
The new patch addresses this issue by explicitly naming the wwid of the root device (if it's a multipath device) on the kernel commandline, to tell multipath to claim it right away.
Test version: rhev-hypervisor7-7.0-20141218.0.el7ev ovirt-node-3.1.0-0.37.20141218gitcf277e1.el7.noarch Test result: server: dell-per515-02 - tested about 2 times, didn't met system halted issue. dell-pet105-02 - tested about 10 times, didn't met system halted issue. workstation: hp-z800-02 - tested about 2 times, didn't met system halted issue. Desktop dell-790 - tested about 3+ times, didn't met system halted issue. dell 9010 - tested about 8+ times, didn't met system halted issue. hp-5850 - tested about 2 times, didn't met system halted issue. Seem the bug is fixed with above build. Thanks!
Moving to VERIFIED according to comment 116.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0160.html