Description of problem: When discovering and logging target, one LUN is missing. There are 3 LUNs altogether in target. All LUNs have same setting. The specific LUN is not visible in vdsm.log nor in engine.log It happens on 2 separate environments. The LUN is corectly shown in 3.5 RHEVM. Problem here is multipath/udev/lvm - there is no governor who can tell what should be (or shloudn't be) done with discovered device. So if there are labels in the connecting device, each 'part'(project - LVM/md/etc...) is handling connecting by itself. This is new approach in RHEL7(.2 ?) where this functionality is enabled by default. In /var/log/messages: Aug 13 15:06:55 vm-1 systemd: Started LVM2 PV scan on device 8:32. Aug 13 15:06:55 vm-1 multipathd: sdb: add path (uevent) Aug 13 15:06:55 vm-1 multipathd: 1ps02: failed in domap for addition of new path sdb Aug 13 15:06:55 vm-1 multipathd: uevent trigger error Version-Release number of selected component (if applicable): vdsm-4.17.2-1.el7ev.noarch device-mapper-multipath-0.4.9-81.el7.x86_64 rhevm-3.6.0-0.11.master.el6.noarch How reproducible: 100% Steps to Reproduce: 1. create new storage/iscsi 2. discover/log in to target 3. Actual results: One LUN is missing Expected results: See all LUNs from the target. Additional info:
(In reply to Pavel Stehlik from comment #0) > Problem here is multipath/udev/lvm - there is no governor who can tell what > should be (or shloudn't be) done with discovered device. So if there are > labels in the connecting device, each 'part'(project - LVM/md/etc...) is > handling connecting by itself. As for LVM (and similar behaviour applies to MD), we're activating all LVs based on incoming events now since RHEL 7.0 (for MD, this was even earlier). If a disk is identified as a PV and we gather all PVs making up a VG, we activate the VG by default unless activation/auto_activation_volume_list or devices/global_filter is set to prevent activation of LVs on incoming PVs. So to prevent this autoactivation, the default LVM configuration needs to be modified (the global_filter and/or auto_activation_volume_list). As I discussed this with Pavel (and by looking at the machine itself), disks were imported via iSCSI where multipath mappings were expected to be created on top. However, in one case, the disk contained PV header with VG metadata and LVM did the autoactivation, which then prevented multipath to create the mapping since the disk was already used by LVM at that point. Since all is event-based here - when a disk is added to the system, there's udev event generated and all the scans are triggered within udev rule execution based on this event. In case of multipath/LVM interaction, there needs to be multipath set in a way so that the disk's WWN is identified by multipath as multipath component - this requires editing multipath configuration so that the WWN is known to multipath. Once multipath knows about which devices are multipath components, it can properly export this information within udev rule execution (via multipath -c call) and such information is then respected by others (like LVM) which prevents any other actions on such device (besides multipath's own actions to create the mapping on top of these components). However, the problem as indicated here goes beyond multipath/LVM - the imported disk can contain ANY other signature or it can belong to any other subsystem which can trigger similar actions and scanning. Currently, as already mentioned in comment #0, there's no governor who directs which devices should be handled and which ones should be ignored in global perspective. What's missing here is the global configuration which would prevent further actions on imported/attached disks - currently, we need to set each tool/subsystem/signature handler on its own since each one has separate configuration. Currently, with the resources we have now, users have two alternative ways how to resolve this (concerning only the event-based actions): A) either they manually write proper configuration (filters etc) for each tools/subsystem/handler that can trigger actions on incoming events B) or they create custom udev rule where they identify disk to be ignored for automatic actions and they set proper flags which are recognized and respected by each tool's/subsystem's/handler's udev rule firing actions on events Would be good if we had a tool generating udev rule for B). This is currently an area which is not resolved fully yet and it would require someone's full attention sooner or later - not per each project, but on scale governing all subsystems (there were some attempts in the past, but none was successful in the end unfortunately).
Liron, as QA contact, please take a look. Thanks.
Pavel, if you destroy the missing lun lvm/partitions metadata - for example: dd if=/dev/zero of=/dev/sdb bs=1M count=1 This should solve your issue - right? Ben, can we do anything else to make multipath take over the LUN? Maybe add the LUN wwid to /etc/multipath/wwids? I guess it will be hard to orchestrate the various subsystems, because each subsystem it trying to do the right thing (from its point of view). From the point of view of vdsm, we need a simple way to find such devices, and to force the system to use the device in the way that fits the user, trying to add the device to a storage domain. Currently we do not find such devices, because we use only multipath devices, so the user cannot do anything via the ovirt/rhev gui.
Adding the device wwids to /etc/multipath/wwids, or filtering the scsi devices with lvm.conf will make multipath grab them. Otherwise, lvm will grab them first.
Pavel, can you check my questions in comment 6?
(In reply to Peter Rajnoha from comment #3) > (In reply to Pavel Stehlik from comment #0) > If a disk is identified as a PV and we gather all PVs making up a VG, we > activate the VG by default unless activation/auto_activation_volume_list or > devices/global_filter is set to prevent activation of LVs on incoming PVs. > > So to prevent this autoactivation, the default LVM configuration needs to be > modified (the global_filter and/or auto_activation_volume_list). If we disable auto activation globally, will it break booting from LUN with lvm volumes? It looks like lvm is configuration is too smart; we need to configure lvm to do nothing unless we (vdsm) ask it.
(In reply to Nir Soffer from comment #9) > Pavel, can you check my questions in comment 6? Hi Nir, indeed. The metatadata are decisive. btw: I'm not monitoring hundreds emails from BZ in my mailbox - but NE works :)
(In reply to Nir Soffer from comment #10) > (In reply to Peter Rajnoha from comment #3) > > (In reply to Pavel Stehlik from comment #0) > > If a disk is identified as a PV and we gather all PVs making up a VG, we > > activate the VG by default unless activation/auto_activation_volume_list or > > devices/global_filter is set to prevent activation of LVs on incoming PVs. > > > > So to prevent this autoactivation, the default LVM configuration needs to be > > modified (the global_filter and/or auto_activation_volume_list). > > If we disable auto activation globally, will it break booting from LUN with > lvm volumes? > As for initramfs, dracut activates the LV on which root FS is present no matter what the auto_activation_volume_list setting is. For all the other volumes activated during boot (after we switch to root fs), the auto_activation_volume_list is honoured (because all the other non-initramfs bootup scripts call vg/lvchange -aay instead of -ay). > It looks like lvm is configuration is too smart; we need to configure lvm > to do nothing unless we (vdsm) ask it. Just make sure that auto_activation_volume_list contains only VG/LVs needed during boot (or none if only root fs is on LVM and not /home or /var or anything else). Then when VDSM is activates VGs or LVs, make sure you're using -ay, not -aay with vg/lvchange. Simply, what gets autoactivated, is controlled by auto_activation_volume_list - this is honoured by the event-based LVM autoactivation as well as all the (after initramfs) bootup scripts.
Isn't that a rather severe bug, making direct LUN pretty much unusable in oVirt 4.0? Not only is some LUNs simply not showing up a source of major confusion (it took me quite some time to find this bugreport) but having them disappear after a while seems to be a surefire way to ruin someones day: Have a VM with a disk backed by a direct LUN, eventually write a LVM PV header to it and suddenly you can't migrate the VM or start it on another host anymore since it won't be able to see the LUN. Reboot the host the LUN was initially added to and now you've "lost" your disk completely and won't be able to restart the VM until you apply some manual hacks to make LVM not grab it first. That's not exactly an exotic use-case either, IMHO. At least for me it's a show-stopper bug that will keep me from updating any 3.6 installations to 4.0 since we have dozens of VMs that would be affected by this.
(In reply to Markus Oswald from comment #13) > Isn't that a rather severe bug, making direct LUN pretty much unusable in > oVirt 4.0? This issue is not related to ovirt-4.0 in any way, this issue exists with any ovirt version on rhel 7.
(In reply to Peter Rajnoha from comment #12) > (In reply to Nir Soffer from comment #10) > > (In reply to Peter Rajnoha from comment #3) > > > (In reply to Pavel Stehlik from comment #0) > > > If a disk is identified as a PV and we gather all PVs making up a VG, we > > > activate the VG by default unless activation/auto_activation_volume_list or > > > devices/global_filter is set to prevent activation of LVs on incoming PVs. > > > > > > So to prevent this autoactivation, the default LVM configuration needs to be > > > modified (the global_filter and/or auto_activation_volume_list). > > > > It looks like lvm is configuration is too smart; we need to configure lvm > > to do nothing unless we (vdsm) ask it. > > Just make sure that auto_activation_volume_list contains only VG/LVs needed > during boot (or none if only root fs is on LVM and not /home or /var or > anything else). Then when VDSM is activates VGs or LVs, make sure you're > using -ay, not -aay with vg/lvchange. We cannot tell which vg/lvs are needed during boot. This is not about configuring a single machine, but about configuring any machine vdsm is installed on. All devices on a hypervisor belongs to vdsm, and may be used by a vm. The system should not touch any device it was not instructed to touch. The behavior that we need is: - lvm does not auto activate anything, unless the admin activates it manually - all scsi devices belong belong to multipath - system should still boot :-) Can we do this by having an empty auto_activation_volume_list? Or maybe the best way is to override the udev rule responsible for auto activation?
(In reply to Nir Soffer from comment #15) > All devices on a hypervisor belongs to vdsm, and may be used by a vm. The > system > should not touch any device it was not instructed to touch. > > The behavior that we need is: > - lvm does not auto activate anything, unless the admin activates it manually > - all scsi devices belong belong to multipath > - system should still boot :-) > > Can we do this by having an empty auto_activation_volume_list? > Yes, you can disable any LVM autoactivation by using activation/auto_activation_volume_list = [] in your lvm configuration. > Or maybe the best way is to override the udev rule responsible for auto > activation? No! Please do not edit any of LVM/DM udev rules - they're not intended for custom changes (except 12-dm-permissions.rules which is for changing permissions).
Moving out all non blocker\exceptions.
This requires applying lvm filter whitelisting the host devices, so LVM does not scan direct luns (or any other lun which is not needed by the hypervisor).
*** Bug 1426916 has been marked as a duplicate of this bug. ***
I tried to reproduce this issue by simulating discovery of a new lun use on another system for as lvm physical volume. Tested on: # rpm -qa | egrep 'lvm2|multipath|kernel' | sort device-mapper-multipath-0.4.9-99.el7_3.1.x86_64 device-mapper-multipath-libs-0.4.9-99.el7_3.1.x86_64 kernel-3.10.0-327.el7.x86_64 kernel-3.10.0-514.10.2.el7.x86_64 kernel-3.10.0-514.el7.x86_64 kernel-headers-3.10.0-514.10.2.el7.x86_64 kernel-tools-3.10.0-514.10.2.el7.x86_64 kernel-tools-libs-3.10.0-514.10.2.el7.x86_64 lvm2-2.02.166-1.el7_3.3.x86_64 lvm2-libs-2.02.166-1.el7_3.3.x86_64 I tried this flow: 1. select a FC LUN not used by vdsm 2. create a PV, VG and 2 LVs pvcreate /dev/mapper/xxxyyy vgcreate guest-vg /dev/mapper/xxxyyy lvcreate --name guest-lv-1 --size 10g guest-vg lvcreate --name guest-lv-2 --size 10g guest-vg 3. Stop vdsm and multipathd (hoping that multipathd will not update /etc/multipath/wwids when stopped) 4. remove xxxyyy from /etc/multipath/wwids 5. reboot 6. check using multipath -ll if multipath could grab the LUN after boot I could not reproduce it after 4 tries. In the output of journalctl -b, we can see that both lvm and lvm are trying to grab the devices, but in all case multipath could grab the device. Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: sde: add path (uevent) Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: sde: spurious uevent, path already in pathvec Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com systemd[1]: Created slice system-lvm2\x2dpvscan.slice. Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com systemd[1]: Starting system-lvm2\x2dpvscan.slice. Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com systemd[1]: Starting LVM2 PV scan on device 8:64... Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com systemd[1]: Starting LVM2 PV scan on device 8:80... Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com kernel: device-mapper: multipath service-time: version 0.3.0 loaded Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: 3600a09803830355a332b47677750717a: load table [0 104857600 multipath 3 pg_init_retries 50 retain_attached_hw_handler 0 1 1 s Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: 3600a09803830355a332b47677750717a: event checker started Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: sde [8:64]: path added to devmap 3600a09803830355a332b47677750717a Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: sdf: add path (uevent) Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: sdf: spurious uevent, path already in pathvec Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: 3600a09803830355a332b476777507230: load table [0 104857600 multipath 3 pg_init_retries 50 retain_attached_hw_handler 0 1 1 s Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: 3600a09803830355a332b476777507230: event checker started Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: sdf [8:80]: path added to devmap 3600a09803830355a332b476777507230 Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com systemd[1]: Started LVM2 PV scan on device 8:64. Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com systemd[1]: Started LVM2 PV scan on device 8:80. Maybe this issue was solved in 7.3? Or maybe there is another thing needed to reproduce this bug? We need also to test the case when a device is discovered not during boot, but when scanning FC hosts. Maybe timing is different in this case? Ben, can you advice how to simulate this issue better?
Elad, I will need help from QE for testing this, I will need to be able to map a new LUN with LVM setup to a running system, and trigger a FC scan. I will need access to a FC server for mapping a new LUN, or someone from QE that can help with with this. I will need to map and unmap a LUN to a host several times to reproduce this.
So, looking at these messages: Jul 02 19:30:07 grey-vdsc.eng.lab.tlv.redhat.com multipathd[911]: sde: spurious uevent, path already in pathvec Makes it look like the device was already present before the uevent occured. Perhaps these messages are from late boot, and the device was initially discovered during the initramfs portion of boot. In this case, did you remake the initramfs after editting /etc/multipath/wwids? Another option would be to try doing this with iscsi devices, instead of FC devices.
We cannot reproduce this issue, but it is prevented by applying a proper lvm filter that will not allow lvm to access any devices not required by the host. We introduced a new vdsm-tool command, "config-lvm-filter", automating lvm configuration. If you use block storage you should configure lvm filter properly on all hosts. See https://ovirt.org/blog/2017/12/lvm-configuration-the-easy-way/
Nir, this is targeted to 4.3.0 but it's modified in 4.2.0. Can you please check / fix target milestone?
(In reply to Sandro Bonazzola from comment #24) > Nir, this is targeted to 4.3.0 but it's modified in 4.2.0. > Can you please check / fix target milestone? Same as https://bugzilla.redhat.com/show_bug.cgi?id=1130527#c26
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No external trackers attached] For more info please contact: infra
Based on comment 20 moving to VERIFIED
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.