Bug 696875 - RHEVH bootup hanging in ovirt-early when doing LVM scanning
RHEVH bootup hanging in ovirt-early when doing LVM scanning
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: ovirt-node (Show other bugs)
5.6
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Mike Burns
Virtualization Bugs
: TestOnly
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-15 01:16 EDT by Mark Huth
Modified: 2016-04-26 11:17 EDT (History)
20 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Due to a bug in the lvm2 package an lvm scan would find too many devices with the same volume group name. New lvm2 packages in 5.8 fix the problem.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-02-21 00:03:41 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Legacy) 46124 None None None Never

  None (edit)
Description Mark Huth 2011-04-15 01:16:23 EDT
Description of problem:
GSS have seen a couple of issues recently where the bootup of RHEVH would hang during the ovirt-early startup script.  It was found the pvs command in ovirt-early was hanging.  The lvm2-monitor startup (and shutdown) script also runs vgs so it would hang too (if ovirt-early was fixed to not hang).  Actually it was found that running any LVM commands on these affected machines, using the default filter of 'a/.*/', would hang.  vdsm wasn't affected though because it runs LVM commands with its own filter.

Version-Release number of selected component (if applicable):
RHEVH 5.6 (9.3.el5_6) ... and earlier versions

How reproducible:
Every time for affected RHEVH hypervisors.

Steps to Reproduce:
1. Boot affected RHEVH with SAN attached and with default LVM filter.
2. Bootup sequence would hang at about "Updating OVIRT_BOOTPARAMS"
3. Subsequent reboots would hang too and so RHEVH could never be brought up.
  
Actual results:
RHEVH could never boot all the way up and was effectively useless.

Expected results:
RHEVH should be able to boot all the way up.

Additional info:
The issue was resolved in one of 3 ways:
1) Disconnect SAN cables from RHEVH and it would boot fine.  Attach SAN cables after boot.  Has to be done for every boot so not a great solution.

2) Boot RHEVH to runlevel 1.  Edit /etc/lvm/lvm.conf to change the filter to only scan the local hard disks, eg filter = [ "a|/dev/sda|", "r|.*|" ].  Continue boot process to runlevel 3 by running init 3.  However the lvm.conf file wasn't persistable, and so it would revert to the default filter on reboot.  Thus the lvm.conf would have to be modified on each boot and so it too wasn't a great solution.

3) Modify the ovirt-early and lvm2-monitor startup scripts to add a filter to the LVM commands so they would only scan the local disks. Eg in ovirt-early:
pvs ... --config 'devices {filter = [ "a|/dev/sda|", "r|.*|" ]}' ...

And in lvm2-monitor:
vgs ... --config 'devices {filter = [ "a|/dev/sda|", "r|.*|" ]}' ...

Persist ovirt-early and lvm2-monitor.  Then the RHEVH could be rebooted and boot all the way up each time.  This was a better solution.

The whole problem and current solution is described in this kbase article:
https://access.redhat.com/kb/docs/DOC-46124

I have not been able to reproduce this issue.  For the affected customers/hypervisors it seemed that LVM was hanging when scanning some the logical volumes for the virtual machines in the storage domain.  Thus removing the SAN cables or modifying the filter to scan only local disks caused LVM to not consider the LVs in the storage domains.  I'm not sure what it was about these LVs that caused LVM to hang when scanning them.

Not sure what the best solution is.  Perhaps the initial configuration of the RHEVH can be scripted to modify the filter in lvm.conf so only local disks are scanned.
Comment 1 Mark Huth 2011-04-15 01:20:16 EDT
I wanted to attach this bug to Red Hat Enterprise Hypervisor but couldn't find that product in the list, so I attached it to the closest thing I could think of - ovirt-node.
Comment 5 Mark Huth 2011-04-28 02:02:03 EDT
I am wondering if this is in any way related to VMs with RAW/Sparse disks?  I was trying to setup a reproducer where a VM had multiple disks and at least one of those disks was entirely allocated as a PV.  Unfortunately I'm not able to reproduce the issue because the version I am using has been fixed as of bz649029 and bz639689, so I can't import a VM so it has raw/sparse disk(s).  

I don't think its an issue with the SAN because one of the customers affected by this problem have since moved to an whole new hardware platform and SAN and when they imported their VMs they ran into the exact same problem on the new environment.  

I notice this customer has a couple of VMs with multiple RAW/sparse disks and PVs spanning the whole disk.  Since I am not able to reproduce this issue with normal RAW/preallocated or COW/sparse disks so I was wondering if the RAW/sparse disks was the key...   

Unfortunately the customer is reluctant to run some tests on their new production environment to try to isolate the problem.
Comment 16 Mike Burns 2011-06-07 09:46:28 EDT
I've tried reproducing this by allocating a large number of luns to a rhev-h host and creating PVs, VGs, and LVs.  I wasn't able to see the hang mentioned, though I did notice a slow down in some processes due to the large number of disks it had to deal with.  

Can you get the following information so I can attempt to reproduce?

Number of luns allocated to the rhevh host
size of luns
how they're allocated in storage domains
number of virtual machines and disk size for each (approximate)
sparse/raw for the vms

Thanks
Comment 19 RHEL Product and Program Management 2011-06-20 18:40:48 EDT
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.
Comment 21 Mark Huth 2011-08-09 00:52:56 EDT
Note, customer in SFDC said they updated the RHEVH to the rhevh-5.7 and the problem went away.  I will confirm this with another customer.

Cheers,
Mark
Comment 22 John Brier 2011-08-30 16:23:53 EDT
Are there any confirmations of RHEV-H 5.7 fixing this issue?

I have asked my customer (sfdc 00457479) to try and reproduce.
Comment 30 John Ruemker 2012-01-12 09:39:18 EST
2 new bugs created for the lvm2 issue:

  https://bugzilla.redhat.com/show_bug.cgi?id=773432 (5.8)
  https://bugzilla.redhat.com/show_bug.cgi?id=773587 (5.7.z)

-John
Comment 31 Mike Burns 2012-01-12 19:24:48 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
This issue is caused by a bug in the lvm2 package.  An lvm scan finds too many devices with the same volume group name. New lvm2 packages in 5.8 fix the problem.
Comment 32 Guohua Ouyang 2012-01-21 00:15:10 EST
Verified on 5.8-20120118.0 build, there is volume group have 7 Luns, RHEVH bootup is ok.

Set bug status to be verified so far.

[root@unused ~]# pvs
  PV                                              VG                                   Fmt  Attr PSize PFree
  /dev/mapper/360a9800050334c33424a68583576375ap2 HostVG                               lvm2 a--  9.93G    0 
  /dev/mapper/360a9800050334c33424a685835774f78   a380ee52-414c-4b82-9e0c-973adda2b38b lvm2 a--  9.88G 9.88G
  /dev/mapper/360a9800050334c33424a685835777163   a380ee52-414c-4b82-9e0c-973adda2b38b lvm2 a--  9.88G 9.88G
  /dev/mapper/360a9800050334c33424a685835784979   a380ee52-414c-4b82-9e0c-973adda2b38b lvm2 a--  9.88G 9.88G
  /dev/mapper/360a9800050334c33424a685835786c2d   a380ee52-414c-4b82-9e0c-973adda2b38b lvm2 a--  9.88G 9.88G
  /dev/mapper/360a9800050334c33424a685835794341   a380ee52-414c-4b82-9e0c-973adda2b38b lvm2 a--  9.88G 9.88G
  /dev/mapper/360a9800050334c33424a6858357a4470   a380ee52-414c-4b82-9e0c-973adda2b38b lvm2 a--  9.88G 9.88G
  /dev/mapper/360a9800050334c33424a6858357a6a5a   a380ee52-414c-4b82-9e0c-973adda2b38b lvm2 a--  9.88G 6.00G
[root@unused ~]# vgs
  VG                                   #PV #LV #SN Attr   VSize  VFree 
  HostVG                                 1   6   0 wz--n-  9.93G     0 
  a380ee52-414c-4b82-9e0c-973adda2b38b   7   6   0 wz--n- 69.12G 65.25G
Comment 33 Stephen Gordon 2012-02-09 16:44:44 EST
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-This issue is caused by a bug in the lvm2 package.  An lvm scan finds too many devices with the same volume group name. New lvm2 packages in 5.8 fix the problem.+Due to a bug in the lvm2 package an lvm scan would find too many devices with the same volume group name. New lvm2 packages in 5.8 fix the problem.
Comment 35 errata-xmlrpc 2012-02-21 00:03:41 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0168.html

Note You need to log in before you can comment on or make changes to this bug.