1174791 – [Rhel7.0] Vdsm loops trough 'extend lv' operation until using all free space

Bug 1174791 - [Rhel7.0] Vdsm loops trough 'extend lv' operation until using all free space

Summary: [Rhel7.0] Vdsm loops trough 'extend lv' operation until using all free space

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Nir Soffer
QA Contact:	Ori Gofen
Docs Contact:
URL:
Whiteboard:	storage
Depends On:	1091094
Blocks:	rhev35rcblocker 1167090 1175686 1176673 1198128
TreeView+	depends on / blocked

Reported:	2014-12-16 13:51 UTC by Ori Gofen
Modified:	2016-05-26 01:47 UTC (History)
CC List:	14 users (show)
Fixed In Version:	vt2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1175686 1176673 (view as bug list)
Environment:
Last Closed:	2015-02-16 13:38:25 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (1.45 MB, application/x-gzip) 2014-12-16 13:51 UTC, Ori Gofen	no flags	Details
logs (1.59 MB, application/x-gzip) 2014-12-22 13:39 UTC, Ori Gofen	no flags	Details
View All

Description Ori Gofen 2014-12-16 13:51:18 UTC

Created attachment 969551 [details]
logs

Description of problem:
Live migrating Block storage on a mixed versioned (3.4 - 3.5) cluster(Rhel7 hosts), cause vdsm to extend the (leaf) qcow sparse volume until it reaches the limit of free space on storage domain.

lvextend loop ->

       2f8787cb-82d8-4cee-9af5-59cbf28f3a50::DEBUG::2014-12-16 14:34:48,998::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvextend --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/360060160f4a03000d254fda11568e411|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } " --autobackup n --size 10240m 0e8e761a-34ef-4bfa-bbd7-ba842b516118/fe32c21d-023a-4d2e-a678-0774b831b2f5' (cwd None)
        Dummy-103::DEBUG::2014-12-16 14:34:51,019::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) 'dd if=/rhev/data-center/af395fd7-a6c4-436b-8cb7-58462456bc9c/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000' (cwd None)
        Dummy-103::DEBUG::2014-12-16 14:34:51,099::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0460881 s, 22.2 MB/s\n'; <rc> = 0


after no more space left exception -->

f306d0fd-8ee5-4bcd-9279-44f20f9ee2b8::ERROR::2014-12-16 14:46:43,188::storage_mailbox::172::Storage.SPM.Messages.Extend::(processRequest) processRequest: Exception caught while trying to extend volume: f39bc15b-ceb8-4dac-b87a-fbf1ab85e84e in domain: 0e8e761a-34ef-4bfa-bbd7-ba842b516118
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 166, in processRequest
    pool.extendVolume(volume['domainID'], volume['volumeID'], size)
  File "/usr/share/vdsm/storage/securable.py", line 75, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1288, in extendVolume
    sdCache.produce(sdUUID).extendVolume(volumeUUID, size, isShuttingDown)
  File "/usr/share/vdsm/storage/blockSD.py", line 1319, in extendVolume
    lvm.extendLV(self.sdUUID, volumeUUID, size)  # , isShuttingDown)
  File "/usr/share/vdsm/storage/lvm.py", line 1151, in extendLV
    _resizeLV("lvextend", vgName, lvName, size)
  File "/usr/share/vdsm/storage/lvm.py", line 1145, in _resizeLV
    free_size / constants.MEGAB))
VolumeGroupSizeError: Volume Group not big enough: ('0e8e761a-34ef-4bfa-bbd7-ba842b516118/f39bc15b-ceb8-4dac-b87a-fbf1ab85e84e 47104 > 768 (MiB)',)


lvm (lvs) view (look at f39bc15b-ceb8-4dac-b87a... size ) --->



1. # lvs | egrep "0e8e761a-34ef-4bfa-bbd7-ba842b516118|9c19aab4-cd14-4e1f-b3af
  LV                                   VG                                   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
**  f39bc15b-ceb8-4dac-b87a-fbf1ab85e8 0ee761a-34ef-4bfa-bbd7-bwi-ao----   3.00g                                                    
  fe32c21d-023a-4d2e-a678-0774b83f5  04fa-bbd7-ba842b516118 -wi-ao----  10.00g                                                    
  ids                                4bfa-bbd7-ba842b516118 -wi-ao---- 128.00m                                                    
  inbox                               bfa-bbd7-ba842b516118 -wi-a----- 128.00m                                                    
  leases                               fa-bbd7-ba842b516118 -wi-a-----   2.00g                                                    
  master                              4fa-bbd7-ba842b516118 -wi-a-----   1.00g                                                    
  metadata                              a-bbd7-ba842b516118 -wi-a----- 512.00m                                                    
  outbox                               0a-bbd7-ba842b516118 -wi-a----- 128.00m          

2. LV                                   VG                                   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
**  f39bc15b-ceb8-4dac-b87a-fbf1a 34ef-4bfa-bbd7-ba842b516118 -wi-ao----  16.00g                                                    
  fe32c21d-023a-4d2e-a678-0774b  -34ef-4bfa-bbd7-ba842b516118 -wi-ao----  10.00g                                                    
  ids                            -34ef-4bfa-bbd7-ba842b516118 -wi-ao---- 128.00m                                                    
  inbox                          -34ef-4bfa-bbd7-ba842b516118 -wi-a----- 128.00m                                                    
  leases                         -34ef-4bfa-bbd7-ba842b516118 -wi-a-----   2.00g                                                    
  master                         -34ef-4bfa-bbd7-ba842b516118 -wi-a-----   1.00g                                                    
  metadata                       -34ef-4bfa-bbd7-ba842b516118 -wi-a----- 512.00m                                                    
  outbox                         -34ef-4bfa-bbd7-ba842b516118 -wi-a----- 128.00m     

3.  LV                                   VG                                   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
**  f39bc15b-ceb8-4dac-b87a-fbf0 -34ef-4bfa-bbd7-ba842b516118 -wi-ao----  45.00g                                                    
  fe32c21d-023a-4d2e-a678-077  -34ef-4bfa-bbd7-ba842b516118 -wi-ao----  10.00g                                                    
  ids                           34ef-4bfa-bbd7-ba842b516118 -wi-ao---- 128.00m                                                    
  inbox                         34ef-4bfa-bbd7-ba842b516118 -wi-a----- 128.00m                                                    
  leases                        ef-4bfa-bbd7-ba842b516118 -wi-a-----   2.00g                                                    
  master                        -34ef-4bfa-bbd7-ba842b516118 -wi-a-----   1.00g                                                    
  metadata                      a-34ef-4bfa-bbd7-ba842b516118 -wi-a----- 512.00m                                                    
  outbox                        a-34ef-4bfa-bbd7-ba842b516118 -wi-a----- 128.00m  

When attemting after a while VM removal
oVirt prompt a pop up error window:
"Error while executing action:

vm_iscsi_test:

    Cannot remove VM. Disk vm_iscsi_test_Disk1 is being moved or copied."

Version-Release number of selected component (if applicable):
engine 3.4  rhevm-3.4.4-2.2.el6ev.noarch
purple-vds1 vdsm-4.16.8.1-2.el7ev.x86_64
purple-vds2 vdsm-4.14.18-4.el7ev.x86_64

How reproducible:
?

Steps to Reproduce:
1.live merge iSCSi disk in mixed versioned environment (I have attempted twice to live merge)

Actual results:
as descripted

Expected results:
live migration basic operation should always be successful as long as the versions support the feature

Additional info:

Comment 1 Allon Mureinik 2014-12-16 14:25:17 UTC

there are too many contracting terms here - are you talking about VM live migration, storage live migration or live merge?

Comment 2 Allon Mureinik 2014-12-16 14:45:18 UTC

This looks like a similar (if not identical) issue to bug 1091094, which was resolved in vdsm-4.16.7.6-1.el6ev

Additional info I'm missing:
1. Which host is SPM and which is HSM?
2. What are the libvirt versions on each host?

Comment 3 Ori Gofen 2014-12-16 14:58:35 UTC

Hey Allon the issue is about Storage live migration, purple-vds1 was the SPM.
purple-vds2 -> HSM

root@purple-vds1 /rhev/data-center/mnt
# rpm -qa | grep libvirt
libvirt-client-1.2.8-9.el7.x86_64
libvirt-daemon-driver-interface-1.2.8-9.el7.x86_64
libvirt-daemon-driver-storage-1.2.8-9.el7.x86_64
libvirt-daemon-driver-network-1.2.8-9.el7.x86_64
libvirt-daemon-kvm-1.2.8-9.el7.x86_64
libvirt-daemon-driver-nwfilter-1.2.8-9.el7.x86_64
libvirt-daemon-driver-nodedev-1.2.8-9.el7.x86_64
libvirt-daemon-driver-qemu-1.2.8-9.el7.x86_64
libvirt-python-1.2.8-6.el7.x86_64
libvirt-daemon-driver-secret-1.2.8-9.el7.x86_64
libvirt-daemon-config-nwfilter-1.2.8-9.el7.x86_64
libvirt-daemon-1.2.8-9.el7.x86_64
libvirt-lock-sanlock-1.2.8-9.el7.x86_64

root@purple-vds2 ~ # rpm -qa | grep libvirt
libvirt-client-1.2.8-9.el7.x86_64
libvirt-daemon-driver-interface-1.2.8-9.el7.x86_64
libvirt-daemon-driver-nwfilter-1.2.8-9.el7.x86_64
libvirt-daemon-driver-storage-1.2.8-9.el7.x86_64
libvirt-python-1.2.8-6.el7.x86_64
libvirt-daemon-driver-nodedev-1.2.8-9.el7.x86_64
libvirt-daemon-kvm-1.2.8-9.el7.x86_64
libvirt-daemon-driver-network-1.2.8-9.el7.x86_64
libvirt-daemon-driver-secret-1.2.8-9.el7.x86_64
libvirt-daemon-driver-qemu-1.2.8-9.el7.x86_64
libvirt-daemon-config-nwfilter-1.2.8-9.el7.x86_64
libvirt-daemon-1.2.8-9.el7.x86_64
libvirt-lock-sanlock-1.2.8-9.el7.x86_64

Comment 4 Allon Mureinik 2014-12-16 16:42:30 UTC

This given credence to the aforementioned theory.
Bug 1091094 solved a regression that was introduced due a behavior change in libvirt 1.2.2, but was only solved for the 3.5 (vdsm 4.16.0) and was never backported to 3.4.z.

i.e., if we can confirm this, the patch should be backported to 3.4.z, and that should solve this issue too.

Nir - can we backport this patch and provide Ori with a scratch build to verify?

Comment 5 Allon Mureinik 2014-12-16 16:43:09 UTC

To clarify - if this theory is true, this bug is NOT a blocker for 3.5.0, as the fix is already merged there, and should only be backported to the next 3.4.z.

Comment 9 Allon Mureinik 2014-12-17 16:03:42 UTC

After some reproduction efforts (thanks Nir!), this is not related to the mixed env.

In 3.5, with a single host and two block domains, this does not reproduced.
In 3.4, with a single host and two block domains, this reproduces consistently.

In other words, the working theory seems to be correct.
Until the pm_ack+ and qa_ack+ are given I cannot clone the BZ, but this should be verified as fixed with a "clean" 3.5.0 env, and the fix backported to 3.4.5.

Comment 12 Allon Mureinik 2014-12-18 11:53:58 UTC

The root cause of this bug was solved upstream in oVirt (see bug 1091094 - which fix was included in RHEV build vt2), and verified there.
Moving to ON_QA so QA can decide if they want to re-verify with this scenario, which is slightly different, or just close as duplicate.

Note that while the upgrade and/or mixed environment is immaterial for this BZ, a full fledged mixed environment scenario cannot be tested until the 3.4.z issue (tracked by bug 1175686) is fixed.

Comment 13 Ori Gofen 2014-12-21 13:26:44 UTC

Nir, installed the packages from the attachment.
vdsm failed to start, adding host failed as well.
tried twice.

Errors from engine(dc version 3.4):

2014-12-21 15:20:03,766 ERROR [org.ovirt.engine.core.dal.dbbroker.audit│evices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_
loghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-48) [7740e│list_from_udev=0 filter = [ '\''a|/dev/mapper/36006016012823000b227cded
6e0] Correlation ID: 7740e6e0, Job ID: aceebcfc-4050-422b-b86e-75334513│0a02e411|/dev/mapper/360060160f4a030001f234c78307ae411|/dev/mapper/3600
fb6f, Call Stack: null, Custom Event ID: -1, Message: Host 10.35.2.53 i│60160f4a030007e04fbb73659e411|/dev/mapper/360060160f4a030008f70f57bee65
s compatible with versions (3.0,3.1,3.2,3.3,3.4) and cannot join Cluste│e411|/dev/mapper/360060160f4a03000d65228391002e411|'\'', '\''r|.*|'\'' 
r Default which is set to version 3.5.        

from host:

root@rat ~/vdsm-4.14.17-30.gitc89899b.el7_0/noarch
 # service vdsmd status
Redirecting to /bin/systemctl status  vdsmd.service
vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
   Active: failed (Result: start-limit) since Sun 2014-12-21 15:01:41 IST; 9min ago

Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: vdsmd.service: control process exited, code...=1
Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Failed to start Virtual Desktop Server Manager.
Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Unit vdsmd.service entered failed state.
Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: vdsmd.service holdoff time over, scheduling...t.
Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Stopping Virtual Desktop Server Manager...
Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Starting Virtual Desktop Server Manager...
Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: vdsmd.service start request repeated too qu...t.
Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Failed to start Virtual Desktop Server Manager.
Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Unit vdsmd.service entered failed state.
Dec 21 15:04:06 rat.tlv.redhat.com systemd[1]: Stopped Virtual Desktop Server Manager.
Hint: Some lines were ellipsized, use -l to show in full.

Comment 14 Nir Soffer 2014-12-21 13:33:02 UTC

(In reply to Ori Gofen from comment #13)
> Nir, installed the packages from the attachment.
> vdsm failed to start, adding host failed as well.
> tried twice.
Ori, the fix is merged in 3.4.5. Please use the next build including this fix for verifying.

Comment 15 Ori Gofen 2014-12-22 13:39:14 UTC

Created attachment 972030 [details]
logs

(In reply to Nir Soffer from comment #14)
> (In reply to Ori Gofen from comment #13)
> > Nir, installed the packages from the attachment.
> > vdsm failed to start, adding host failed as well.
> > tried twice.
> Ori, the fix is merged in 3.4.5. Please use the next build including this
> fix for verifying.

I will remember that when I'll verify bz #1175686, but this bug has reproduced on vt13.4 (clean installation)

engine:
rhevm-backend-3.5.0-0.26.el6ev.noarch
rhevm-3.5.0-0.26.el6ev.noarch

vdsm:
vdsm-4.16.8.1-4.el7ev.x86_64
vdsm-cli-4.16.8.1-4.el7ev.noarch

Comment 16 Ori Gofen 2014-12-22 13:41:54 UTC

failed QA

Comment 17 Nir Soffer 2014-12-22 14:04:49 UTC

(In reply to Ori Gofen from comment #16)
> failed QA

Please describe detailed steps to reproduce this, and I could not reproduce
this with vdsm-4.16.8.1.

Comment 18 Ori Gofen 2014-12-22 15:17:43 UTC

steps to reproduce:
Setup is: engine is vt13.4 running on rhel 6.5
          host is vt13.4 running on rhel7 
          2X iSCSi domains

1.create a VM + thin disk
2.install fedora + while installation is in progress wait for the first lvextend command, then lsm VM's disk.

Comment 19 Nir Soffer 2014-12-22 15:30:23 UTC

(In reply to Ori Gofen from comment #18)
Steps to reproduce seems valid, although migrating a disk while installing a vm is unlikely.

Basically this is equivalent to migration a disk while doing lot of io, an operation which may fail, depending on how much io you do, but we do not expect unlimited extension of the new volume.

Comment 20 Nir Soffer 2014-12-22 15:33:09 UTC

Please reproduce with 3.4 - we must know if this a regression in 3.5 or an existing issue.

Comment 21 Nir Soffer 2014-12-22 16:36:17 UTC

Checking the host, it is not running rhel7:

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.1 Beta (Maipo)

Testing 7.1 beta is important, but we are trying to verify rhev 3.5 on the current release (7.0).

Please re-install this host using the correct repositories and reproduce again.

Comment 22 Ori Gofen 2014-12-22 17:53:23 UTC

Nir, you are right, this is a Rhel7.1 bug, updating the header, steps to reproduce are the same as descripted above.

Comment 23 Nir Soffer 2014-12-22 17:55:16 UTC

Ori, can you confirm that you cannot reproduce this on rhel7.0?

Comment 24 Ori Gofen 2014-12-22 17:56:56 UTC

yes, rhel7.0 works with vt13.4.

Comment 25 Nir Soffer 2014-12-22 18:00:53 UTC

(In reply to Ori Gofen from comment #24)
> yes, rhel7.0 works with vt13.4.

Then this bug should be closed VERIFIED or CURRENTRELEASE (since this was fixed long time ago for another bug).

For the rhel7.1 issue, please open a new bug. Converting existing bug makes it harder to track issues.

Comment 26 Ori Gofen 2014-12-22 19:07:25 UTC

verified on vt13.4

Note You need to log in before you can comment on or make changes to this bug.