Created attachment 969551 [details] logs Description of problem: Live migrating Block storage on a mixed versioned (3.4 - 3.5) cluster(Rhel7 hosts), cause vdsm to extend the (leaf) qcow sparse volume until it reaches the limit of free space on storage domain. lvextend loop -> 2f8787cb-82d8-4cee-9af5-59cbf28f3a50::DEBUG::2014-12-16 14:34:48,998::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvextend --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/360060160f4a03000d254fda11568e411|\', \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } " --autobackup n --size 10240m 0e8e761a-34ef-4bfa-bbd7-ba842b516118/fe32c21d-023a-4d2e-a678-0774b831b2f5' (cwd None) Dummy-103::DEBUG::2014-12-16 14:34:51,019::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) 'dd if=/rhev/data-center/af395fd7-a6c4-436b-8cb7-58462456bc9c/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000' (cwd None) Dummy-103::DEBUG::2014-12-16 14:34:51,099::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0460881 s, 22.2 MB/s\n'; <rc> = 0 after no more space left exception --> f306d0fd-8ee5-4bcd-9279-44f20f9ee2b8::ERROR::2014-12-16 14:46:43,188::storage_mailbox::172::Storage.SPM.Messages.Extend::(processRequest) processRequest: Exception caught while trying to extend volume: f39bc15b-ceb8-4dac-b87a-fbf1ab85e84e in domain: 0e8e761a-34ef-4bfa-bbd7-ba842b516118 Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 166, in processRequest pool.extendVolume(volume['domainID'], volume['volumeID'], size) File "/usr/share/vdsm/storage/securable.py", line 75, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1288, in extendVolume sdCache.produce(sdUUID).extendVolume(volumeUUID, size, isShuttingDown) File "/usr/share/vdsm/storage/blockSD.py", line 1319, in extendVolume lvm.extendLV(self.sdUUID, volumeUUID, size) # , isShuttingDown) File "/usr/share/vdsm/storage/lvm.py", line 1151, in extendLV _resizeLV("lvextend", vgName, lvName, size) File "/usr/share/vdsm/storage/lvm.py", line 1145, in _resizeLV free_size / constants.MEGAB)) VolumeGroupSizeError: Volume Group not big enough: ('0e8e761a-34ef-4bfa-bbd7-ba842b516118/f39bc15b-ceb8-4dac-b87a-fbf1ab85e84e 47104 > 768 (MiB)',) lvm (lvs) view (look at f39bc15b-ceb8-4dac-b87a... size ) ---> 1. # lvs | egrep "0e8e761a-34ef-4bfa-bbd7-ba842b516118|9c19aab4-cd14-4e1f-b3af LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert ** f39bc15b-ceb8-4dac-b87a-fbf1ab85e8 0ee761a-34ef-4bfa-bbd7-bwi-ao---- 3.00g fe32c21d-023a-4d2e-a678-0774b83f5 04fa-bbd7-ba842b516118 -wi-ao---- 10.00g ids 4bfa-bbd7-ba842b516118 -wi-ao---- 128.00m inbox bfa-bbd7-ba842b516118 -wi-a----- 128.00m leases fa-bbd7-ba842b516118 -wi-a----- 2.00g master 4fa-bbd7-ba842b516118 -wi-a----- 1.00g metadata a-bbd7-ba842b516118 -wi-a----- 512.00m outbox 0a-bbd7-ba842b516118 -wi-a----- 128.00m 2. LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert ** f39bc15b-ceb8-4dac-b87a-fbf1a 34ef-4bfa-bbd7-ba842b516118 -wi-ao---- 16.00g fe32c21d-023a-4d2e-a678-0774b -34ef-4bfa-bbd7-ba842b516118 -wi-ao---- 10.00g ids -34ef-4bfa-bbd7-ba842b516118 -wi-ao---- 128.00m inbox -34ef-4bfa-bbd7-ba842b516118 -wi-a----- 128.00m leases -34ef-4bfa-bbd7-ba842b516118 -wi-a----- 2.00g master -34ef-4bfa-bbd7-ba842b516118 -wi-a----- 1.00g metadata -34ef-4bfa-bbd7-ba842b516118 -wi-a----- 512.00m outbox -34ef-4bfa-bbd7-ba842b516118 -wi-a----- 128.00m 3. LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert ** f39bc15b-ceb8-4dac-b87a-fbf0 -34ef-4bfa-bbd7-ba842b516118 -wi-ao---- 45.00g fe32c21d-023a-4d2e-a678-077 -34ef-4bfa-bbd7-ba842b516118 -wi-ao---- 10.00g ids 34ef-4bfa-bbd7-ba842b516118 -wi-ao---- 128.00m inbox 34ef-4bfa-bbd7-ba842b516118 -wi-a----- 128.00m leases ef-4bfa-bbd7-ba842b516118 -wi-a----- 2.00g master -34ef-4bfa-bbd7-ba842b516118 -wi-a----- 1.00g metadata a-34ef-4bfa-bbd7-ba842b516118 -wi-a----- 512.00m outbox a-34ef-4bfa-bbd7-ba842b516118 -wi-a----- 128.00m When attemting after a while VM removal oVirt prompt a pop up error window: "Error while executing action: vm_iscsi_test: Cannot remove VM. Disk vm_iscsi_test_Disk1 is being moved or copied." Version-Release number of selected component (if applicable): engine 3.4 rhevm-3.4.4-2.2.el6ev.noarch purple-vds1 vdsm-4.16.8.1-2.el7ev.x86_64 purple-vds2 vdsm-4.14.18-4.el7ev.x86_64 How reproducible: ? Steps to Reproduce: 1.live merge iSCSi disk in mixed versioned environment (I have attempted twice to live merge) Actual results: as descripted Expected results: live migration basic operation should always be successful as long as the versions support the feature Additional info:
there are too many contracting terms here - are you talking about VM live migration, storage live migration or live merge?
This looks like a similar (if not identical) issue to bug 1091094, which was resolved in vdsm-4.16.7.6-1.el6ev Additional info I'm missing: 1. Which host is SPM and which is HSM? 2. What are the libvirt versions on each host?
Hey Allon the issue is about Storage live migration, purple-vds1 was the SPM. purple-vds2 -> HSM root@purple-vds1 /rhev/data-center/mnt # rpm -qa | grep libvirt libvirt-client-1.2.8-9.el7.x86_64 libvirt-daemon-driver-interface-1.2.8-9.el7.x86_64 libvirt-daemon-driver-storage-1.2.8-9.el7.x86_64 libvirt-daemon-driver-network-1.2.8-9.el7.x86_64 libvirt-daemon-kvm-1.2.8-9.el7.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-9.el7.x86_64 libvirt-daemon-driver-nodedev-1.2.8-9.el7.x86_64 libvirt-daemon-driver-qemu-1.2.8-9.el7.x86_64 libvirt-python-1.2.8-6.el7.x86_64 libvirt-daemon-driver-secret-1.2.8-9.el7.x86_64 libvirt-daemon-config-nwfilter-1.2.8-9.el7.x86_64 libvirt-daemon-1.2.8-9.el7.x86_64 libvirt-lock-sanlock-1.2.8-9.el7.x86_64 root@purple-vds2 ~ # rpm -qa | grep libvirt libvirt-client-1.2.8-9.el7.x86_64 libvirt-daemon-driver-interface-1.2.8-9.el7.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-9.el7.x86_64 libvirt-daemon-driver-storage-1.2.8-9.el7.x86_64 libvirt-python-1.2.8-6.el7.x86_64 libvirt-daemon-driver-nodedev-1.2.8-9.el7.x86_64 libvirt-daemon-kvm-1.2.8-9.el7.x86_64 libvirt-daemon-driver-network-1.2.8-9.el7.x86_64 libvirt-daemon-driver-secret-1.2.8-9.el7.x86_64 libvirt-daemon-driver-qemu-1.2.8-9.el7.x86_64 libvirt-daemon-config-nwfilter-1.2.8-9.el7.x86_64 libvirt-daemon-1.2.8-9.el7.x86_64 libvirt-lock-sanlock-1.2.8-9.el7.x86_64
This given credence to the aforementioned theory. Bug 1091094 solved a regression that was introduced due a behavior change in libvirt 1.2.2, but was only solved for the 3.5 (vdsm 4.16.0) and was never backported to 3.4.z. i.e., if we can confirm this, the patch should be backported to 3.4.z, and that should solve this issue too. Nir - can we backport this patch and provide Ori with a scratch build to verify?
To clarify - if this theory is true, this bug is NOT a blocker for 3.5.0, as the fix is already merged there, and should only be backported to the next 3.4.z.
After some reproduction efforts (thanks Nir!), this is not related to the mixed env. In 3.5, with a single host and two block domains, this does not reproduced. In 3.4, with a single host and two block domains, this reproduces consistently. In other words, the working theory seems to be correct. Until the pm_ack+ and qa_ack+ are given I cannot clone the BZ, but this should be verified as fixed with a "clean" 3.5.0 env, and the fix backported to 3.4.5.
The root cause of this bug was solved upstream in oVirt (see bug 1091094 - which fix was included in RHEV build vt2), and verified there. Moving to ON_QA so QA can decide if they want to re-verify with this scenario, which is slightly different, or just close as duplicate. Note that while the upgrade and/or mixed environment is immaterial for this BZ, a full fledged mixed environment scenario cannot be tested until the 3.4.z issue (tracked by bug 1175686) is fixed.
Nir, installed the packages from the attachment. vdsm failed to start, adding host failed as well. tried twice. Errors from engine(dc version 3.4): 2014-12-21 15:20:03,766 ERROR [org.ovirt.engine.core.dal.dbbroker.audit│evices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_ loghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-48) [7740e│list_from_udev=0 filter = [ '\''a|/dev/mapper/36006016012823000b227cded 6e0] Correlation ID: 7740e6e0, Job ID: aceebcfc-4050-422b-b86e-75334513│0a02e411|/dev/mapper/360060160f4a030001f234c78307ae411|/dev/mapper/3600 fb6f, Call Stack: null, Custom Event ID: -1, Message: Host 10.35.2.53 i│60160f4a030007e04fbb73659e411|/dev/mapper/360060160f4a030008f70f57bee65 s compatible with versions (3.0,3.1,3.2,3.3,3.4) and cannot join Cluste│e411|/dev/mapper/360060160f4a03000d65228391002e411|'\'', '\''r|.*|'\'' r Default which is set to version 3.5. from host: root@rat ~/vdsm-4.14.17-30.gitc89899b.el7_0/noarch # service vdsmd status Redirecting to /bin/systemctl status vdsmd.service vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled) Active: failed (Result: start-limit) since Sun 2014-12-21 15:01:41 IST; 9min ago Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: vdsmd.service: control process exited, code...=1 Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Failed to start Virtual Desktop Server Manager. Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Unit vdsmd.service entered failed state. Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: vdsmd.service holdoff time over, scheduling...t. Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Stopping Virtual Desktop Server Manager... Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Starting Virtual Desktop Server Manager... Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: vdsmd.service start request repeated too qu...t. Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Failed to start Virtual Desktop Server Manager. Dec 21 15:01:41 rat.tlv.redhat.com systemd[1]: Unit vdsmd.service entered failed state. Dec 21 15:04:06 rat.tlv.redhat.com systemd[1]: Stopped Virtual Desktop Server Manager. Hint: Some lines were ellipsized, use -l to show in full.
(In reply to Ori Gofen from comment #13) > Nir, installed the packages from the attachment. > vdsm failed to start, adding host failed as well. > tried twice. Ori, the fix is merged in 3.4.5. Please use the next build including this fix for verifying.
Created attachment 972030 [details] logs (In reply to Nir Soffer from comment #14) > (In reply to Ori Gofen from comment #13) > > Nir, installed the packages from the attachment. > > vdsm failed to start, adding host failed as well. > > tried twice. > Ori, the fix is merged in 3.4.5. Please use the next build including this > fix for verifying. I will remember that when I'll verify bz #1175686, but this bug has reproduced on vt13.4 (clean installation) engine: rhevm-backend-3.5.0-0.26.el6ev.noarch rhevm-3.5.0-0.26.el6ev.noarch vdsm: vdsm-4.16.8.1-4.el7ev.x86_64 vdsm-cli-4.16.8.1-4.el7ev.noarch
failed QA
(In reply to Ori Gofen from comment #16) > failed QA Please describe detailed steps to reproduce this, and I could not reproduce this with vdsm-4.16.8.1.
steps to reproduce: Setup is: engine is vt13.4 running on rhel 6.5 host is vt13.4 running on rhel7 2X iSCSi domains 1.create a VM + thin disk 2.install fedora + while installation is in progress wait for the first lvextend command, then lsm VM's disk.
(In reply to Ori Gofen from comment #18) Steps to reproduce seems valid, although migrating a disk while installing a vm is unlikely. Basically this is equivalent to migration a disk while doing lot of io, an operation which may fail, depending on how much io you do, but we do not expect unlimited extension of the new volume.
Please reproduce with 3.4 - we must know if this a regression in 3.5 or an existing issue.
Checking the host, it is not running rhel7: # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.1 Beta (Maipo) Testing 7.1 beta is important, but we are trying to verify rhev 3.5 on the current release (7.0). Please re-install this host using the correct repositories and reproduce again.
Nir, you are right, this is a Rhel7.1 bug, updating the header, steps to reproduce are the same as descripted above.
Ori, can you confirm that you cannot reproduce this on rhel7.0?
yes, rhel7.0 works with vt13.4.
(In reply to Ori Gofen from comment #24) > yes, rhel7.0 works with vt13.4. Then this bug should be closed VERIFIED or CURRENTRELEASE (since this was fixed long time ago for another bug). For the rhel7.1 issue, please open a new bug. Converting existing bug makes it harder to track issues.
verified on vt13.4