Red Hat Bugzilla – 1117179 – The operation of hotplugging virtio block disks to a guest who doesn't have OS installation leads to multiple failures,IRSExceptions and data corruption
The operation of hotplugging virtio block disks to a guest who doesn't have OS installation leads to multiple failures,IRSExceptions and data corruption
Created attachment 916296[details]
vdsm+engine logs
Description of problem:
The operation of deactivating and activating an LV on a running guest can lead to several failures that originate so it seems from a race condition.
when having a running guest with deactivated Block disks we expect that the
Lv's will be deactivated as well,but when this bug appears the host reports the Lv's as activated.
[root@camel-vdsc ~]# lvs
LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert
21a9407a-089a-4321-a86c-04eb41b58866 70dfdfa3-653c-4656-a831-d73a1681b068 -wi-a----- 1.00g
5c7a16ca-6c01-433f-a402-a176a36e466c 70dfdfa3-653c-4656-a831-d73a1681b068 -wi-a----- 2.00g
5e737af0-aa6f-419d-b48e-156d42c83bdf 70dfdfa3-653c-4656-a831-d73a1681b068 -wi-a----- 2.00g
lv_root vg0 -wi-ao---- 224.88g
lv_swap vg0 -wi-ao---- 7.81g
when trying to remove those disks,their state becomes illegal
from oVirt's log:
2014-07-08 10:03:53,098 ERROR [org.ovirt.engine.core.bll.RemoveImageCommand] (org.ovirt.thread.pool-8-thread-33) [15e14103] Command org.ovirt.engine.core.bll.RemoveImageCommand throw Vdc Bll exception
Repeating the removal operation causes data corruption to psql tables.
engine=# SELECT volume_format,image_group_id,creation_date,_update_date,active,it_guid FROM images;
volume_format | image_group_id | creation_date | _update_date | active | it_guid
---------------+----------------+------------------------+--------------+--------+--------------------------------------
4 | | 2008-04-01 00:00:00+03 | | t | 00000000-0000-0000-0000-000000000000
(1 row)
images table is empty.
lvs command shows activated LV's and when executing fuser:
[root@camel-vdsc ~]# fuser -kuc /dev/70dfdfa3-653c-4656-a831-d73a1681b068/21a9407a-089a-4321-a86c-04eb41b58866
/dev/70dfdfa3-653c-4656-a831-d73a1681b068/21a9407a-089a-4321-a86c-04eb41b58866: 11601(qemu)
we see that qemu process still uses/locks the LV
[root@camel-vdsc ~]# lvs -o lv_name,lv_tags
LV LV Tags
21a9407a-089a-4321-a86c-04eb41b58866 PU_00000000-0000-0000-0000-000000000000,MD_5,IU__remove_me_d17878dc-55c5-4297-9934-9d2f19adb996
5c7a16ca-6c01-433f-a402-a176a36e466c MD_6,PU_00000000-0000-0000-0000-000000000000,IU__remove_me_029fd296-9588-46ce-95a6-cd365c9df48c
5e737af0-aa6f-419d-b48e-156d42c83bdf MD_4,PU_00000000-0000-0000-0000-000000000000,IU__remove_me_2a382a92-0e4c-473f-b5c8-0540386188e9
adding the lv_tags flag shows that a string "remove_me" is added to image id.
asinc_task table is not cleared as well
engine=# SELECT task_type,task_id,command_id,action_type,started_at FROM async_tasks;
task_type | task_id | command_id | action_type | started_at
-----------+--------------------------------------+--------------------------------------+-------------+----------------------------
5 | b5627649-074f-4afc-b3bf-a995fcd7bab3 | 6a32e70d-c5d2-45bf-ab5c-1fe7166efa94 | 230 | 2014-07-08 10:04:00.236+03
5 | 803f412d-4de8-4319-ba3e-51463e8f74f5 | 41e9c6bf-f237-4907-a4dd-4d825bfd1c50 | 230 | 2014-07-08 10:04:12.018+03
5 | 3946f603-68bf-4d2a-a326-f86f94a601ce | 033357a8-8f55-4a31-92c9-9b5f35e09a1b | 230 | 2014-07-08 10:04:23.273+03
(3 rows)
Then if we try to remove the Block domain, operation fail with bll and irs exceptions
from oVirt-engine logs:
2014-07-08 10:07:18,797 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp--127.0.0.1-8702-7) [42184087] Failed in FormatStorageDomainVDS method
2014-07-08 10:07:18,798 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp--127.0.0.1-8702-7) [42184087] Command org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand return value
StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=508, mMessage=Volume Group remove error: ('VG 70dfdfa3-653c-4656-a831-d73a1681b068 remove failed.',)]]
2014-07-08 10:07:18,805 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp--127.0.0.1-8702-7) [42184087] HostName = vdsc
2014-07-08 10:07:18,809 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp--127.0.0.1-8702-7) [42184087] Command FormatStorageDomainVDSCommand(HostName = vdsc, HostId = e61ee2aa-fa3c-49fc-a803-533663f6b9c1, storageDomainId=70dfdfa3-653c-4656-a831-d73a1681b068) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to FormatStorageDomainVDS, error = Volume Group remove error: ('VG 70dfdfa3-653c-4656-a831-d73a1681b068 remove failed.',), code = 508
2014-07-08 10:07:18,822 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp--127.0.0.1-8702-7) [42184087] FINISH, FormatStorageDomainVDSCommand, log id: 19707ea1
2014-07-08 10:07:18,829 ERROR [org.ovirt.engine.core.bll.storage.RemoveStorageDomainCommand] (ajp--127.0.0.1-8702-7) [42184087] Command org.ovirt.engine.core.bll.storage.RemoveStorageDomainCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to FormatStorageDomainVDS, error = Volume Group remove error: ('VG 70dfdfa3-653c-4656-a831-d73a1681b068 remove failed.',), code = 508 (Failed with error VolumeGroupRemoveError and code 508)
2014-07-08 10:07:18,850 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp--127.0.0.1-8702-7) [42184087] Correlation ID: 575f094, Job ID: 89047a49-fbf6-49a5-b740-faf56f3f02c6, Call Stack: null, Custom Event ID: -1, Message: Failed to remove Storage Domain ISCSI. (User: admin)
and from vdsm's log:
Thread-19::ERROR::2014-07-08 10:07:21,304::task::866::Storage.TaskManager.Task::(_setError) Task=`0b751b9e-e9b1-4e1a-8110-7d757c3238e0`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 873, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 2760, in formatStorageDomain
self._recycle(sd)
File "/usr/share/vdsm/storage/hsm.py", line 2706, in _recycle
dom.format(dom.sdUUID)
File "/usr/share/vdsm/storage/blockSD.py", line 900, in format
lvm.removeVG(sdUUID)
File "/usr/share/vdsm/storage/lvm.py", line 940, in removeVG
raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
Setup:have a dc with nfs as master,and an iscsi domain
1.create vm + 4 disks on iscsi
2.run vm
3.diactivate all the disks quickly
4.after all disks are diactivated try to activate them
5.wait for a ui error box
6.remove the disks,(they all become illegal) remove again
7.maintain the iscsi domain and remove it (fails)
Actual results:
multiple failures,exceptions which leads to data lose
Expected results:
removing an image or a block domain should be successful according to Ovirt's docs
Additional info:
important note another ERROR also appears on vdsm's logs every several seconds please read BZ #1116826 first.
(In reply to Nir Soffer from comment #2)
> What do you mean in steps 3 and 4?
> Hot unplug and plug via the UI?
Yes,actually time is not a factor here, the lv's are not activated due to virtio disk qualities (cannot be hotplugged unless an OS is installed on guest)
Updaing Steps to Reproduce:
Setup:have a dc with nfs as master,and an iscsi domain
1.create vm + 4 virtio disks on iscsi
2.run vm
3. deactivate all the disks one by one from the UI
4.after all disks are deactivated try to activate them (again from the UI)
5.wait for a ui error box
6.remove the disks,(they all become illegal) remove again
7.maintain the iscsi domain and remove it (fails)
The engine does not have any insight as to whether the Guest OS exists, or if it supports hotpluging or not.
Additionally, there's no business usecase for running (and hot [un]plugging) VMs with no guest OS.