Bug 1117179 - The operation of hotplugging virtio block disks to a guest who doesn't have OS installation leads to multiple failures,IRSExceptions and data corruption
Summary: The operation of hotplugging virtio block disks to a guest who doesn't have O...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Nir Soffer
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-07-08 08:23 UTC by Ori Gofen
Modified: 2016-02-10 19:43 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-08-05 17:05:06 UTC
oVirt Team: Storage


Attachments (Terms of Use)
vdsm+engine logs (1.20 MB, application/x-tar)
2014-07-08 08:23 UTC, Ori Gofen
no flags Details

Description Ori Gofen 2014-07-08 08:23:37 UTC
Created attachment 916296 [details]
vdsm+engine logs

Description of problem:

The operation of deactivating and activating an LV on a running guest can lead to several failures that originate so it seems from a race condition.

when having a running guest with deactivated Block disks we expect that the
Lv's will be deactivated as well,but when this bug appears the host reports the Lv's as activated.
[root@camel-vdsc ~]# lvs
  LV                                   VG                                   Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  21a9407a-089a-4321-a86c-04eb41b58866 70dfdfa3-653c-4656-a831-d73a1681b068 -wi-a-----   1.00g                                             
  5c7a16ca-6c01-433f-a402-a176a36e466c 70dfdfa3-653c-4656-a831-d73a1681b068 -wi-a-----   2.00g                                             
  5e737af0-aa6f-419d-b48e-156d42c83bdf 70dfdfa3-653c-4656-a831-d73a1681b068 -wi-a-----   2.00g                                             
  lv_root                              vg0                                  -wi-ao---- 224.88g                                             
  lv_swap                              vg0                                  -wi-ao----   7.81g   

when trying to remove those disks,their state becomes illegal
from oVirt's log:

2014-07-08 10:03:53,098 ERROR [org.ovirt.engine.core.bll.RemoveImageCommand] (org.ovirt.thread.pool-8-thread-33) [15e14103] Command org.ovirt.engine.core.bll.RemoveImageCommand throw Vdc Bll exception

Repeating the removal operation causes data corruption to psql tables.

engine=# SELECT volume_format,image_group_id,creation_date,_update_date,active,it_guid FROM images;
 volume_format | image_group_id |     creation_date      | _update_date | active |               it_guid                
---------------+----------------+------------------------+--------------+--------+--------------------------------------
             4 |                | 2008-04-01 00:00:00+03 |              | t      | 00000000-0000-0000-0000-000000000000
(1 row)

images table is empty.
lvs command shows activated LV's and when executing fuser:

[root@camel-vdsc ~]#  fuser -kuc /dev/70dfdfa3-653c-4656-a831-d73a1681b068/21a9407a-089a-4321-a86c-04eb41b58866 
/dev/70dfdfa3-653c-4656-a831-d73a1681b068/21a9407a-089a-4321-a86c-04eb41b58866: 11601(qemu)

we see that qemu process still uses/locks the LV

[root@camel-vdsc ~]# lvs -o lv_name,lv_tags
  LV                                   LV Tags                                                                                        
  21a9407a-089a-4321-a86c-04eb41b58866 PU_00000000-0000-0000-0000-000000000000,MD_5,IU__remove_me_d17878dc-55c5-4297-9934-9d2f19adb996
  5c7a16ca-6c01-433f-a402-a176a36e466c MD_6,PU_00000000-0000-0000-0000-000000000000,IU__remove_me_029fd296-9588-46ce-95a6-cd365c9df48c
  5e737af0-aa6f-419d-b48e-156d42c83bdf MD_4,PU_00000000-0000-0000-0000-000000000000,IU__remove_me_2a382a92-0e4c-473f-b5c8-0540386188e9

adding the lv_tags flag shows that a string "remove_me" is added to image id.

asinc_task table is not cleared as well

engine=# SELECT task_type,task_id,command_id,action_type,started_at FROM async_tasks;
 task_type |               task_id                |              command_id              | action_type |         started_at         
-----------+--------------------------------------+--------------------------------------+-------------+----------------------------
         5 | b5627649-074f-4afc-b3bf-a995fcd7bab3 | 6a32e70d-c5d2-45bf-ab5c-1fe7166efa94 |         230 | 2014-07-08 10:04:00.236+03
         5 | 803f412d-4de8-4319-ba3e-51463e8f74f5 | 41e9c6bf-f237-4907-a4dd-4d825bfd1c50 |         230 | 2014-07-08 10:04:12.018+03
         5 | 3946f603-68bf-4d2a-a326-f86f94a601ce | 033357a8-8f55-4a31-92c9-9b5f35e09a1b |         230 | 2014-07-08 10:04:23.273+03
(3 rows)

Then if we try to remove the Block domain, operation fail with bll and irs exceptions

from oVirt-engine logs:

2014-07-08 10:07:18,797 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp--127.0.0.1-8702-7) [42184087] Failed in FormatStorageDomainVDS method
2014-07-08 10:07:18,798 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp--127.0.0.1-8702-7) [42184087] Command org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=508, mMessage=Volume Group remove error: ('VG 70dfdfa3-653c-4656-a831-d73a1681b068 remove failed.',)]]
2014-07-08 10:07:18,805 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp--127.0.0.1-8702-7) [42184087] HostName = vdsc
2014-07-08 10:07:18,809 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp--127.0.0.1-8702-7) [42184087] Command FormatStorageDomainVDSCommand(HostName = vdsc, HostId = e61ee2aa-fa3c-49fc-a803-533663f6b9c1, storageDomainId=70dfdfa3-653c-4656-a831-d73a1681b068) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to FormatStorageDomainVDS, error = Volume Group remove error: ('VG 70dfdfa3-653c-4656-a831-d73a1681b068 remove failed.',), code = 508
2014-07-08 10:07:18,822 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp--127.0.0.1-8702-7) [42184087] FINISH, FormatStorageDomainVDSCommand, log id: 19707ea1
2014-07-08 10:07:18,829 ERROR [org.ovirt.engine.core.bll.storage.RemoveStorageDomainCommand] (ajp--127.0.0.1-8702-7) [42184087] Command org.ovirt.engine.core.bll.storage.RemoveStorageDomainCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to FormatStorageDomainVDS, error = Volume Group remove error: ('VG 70dfdfa3-653c-4656-a831-d73a1681b068 remove failed.',), code = 508 (Failed with error VolumeGroupRemoveError and code 508)
2014-07-08 10:07:18,850 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp--127.0.0.1-8702-7) [42184087] Correlation ID: 575f094, Job ID: 89047a49-fbf6-49a5-b740-faf56f3f02c6, Call Stack: null, Custom Event ID: -1, Message: Failed to remove Storage Domain ISCSI. (User: admin)

and from vdsm's log:

Thread-19::ERROR::2014-07-08 10:07:21,304::task::866::Storage.TaskManager.Task::(_setError) Task=`0b751b9e-e9b1-4e1a-8110-7d757c3238e0`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2760, in formatStorageDomain
    self._recycle(sd)
  File "/usr/share/vdsm/storage/hsm.py", line 2706, in _recycle
    dom.format(dom.sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 900, in format
    lvm.removeVG(sdUUID)
  File "/usr/share/vdsm/storage/lvm.py", line 940, in removeVG
    raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
Setup:have a dc with nfs as master,and an iscsi domain
1.create vm + 4 disks on iscsi
2.run vm
3.diactivate all the disks quickly
4.after all disks are diactivated try to activate them
5.wait for a ui error box
6.remove the disks,(they all become illegal) remove again
7.maintain the iscsi domain and remove it (fails)


Actual results:
multiple failures,exceptions which leads to data lose 

Expected results:
removing an image or a block domain should be successful according to Ovirt's docs

Additional info:
important note another ERROR also appears on vdsm's logs every several seconds please read BZ #1116826 first.

Comment 1 Ori Gofen 2014-07-10 17:45:10 UTC
*** note *** Happens with virtio block disks only

Comment 2 Nir Soffer 2014-08-05 16:24:47 UTC
What do you mean in steps 3 and 4?
Hot unplug and plug via the UI?

Comment 3 Ori Gofen 2014-08-05 16:42:13 UTC
(In reply to Nir Soffer from comment #2)
> What do you mean in steps 3 and 4?
> Hot unplug and plug via the UI?

Yes,actually time is not a factor here, the lv's are not activated due to virtio disk qualities (cannot be hotplugged unless an OS is installed on guest)

Updaing Steps to Reproduce:
Setup:have a dc with nfs as master,and an iscsi domain
1.create vm + 4 virtio disks on iscsi
2.run vm
3. deactivate all the disks one by one from the UI
4.after all disks are deactivated try to activate them (again from the UI)
5.wait for a ui error box
6.remove the disks,(they all become illegal) remove again
7.maintain the iscsi domain and remove it (fails)

Comment 4 Ori Gofen 2014-08-05 16:53:16 UTC
Expected results:
Hotpluging virtio disks should be blocked with CNA in a case of absence of OS on guest

Comment 5 Allon Mureinik 2014-08-05 17:05:06 UTC
The engine does not have any insight as to whether the Guest OS exists, or if it supports hotpluging or not.

Additionally, there's no business usecase for running (and hot [un]plugging) VMs with no guest OS.


Note You need to log in before you can comment on or make changes to this bug.