Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1449944

Summary:

remove & format iscsi storage domain fails sometimes - FormatStorageDomainVDS failed

Product:

[oVirt] vdsm

Reporter:

Avihai <aefrat>

Component:

General

Assignee:

Maor <mlipchuk>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Avihai <aefrat>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.20.0

CC:

aefrat, amureini, bugs, lveyde, mlipchuk, nsoffer, teigland

Target Milestone:

ovirt-4.1.5

Keywords:

Automation, AutomationBlocker, Regression

Target Release:

4.19.26

Flags:

rule-engine: ovirt-4.1+
rule-engine: ovirt-4.2+
rule-engine: blocker+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

vdsm v4.19.26

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-08-23 08:03:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine & vdsm logs	none
engine & vdsm logs scenario2	none
engine & vdsm logs scenario3	none
engine & vdsm logs	none
4.1 reproduction engine & vdsm logs	none
4.2.0-0.0.master.20170606 Engine +VDSM logs	none
repro4_2_20170606 TC18215 engine&vdsm logs	none
sanlock_client_status	none
sanlock_client_status	none
sanlock_client_log_dump	none

Description Avihai 2017-05-11 08:34:42 UTC

Created attachment 1277776 [details]
engine & vdsm logs

Description of problem:
remove & format iscsi storage domain fails sometimes.

Version-Release number of selected component (if applicable):
Engine:
ovirt-engine-4.2.0-0.0.master.20170505124438.git61f971b.el7.centos.noarch

VDSM:
4.20.0-751

How reproducible:
Hard, about 30%.
Occurs 100% when I ran many tests at once (tier4 qcow2_v3 TP TC's)


Steps to Reproduce:
1. Create DC + cluster on v3 + iscsi SD + 1 host(host_mixed_1)
2. Create a VM with thin disk and one snapshot
3. Attempt to amend the disk & snapshot disks on old DC-> failed as expected
4. remove DC
5  Deactivate & Move host host_mixed_1 to another cluster of another DC in V4.
6. Remove iscsi SD  with format SD option enabled.


Actual results:
remove & format storage domain fails.

Event issued :
May 11, 2017, 10:48:51 AM VDSM host_mixed_1 command FormatStorageDomainVDS failed: Volume Group remove error: ('VG 8ad06eb6-e517-4c2d-9e0a-1ff267a15715 remove failed.',)

Expected results:
remove & format storage domain succeeds.

Work around:
Remove without format works.


Additional info:
In event log we see host_mixed_1 tried to 
VDSM host_mixed_1 command FormatStorageDomainVDS failed: Volume Group remove error: ('VG 8ad06eb6-e517-4c2d-9e0a-1ff267a15715 remove failed.',)

Timeline:
10:44:03 s2017-05-11 10:44:04,299 INFO       Test Setup   1: Create data-center dc_upgrade_4_0_to_4_1 with cluster cluster_upgrade_4_0_to_4_1 and host host_mixed_1
10:44:04 2017-05-11 10:44:04,299 INFO       Test Step   2: Add data-center dc_upgrade_4_0_to_4_1
10:44:04 2017-05-11 10:44:04,618 INFO       Test Step   3: Add cluster cluster_upgrade_4_0_to_4_1
10:44:05 2017-05-11 10:44:05,070 INFO       Test Step   4: Move host host_mixed_1 to cluster cluster_upgrade_4_0_to_4_1
10:44:05 2017-05-11 10:44:05,073 INFO       Test Setup   5: Switch host host_mixed_1 to different cluster cluster_upgrade_4_0_to_4_1 with {'activate': True}
10:44:05 2017-05-11 10:44:05,229 INFO       Test Setup   6: Deactivate the host host_mixed_1. with {'expected_status': 'maintenance'}
10:44:17 2017-05-11 10:44:17,072 INFO       Test Setup   7: Activate host host_mixed_1 (set status to UP) with {'wait': True}
10:45:58 2017-05-11 10:45:58,416 INFO       Test Setup   8: Creating VM vm_TestCase18348_REST_ISCS_1110455841
10:45:58 2017-05-11 10:45:58,625 INFO       Test Setup   9: Description vm_TestCase18348_REST_ISCS_1110455841: add new vm (without starting it) with {'name': 'vm_TestCase18348_REST_ISCS_1110455841', 'cpu_socket': 1, 'copy_permissions': False, 'cpu_cores': 1, 'cluster': 'cluster_upgrade_4_0_to_4_1', 'display_type': 'spice', 'template': 'Blank', 'memory': 1073741824, 'async': False, 'os_type': 'RHEL6x64', 'type': 'desktop', 'wait': True}
10:46:00 2017-05-11 10:46:00,427 INFO       Test Setup  10: Add vNIC name nic1 to VM vm_TestCase18348_REST_ISCS_1110455841 with {'vnic_profile': 'ovirtmgmt', 'network': 'ovirtmgmt', 'plugged': 'true', 'interface': 'virtio', 'linked': 'true'}
10:46:24 2017-05-11 10:46:24,117 INFO       Test Setup  11: Creating snapshot snap_TestCase18348_REST_ISCS_11104624110 of VM vm_TestCase18348_REST_ISCS_1110455841
10:46:24 2017-05-11 10:46:24,120 INFO       Test Setup  12: Description snap_TestCase18348_REST_ISCS_11104624110: Add snapshot to VM vm_TestCase18348_REST_ISCS_1110455841 with {'wait': True}
10:46:43 2017-05-11 10:46:43,141 INFO 010: storage/rhevmtests.storage.storage_qcow2_v3.test_qcow2_v3.TestCase18348_REST_ISCSI.test_amend_old_dc
10:46:43 2017-05-11 10:46:43,141 INFO   API: REST
10:46:43 2017-05-11 10:46:43,141 INFO   STORAGE: ISCSI
10:46:43 2017-05-11 10:46:43,141 INFO   POLARION: https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-18348
10:46:43 2017-05-11 10:46:43,513 INFO       Test Step  13: Amend qcow disk vm_TestCase18348_REST_ISCS_1110455841_Disk1 on old DC dc_upgrade_4_0_to_4_1 , expected to fail
10:46:44 2017-05-11 10:46:44,509 INFO Result: PASSED
10:46:44 .2017-05-11 10:46:44,510 INFO       Test Teardown  14: Remove VM vm_TestCase18348_REST_ISCS_1110455841
10:46:45 2017-05-11 10:46:45,082 INFO       Test Teardown  15: Remove VM vm_TestCase18348_REST_ISCS_1110455841 with {'wait': False}
10:46:57 2017-05-11 10:46:57,642 INFO       Test Teardown  16: Data center's dc_upgrade_4_0_to_4_1 master domain is upgrade_4_0_to_4_1iscsi0
10:46:58 2017-05-11 10:46:58,307 INFO       Test Teardown  17: Clean data-center dc_upgrade_4_0_to_4_1 and remove it
10:46:58 2017-05-11 10:46:58,490 INFO       Test Step  18: Deactivate storage-domain upgrade_4_0_to_4_1iscsi0 
10:47:43 2017-05-11 10:47:43,012 INFO       Test Step  19: Remove data-center dc_upgrade_4_0_to_4_1
10:47:57 2017-05-11 10:47:57,679 INFO       Test Step  20: Move host host_mixed_1 to cluster cluster_upgrade_4_0_to_4_1
10:47:57 2017-05-11 10:47:57,681 INFO       Test Teardown  21: Switch host host_mixed_1 to different cluster golden_env_mixed_1 with {'activate': True}
10:47:57 2017-05-11 10:47:57,839 INFO       Test Teardown  22: Deactivate the host host_mixed_1. with {'expected_status': 'maintenance'}
10:48:09 2017-05-11 10:48:09,391 INFO       Test Teardown  23: Activate host host_mixed_1 (set status to UP) with {'w'wait': True}
10:56:14 2017-05-11 10:56:14,482 INFO       Test Teardown  24: Remove storage domain upgrade_4_0_to_4_1glusterfs0

TRY AGAIN MANUALLY TO REMOVE THE SD at 2017-05-11 11:05 & got the same issue.

Engine log:
2017-05-11 10:48:50,429+03 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler2) [8e1d481] Cleared all tasks of pool 'efc98c0e-7caa-44b9-b8eb-0249b30596e5'.
2017-05-11 10:48:51,250+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (default task-16) [] Failed in 'FormatStorageDomainVDS' method
2017-05-11 10:48:51,270+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-16) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command FormatStorageDomainVDS failed: Volume Group remove error: ('VG 8ad06eb6-e517-4c2d-9e0a-1ff267a15715 remove failed.',)
2017-05-11 10:48:51,270+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (default task-16) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand' return value 'StatusOnlyReturn [status=Status [code=508, message=Volume Group remove error: ('VG 8ad06eb6-e517-4c2d-9e0a-1ff267a15715 remove failed.',)]]'
2017-05-11 10:48:51,270+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (default task-16) [] HostName = host_mixed_1
2017-05-11 10:48:51,270+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (default task-16) [] Command 'FormatStorageDomainVDSCommand(HostName = host_mixed_1, FormatStorageDomainVDSCommandParameters:{runAsync='true', hostId='5a915aa1-65c2-44d9-8371-36ba5798a8f0', storageDomainId='8ad06eb6-e517-4c2d-9e0a-1ff267a15715'})' execution failed: VDSGenericException: VDSErrorException: Failed to FormatStorageDomainVDS, error = Volume Group remove error: ('VG 8ad06eb6-e517-4c2d-9e0a-1ff267a15715 remove failed.',), code = 508
2017-05-11 10:48:51,270+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (default task-16) [] FINISH, FormatStorageDomainVDSCommand, log id: 4ba15325
2017-05-11 10:48:51,271+03 ERROR [org.ovirt.engine.core.bll.storage.domain.RemoveStorageDomainCommand] (default task-16) [] Command 'org.ovirt.engine.core.bll.storage.domain.RemoveStorageDomainCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to FormatStorageDomainVDS, error = Volume Group remove error: ('VG 8ad06eb6-e517-4c2d-9e0a-1ff267a15715 remove failed.',), code = 508 (Failed with error VolumeGroupRemoveError and code 508)
2017-05-11 10:48:51,282+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-16) [] EVENT_ID: USER_REMOVE_STORAGE_DOMAIN_FAILED(961), Failed to remove Storage Domain upgrade_4_0_to_4_1iscsi0. (User: admin@internal-authz)
2017-05-11 10:48:51,291+03 INFO  [org.ovirt.engine.core.bll.storage.domain.RemoveStorageDomainCommand] (default task-16) [] Lock freed to object 'EngineLock:{exclusiveLocks='[8ad06eb6-e517-4c2d-9e0a-1ff267a15715=<STORAGE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2017-05-11 10:48:51,292+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-16) [] Operation Failed: [Volume Group remove error]

 
VDSM log :
2017-05-11 10:48:51,179+0300 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='2779718a-edc3-415c-b17d-a96b701269b0') Unexpected error (task:871)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 878, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2626, in formatStorageDomain
    self._recycle(sd)
  File "/usr/share/vdsm/storage/hsm.py", line 2569, in _recycle
    dom.format(dom.sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1219, in format
    lvm.removeVG(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1019, in removeVG
    raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)
VolumeGroupRemoveError: Volume Group remove error: ('VG 8ad06eb6-e517-4c2d-9e0a-1ff267a15715 remove failed.',)
2017-05-11 10:48:51,239+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='2779718a-edc3-415c-b17d-a96b701269b0') Task._run: 2779718a-edc3-415c-b17d-a96b701269b0 (u'8ad06eb6-e517-4c2d-9e0a-1ff267a15715', False) {} failed - stopping task (task:890)
2017-05-11 10:48:51,240+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='2779718a-edc3-415c-b17d-a96b701269b0') stopping in state preparing (force False) (task:1251)
2017-05-11 10:48:51,240+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='2779718a-edc3-415c-b17d-a96b701269b0') ref 1 aborting True (task:998)
2017-05-11 10:48:51,241+0300 INFO  (jsonrpc/0) [storage.TaskManager.Task] (Task='2779718a-edc3-415c-b17d-a96b701269b0') aborting: Task is aborted: 'Volume Group remove error' - code 508 (task:1176)
2017-05-11 10:48:51,241+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='2779718a-edc3-415c-b17d-a96b701269b0') Prepare: aborted: Volume Group remove error (task:1181)

Comment 1 Avihai 2017-05-11 10:55:16 UTC

More info:

1) Issue is seen only in 4.2 on only on iscsi (the same scenario is running also on NFS & gluster and issue is not seen there)

2) I ran same Test suit in 4.1.2.1 multiple times and the issue does not reproduce.

3) Another scenario(automation test case) I saw this issue occur is details below, Added logs from this case as well.

Details :
1. Create DC + cluster on v3 + iscsi storage domain
2. Create a VM with thin disk and one snapshot
3. Move storage domain to maintenance
4. Attempt to amend the disk & snapshot disks on old DC-> should fail
5. reactivate iscsi storage domain
6. remove DC
7. Deactivate & Move host host_mixed_1 to another cluster of another DC in V4.
8. Remove original cluster (created at step 1) 
9. Remove iscsi SD  with format SD option enabled.

Error seen in VDSM log: 
2017-05-11 12:52:05,181+0300 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='5b612dc4-40ea-44fb-a3a8-6a23b0bfa3ae') Unexpected error (task:871)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 878, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2626, in formatStorageDomain
    self._recycle(sd)
  File "/usr/share/vdsm/storage/hsm.py", line 2569, in _recycle
    dom.format(dom.sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1219, in format
    lvm.removeVG(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1019, in removeVG
    raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)
VolumeGroupRemoveError: Volume Group remove error: ('VG d2e4878c-fb00-4488-934d-c546b1146211 remove failed.',)
2017-05-11 12:52:05,183+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='5b612dc4-40ea-44fb-a3a8-6a23b0bfa3ae') Task._run: 5b612dc4-40ea-44fb-a3a8-6a23b0bfa3ae (u'd2e4878c-fb00-4488-934d-c546b1146211', False) {} failed - stopping task (task:890)

Comment 2 Avihai 2017-05-11 10:56:14 UTC

Created attachment 1277828 [details]
engine & vdsm logs scenario2

Comment 3 Avihai 2017-05-11 12:12:08 UTC

The issue is seen again at another scenario(#3) which is similar to scenario2  automation run ( after blocking scenario's 1&2).

Raising severity to high.

Scenario 3 Details :
1. Create DC + cluster on v3 + iscsi storage domain
2. Create a VM with thin disk and one snapshot
3. Attempt to amend the disk & snapshot disks on old DC-> should fail
4. remove DC
5. Deactivate & Move host host_mixed_1 to another cluster of another DC in V4.
6. Remove original cluster (created at step 1) 
7. Remove iscsi SD  with format SD option enabled -> Issue occurs .

Timestamp when issue occurred - May 11, 2017, 2:34:41 PM. 

VDSM log:
2017-05-11 14:34:41,966+0300 ERROR (jsonrpc/7) [storage.TaskManager.Task] (Task='5ce71b7a-5964-45bc-9933-cf098fe6fd38') Unexpected error (task:871)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 878, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2626, in formatStorageDomain
    self._recycle(sd)
  File "/usr/share/vdsm/storage/hsm.py", line 2569, in _recycle
    dom.format(dom.sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1219, in format
    lvm.removeVG(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1019, in removeVG
    raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)
VolumeGroupRemoveError: Volume Group remove error: ('VG 2a3d0825-c3e9-4bac-a4b8-438eab1f552b remove failed.',)

Comment 4 Avihai 2017-05-11 12:16:10 UTC

Created attachment 1277870 [details]
engine & vdsm logs scenario3

Added attachment for the scenario3 log.

Blocking Automation TC's 18348,18350,18351 matching scenarios 1,2,3.

This is also blocking the following TC's as storage domain is not removed the following tests will fail SETUP stage as they are trying to create a storage domain with the same name & path ( if ISCSI).

Comment 5 Allon Mureinik 2017-05-14 10:27:51 UTC

The interesting part of the error seems to be that the "ids" LV remains active. E.g., on the line prior to the given stacktrace:

2017-05-11 12:52:05,170+0300 DEBUG (jsonrpc/0) [storage.Misc.excCmd] FAILED: <err> = '  Logical volume d2e4878c-fb00-4488-934d-c546b1146211/ids in use.\n'; <rc> = 5 (commands:93)

Comment 6 Red Hat Bugzilla Rules Engine 2017-05-14 10:47:28 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 7 Maor 2017-05-15 03:00:57 UTC

Avihai,

Does it reproduce also if you don't call amend?

With the following steps:
1. Create DC + cluster on v3 + iscsi storage domain
2. Create a VM with thin disk and one snapshot
4. remove DC
5. Deactivate & Move host host_mixed_1 to another cluster of another DC in V4.
6. Remove original cluster (created at step 1) 
7. Remove iscsi SD  with format SD option enabled

Comment 8 Avihai 2017-05-15 10:15:30 UTC

(In reply to Maor from comment #7)
> Avihai,
> 
> Does it reproduce also if you don't call amend?
> 
> With the following steps:
> 1. Create DC + cluster on v3 + iscsi storage domain
> 2. Create a VM with thin disk and one snapshot
> 4. remove DC
> 5. Deactivate & Move host host_mixed_1 to another cluster of another DC in
> V4.
> 6. Remove original cluster (created at step 1) 
> 7. Remove iscsi SD  with format SD option enabled

I ran the scenario without amend X10 times & the issue did not reproduce.

Comment 9 Avihai 2017-05-15 11:29:59 UTC

Ran in v4.1, now running in 4.2 ( which original issue occured) , will update when 10X runs are done

Comment 10 Avihai 2017-05-15 15:01:46 UTC

Ran in v4.2 without amend and issue reproduced.

2017-05-15 17:47:20,615+0300 ERROR (jsonrpc/1) [storage.Dispatcher] {'status': {'message': "Volume Group remove error: ('VG 3048ed4c-ed05-464e-95f7-75b5c73aee54 remove failed.',)", 'code': 508}} (dispatcher:78)


Attached logs as usual.

Comment 11 Avihai 2017-05-15 15:02:32 UTC

Created attachment 1279034 [details]
engine & vdsm logs

Comment 12 Maor 2017-05-23 12:16:24 UTC

It seems like a sanlock sync issue...
I could not able to reproduce it on master and seems like it does not fail constantly. I will try to dig in the logs once more try to find any clues.

based on Avihai's comments it seems that this issue only reproduces in 4.2 and does not occur in 4.1, so I believe that the target version should be changed to 4.2

Comment 13 Maor 2017-05-23 12:16:26 UTC

It seems like a sanlock sync issue...
I could not able to reproduce it on master and seems like it does not fail constantly. I will try to dig in the logs once more try to find any clues.

based on Avihai's comments it seems that this issue only reproduces in 4.2 and does not occur in 4.1, so I believe that the target version should be changed to 4.2

Comment 14 Avihai 2017-05-23 12:40:46 UTC

(In reply to Maor from comment #13)
> It seems like a sanlock sync issue...
> I could not able to reproduce it on master and seems like it does not fail
> constantly. I will try to dig in the logs once more try to find any clues.
> 
> based on Avihai's comments it seems that this issue only reproduces in 4.2
> and does not occur in 4.1, so I believe that the target version should be
> changed to 4.2

As we discussed this issue occurred only in 4.2 & is very hard to reproduce, moving a version to 4.2.

Comment 15 Avihai 2017-06-07 06:31:11 UTC

The same issue also reproduces in a recent on 4.1.3 on a tier2 run.

https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.1v4-ge-runner-tier2/50/

Test :
test_qcow2_v3.TestCase18215_REST_ISCSI.test_upgrade_dc

Logs attached.

Comment 16 Avihai 2017-06-07 06:33:29 UTC

Created attachment 1285663 [details]
4.1 reproduction engine & vdsm logs

Comment 17 Avihai 2017-06-09 04:12:58 UTC

As requested by Maor to see if this issue still occur in latest 4.2 - well the answer is YES :)

The same issue also occurred in latest 4.2 on another new scenario (#4), as all other prior scenarios are blocked from running due to this bug.

oVirt Engine Version: 4.2.0-0.0.master.20170606135217.gita02bf9d.el7.centos


Event issued:
Jun 8, 2017, 6:01:46 PM VDSM host_mixed_1 command FormatStorageDomainVDS failed: Volume Group remove error: ('VG 11506d82-1c10-4ac1-83c9-32aee93eda7c remove failed.',)

VDSM log:
2017-06-08 18:01:46,868+0300 ERROR (jsonrpc/4) [storage.Dispatcher] FINISH formatStorageDomain error=Volume Group remove error: ('VG 11506d82-1c10-4ac1-83c9-32aee93eda7c remove failed.',) (dispatcher:79)


Attached Engine & VDSM logs.

*Issue as it is VERY HARD to reproduce & does not occur in the same scenario but in a bundle of scenario's, each time in a different place(looks like a pure statistical&timing bug) -> SO NO WAY for a simple step by step reproduction scenario here that Dev can easily reproduce at your debug environment.

*IMHO, a debug VDSM build with all the needed hooks/debug/debugger is needed to simulate a debug environment on my environment

Comment 18 Avihai 2017-06-09 04:14:21 UTC

Created attachment 1286300 [details]
4.2.0-0.0.master.20170606 Engine +VDSM logs

Comment 19 Avihai 2017-06-10 04:01:48 UTC

Another 4.2 latest failure of the same issue, This time the same issue occurred running TestCase18215 teardown (scenario #5 , the comment above occured at TestCase18346)  which start the test plan suit, this issue seem to block more tests than before.


Test - TestCase18215_REST_ISCSI

Timeline & scenario:
8:00:07 2017-06-09 18:00:06,748 INFO       Test Setup   1: Create data-center dc_upgrade_4_0_to_4_1 with cluster cluster_upgrade_4_0_to_4_1 and host host_mixed_1
18:00:07 2017-06-09 18:00:06,748 INFO       Test Step   2: Add data-center dc_upgrade_4_0_to_4_1
18:00:07 2017-06-09 18:00:07,081 INFO       Test Step   3: Add cluster cluster_upgrade_4_0_to_4_1
18:00:07 2017-06-09 18:00:07,519 INFO       Test Step   4: Move host host_mixed_1 to cluster cluster_upgrade_4_0_to_4_1
18:00:07 2017-06-09 18:00:07,523 INFO       Test Setup   5: Switch host host_mixed_1 to different cluster cluster_upgrade_4_0_to_4_1 with {'activate': True}
18:00:07 2017-06-09 18:00:07,685 INFO       Test Setup   6: Deactivate the host host_mixed_1. with {'expected_status': 'maintenance'}
18:00:19 2017-06-09 18:00:18,704 INFO       Test Setup   7: Update properties of host host_mixed_1 (provided in parameters) with {'cluster': 'cluster_upgrade_4_0_to_4_1'}
18:00:19 2017-06-09 18:00:19,282 INFO       Test Setup   8: Activate host host_mixed_1 (set status to UP) with {'wait': True}
18:02:06 2017-06-09 18:02:06,004 INFO 001: storage/rhevmtests.storage.storage_qcow2_v3.test_qcow2_v3.TestCase18215_REST_ISCSI.test_upgrade_dc
18:02:06 2017-06-09 18:02:06,004 INFO   API: REST
18:02:06 2017-06-09 18:02:06,004 INFO   STORAGE: ISCSI
18:02:06 2017-06-09 18:02:06,004 INFO   POLARION: https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-18215
18:02:06 2017-06-09 18:02:06,004 INFO       Test Setup   9: Upgrading cluster cluster_upgrade_4_0_to_4_1 from version 4.0 to version 4.1 
18:02:06 2017-06-09 18:02:06,007 INFO       Test Step  10: Update cluster cluster_upgrade_4_0_to_4_1 with {'version': '4.1'}
18:02:06 2017-06-09 18:02:06,335 INFO       Test Setup  11: Upgrading data center dc_upgrade_4_0_to_4_1 from version 4.0 to version 4.1 
18:02:11 2017-06-09 18:02:11,756 INFO Result: PASSED
18:02:11 .2017-06-09 18:02:11,960 INFO       Test Teardown  12: Data center's dc_upgrade_4_0_to_4_1 master domain is upgrade_4_0_to_4_1iscsi0
18:02:12 2017-06-09 18:02:12,526 INFO       Test Teardown  13: Clean data-center dc_upgrade_4_0_to_4_1 and remove it
18:02:12 2017-06-09 18:02:12,773 INFO       Test Step  14: Deactivate storage-domain upgrade_4_0_to_4_1iscsi0 
18:02:57 2017-06-09 18:02:57,814 INFO       Test Step  15: Remove data-center dc_upgrade_4_0_to_4_1
18:03:14 2017-06-09 18:03:14,407 INFO       Test Step  16: Move host host_mixed_1 to cluster cluster_upgrade_4_0_to_4_1
18:03:14 2017-06-09 18:03:14,409 INFO       Test Teardown  17: Switch host host_mixed_1 to different cluster golden_env_mixed_1 with {'activate': True}
18:03:14 2017-06-09 18:03:14,582 INFO       Test Teardown  18: Deactivate the host host_mixed_1. with {'expected_status': 'maintenance'}
18:03:25 2017-06-09 18:03:25,682 INFO       Test Teardown  19: Update properties of host host_mixed_1 (provided in parameters) with {'cluster': 'golden_env_mixed_1'}
18:03:26 2017-06-09 18:03:26,317 INFO       Test Teardown  20: Activate host host_mixed_1 (set status to UP) with {'wait': True}
18:03:47 2017-06-09 18:03:47,408 INFO       Test Teardown  21: Remove storage domain upgrade_4_0_to_4_1iscsi0

ISSUE occurs at 18:04:20 2017-06-09 VDSM host_mixed_1 command FormatStorageDomainVDS failed: Volume Group remove error: ('VG 828770f4-66a1-4404-999a-2dc7a1badaf6 remove failed.',)2017-06-09 18:04:20,010+0300 ERROR (jsonrpc/5) 


From VDSM log:
2017-06-09 18:04:19,997+0300 ERROR (jsonrpc/5) [storage.TaskManager.Task] (Task='c551c692-a56c-4e63-bd59-02c95cad59ba') Unexpected error (task:871)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 878, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in formatStorageDomain
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2628, in formatStorageDomain
    self._recycle(sd)
  File "/usr/share/vdsm/storage/hsm.py", line 2571, in _recycle
    dom.format(dom.sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1219, in format
    lvm.removeVG(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1019, in removeVG
    raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)

[storage.Dispatcher] FINISH formatStorageDomain error=Volume Group remove error: ('VG 828770f4-66a1-4404-999a-2dc7a1badaf6 remove failed.',) (dispatcher:79)

Comment 20 Avihai 2017-06-10 04:05:35 UTC

Created attachment 1286574 [details]
repro4_2_20170606 TC18215 engine&vdsm logs

Comment 21 Avihai 2017-06-28 03:58:08 UTC

The issue still occurs on latest 4.1 & 4.2, any news on this issue?
Lastest reproduction was on 4.1 TC.

Comment 22 Avihai 2017-06-28 05:14:14 UTC

The issue started reoccurring again in more frequency at latest 4.1.3.4 & still occurs sporadically in even more higher frequency at latest 4.2.

4.1 TC's it was see latest at TC's:
TestCase18336
TestCase18305
TestCase18348
TestCase18350
TestCase18351

Please give this priority , this block many tests in both 4.1&4.2 for a long time now .

There is no one scenario that is reproducable that cause this , maybe we should consider debugger or a special debug build if logs are not enough.

Comment 25 Maor 2017-06-28 10:03:06 UTC

I was looking into the host which Avihai worked with, from first site it seems that sanlock does not release the resource.

David,
I was wondering if maybe you will have any insights about this.

It looks like any attempt to remove a VG will show the following error in VDSM:

2017-06-28 11:44:52,430+0300 DEBUG (jsonrpc/1) [storage.Misc.excCmd] FAILED: <err> = '  Logical volume 25fffa21-b6b5-45ed-ac85-cc540638488d/ids in use.\n'; <rc> = 5 (commands:94)

Once I try to deactivate the LV from shell I can see that it is still being used:
# lvchange -an 25fffa21-b6b5-45ed-ac85-cc540638488d/ids
  Logical volume 25fffa21-b6b5-45ed-ac85-cc540638488d/ids in use.

It looks like sanlock occupy the volume:
# fuser -vam /dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids
                     USER        PID ACCESS COMMAND
/dev/dm-32:          sanlock     683 F.... sanlock

Although looking into the sanlock logs I haven't see any errors and IINM it looks like the LV of "ids" was released:

2017-06-28 11:41:55 150466 [13292]: s307 lockspace 25fffa21-b6b5-45ed-ac85-cc540638488d:250:/dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids:0
and
2017-06-28 11:42:24 150496 [13524]: s308 lockspace 25fffa21-b6b5-45ed-ac85-cc540638488d:1:/dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids:0

any help would be appreciated?

Comment 26 David Teigland 2017-06-28 14:29:02 UTC

Those log messages are created when the lockspace is added, not released.  I've not seen any info from sanlock itself above.  When this happens I'd need to get 'sanlock client status', 'sanlock client log_dump' output.  If that's not possible, configure sanlock to save all debugging in a log file (see http://people.redhat.com/teigland/sanlock-messages.txt)

Is sanlock_rem_lockspace() being called with the ASYNC or UNUSED flag?

Comment 27 Maor 2017-06-28 14:44:49 UTC

Created attachment 1292663 [details]
sanlock_client_status

Comment 28 Maor 2017-06-28 14:45:18 UTC

Created attachment 1292664 [details]
sanlock_client_status

Comment 29 Maor 2017-06-28 14:46:00 UTC

Created attachment 1292665 [details]
sanlock_client_log_dump

Comment 30 Maor 2017-06-28 14:52:18 UTC

> Is sanlock_rem_lockspace() being called with the ASYNC or UNUSED flag?

I'm not sure how to check this?

Comment 31 Avihai 2017-06-28 15:09:56 UTC

Maor , issue reproduced again on the same setup.


VDSM (storage-ge1-vdsm1.scl.lab.tlv.redhat.com):
2017-06-28 18:01:40,055+0300 ERROR (jsonrpc/4) [storage.Dispatcher] FINISH formatStorageDomain error=Volume Group remove error: ('VG df672555-20de-4a26-89ca-39f1fb2e767d remove failed.',) (dispatcher:79)

Comment 32 Avihai 2017-06-28 15:10:47 UTC

Maor ,Setup is yours if you need it , ping me when done.

Comment 33 David Teigland 2017-06-28 15:38:19 UTC

sanlock client status shows one lockspace still running using an LV:

s 25fffa21-b6b5-45ed-ac85-cc540638488d:1:/dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids:0

sanlock client log_dump does not show any rem_lockspace call for this lockspace.

So, sanlock still has that "ids" LV open because the lockspace hasn't been stopped.

(rem_lockspace() flags don't matter as it's not been called.)

Comment 36 Maor 2017-06-29 12:57:22 UTC

Nir,

I do see from the logs that the host id is being released:
   "2017-05-11 14:31:26,965+0300 INFO  (jsonrpc/3) [storage.SANLock] Releasing host id for domain 2a3d0825-c3e9-4bac-a4b8-438eab1f552b (id: 250)" 
I guess that should release the lockspace from the storage domain.
David indicated in its previous comments that there is no rem_lockspace in the sanlock logs.
Is there a way to know at which point the release is being called?

Comment 37 David Teigland 2017-06-29 15:40:26 UTC

Keep in mind that you need to look at the host_id being used with the lockspace.
The same lockspace name is being removed with host_id 250, and then added with host_id 1.

rem_lockspace 25fff with host_id 250:

2017-06-28 11:42:18 150490 [700]: cmd_rem_lockspace 3,10 25fffa21-b6b5-45ed-ac85-cc540638488d flags 0
2017-06-28 11:42:20 150492 [10932]: s307 delta_release begin 25fffa21-b6b5-45ed-ac85-cc540638488d:250
2017-06-28 11:42:20 150492 [10932]: s307 delta_release done 250 1 0
2017-06-28 11:42:20 150492 [683]: s307 free lockspace
2017-06-28 11:42:20 150492 [700]: cmd_rem_lockspace 3,10 done 0

add_lockspace 25fff with host_id 1:

2017-06-28 11:42:24 150496 [13524]: cmd_add_lockspace 3,10 25fffa21-b6b5-45ed-ac85-cc540638488d:1:/dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids:0 flags 0 timeout 0
2017-06-28 11:42:24 150496 [13524]: s308 lockspace 25fffa21-b6b5-45ed-ac85-cc540638488d:1:/dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids:0
2017-06-28 11:42:24 150496 [11163]: s308 delta_acquire begin 25fffa21-b6b5-45ed-ac85-cc540638488d:1
2017-06-28 11:42:24 150496 [11163]: s308 delta_acquire write 1 1 150496 0c7e9f21-82a9-4379-a320-f0c41f202d32.storage-ge
2017-06-28 11:42:24 150496 [11163]: s308 delta_acquire delta_short_delay 20
2017-06-28 11:42:44 150516 [11163]: s308 delta_acquire done 1 1 150496
2017-06-28 11:42:45 150517 [13524]: s308 add_lockspace done
2017-06-28 11:42:45 150517 [13524]: cmd_add_lockspace 3,10 done 0

Comment 38 Avihai 2017-07-04 10:49:48 UTC

Guys , do we have any update on this issue ?

Comment 39 Avihai 2017-07-11 05:38:08 UTC

This issue blocks qcow2_v3 test plan automation on both 4.1&4.2.
Please give this higher priority.

Any ETA on a fix on this issue?

Comment 40 Maor 2017-07-11 12:21:47 UTC

If there is a way to reproduce this consistently it could be helpful.
Nir, here is the filtered VDSM log (see [1]) with all the engine calls, do you see any way that the release of the host id is missed?

[1]
2017-05-11 14:33:12,894+03 INFO  [org.ovirt.engine.core.bll.storage.domain.ActivateStorageDomainCommand]

2017-05-11 14:33:12,125+0300 INFO  (jsonrpc/0) [dispatcher] Run and protect: deactivateStorageDomain(sdUUID=u'2a3d0825-c3e9-4bac-a4b8-438eab1f552b', spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323
eb58', msdUUID=u'00000000-0000-0000-0000-000000000000', masterVersion=1, options=None) (logUtils:51)
  2017-05-11 14:33:12,134+0300 INFO  (jsonrpc/0) [dispatcher] Run and protect: deactivateStorageDomain, Return response: None (logUtils:54)

2017-05-11 14:33:12,190+0300 INFO  (jsonrpc/1) [dispatcher] Run and protect: spmStop(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', options=None) (logUtils:51
  2017-05-11 14:33:12,678+0300 INFO  (jsonrpc/1) [dispatcher] Run and protect: spmStop, Return response: None (logUtils:54)

2017-05-11 14:33:12,701+0300 INFO  (jsonrpc/4) [dispatcher] Run and protect: disconnectStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, remove=False, options=None) (logU
tils:51)
  2017-05-11 14:33:16,315+0300 INFO  (jsonrpc/4) [dispatcher] Run and protect: disconnectStoragePool, Return response: True (logUtils:54)

2017-05-11 14:33:16,333+0300 INFO  (jsonrpc/5) [dispatcher] Run and protect: disconnectStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, remove=False, options=None) (logU
tils:51)
  2017-05-11 14:33:16,334+0300 INFO  (jsonrpc/5) [dispatcher] Run and protect: disconnectStoragePool, Return response: None (logUtils:54)

2017-05-11 14:33:16,416+0300 INFO  (jsonrpc/6) [dispatcher] Run and protect: disconnectStorageServer(domType=3, spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', conList=[], options=None) (log
Utils:51)
  2017-05-11 14:33:16,733+0300 INFO  (jsonrpc/6) [dispatcher] Run and protect: disconnectStorageServer, Return response: {'statuslist': []} (logUtils:54)

2017-05-11 14:33:16,769+0300 INFO  (jsonrpc/7) [dispatcher] Run and protect: disconnectStorageServer(domType=3, spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', conList=[], options=None) (log
Utils:51)
  2017-05-11 14:33:17,056+0300 INFO  (jsonrpc/7) [dispatcher] Run and protect: disconnectStorageServer, Return response: {'statuslist': []} (logUtils:54)

2017-05-11 14:33:24,996+0300 INFO  (jsonrpc/3) [dispatcher] Run and protect: connectStorageServer(domType=3, spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', conList=[{u'id': u'34c89bc7-23e1-
402d-a070-467825318765', u'connection': u'10.35.146.161', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c01', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u
'3260'}, {u'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be', u'connection': u'10.35.146.193', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c04', u'user': u'', u'tpgt': u'1', u'
password': '********', u'port': u'3260'}, {u'id': u'bf6d36c6-df06-47b8-a0da-d22560e20b7e', u'connection': u'10.35.146.129', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00'
, u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}, {u'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434', u'connection': u'10.35.146.225', u'iqn': u'iqn.2008-05.com.xtremio
:xio00153500071-514f0c50023f6c05', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}], options=None) (logUtils:51)
  2017-05-11 14:33:27,676+0300 INFO  (jsonrpc/3) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'34c89bc7-23e1-402d-a070-467825318765
'}, {'status': 0, 'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be'}, {'status': 0, 'id': u'bf6d36c6-df06-47b8-a0da-d22560e20b7e'}, {'status': 0, 'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434'}]}
 (logUtils:54)

Egnine log: 
    2017-05-11 14:40:21,893+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand]

2017-05-11 14:33:27,691+0300 INFO  (jsonrpc/4) [dispatcher] Run and protect: connectStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, msdUUID=u'2a3d0825-c3e9-4bac-a4b8-43
8eab1f552b', masterVersion=1, domainsMap={u'2a3d0825-c3e9-4bac-a4b8-438eab1f552b': u'active'}, options=None) (logUtils:51)
  2017-05-11 14:33:29,757+0300 INFO  (jsonrpc/4) [dispatcher] Run and protect: connectStoragePool, Return response: True (logUtils:54)

2017-05-11 14:33:29,852+0300 INFO  (jsonrpc/5) [dispatcher] Run and protect: connectStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, msdUUID=u'2a3d0825-c3e9-4bac-a4b8-43
8eab1f552b', masterVersion=1, domainsMap={u'2a3d0825-c3e9-4bac-a4b8-438eab1f552b': u'active'}, options=None) (logUtils:51)
  2017-05-11 14:33:31,741+0300 INFO  (jsonrpc/5) [dispatcher] Run and protect: connectStoragePool, Return response: True (logUtils:54)

2017-05-11 14:33:31,945+0300 INFO  (jsonrpc/7) [dispatcher] Run and protect: spmStart(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', prevID=-1, prevLVER=u'-1', maxHostID=250, domVersion=u'3
', options=None) (logUtils:51)
  2017-05-11 14:33:31,950+0300 INFO  (jsonrpc/7) [dispatcher] Run and protect: spmStart, Return response: None (logUtils:54)

2017-05-11 14:33:34,331+0300 INFO  (jsonrpc/4) [dispatcher] Run and protect: destroyStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, options=None) (logUtils:51)
  2017-05-11 14:33:38,735+0300 INFO  (jsonrpc/4) [dispatcher] Run and protect: destroyStoragePool, Return response: True (logUtils:54)

2017-05-11 14:33:38,939+0300 INFO  (jsonrpc/6) [dispatcher] Run and protect: disconnectStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, remove=False, options=None) (logU
tils:51)
  2017-05-11 14:33:38,941+0300 INFO  (jsonrpc/6) [dispatcher] Run and protect: disconnectStoragePool, Return response: None (logUtils:54)


2017-05-11 14:33:39,004+0300 INFO  (jsonrpc/7) [dispatcher] Run and protect: disconnectStorageServer(domType=3, spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', conList=[], options=None) (log
Utils:51)
  2017-05-11 14:33:39,306+0300 INFO  (jsonrpc/7) [dispatcher] Run and protect: disconnectStorageServer, Return response: {'statuslist': []} (logUtils:54)


Egnie: 2017-05-11 14:31:23,189+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand

2017-05-11 14:33:54,380+0300 INFO  (jsonrpc/6) [dispatcher] Run and protect: connectStorageServer(domType=7, spUUID=u'08b0f4d6-b413-4991-a469-241ad41d027d', conList=[{u'id': u'785d9f22-26be-
42e5-9a4f-c6306db74831', u'connection': u'gluster-server01.qa.lab.tlv.redhat.com:/storage_local_ge4_volume_0', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'vfs_type': u'glusterfs', u'password
': '********', u'port': u''}, {u'id': u'4eb9dab9-0b8d-441e-b86e-0f7ac2835ee3', u'connection': u'gluster-server01.qa.lab.tlv.redhat.com:/storage_local_ge4_volume_1', u'iqn': u'', u'user': u''
, u'tpgt': u'1', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}, {u'id': u'fbdc338d-d9ad-41e4-8dc6-dfb18a2dea18', u'connection': u'gluster-server01.qa.lab.tlv.redhat.com:/
storage_local_ge4_volume_2', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], options=None) (logUtils:51)
2017-05-11 14:33:54,427+0300 DEBUG (jsonrpc/6) [storage.fileUtils] Creating directory: /rhev/data-center/mnt/glusterSD/gluster-server01.qa.lab.tlv.redhat.com:_storage__local__ge4__volume__0 
mode: None (fileUtils:197
  2017-05-11 14:33:56,481+0300 INFO  (jsonrpc/6) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'785d9f22-26be-42e5-9a4f-c6306db74831
'}, {'status': 0, 'id': u'4eb9dab9-0b8d-441e-b86e-0f7ac2835ee3'}, {'status': 0, 'id': u'fbdc338d-d9ad-41e4-8dc6-dfb18a2dea18'}]} (logUtils:54)


2017-05-11 14:33:56,500+0300 INFO  (jsonrpc/2) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'08b0f4d6-b413-4991-a469-241ad41d027d', conList=[{u'id': u'6f5283c8-8360-
4a2a-a561-50178ce35782', u'connection': u'yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/storage_local_ge4_nfs_0', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'password': '********', u'port':
 u''}, {u'id': u'0e89156a-01ed-4a9b-8735-d028847768a6', u'connection': u'yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/storage_local_ge4_nfs_1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'
password': '********', u'port': u''}, {u'id': u'8b90456d-4748-4ab6-af95-8238b0d803d4', u'connection': u'yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/storage_local_ge4_nfs_2', u'iqn': u'', 
u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u''}, {u'id': u'c3bc79f1-428e-4ea1-adf6-90159781e68f', u'connection': u'yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/export_d
omain_storage_local_ge4', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u''}], options=None) (logUtils:51)
  2017-05-11 14:33:57,025+0300 INFO  (jsonrpc/2) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'6f5283c8-8360-4a2a-a561-50178ce35782
'}, {'status': 0, 'id': u'0e89156a-01ed-4a9b-8735-d028847768a6'}, {'status': 0, 'id': u'8b90456d-4748-4ab6-af95-8238b0d803d4'}, {'status': 0, 'id': u'c3bc79f1-428e-4ea1-adf6-90159781e68f'}]}
 (logUtils:54)


2017-05-11 14:33:57,054+0300 INFO  (jsonrpc/0) [dispatcher] Run and protect: connectStorageServer(domType=3, spUUID=u'08b0f4d6-b413-4991-a469-241ad41d027d', conList=[{u'id': u'bf6d36c6-df06-
47b8-a0da-d22560e20b7e', u'connection': u'10.35.146.129', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u
'3260'}, {u'id': u'34c89bc7-23e1-402d-a070-467825318765', u'connection': u'10.35.146.161', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c01', u'user': u'', u'tpgt': u'1', u'
password': '********', u'port': u'3260'}, {u'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be', u'connection': u'10.35.146.193', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c04'
, u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}, {u'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434', u'connection': u'10.35.146.225', u'iqn': u'iqn.2008-05.com.xtremio
:xio00153500071-514f0c50023f6c05', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}], options=None) (logUtils:51)
2017-05-11 14:33:57,055+0300 DEBUG (jsonrpc/0) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /usr/sbin/iscsiadm -m node -T iqn
  2017-05-11 14:33:59,756+0300 INFO  (jsonrpc/0) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'bf6d36c6-df06-47b8-a0da-d22560e20b7e
'}, {'status': 0, 'id': u'34c89bc7-23e1-402d-a070-467825318765'}, {'status': 0, 'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be'}, {'status': 0, 'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434'}]}
 (logUtils:54)


2017-05-11 14:33:59,799+0300 INFO  (jsonrpc/1) [dispatcher] Run and protect: connectStoragePool(spUUID=u'08b0f4d6-b413-4991-a469-241ad41d027d', hostID=1, msdUUID=u'8837a191-d72c-42bb-925f-07
99adfaa1d6', masterVersion=1, domainsMap={u'd20921cb-df0c-4c2b-8347-7d0b82b4daa8': u'active', u'24b505ca-3827-4d1d-9309-0f559f977f0c': u'active', u'c05d32d5-bc3a-4845-b0e5-16a1f2dd8ef1': u'a
ctive', u'f1cac489-d7a8-4054-a093-c128600d8ed0': u'active', u'95324129-bfea-4ce4-aaf9-7ec75288292b': u'active', u'74b94d28-49b5-4bad-8512-a103591efcb4': u'active', u'70eebd02-3a20-45f1-be26-
b364a41cdce6': u'active', u'8837a191-d72c-42bb-925f-0799adfaa1d6': u'active', u'6957070f-6ce3-4e5c-8bc7-cc0edcb3ccaf': u'active', u'd6c4bbce-47ab-4e81-8ff0-3c51dca0fa6c': u'active'}, options
=None) (logUtils:51)
  2017-05-11 14:34:02,697+0300 INFO  (jsonrpc/1) [dispatcher] Run and protect: connectStoragePool, Return response: True (logUtils:54)


2017-05-11 14:34:12,340+0300 INFO  (jsonrpc/2) [dispatcher] Run and protect: connectStorageServer(domType=3, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'34c89bc7-23e1-
402d-a070-467825318765', u'connection': u'10.35.146.161', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c01', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u
'3260'}, {u'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be', u'connection': u'10.35.146.193', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c04', u'user': u'', u'tpgt': u'1', u'
password': '********', u'port': u'3260'}, {u'id': u'bf6d36c6-df06-47b8-a0da-d22560e20b7e', u'connection': u'10.35.146.129', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00'
, u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}, {u'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434', u'connection': u'10.35.146.225', u'iqn': u'iqn.2008-05.com.xtremio
:xio00153500071-514f0c50023f6c05', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}], options=None) (logUtils:51)
  2017-05-11 14:34:16,596+0300 INFO  (jsonrpc/2) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'34c89bc7-23e1-402d-a070-467825318765
'}, {'status': 0, 'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be'}, {'status': 0, 'id': u'bf6d36c6-df06-47b8-a0da-d22560e20b7e'}, {'status': 0, 'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434'}]}
 (logUtils:54)


2017-05-11 14:34:16,618+0300 INFO  (jsonrpc/0) [dispatcher] Run and protect: getStorageDomainInfo(sdUUID=u'2a3d0825-c3e9-4bac-a4b8-438eab1f552b', options=None) (logUtils:51)
  2017-05-11 14:34:20,768+0300 INFO  (jsonrpc/0) [dispatcher] Run and protect: getStorageDomainInfo, Return response: {'info': {'uuid': '2a3d0825-c3e9-4bac-a4b8-438eab1f552b', 'vgMetadataDevic
e': '3514f0c5a51600268', 'vguuid': 'NEckcd-72gk-stzw-Xckv-t0yR-u95W-yyReib', 'metadataDevice': '3514f0c5a51600268', 'state': 'OK', 'version': '3', 'role': 'Regular', 'type': 'ISCSI', 'class'
: 'Data', 'pool': [], 'name': 'upgrade_4_0_to_4_1iscsi0'}} (logUtils:54)


2017-05-11 14:34:20,867+0300 INFO  (jsonrpc/5) [dispatcher] Run and protect: disconnectStorageServer(domType=3, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[], options=None) (log
Utils:51)
  2017-05-11 14:34:21,171+0300 INFO  (jsonrpc/5) [dispatcher] Run and protect: disconnectStorageServer, Return response: {'statuslist': []} (logUtils:54)

2017-05-11 14:34:23,831+0300 INFO  (jsonrpc/7) [dispatcher] Run and protect: formatStorageDomain(sdUUID=u'2a3d0825-c3e9-4bac-a4b8-438eab1f552b', autoDetach=False, options=None) (logUtils:51)

2017-05-11 14:34:41,966+0300 ERROR (jsonrpc/7) [storage.TaskManager.Task] (Task='5ce71b7a-5964-45bc-9933-cf098fe6fd38') Unexpected error (task:871)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 878, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2626, in formatStorageDomain
    self._recycle(sd)
  File "/usr/share/vdsm/storage/hsm.py", line 2569, in _recycle
    dom.format(dom.sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1219, in format
    lvm.removeVG(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1019, in removeVG
    raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)
VolumeGroupRemoveError: Volume Group remove error: ('VG 2a3d0825-c3e9-4bac-a4b8-438eab1f552b remove failed.',)(In reply to Avihai from comment #39)

Comment 41 Maor 2017-07-11 15:42:44 UTC

We suspect that the root cause of this bug is a race issue which the client (VDSM) that calls acquire with sync=false and before it ends it calls release.
Currently checking if there is a way to cancel the acquire once release
is being called.

Comment 42 Nir Soffer 2017-07-11 15:56:46 UTC

(In reply to Maor from comment #40)
We debugged this offline, removing the need info.

Comment 43 David Teigland 2017-07-11 16:07:05 UTC

To be clear, you are talking about calling rem_lockspace() while an async add_lockspace() is still running?  If you do that, sanlock currently cancels the in-progress add_lockspace() which returns -201.  Here is an example of how that looks in the sanlock logs:

2017-07-11 04:45:44 4829695 [40331]: cmd_add_lockspace 2,10 LS:1:/dev/loop9:0 flags 1 timeout 0
2017-07-11 04:45:44 4829695 [40331]: s3 lockspace LS:1:/dev/loop9:0
2017-07-11 04:45:44 4829695 [40331]: cmd_add_lockspace 2,10 async done 0
2017-07-11 04:45:44 4829695 [40360]: s3 delta_acquire begin LS:1
2017-07-11 04:45:44 4829695 [40360]: s3 delta_acquire fast reacquire
2017-07-11 04:45:44 4829695 [40360]: s3 delta_acquire write 1 2 4829695 null-02
2017-07-11 04:45:44 4829695 [40360]: s3 delta_acquire delta_short_delay 20
2017-07-11 04:45:46 4829696 [40332]: cmd_rem_lockspace 2,10 LS flags 0
2017-07-11 04:45:46 4829697 [40360]: s3 delta_acquire abort2 remove 1 shutdown 0
2017-07-11 04:45:47 4829698 [40331]: s3 add_lockspace fail result -201
2017-07-11 04:45:48 4829698 [40332]: cmd_rem_lockspace 2,10 done 0

Comment 44 Nir Soffer 2017-07-11 17:24:41 UTC

(In reply to David Teigland from comment #43)
> To be clear, you are talking about calling rem_lockspace() while an async
> add_lockspace() is still running?  If you do that, sanlock currently cancels
> the in-progress add_lockspace() which returns -201.

David, just to be sure:

- If a add_lockspace called with async=True flag, (in the python bindings), the 
  call will be canceled silently (nobody is waiting on it).

- If add_lockspace called with async=False flag, the call will return -201 error.

Right?

Comment 45 David Teigland 2017-07-11 17:35:14 UTC

Yes, good point, when async there would be nothing to see the -201 result, it would only be visible in the logs.

Comment 46 Maor 2017-07-12 07:48:14 UTC

Avihai, would it be possible to test again the failing automation tests with the following patch:
   https://gerrit.ovirt.org/70443

Comment 47 Avihai 2017-07-12 08:51:50 UTC

(In reply to Maor from comment #46)
> Avihai, would it be possible to test again the failing automation tests with
> the following patch:
>    https://gerrit.ovirt.org/70443

Sure , I'll test qcow2_v3 automation TP on this patch& update.

Comment 51 Nir Soffer 2017-07-13 11:21:54 UTC

Avihay, are you sure this issue is gone with https://gerrit.ovirt.org/70443?

We don't fully understand the race causing this bug. Did we have enough to check
this issue?

Comment 52 Avihai 2017-07-13 12:19:38 UTC

(In reply to Nir Soffer from comment #51)
> Avihay, are you sure this issue is gone with https://gerrit.ovirt.org/70443?
> 
> We don't fully understand the race causing this bug. Did we have enough to
> check
> this issue?

100% sure -> well No as this bug was not 100% reproducible.
To begin with, the issue happened each time at a different test case running qcow2_v3 TP.
However, I remember that each time I ran 4.2 qcow2_v3 TP it occurred at least once so this is a very good sign indeed.

I think we should rerun this test several more times just increase confidence in this fix especially due to the sporadic nature of this bug + considering your/dev root cause/fix of this issue is not still clear.

Comment 53 Avihai 2017-07-13 12:43:52 UTC

Started running the TP once more, will update when done.

Comment 54 Avihai 2017-07-13 13:48:40 UTC

Link to run:
https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.2-ge-runner-storage/336/console

Comment 55 Nir Soffer 2017-07-13 14:26:45 UTC

Avihay, sounds like a good plan.

Comment 56 Avihai 2017-07-15 05:31:12 UTC

Another 2 runs so far, the issue did not reproduce.

3rd run started (after patching the glance image issue which failed many tests at setup stage).

Comment 57 Nir Soffer 2017-07-16 16:23:55 UTC

We want to backport the fix to 4.1, needed for another 4.1 bug.

Comment 58 Avihai 2017-08-07 07:53:02 UTC

verified on 4.1.5