Bug 1449944
| Summary: | remove & format iscsi storage domain fails sometimes - FormatStorageDomainVDS failed | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Avihai <aefrat> | ||||||||||||||||||||||
| Component: | General | Assignee: | Maor <mlipchuk> | ||||||||||||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Avihai <aefrat> | ||||||||||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||||||||||
| Priority: | unspecified | ||||||||||||||||||||||||
| Version: | 4.20.0 | CC: | aefrat, amureini, bugs, lveyde, mlipchuk, nsoffer, teigland | ||||||||||||||||||||||
| Target Milestone: | ovirt-4.1.5 | Keywords: | Automation, AutomationBlocker, Regression | ||||||||||||||||||||||
| Target Release: | 4.19.26 | Flags: | rule-engine:
ovirt-4.1+
rule-engine: ovirt-4.2+ rule-engine: blocker+ |
||||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||||
| Fixed In Version: | vdsm v4.19.26 | Doc Type: | If docs needed, set a value | ||||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||||
| Last Closed: | 2017-08-23 08:03:44 UTC | Type: | Bug | ||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||
|
Description
Avihai
2017-05-11 08:34:42 UTC
More info:
1) Issue is seen only in 4.2 on only on iscsi (the same scenario is running also on NFS & gluster and issue is not seen there)
2) I ran same Test suit in 4.1.2.1 multiple times and the issue does not reproduce.
3) Another scenario(automation test case) I saw this issue occur is details below, Added logs from this case as well.
Details :
1. Create DC + cluster on v3 + iscsi storage domain
2. Create a VM with thin disk and one snapshot
3. Move storage domain to maintenance
4. Attempt to amend the disk & snapshot disks on old DC-> should fail
5. reactivate iscsi storage domain
6. remove DC
7. Deactivate & Move host host_mixed_1 to another cluster of another DC in V4.
8. Remove original cluster (created at step 1)
9. Remove iscsi SD with format SD option enabled.
Error seen in VDSM log:
2017-05-11 12:52:05,181+0300 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='5b612dc4-40ea-44fb-a3a8-6a23b0bfa3ae') Unexpected error (task:871)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 878, in _run
return fn(*args, **kargs)
File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 2626, in formatStorageDomain
self._recycle(sd)
File "/usr/share/vdsm/storage/hsm.py", line 2569, in _recycle
dom.format(dom.sdUUID)
File "/usr/share/vdsm/storage/blockSD.py", line 1219, in format
lvm.removeVG(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1019, in removeVG
raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)
VolumeGroupRemoveError: Volume Group remove error: ('VG d2e4878c-fb00-4488-934d-c546b1146211 remove failed.',)
2017-05-11 12:52:05,183+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='5b612dc4-40ea-44fb-a3a8-6a23b0bfa3ae') Task._run: 5b612dc4-40ea-44fb-a3a8-6a23b0bfa3ae (u'd2e4878c-fb00-4488-934d-c546b1146211', False) {} failed - stopping task (task:890)
Created attachment 1277828 [details]
engine & vdsm logs scenario2
The issue is seen again at another scenario(#3) which is similar to scenario2 automation run ( after blocking scenario's 1&2).
Raising severity to high.
Scenario 3 Details :
1. Create DC + cluster on v3 + iscsi storage domain
2. Create a VM with thin disk and one snapshot
3. Attempt to amend the disk & snapshot disks on old DC-> should fail
4. remove DC
5. Deactivate & Move host host_mixed_1 to another cluster of another DC in V4.
6. Remove original cluster (created at step 1)
7. Remove iscsi SD with format SD option enabled -> Issue occurs .
Timestamp when issue occurred - May 11, 2017, 2:34:41 PM.
VDSM log:
2017-05-11 14:34:41,966+0300 ERROR (jsonrpc/7) [storage.TaskManager.Task] (Task='5ce71b7a-5964-45bc-9933-cf098fe6fd38') Unexpected error (task:871)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 878, in _run
return fn(*args, **kargs)
File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 2626, in formatStorageDomain
self._recycle(sd)
File "/usr/share/vdsm/storage/hsm.py", line 2569, in _recycle
dom.format(dom.sdUUID)
File "/usr/share/vdsm/storage/blockSD.py", line 1219, in format
lvm.removeVG(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1019, in removeVG
raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)
VolumeGroupRemoveError: Volume Group remove error: ('VG 2a3d0825-c3e9-4bac-a4b8-438eab1f552b remove failed.',)
Created attachment 1277870 [details]
engine & vdsm logs scenario3
Added attachment for the scenario3 log.
Blocking Automation TC's 18348,18350,18351 matching scenarios 1,2,3.
This is also blocking the following TC's as storage domain is not removed the following tests will fail SETUP stage as they are trying to create a storage domain with the same name & path ( if ISCSI).
The interesting part of the error seems to be that the "ids" LV remains active. E.g., on the line prior to the given stacktrace: 2017-05-11 12:52:05,170+0300 DEBUG (jsonrpc/0) [storage.Misc.excCmd] FAILED: <err> = ' Logical volume d2e4878c-fb00-4488-934d-c546b1146211/ids in use.\n'; <rc> = 5 (commands:93) This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. Avihai, Does it reproduce also if you don't call amend? With the following steps: 1. Create DC + cluster on v3 + iscsi storage domain 2. Create a VM with thin disk and one snapshot 4. remove DC 5. Deactivate & Move host host_mixed_1 to another cluster of another DC in V4. 6. Remove original cluster (created at step 1) 7. Remove iscsi SD with format SD option enabled (In reply to Maor from comment #7) > Avihai, > > Does it reproduce also if you don't call amend? > > With the following steps: > 1. Create DC + cluster on v3 + iscsi storage domain > 2. Create a VM with thin disk and one snapshot > 4. remove DC > 5. Deactivate & Move host host_mixed_1 to another cluster of another DC in > V4. > 6. Remove original cluster (created at step 1) > 7. Remove iscsi SD with format SD option enabled I ran the scenario without amend X10 times & the issue did not reproduce. Ran in v4.1, now running in 4.2 ( which original issue occured) , will update when 10X runs are done Ran in v4.2 without amend and issue reproduced.
2017-05-15 17:47:20,615+0300 ERROR (jsonrpc/1) [storage.Dispatcher] {'status': {'message': "Volume Group remove error: ('VG 3048ed4c-ed05-464e-95f7-75b5c73aee54 remove failed.',)", 'code': 508}} (dispatcher:78)
Attached logs as usual.
Created attachment 1279034 [details]
engine & vdsm logs
It seems like a sanlock sync issue... I could not able to reproduce it on master and seems like it does not fail constantly. I will try to dig in the logs once more try to find any clues. based on Avihai's comments it seems that this issue only reproduces in 4.2 and does not occur in 4.1, so I believe that the target version should be changed to 4.2 It seems like a sanlock sync issue... I could not able to reproduce it on master and seems like it does not fail constantly. I will try to dig in the logs once more try to find any clues. based on Avihai's comments it seems that this issue only reproduces in 4.2 and does not occur in 4.1, so I believe that the target version should be changed to 4.2 (In reply to Maor from comment #13) > It seems like a sanlock sync issue... > I could not able to reproduce it on master and seems like it does not fail > constantly. I will try to dig in the logs once more try to find any clues. > > based on Avihai's comments it seems that this issue only reproduces in 4.2 > and does not occur in 4.1, so I believe that the target version should be > changed to 4.2 As we discussed this issue occurred only in 4.2 & is very hard to reproduce, moving a version to 4.2. The same issue also reproduces in a recent on 4.1.3 on a tier2 run. https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.1v4-ge-runner-tier2/50/ Test : test_qcow2_v3.TestCase18215_REST_ISCSI.test_upgrade_dc Logs attached. Created attachment 1285663 [details]
4.1 reproduction engine & vdsm logs
As requested by Maor to see if this issue still occur in latest 4.2 - well the answer is YES :)
The same issue also occurred in latest 4.2 on another new scenario (#4), as all other prior scenarios are blocked from running due to this bug.
oVirt Engine Version: 4.2.0-0.0.master.20170606135217.gita02bf9d.el7.centos
Event issued:
Jun 8, 2017, 6:01:46 PM VDSM host_mixed_1 command FormatStorageDomainVDS failed: Volume Group remove error: ('VG 11506d82-1c10-4ac1-83c9-32aee93eda7c remove failed.',)
VDSM log:
2017-06-08 18:01:46,868+0300 ERROR (jsonrpc/4) [storage.Dispatcher] FINISH formatStorageDomain error=Volume Group remove error: ('VG 11506d82-1c10-4ac1-83c9-32aee93eda7c remove failed.',) (dispatcher:79)
Attached Engine & VDSM logs.
*Issue as it is VERY HARD to reproduce & does not occur in the same scenario but in a bundle of scenario's, each time in a different place(looks like a pure statistical&timing bug) -> SO NO WAY for a simple step by step reproduction scenario here that Dev can easily reproduce at your debug environment.
*IMHO, a debug VDSM build with all the needed hooks/debug/debugger is needed to simulate a debug environment on my environment
Created attachment 1286300 [details]
4.2.0-0.0.master.20170606 Engine +VDSM logs
Another 4.2 latest failure of the same issue, This time the same issue occurred running TestCase18215 teardown (scenario #5 , the comment above occured at TestCase18346) which start the test plan suit, this issue seem to block more tests than before.
Test - TestCase18215_REST_ISCSI
Timeline & scenario:
8:00:07 2017-06-09 18:00:06,748 INFO Test Setup 1: Create data-center dc_upgrade_4_0_to_4_1 with cluster cluster_upgrade_4_0_to_4_1 and host host_mixed_1
18:00:07 2017-06-09 18:00:06,748 INFO Test Step 2: Add data-center dc_upgrade_4_0_to_4_1
18:00:07 2017-06-09 18:00:07,081 INFO Test Step 3: Add cluster cluster_upgrade_4_0_to_4_1
18:00:07 2017-06-09 18:00:07,519 INFO Test Step 4: Move host host_mixed_1 to cluster cluster_upgrade_4_0_to_4_1
18:00:07 2017-06-09 18:00:07,523 INFO Test Setup 5: Switch host host_mixed_1 to different cluster cluster_upgrade_4_0_to_4_1 with {'activate': True}
18:00:07 2017-06-09 18:00:07,685 INFO Test Setup 6: Deactivate the host host_mixed_1. with {'expected_status': 'maintenance'}
18:00:19 2017-06-09 18:00:18,704 INFO Test Setup 7: Update properties of host host_mixed_1 (provided in parameters) with {'cluster': 'cluster_upgrade_4_0_to_4_1'}
18:00:19 2017-06-09 18:00:19,282 INFO Test Setup 8: Activate host host_mixed_1 (set status to UP) with {'wait': True}
18:02:06 2017-06-09 18:02:06,004 INFO 001: storage/rhevmtests.storage.storage_qcow2_v3.test_qcow2_v3.TestCase18215_REST_ISCSI.test_upgrade_dc
18:02:06 2017-06-09 18:02:06,004 INFO API: REST
18:02:06 2017-06-09 18:02:06,004 INFO STORAGE: ISCSI
18:02:06 2017-06-09 18:02:06,004 INFO POLARION: https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-18215
18:02:06 2017-06-09 18:02:06,004 INFO Test Setup 9: Upgrading cluster cluster_upgrade_4_0_to_4_1 from version 4.0 to version 4.1
18:02:06 2017-06-09 18:02:06,007 INFO Test Step 10: Update cluster cluster_upgrade_4_0_to_4_1 with {'version': '4.1'}
18:02:06 2017-06-09 18:02:06,335 INFO Test Setup 11: Upgrading data center dc_upgrade_4_0_to_4_1 from version 4.0 to version 4.1
18:02:11 2017-06-09 18:02:11,756 INFO Result: PASSED
18:02:11 .2017-06-09 18:02:11,960 INFO Test Teardown 12: Data center's dc_upgrade_4_0_to_4_1 master domain is upgrade_4_0_to_4_1iscsi0
18:02:12 2017-06-09 18:02:12,526 INFO Test Teardown 13: Clean data-center dc_upgrade_4_0_to_4_1 and remove it
18:02:12 2017-06-09 18:02:12,773 INFO Test Step 14: Deactivate storage-domain upgrade_4_0_to_4_1iscsi0
18:02:57 2017-06-09 18:02:57,814 INFO Test Step 15: Remove data-center dc_upgrade_4_0_to_4_1
18:03:14 2017-06-09 18:03:14,407 INFO Test Step 16: Move host host_mixed_1 to cluster cluster_upgrade_4_0_to_4_1
18:03:14 2017-06-09 18:03:14,409 INFO Test Teardown 17: Switch host host_mixed_1 to different cluster golden_env_mixed_1 with {'activate': True}
18:03:14 2017-06-09 18:03:14,582 INFO Test Teardown 18: Deactivate the host host_mixed_1. with {'expected_status': 'maintenance'}
18:03:25 2017-06-09 18:03:25,682 INFO Test Teardown 19: Update properties of host host_mixed_1 (provided in parameters) with {'cluster': 'golden_env_mixed_1'}
18:03:26 2017-06-09 18:03:26,317 INFO Test Teardown 20: Activate host host_mixed_1 (set status to UP) with {'wait': True}
18:03:47 2017-06-09 18:03:47,408 INFO Test Teardown 21: Remove storage domain upgrade_4_0_to_4_1iscsi0
ISSUE occurs at 18:04:20 2017-06-09 VDSM host_mixed_1 command FormatStorageDomainVDS failed: Volume Group remove error: ('VG 828770f4-66a1-4404-999a-2dc7a1badaf6 remove failed.',)2017-06-09 18:04:20,010+0300 ERROR (jsonrpc/5)
From VDSM log:
2017-06-09 18:04:19,997+0300 ERROR (jsonrpc/5) [storage.TaskManager.Task] (Task='c551c692-a56c-4e63-bd59-02c95cad59ba') Unexpected error (task:871)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 878, in _run
return fn(*args, **kargs)
File "<string>", line 2, in formatStorageDomain
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
ret = func(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 2628, in formatStorageDomain
self._recycle(sd)
File "/usr/share/vdsm/storage/hsm.py", line 2571, in _recycle
dom.format(dom.sdUUID)
File "/usr/share/vdsm/storage/blockSD.py", line 1219, in format
lvm.removeVG(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1019, in removeVG
raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)
[storage.Dispatcher] FINISH formatStorageDomain error=Volume Group remove error: ('VG 828770f4-66a1-4404-999a-2dc7a1badaf6 remove failed.',) (dispatcher:79)
Created attachment 1286574 [details]
repro4_2_20170606 TC18215 engine&vdsm logs
The issue still occurs on latest 4.1 & 4.2, any news on this issue? Lastest reproduction was on 4.1 TC. The issue started reoccurring again in more frequency at latest 4.1.3.4 & still occurs sporadically in even more higher frequency at latest 4.2. 4.1 TC's it was see latest at TC's: TestCase18336 TestCase18305 TestCase18348 TestCase18350 TestCase18351 Please give this priority , this block many tests in both 4.1&4.2 for a long time now . There is no one scenario that is reproducable that cause this , maybe we should consider debugger or a special debug build if logs are not enough. I was looking into the host which Avihai worked with, from first site it seems that sanlock does not release the resource.
David,
I was wondering if maybe you will have any insights about this.
It looks like any attempt to remove a VG will show the following error in VDSM:
2017-06-28 11:44:52,430+0300 DEBUG (jsonrpc/1) [storage.Misc.excCmd] FAILED: <err> = ' Logical volume 25fffa21-b6b5-45ed-ac85-cc540638488d/ids in use.\n'; <rc> = 5 (commands:94)
Once I try to deactivate the LV from shell I can see that it is still being used:
# lvchange -an 25fffa21-b6b5-45ed-ac85-cc540638488d/ids
Logical volume 25fffa21-b6b5-45ed-ac85-cc540638488d/ids in use.
It looks like sanlock occupy the volume:
# fuser -vam /dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids
USER PID ACCESS COMMAND
/dev/dm-32: sanlock 683 F.... sanlock
Although looking into the sanlock logs I haven't see any errors and IINM it looks like the LV of "ids" was released:
2017-06-28 11:41:55 150466 [13292]: s307 lockspace 25fffa21-b6b5-45ed-ac85-cc540638488d:250:/dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids:0
and
2017-06-28 11:42:24 150496 [13524]: s308 lockspace 25fffa21-b6b5-45ed-ac85-cc540638488d:1:/dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids:0
any help would be appreciated?
Those log messages are created when the lockspace is added, not released. I've not seen any info from sanlock itself above. When this happens I'd need to get 'sanlock client status', 'sanlock client log_dump' output. If that's not possible, configure sanlock to save all debugging in a log file (see http://people.redhat.com/teigland/sanlock-messages.txt) Is sanlock_rem_lockspace() being called with the ASYNC or UNUSED flag? Created attachment 1292663 [details]
sanlock_client_status
Created attachment 1292664 [details]
sanlock_client_status
Created attachment 1292665 [details]
sanlock_client_log_dump
> Is sanlock_rem_lockspace() being called with the ASYNC or UNUSED flag?
I'm not sure how to check this?
Maor , issue reproduced again on the same setup.
VDSM (storage-ge1-vdsm1.scl.lab.tlv.redhat.com):
2017-06-28 18:01:40,055+0300 ERROR (jsonrpc/4) [storage.Dispatcher] FINISH formatStorageDomain error=Volume Group remove error: ('VG df672555-20de-4a26-89ca-39f1fb2e767d remove failed.',) (dispatcher:79)
Maor ,Setup is yours if you need it , ping me when done. sanlock client status shows one lockspace still running using an LV: s 25fffa21-b6b5-45ed-ac85-cc540638488d:1:/dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids:0 sanlock client log_dump does not show any rem_lockspace call for this lockspace. So, sanlock still has that "ids" LV open because the lockspace hasn't been stopped. (rem_lockspace() flags don't matter as it's not been called.) Nir, I do see from the logs that the host id is being released: "2017-05-11 14:31:26,965+0300 INFO (jsonrpc/3) [storage.SANLock] Releasing host id for domain 2a3d0825-c3e9-4bac-a4b8-438eab1f552b (id: 250)" I guess that should release the lockspace from the storage domain. David indicated in its previous comments that there is no rem_lockspace in the sanlock logs. Is there a way to know at which point the release is being called? Keep in mind that you need to look at the host_id being used with the lockspace. The same lockspace name is being removed with host_id 250, and then added with host_id 1. rem_lockspace 25fff with host_id 250: 2017-06-28 11:42:18 150490 [700]: cmd_rem_lockspace 3,10 25fffa21-b6b5-45ed-ac85-cc540638488d flags 0 2017-06-28 11:42:20 150492 [10932]: s307 delta_release begin 25fffa21-b6b5-45ed-ac85-cc540638488d:250 2017-06-28 11:42:20 150492 [10932]: s307 delta_release done 250 1 0 2017-06-28 11:42:20 150492 [683]: s307 free lockspace 2017-06-28 11:42:20 150492 [700]: cmd_rem_lockspace 3,10 done 0 add_lockspace 25fff with host_id 1: 2017-06-28 11:42:24 150496 [13524]: cmd_add_lockspace 3,10 25fffa21-b6b5-45ed-ac85-cc540638488d:1:/dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids:0 flags 0 timeout 0 2017-06-28 11:42:24 150496 [13524]: s308 lockspace 25fffa21-b6b5-45ed-ac85-cc540638488d:1:/dev/25fffa21-b6b5-45ed-ac85-cc540638488d/ids:0 2017-06-28 11:42:24 150496 [11163]: s308 delta_acquire begin 25fffa21-b6b5-45ed-ac85-cc540638488d:1 2017-06-28 11:42:24 150496 [11163]: s308 delta_acquire write 1 1 150496 0c7e9f21-82a9-4379-a320-f0c41f202d32.storage-ge 2017-06-28 11:42:24 150496 [11163]: s308 delta_acquire delta_short_delay 20 2017-06-28 11:42:44 150516 [11163]: s308 delta_acquire done 1 1 150496 2017-06-28 11:42:45 150517 [13524]: s308 add_lockspace done 2017-06-28 11:42:45 150517 [13524]: cmd_add_lockspace 3,10 done 0 Guys , do we have any update on this issue ? This issue blocks qcow2_v3 test plan automation on both 4.1&4.2. Please give this higher priority. Any ETA on a fix on this issue? If there is a way to reproduce this consistently it could be helpful.
Nir, here is the filtered VDSM log (see [1]) with all the engine calls, do you see any way that the release of the host id is missed?
[1]
2017-05-11 14:33:12,894+03 INFO [org.ovirt.engine.core.bll.storage.domain.ActivateStorageDomainCommand]
2017-05-11 14:33:12,125+0300 INFO (jsonrpc/0) [dispatcher] Run and protect: deactivateStorageDomain(sdUUID=u'2a3d0825-c3e9-4bac-a4b8-438eab1f552b', spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323
eb58', msdUUID=u'00000000-0000-0000-0000-000000000000', masterVersion=1, options=None) (logUtils:51)
2017-05-11 14:33:12,134+0300 INFO (jsonrpc/0) [dispatcher] Run and protect: deactivateStorageDomain, Return response: None (logUtils:54)
2017-05-11 14:33:12,190+0300 INFO (jsonrpc/1) [dispatcher] Run and protect: spmStop(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', options=None) (logUtils:51
2017-05-11 14:33:12,678+0300 INFO (jsonrpc/1) [dispatcher] Run and protect: spmStop, Return response: None (logUtils:54)
2017-05-11 14:33:12,701+0300 INFO (jsonrpc/4) [dispatcher] Run and protect: disconnectStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, remove=False, options=None) (logU
tils:51)
2017-05-11 14:33:16,315+0300 INFO (jsonrpc/4) [dispatcher] Run and protect: disconnectStoragePool, Return response: True (logUtils:54)
2017-05-11 14:33:16,333+0300 INFO (jsonrpc/5) [dispatcher] Run and protect: disconnectStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, remove=False, options=None) (logU
tils:51)
2017-05-11 14:33:16,334+0300 INFO (jsonrpc/5) [dispatcher] Run and protect: disconnectStoragePool, Return response: None (logUtils:54)
2017-05-11 14:33:16,416+0300 INFO (jsonrpc/6) [dispatcher] Run and protect: disconnectStorageServer(domType=3, spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', conList=[], options=None) (log
Utils:51)
2017-05-11 14:33:16,733+0300 INFO (jsonrpc/6) [dispatcher] Run and protect: disconnectStorageServer, Return response: {'statuslist': []} (logUtils:54)
2017-05-11 14:33:16,769+0300 INFO (jsonrpc/7) [dispatcher] Run and protect: disconnectStorageServer(domType=3, spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', conList=[], options=None) (log
Utils:51)
2017-05-11 14:33:17,056+0300 INFO (jsonrpc/7) [dispatcher] Run and protect: disconnectStorageServer, Return response: {'statuslist': []} (logUtils:54)
2017-05-11 14:33:24,996+0300 INFO (jsonrpc/3) [dispatcher] Run and protect: connectStorageServer(domType=3, spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', conList=[{u'id': u'34c89bc7-23e1-
402d-a070-467825318765', u'connection': u'10.35.146.161', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c01', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u
'3260'}, {u'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be', u'connection': u'10.35.146.193', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c04', u'user': u'', u'tpgt': u'1', u'
password': '********', u'port': u'3260'}, {u'id': u'bf6d36c6-df06-47b8-a0da-d22560e20b7e', u'connection': u'10.35.146.129', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00'
, u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}, {u'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434', u'connection': u'10.35.146.225', u'iqn': u'iqn.2008-05.com.xtremio
:xio00153500071-514f0c50023f6c05', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}], options=None) (logUtils:51)
2017-05-11 14:33:27,676+0300 INFO (jsonrpc/3) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'34c89bc7-23e1-402d-a070-467825318765
'}, {'status': 0, 'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be'}, {'status': 0, 'id': u'bf6d36c6-df06-47b8-a0da-d22560e20b7e'}, {'status': 0, 'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434'}]}
(logUtils:54)
Egnine log:
2017-05-11 14:40:21,893+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand]
2017-05-11 14:33:27,691+0300 INFO (jsonrpc/4) [dispatcher] Run and protect: connectStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, msdUUID=u'2a3d0825-c3e9-4bac-a4b8-43
8eab1f552b', masterVersion=1, domainsMap={u'2a3d0825-c3e9-4bac-a4b8-438eab1f552b': u'active'}, options=None) (logUtils:51)
2017-05-11 14:33:29,757+0300 INFO (jsonrpc/4) [dispatcher] Run and protect: connectStoragePool, Return response: True (logUtils:54)
2017-05-11 14:33:29,852+0300 INFO (jsonrpc/5) [dispatcher] Run and protect: connectStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, msdUUID=u'2a3d0825-c3e9-4bac-a4b8-43
8eab1f552b', masterVersion=1, domainsMap={u'2a3d0825-c3e9-4bac-a4b8-438eab1f552b': u'active'}, options=None) (logUtils:51)
2017-05-11 14:33:31,741+0300 INFO (jsonrpc/5) [dispatcher] Run and protect: connectStoragePool, Return response: True (logUtils:54)
2017-05-11 14:33:31,945+0300 INFO (jsonrpc/7) [dispatcher] Run and protect: spmStart(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', prevID=-1, prevLVER=u'-1', maxHostID=250, domVersion=u'3
', options=None) (logUtils:51)
2017-05-11 14:33:31,950+0300 INFO (jsonrpc/7) [dispatcher] Run and protect: spmStart, Return response: None (logUtils:54)
2017-05-11 14:33:34,331+0300 INFO (jsonrpc/4) [dispatcher] Run and protect: destroyStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, options=None) (logUtils:51)
2017-05-11 14:33:38,735+0300 INFO (jsonrpc/4) [dispatcher] Run and protect: destroyStoragePool, Return response: True (logUtils:54)
2017-05-11 14:33:38,939+0300 INFO (jsonrpc/6) [dispatcher] Run and protect: disconnectStoragePool(spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', hostID=1, remove=False, options=None) (logU
tils:51)
2017-05-11 14:33:38,941+0300 INFO (jsonrpc/6) [dispatcher] Run and protect: disconnectStoragePool, Return response: None (logUtils:54)
2017-05-11 14:33:39,004+0300 INFO (jsonrpc/7) [dispatcher] Run and protect: disconnectStorageServer(domType=3, spUUID=u'b2f1b54b-65e7-4ed4-914f-a79d6323eb58', conList=[], options=None) (log
Utils:51)
2017-05-11 14:33:39,306+0300 INFO (jsonrpc/7) [dispatcher] Run and protect: disconnectStorageServer, Return response: {'statuslist': []} (logUtils:54)
Egnie: 2017-05-11 14:31:23,189+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand
2017-05-11 14:33:54,380+0300 INFO (jsonrpc/6) [dispatcher] Run and protect: connectStorageServer(domType=7, spUUID=u'08b0f4d6-b413-4991-a469-241ad41d027d', conList=[{u'id': u'785d9f22-26be-
42e5-9a4f-c6306db74831', u'connection': u'gluster-server01.qa.lab.tlv.redhat.com:/storage_local_ge4_volume_0', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'vfs_type': u'glusterfs', u'password
': '********', u'port': u''}, {u'id': u'4eb9dab9-0b8d-441e-b86e-0f7ac2835ee3', u'connection': u'gluster-server01.qa.lab.tlv.redhat.com:/storage_local_ge4_volume_1', u'iqn': u'', u'user': u''
, u'tpgt': u'1', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}, {u'id': u'fbdc338d-d9ad-41e4-8dc6-dfb18a2dea18', u'connection': u'gluster-server01.qa.lab.tlv.redhat.com:/
storage_local_ge4_volume_2', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], options=None) (logUtils:51)
2017-05-11 14:33:54,427+0300 DEBUG (jsonrpc/6) [storage.fileUtils] Creating directory: /rhev/data-center/mnt/glusterSD/gluster-server01.qa.lab.tlv.redhat.com:_storage__local__ge4__volume__0
mode: None (fileUtils:197
2017-05-11 14:33:56,481+0300 INFO (jsonrpc/6) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'785d9f22-26be-42e5-9a4f-c6306db74831
'}, {'status': 0, 'id': u'4eb9dab9-0b8d-441e-b86e-0f7ac2835ee3'}, {'status': 0, 'id': u'fbdc338d-d9ad-41e4-8dc6-dfb18a2dea18'}]} (logUtils:54)
2017-05-11 14:33:56,500+0300 INFO (jsonrpc/2) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'08b0f4d6-b413-4991-a469-241ad41d027d', conList=[{u'id': u'6f5283c8-8360-
4a2a-a561-50178ce35782', u'connection': u'yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/storage_local_ge4_nfs_0', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'password': '********', u'port':
u''}, {u'id': u'0e89156a-01ed-4a9b-8735-d028847768a6', u'connection': u'yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/storage_local_ge4_nfs_1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'
password': '********', u'port': u''}, {u'id': u'8b90456d-4748-4ab6-af95-8238b0d803d4', u'connection': u'yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/storage_local_ge4_nfs_2', u'iqn': u'',
u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u''}, {u'id': u'c3bc79f1-428e-4ea1-adf6-90159781e68f', u'connection': u'yellow-vdsb.qa.lab.tlv.redhat.com:/Storage_NFS/export_d
omain_storage_local_ge4', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u''}], options=None) (logUtils:51)
2017-05-11 14:33:57,025+0300 INFO (jsonrpc/2) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'6f5283c8-8360-4a2a-a561-50178ce35782
'}, {'status': 0, 'id': u'0e89156a-01ed-4a9b-8735-d028847768a6'}, {'status': 0, 'id': u'8b90456d-4748-4ab6-af95-8238b0d803d4'}, {'status': 0, 'id': u'c3bc79f1-428e-4ea1-adf6-90159781e68f'}]}
(logUtils:54)
2017-05-11 14:33:57,054+0300 INFO (jsonrpc/0) [dispatcher] Run and protect: connectStorageServer(domType=3, spUUID=u'08b0f4d6-b413-4991-a469-241ad41d027d', conList=[{u'id': u'bf6d36c6-df06-
47b8-a0da-d22560e20b7e', u'connection': u'10.35.146.129', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u
'3260'}, {u'id': u'34c89bc7-23e1-402d-a070-467825318765', u'connection': u'10.35.146.161', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c01', u'user': u'', u'tpgt': u'1', u'
password': '********', u'port': u'3260'}, {u'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be', u'connection': u'10.35.146.193', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c04'
, u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}, {u'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434', u'connection': u'10.35.146.225', u'iqn': u'iqn.2008-05.com.xtremio
:xio00153500071-514f0c50023f6c05', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}], options=None) (logUtils:51)
2017-05-11 14:33:57,055+0300 DEBUG (jsonrpc/0) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /usr/sbin/iscsiadm -m node -T iqn
2017-05-11 14:33:59,756+0300 INFO (jsonrpc/0) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'bf6d36c6-df06-47b8-a0da-d22560e20b7e
'}, {'status': 0, 'id': u'34c89bc7-23e1-402d-a070-467825318765'}, {'status': 0, 'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be'}, {'status': 0, 'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434'}]}
(logUtils:54)
2017-05-11 14:33:59,799+0300 INFO (jsonrpc/1) [dispatcher] Run and protect: connectStoragePool(spUUID=u'08b0f4d6-b413-4991-a469-241ad41d027d', hostID=1, msdUUID=u'8837a191-d72c-42bb-925f-07
99adfaa1d6', masterVersion=1, domainsMap={u'd20921cb-df0c-4c2b-8347-7d0b82b4daa8': u'active', u'24b505ca-3827-4d1d-9309-0f559f977f0c': u'active', u'c05d32d5-bc3a-4845-b0e5-16a1f2dd8ef1': u'a
ctive', u'f1cac489-d7a8-4054-a093-c128600d8ed0': u'active', u'95324129-bfea-4ce4-aaf9-7ec75288292b': u'active', u'74b94d28-49b5-4bad-8512-a103591efcb4': u'active', u'70eebd02-3a20-45f1-be26-
b364a41cdce6': u'active', u'8837a191-d72c-42bb-925f-0799adfaa1d6': u'active', u'6957070f-6ce3-4e5c-8bc7-cc0edcb3ccaf': u'active', u'd6c4bbce-47ab-4e81-8ff0-3c51dca0fa6c': u'active'}, options
=None) (logUtils:51)
2017-05-11 14:34:02,697+0300 INFO (jsonrpc/1) [dispatcher] Run and protect: connectStoragePool, Return response: True (logUtils:54)
2017-05-11 14:34:12,340+0300 INFO (jsonrpc/2) [dispatcher] Run and protect: connectStorageServer(domType=3, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'34c89bc7-23e1-
402d-a070-467825318765', u'connection': u'10.35.146.161', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c01', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u
'3260'}, {u'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be', u'connection': u'10.35.146.193', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c04', u'user': u'', u'tpgt': u'1', u'
password': '********', u'port': u'3260'}, {u'id': u'bf6d36c6-df06-47b8-a0da-d22560e20b7e', u'connection': u'10.35.146.129', u'iqn': u'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00'
, u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}, {u'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434', u'connection': u'10.35.146.225', u'iqn': u'iqn.2008-05.com.xtremio
:xio00153500071-514f0c50023f6c05', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}], options=None) (logUtils:51)
2017-05-11 14:34:16,596+0300 INFO (jsonrpc/2) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'34c89bc7-23e1-402d-a070-467825318765
'}, {'status': 0, 'id': u'4eba3d50-20f1-4bfe-99ec-18af62f968be'}, {'status': 0, 'id': u'bf6d36c6-df06-47b8-a0da-d22560e20b7e'}, {'status': 0, 'id': u'dcb9e9c4-2100-432d-a716-2f84f83ac434'}]}
(logUtils:54)
2017-05-11 14:34:16,618+0300 INFO (jsonrpc/0) [dispatcher] Run and protect: getStorageDomainInfo(sdUUID=u'2a3d0825-c3e9-4bac-a4b8-438eab1f552b', options=None) (logUtils:51)
2017-05-11 14:34:20,768+0300 INFO (jsonrpc/0) [dispatcher] Run and protect: getStorageDomainInfo, Return response: {'info': {'uuid': '2a3d0825-c3e9-4bac-a4b8-438eab1f552b', 'vgMetadataDevic
e': '3514f0c5a51600268', 'vguuid': 'NEckcd-72gk-stzw-Xckv-t0yR-u95W-yyReib', 'metadataDevice': '3514f0c5a51600268', 'state': 'OK', 'version': '3', 'role': 'Regular', 'type': 'ISCSI', 'class'
: 'Data', 'pool': [], 'name': 'upgrade_4_0_to_4_1iscsi0'}} (logUtils:54)
2017-05-11 14:34:20,867+0300 INFO (jsonrpc/5) [dispatcher] Run and protect: disconnectStorageServer(domType=3, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[], options=None) (log
Utils:51)
2017-05-11 14:34:21,171+0300 INFO (jsonrpc/5) [dispatcher] Run and protect: disconnectStorageServer, Return response: {'statuslist': []} (logUtils:54)
2017-05-11 14:34:23,831+0300 INFO (jsonrpc/7) [dispatcher] Run and protect: formatStorageDomain(sdUUID=u'2a3d0825-c3e9-4bac-a4b8-438eab1f552b', autoDetach=False, options=None) (logUtils:51)
2017-05-11 14:34:41,966+0300 ERROR (jsonrpc/7) [storage.TaskManager.Task] (Task='5ce71b7a-5964-45bc-9933-cf098fe6fd38') Unexpected error (task:871)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 878, in _run
return fn(*args, **kargs)
File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 2626, in formatStorageDomain
self._recycle(sd)
File "/usr/share/vdsm/storage/hsm.py", line 2569, in _recycle
dom.format(dom.sdUUID)
File "/usr/share/vdsm/storage/blockSD.py", line 1219, in format
lvm.removeVG(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1019, in removeVG
raise se.VolumeGroupRemoveError("VG %s remove failed." % vgName)
VolumeGroupRemoveError: Volume Group remove error: ('VG 2a3d0825-c3e9-4bac-a4b8-438eab1f552b remove failed.',)(In reply to Avihai from comment #39)
We suspect that the root cause of this bug is a race issue which the client (VDSM) that calls acquire with sync=false and before it ends it calls release. Currently checking if there is a way to cancel the acquire once release is being called. (In reply to Maor from comment #40) We debugged this offline, removing the need info. To be clear, you are talking about calling rem_lockspace() while an async add_lockspace() is still running? If you do that, sanlock currently cancels the in-progress add_lockspace() which returns -201. Here is an example of how that looks in the sanlock logs: 2017-07-11 04:45:44 4829695 [40331]: cmd_add_lockspace 2,10 LS:1:/dev/loop9:0 flags 1 timeout 0 2017-07-11 04:45:44 4829695 [40331]: s3 lockspace LS:1:/dev/loop9:0 2017-07-11 04:45:44 4829695 [40331]: cmd_add_lockspace 2,10 async done 0 2017-07-11 04:45:44 4829695 [40360]: s3 delta_acquire begin LS:1 2017-07-11 04:45:44 4829695 [40360]: s3 delta_acquire fast reacquire 2017-07-11 04:45:44 4829695 [40360]: s3 delta_acquire write 1 2 4829695 null-02 2017-07-11 04:45:44 4829695 [40360]: s3 delta_acquire delta_short_delay 20 2017-07-11 04:45:46 4829696 [40332]: cmd_rem_lockspace 2,10 LS flags 0 2017-07-11 04:45:46 4829697 [40360]: s3 delta_acquire abort2 remove 1 shutdown 0 2017-07-11 04:45:47 4829698 [40331]: s3 add_lockspace fail result -201 2017-07-11 04:45:48 4829698 [40332]: cmd_rem_lockspace 2,10 done 0 (In reply to David Teigland from comment #43) > To be clear, you are talking about calling rem_lockspace() while an async > add_lockspace() is still running? If you do that, sanlock currently cancels > the in-progress add_lockspace() which returns -201. David, just to be sure: - If a add_lockspace called with async=True flag, (in the python bindings), the call will be canceled silently (nobody is waiting on it). - If add_lockspace called with async=False flag, the call will return -201 error. Right? Yes, good point, when async there would be nothing to see the -201 result, it would only be visible in the logs. Avihai, would it be possible to test again the failing automation tests with the following patch: https://gerrit.ovirt.org/70443 (In reply to Maor from comment #46) > Avihai, would it be possible to test again the failing automation tests with > the following patch: > https://gerrit.ovirt.org/70443 Sure , I'll test qcow2_v3 automation TP on this patch& update. Avihay, are you sure this issue is gone with https://gerrit.ovirt.org/70443? We don't fully understand the race causing this bug. Did we have enough to check this issue? (In reply to Nir Soffer from comment #51) > Avihay, are you sure this issue is gone with https://gerrit.ovirt.org/70443? > > We don't fully understand the race causing this bug. Did we have enough to > check > this issue? 100% sure -> well No as this bug was not 100% reproducible. To begin with, the issue happened each time at a different test case running qcow2_v3 TP. However, I remember that each time I ran 4.2 qcow2_v3 TP it occurred at least once so this is a very good sign indeed. I think we should rerun this test several more times just increase confidence in this fix especially due to the sporadic nature of this bug + considering your/dev root cause/fix of this issue is not still clear. Started running the TP once more, will update when done. Link to run: https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.2-ge-runner-storage/336/console Avihay, sounds like a good plan. Another 2 runs so far, the issue did not reproduce. 3rd run started (after patching the glance image issue which failed many tests at setup stage). We want to backport the fix to 4.1, needed for another 4.1 bug. verified on 4.1.5 |