1469175 – vdsm path checking stops

Bug 1469175 - vdsm path checking stops

Summary: vdsm path checking stops

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.20.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.1.5
Target Release:	4.19.27
Assignee:	Nir Soffer
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1422508
TreeView+	depends on / blocked

Reported:	2017-07-10 15:17 UTC by Raz Tamir
Modified:	2017-08-09 16:37 UTC (History)
CC List:	7 users (show)
Fixed In Version:	v4.19.26
Clone Of:
Environment:
Last Closed:	2017-08-08 07:55:05 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ rule-engine: blocker+

Attachments	(Terms of Use)
vdsm logs (5.25 MB, application/x-gzip) 2017-07-10 15:17 UTC, Raz Tamir	no flags	Details
engine and vdsm logs (2.80 MB, application/x-gzip) 2017-07-10 15:18 UTC, Raz Tamir	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	70443	master	MERGED	monitor: Fix domain monitoring isolation	2017-07-13 22:15:39 UTC
oVirt gerrit	79241	master	MERGED	concurrent: Log BaseException when threads finish	2017-07-11 20:33:30 UTC
oVirt gerrit	79242	master	MERGED	asyncevent: Fix double complete calls on errors	2017-07-13 14:32:58 UTC
oVirt gerrit	79261	master	MERGED	asyncevent: Add failing test for handle_error failures	2017-07-13 14:33:01 UTC
oVirt gerrit	79290	master	MERGED	asyncevent: Log errors in handle_error()	2017-07-13 14:33:04 UTC
oVirt gerrit	79291	master	MERGED	check: Handle errors the complete callback	2017-07-13 14:33:07 UTC
oVirt gerrit	79297	master	MERGED	monitor: _notifyStatusChange should not be a cancelpoint	2017-07-13 22:15:42 UTC
oVirt gerrit	79458	ovirt-4.1	MERGED	utils: Log BaseException in utils.traceback()	2017-08-01 12:39:51 UTC
oVirt gerrit	79460	ovirt-4.1	MERGED	asyncevent: Fix double complete calls on errors	2017-08-01 12:39:58 UTC
oVirt gerrit	79461	ovirt-4.1	MERGED	asyncevent: Add failing test for handle_error failures	2017-08-01 12:40:03 UTC
oVirt gerrit	79462	ovirt-4.1	MERGED	asyncevent: Log errors in handle_error()	2017-08-01 12:40:08 UTC
oVirt gerrit	79463	ovirt-4.1	MERGED	check: Handle errors the complete callback	2017-08-01 12:40:13 UTC
oVirt gerrit	79464	ovirt-4.1	MERGED	monitor: Fix domain monitoring isolation	2017-08-01 12:40:32 UTC
oVirt gerrit	79465	ovirt-4.1	MERGED	monitor: _notifyStatusChange should not be a cancelpoint	2017-08-01 12:40:37 UTC

Description Raz Tamir 2017-07-10 15:17:46 UTC

Created attachment 1295854 [details]
vdsm logs

Description of problem:
vdsm path checking stops

2017-07-09 10:17:13,765+0300 INFO  (check/loop) [storage.SANLock] Host id for domain b7ba16a4-fd9f-49aa-94d3-e12fbf00740f already acquired (id=2, async=True) (clusterlock:313)
2017-07-09 10:17:13,766+0300 DEBUG (check/loop) [storage.asyncevent] <EventLoop running=False closed=False at 0x56283600> stopped (asyncevent:130)
2017-07-09 10:17:13,771+0300 DEBUG (monitor/b7ba16a) [storage.Monitor] Domain monitor for b7ba16a4-fd9f-49aa-94d3-e12fbf00740f canceled (monitor:309)
2017-07-09 10:17:13,772+0300 DEBUG (monitor/b7ba16a) [storage.Monitor] Domain monitor for b7ba16a4-fd9f-49aa-94d3-e12fbf00740f stopped (shutdown=False) (monitor:312)
2017-07-09 10:17:13,772+0300 INFO  (monitor/b7ba16a) [storage.check] Stop checking '/dev/b7ba16a4-fd9f-49aa-94d3-e12fbf00740f/metadata' (check:136)
2017-07-09 10:17:14,143+0300 DEBUG (monitor/6baec58) [storage.Safelease] Host id for domain 6baec589-6add-4868-9cd3-0da93f1d0b24 released successfully (id: 2) (clusterlock:163)
2017-07-09 10:17:14,144+0300 DEBUG (monitor/6baec58) [storage.Monitor] FINISH thread <Thread(monitor/6baec58, started daemon 139655861008128)> (concurrent:192)
2017-07-09 10:17:14,753+0300 DEBUG (monitor/f722591) [storage.Monitor] FINISH thread <Thread(monitor/f722591, started daemon 139655701546752)> (concurrent:192)
2017-07-09 10:17:14,755+0300 INFO  (monitor/1320c57) [storage.SANLock] Releasing host id for domain 1320c570-c46f-476d-b84f-6ef92c50acce (id: 2) (clusterlock:326)
2017-07-09 10:17:14,757+0300 INFO  (monitor/e8cfeff) [storage.SANLock] Releasing host id for domain e8cfeffc-ae87-4dc5-948d-042d7befc4ff (id: 2) (clusterlock:326)
2017-07-09 10:17:14,759+0300 INFO  (monitor/7a613f6) [storage.SANLock] Releasing host id for domain 7a613f66-5ec2-417a-8a89-5989030baabf (id: 2) (clusterlock:326)
2017-07-09 10:17:14,760+0300 INFO  (monitor/a0f88d0) [storage.SANLock] Releasing host id for domain a0f88d0b-fdbf-4f85-8a76-8471b612be5f (id: 2) (clusterlock:326)
2017-07-09 10:17:14,762+0300 INFO  (monitor/93d5dc6) [storage.SANLock] Releasing host id for domain 93d5dc65-31e4-444a-b46b-ce2307f0a4c1 (id: 2) (clusterlock:326)
2017-07-09 10:17:14,764+0300 INFO  (monitor/9e3e414) [storage.SANLock] Releasing host id for domain 9e3e4147-de96-4d74-a6de-3a79c50302e7 (id: 2) (clusterlock:326)
2017-07-09 10:17:14,765+0300 INFO  (monitor/34b3752) [storage.SANLock] Releasing host id for domain 34b37522-6418-429a-9a7a-c30c88b8d817 (id: 2) (clusterlock:326)
2017-07-09 10:17:14,773+0300 INFO  (monitor/b7ba16a) [storage.SANLock] Releasing host id for domain b7ba16a4-fd9f-49aa-94d3-e12fbf00740f (id: 2) (clusterlock:326)

The result is timeout to acquire leases:

2017-07-09 10:34:13,134+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler7) [disks_syncAction_b62b08ce-44de-418e] Command 'MoveOrCopyDisk' (id: '525986ac-19b0-450
9-872d-412326cda8a4') waiting on child command id: '7df669f2-f033-4151-b67c-1bd392017bb3' type:'MoveImageGroup' to complete
2017-07-09 10:34:13,157+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler7) [disks_syncAction_b62b08ce-44de-418e] Command 'MoveImageGroup' (id: '7df669f2-f033-415
1-b67c-1bd392017bb3') waiting on child command id: '92bd660f-06ac-44a4-8c62-36f378bbdbee' type:'CopyImageGroupWithData' to complete
2017-07-09 10:34:18,172+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler8) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupWithData' (id: '92bd660f-06ac
-44a4-8c62-36f378bbdbee') waiting on child command id: 'e2c9242a-ac92-4909-9e3b-58466c28b701' type:'CopyImageGroupVolumesData' to complete
2017-07-09 10:34:19,181+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler6) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupVolumesData' (id: 'e2c9242a-a
c92-4909-9e3b-58466c28b701') waiting on child command id: '72e5677b-28b9-4944-9b48-da00b780fee3' type:'CopyData' to complete
2017-07-09 10:34:20,198+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] START, GetHostJobsVDSCommand(HostName = host_mixed_
2, GetHostJobsVDSCommandParameters:{hostId='7ab3e3f7-1352-46cc-b590-82eb30e35dfc', type='storage', jobIds='[f641b9df-8c64-4b57-919e-a5cf5b052274]'}), log id: 49e5b5ed
2017-07-09 10:34:20,207+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] FINISH, GetHostJobsVDSCommand, return: {f641b9df-8c
64-4b57-919e-a5cf5b052274=HostJobInfo:{id='f641b9df-8c64-4b57-919e-a5cf5b052274', type='storage', description='copy_data', status='failed', progress='null', error='VDSError:{code='AcquireHostIdFailure', message='Ca
nnot acquire host id: ("Timeout acquiring host id, cannot acquire Lease(name='fb6ce1fc-a7ee-437c-b535-5c7d1c1b03c5', path=u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge8_
_nfs__0/7a613f66-5ec2-417a-8a89-5989030baabf/images/d0bad82e-633d-462d-ab4d-f44e21d7a2c0/fb6ce1fc-a7ee-437c-b535-5c7d1c1b03c5.lease', offset=0) (id=2)",)'}'}}, log id: 49e5b5ed
2017-07-09 10:34:20,215+03 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] Command CopyData id: '72e5677b-28b9-4944-9b48-da00b780fee3': job 'f641
b9df-8c64-4b57-919e-a5cf5b052274' execution was completed with VDSM job status 'failed'
2017-07-09 10:34:20,223+03 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] Command CopyData id: '72e5677b-28b9-4944-9b48-da00b780fee3': execution
 was completed, the command status is 'FAILED'
2017-07-09 10:34:21,236+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (DefaultQuartzScheduler1) [disks_syncAction_b62b08ce-44de-418e] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand' with failure.
2017-07-09 10:34:21,258+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler1) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupVolumesData' id: 'e2c9242a-ac92-4909-9e3b-58466c28b701' child commands '[72e5677b-28b9-4944-9b48-da00b780fee3]' executions were completed, status 'FAILED'
2017-07-09 10:34:22,293+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand] (DefaultQuartzScheduler9) [disks_syncAction_b62b08ce-44de-418e] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand' with failure.
2017-07-09 10:34:22,318+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler9) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupWithData' id: '92bd660f-06ac-44a4-8c62-36f378bbdbee' child commands '[002f899e-9bf0-499a-acd8-a3b8b556a4d6, e2c9242a-ac92-4909-9e3b-58466c28b701]' executions were completed, status 'FAILED'


vdsm.log:

2017-07-09 14:19:10,405+0300 ERROR (tasks/8) [root] Job '9b3a948f-5c16-48bc-a551-51343c34989f' failed (jobs:221)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/jobs.py", line 157, in run
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdm/api/copy_data.py", line 61, in _run
    with guarded.context(self._source.locks + self._dest.locks):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 102, in __enter__
    six.reraise(*exc)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, in __enter__
    lock.acquire()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1519, in acquire
    dom.acquireVolumeLease(self._host_id, self._img_id, self._vol_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 478, in acquireVolumeLease
    self._domainLock.acquire(hostId, lease)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 390, in acquire
    % (lease, hostId))
AcquireHostIdFailure: Cannot acquire host id: ("Timeout acquiring host id, cannot acquire Lease(name='9ad82bab-6e6e-4250-9757-e94b01f89736', path='/dev/e8cfeffc-ae87-4dc5-948d-042d7befc4ff/leases', offset=114294784) (id=2)",)




Version-Release number of selected component (if applicable):
vdsm-4.20.1-157.git79aca9e.el7.centos.x86_64
ovirt-engine-4.2.0-0.0.master.20170707124946.gitf15a6d9.el7.centos.noarch


How reproducible:
don't know

Steps to Reproduce:
1. Move disk
2.
3.

Actual results:
all domains are in actual=False

Expected results:


Additional info:

Comment 1 Raz Tamir 2017-07-10 15:18:24 UTC

Created attachment 1295855 [details]
engine and vdsm logs

Comment 2 Nir Soffer 2017-07-10 18:28:12 UTC

This is the main issue:

2017-07-09 10:17:13,766+0300 DEBUG (check/loop) [storage.asyncevent] <EventLoop running=False closed=False at 0x56283600> stopped (asyncevent:130)

The even loop used by the path checking thread should never exit. If it fails, we must have a traceback showing the underlying error.

The first step in handling this we will enhance logging so such silent death is not
possible. This will expose the real error in the next automation run.

Comment 3 Allon Mureinik 2017-07-10 19:48:23 UTC

Does this happen on 4.1.z too?

Comment 4 Nir Soffer 2017-07-10 23:46:08 UTC

(In reply to Allon Mureinik from comment #3)
> Does this happen on 4.1.z too?

It should happen with 4.1 and 4.0. I think the trigger is the patches waiting for
the host id, but they just reveal the underlying problem.

Raz, can you run the failing test with the 4.1 backports?
https://gerrit.ovirt.org/79166

Comment 5 Nir Soffer 2017-07-11 20:43:07 UTC

Raz, this issue should be fixed by the attached bugs. Can you run the failing
automation test again with the patches?

Comment 6 Raz Tamir 2017-07-12 15:16:53 UTC

(In reply to Allon Mureinik from comment #3)
> Does this happen on 4.1.z too?

No, because this issue is hidden by different bug #1422508 which is still not merged

Comment 7 Raz Tamir 2017-07-12 15:41:58 UTC

(In reply to Nir Soffer from comment #4)
> (In reply to Allon Mureinik from comment #3)
> > Does this happen on 4.1.z too?
> 
> It should happen with 4.1 and 4.0. I think the trigger is the patches
> waiting for
> the host id, but they just reveal the underlying problem.
> 
> Raz, can you run the failing test with the 4.1 backports?
> https://gerrit.ovirt.org/79166

Sure,
testing in progress

Comment 8 Raz Tamir 2017-07-12 15:42:47 UTC

(In reply to Nir Soffer from comment #5)
> Raz, this issue should be fixed by the attached bugs. Can you run the failing
> automation test again with the patches?

Sure,

testing in progress

Comment 9 Avihai 2017-07-13 04:06:08 UTC

(In reply to Raz Tamir from comment #8)
> (In reply to Nir Soffer from comment #5)
> > Raz, this issue should be fixed by the attached bugs. Can you run the failing
> > automation test again with the patches?
> 
> Sure,
> 
> testing in progress

Testing of https://gerrit.ovirt.org/#/c/70443/ was successful.
Link to run:
https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/4.2_dev/job/rhv-4.2-ge-runner-storage/318/console

Comment 10 Avihai 2017-07-13 13:22:34 UTC

Testing of https://gerrit.ovirt.org/79166 on 4.1 started:

Test Link:
https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/4.1_dev/job/rhv-4.1v4-ge-runner-storage/1588/

Comment 11 Raz Tamir 2017-08-05 08:08:08 UTC

Nir, 
How can we reproduce this?

Comment 12 Nir Soffer 2017-08-07 18:07:03 UTC

Raz, this issue is impossible to reproduce.

Basically you need to be have a host in start spm flow trying to acquire a host id, while the storage check loop thread is trying to acquire in the same time, blocked
in clusterlock internal lock. Then, you need to deactivate this storage domain,
which will raise a special error in the storage check loop, practically aborting
this thread. I don't know how to reproduce this.

Since you reproduced this with one of your automation tests, running this test
many times should be good enough.

The main fix for this issue was to remove the call in the check loop thread that
may raise the special error killing the check loop thread; so this error is 
impossible now.

Comment 13 Raz Tamir 2017-08-08 07:55:05 UTC

According to comment #12 closing with resolution NEXTRELEASE

Note You need to log in before you can comment on or make changes to this bug.