Created attachment 1295854 [details] vdsm logs Description of problem: vdsm path checking stops 2017-07-09 10:17:13,765+0300 INFO (check/loop) [storage.SANLock] Host id for domain b7ba16a4-fd9f-49aa-94d3-e12fbf00740f already acquired (id=2, async=True) (clusterlock:313) 2017-07-09 10:17:13,766+0300 DEBUG (check/loop) [storage.asyncevent] <EventLoop running=False closed=False at 0x56283600> stopped (asyncevent:130) 2017-07-09 10:17:13,771+0300 DEBUG (monitor/b7ba16a) [storage.Monitor] Domain monitor for b7ba16a4-fd9f-49aa-94d3-e12fbf00740f canceled (monitor:309) 2017-07-09 10:17:13,772+0300 DEBUG (monitor/b7ba16a) [storage.Monitor] Domain monitor for b7ba16a4-fd9f-49aa-94d3-e12fbf00740f stopped (shutdown=False) (monitor:312) 2017-07-09 10:17:13,772+0300 INFO (monitor/b7ba16a) [storage.check] Stop checking '/dev/b7ba16a4-fd9f-49aa-94d3-e12fbf00740f/metadata' (check:136) 2017-07-09 10:17:14,143+0300 DEBUG (monitor/6baec58) [storage.Safelease] Host id for domain 6baec589-6add-4868-9cd3-0da93f1d0b24 released successfully (id: 2) (clusterlock:163) 2017-07-09 10:17:14,144+0300 DEBUG (monitor/6baec58) [storage.Monitor] FINISH thread <Thread(monitor/6baec58, started daemon 139655861008128)> (concurrent:192) 2017-07-09 10:17:14,753+0300 DEBUG (monitor/f722591) [storage.Monitor] FINISH thread <Thread(monitor/f722591, started daemon 139655701546752)> (concurrent:192) 2017-07-09 10:17:14,755+0300 INFO (monitor/1320c57) [storage.SANLock] Releasing host id for domain 1320c570-c46f-476d-b84f-6ef92c50acce (id: 2) (clusterlock:326) 2017-07-09 10:17:14,757+0300 INFO (monitor/e8cfeff) [storage.SANLock] Releasing host id for domain e8cfeffc-ae87-4dc5-948d-042d7befc4ff (id: 2) (clusterlock:326) 2017-07-09 10:17:14,759+0300 INFO (monitor/7a613f6) [storage.SANLock] Releasing host id for domain 7a613f66-5ec2-417a-8a89-5989030baabf (id: 2) (clusterlock:326) 2017-07-09 10:17:14,760+0300 INFO (monitor/a0f88d0) [storage.SANLock] Releasing host id for domain a0f88d0b-fdbf-4f85-8a76-8471b612be5f (id: 2) (clusterlock:326) 2017-07-09 10:17:14,762+0300 INFO (monitor/93d5dc6) [storage.SANLock] Releasing host id for domain 93d5dc65-31e4-444a-b46b-ce2307f0a4c1 (id: 2) (clusterlock:326) 2017-07-09 10:17:14,764+0300 INFO (monitor/9e3e414) [storage.SANLock] Releasing host id for domain 9e3e4147-de96-4d74-a6de-3a79c50302e7 (id: 2) (clusterlock:326) 2017-07-09 10:17:14,765+0300 INFO (monitor/34b3752) [storage.SANLock] Releasing host id for domain 34b37522-6418-429a-9a7a-c30c88b8d817 (id: 2) (clusterlock:326) 2017-07-09 10:17:14,773+0300 INFO (monitor/b7ba16a) [storage.SANLock] Releasing host id for domain b7ba16a4-fd9f-49aa-94d3-e12fbf00740f (id: 2) (clusterlock:326) The result is timeout to acquire leases: 2017-07-09 10:34:13,134+03 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler7) [disks_syncAction_b62b08ce-44de-418e] Command 'MoveOrCopyDisk' (id: '525986ac-19b0-450 9-872d-412326cda8a4') waiting on child command id: '7df669f2-f033-4151-b67c-1bd392017bb3' type:'MoveImageGroup' to complete 2017-07-09 10:34:13,157+03 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler7) [disks_syncAction_b62b08ce-44de-418e] Command 'MoveImageGroup' (id: '7df669f2-f033-415 1-b67c-1bd392017bb3') waiting on child command id: '92bd660f-06ac-44a4-8c62-36f378bbdbee' type:'CopyImageGroupWithData' to complete 2017-07-09 10:34:18,172+03 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler8) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupWithData' (id: '92bd660f-06ac -44a4-8c62-36f378bbdbee') waiting on child command id: 'e2c9242a-ac92-4909-9e3b-58466c28b701' type:'CopyImageGroupVolumesData' to complete 2017-07-09 10:34:19,181+03 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler6) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupVolumesData' (id: 'e2c9242a-a c92-4909-9e3b-58466c28b701') waiting on child command id: '72e5677b-28b9-4944-9b48-da00b780fee3' type:'CopyData' to complete 2017-07-09 10:34:20,198+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] START, GetHostJobsVDSCommand(HostName = host_mixed_ 2, GetHostJobsVDSCommandParameters:{hostId='7ab3e3f7-1352-46cc-b590-82eb30e35dfc', type='storage', jobIds='[f641b9df-8c64-4b57-919e-a5cf5b052274]'}), log id: 49e5b5ed 2017-07-09 10:34:20,207+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] FINISH, GetHostJobsVDSCommand, return: {f641b9df-8c 64-4b57-919e-a5cf5b052274=HostJobInfo:{id='f641b9df-8c64-4b57-919e-a5cf5b052274', type='storage', description='copy_data', status='failed', progress='null', error='VDSError:{code='AcquireHostIdFailure', message='Ca nnot acquire host id: ("Timeout acquiring host id, cannot acquire Lease(name='fb6ce1fc-a7ee-437c-b535-5c7d1c1b03c5', path=u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge8_ _nfs__0/7a613f66-5ec2-417a-8a89-5989030baabf/images/d0bad82e-633d-462d-ab4d-f44e21d7a2c0/fb6ce1fc-a7ee-437c-b535-5c7d1c1b03c5.lease', offset=0) (id=2)",)'}'}}, log id: 49e5b5ed 2017-07-09 10:34:20,215+03 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] Command CopyData id: '72e5677b-28b9-4944-9b48-da00b780fee3': job 'f641 b9df-8c64-4b57-919e-a5cf5b052274' execution was completed with VDSM job status 'failed' 2017-07-09 10:34:20,223+03 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] Command CopyData id: '72e5677b-28b9-4944-9b48-da00b780fee3': execution was completed, the command status is 'FAILED' 2017-07-09 10:34:21,236+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (DefaultQuartzScheduler1) [disks_syncAction_b62b08ce-44de-418e] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand' with failure. 2017-07-09 10:34:21,258+03 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler1) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupVolumesData' id: 'e2c9242a-ac92-4909-9e3b-58466c28b701' child commands '[72e5677b-28b9-4944-9b48-da00b780fee3]' executions were completed, status 'FAILED' 2017-07-09 10:34:22,293+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand] (DefaultQuartzScheduler9) [disks_syncAction_b62b08ce-44de-418e] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand' with failure. 2017-07-09 10:34:22,318+03 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler9) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupWithData' id: '92bd660f-06ac-44a4-8c62-36f378bbdbee' child commands '[002f899e-9bf0-499a-acd8-a3b8b556a4d6, e2c9242a-ac92-4909-9e3b-58466c28b701]' executions were completed, status 'FAILED' vdsm.log: 2017-07-09 14:19:10,405+0300 ERROR (tasks/8) [root] Job '9b3a948f-5c16-48bc-a551-51343c34989f' failed (jobs:221) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/jobs.py", line 157, in run self._run() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdm/api/copy_data.py", line 61, in _run with guarded.context(self._source.locks + self._dest.locks): File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 102, in __enter__ six.reraise(*exc) File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, in __enter__ lock.acquire() File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1519, in acquire dom.acquireVolumeLease(self._host_id, self._img_id, self._vol_id) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 478, in acquireVolumeLease self._domainLock.acquire(hostId, lease) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 390, in acquire % (lease, hostId)) AcquireHostIdFailure: Cannot acquire host id: ("Timeout acquiring host id, cannot acquire Lease(name='9ad82bab-6e6e-4250-9757-e94b01f89736', path='/dev/e8cfeffc-ae87-4dc5-948d-042d7befc4ff/leases', offset=114294784) (id=2)",) Version-Release number of selected component (if applicable): vdsm-4.20.1-157.git79aca9e.el7.centos.x86_64 ovirt-engine-4.2.0-0.0.master.20170707124946.gitf15a6d9.el7.centos.noarch How reproducible: don't know Steps to Reproduce: 1. Move disk 2. 3. Actual results: all domains are in actual=False Expected results: Additional info:
Created attachment 1295855 [details] engine and vdsm logs
This is the main issue: 2017-07-09 10:17:13,766+0300 DEBUG (check/loop) [storage.asyncevent] <EventLoop running=False closed=False at 0x56283600> stopped (asyncevent:130) The even loop used by the path checking thread should never exit. If it fails, we must have a traceback showing the underlying error. The first step in handling this we will enhance logging so such silent death is not possible. This will expose the real error in the next automation run.
Does this happen on 4.1.z too?
(In reply to Allon Mureinik from comment #3) > Does this happen on 4.1.z too? It should happen with 4.1 and 4.0. I think the trigger is the patches waiting for the host id, but they just reveal the underlying problem. Raz, can you run the failing test with the 4.1 backports? https://gerrit.ovirt.org/79166
Raz, this issue should be fixed by the attached bugs. Can you run the failing automation test again with the patches?
(In reply to Allon Mureinik from comment #3) > Does this happen on 4.1.z too? No, because this issue is hidden by different bug #1422508 which is still not merged
(In reply to Nir Soffer from comment #4) > (In reply to Allon Mureinik from comment #3) > > Does this happen on 4.1.z too? > > It should happen with 4.1 and 4.0. I think the trigger is the patches > waiting for > the host id, but they just reveal the underlying problem. > > Raz, can you run the failing test with the 4.1 backports? > https://gerrit.ovirt.org/79166 Sure, testing in progress
(In reply to Nir Soffer from comment #5) > Raz, this issue should be fixed by the attached bugs. Can you run the failing > automation test again with the patches? Sure, testing in progress
(In reply to Raz Tamir from comment #8) > (In reply to Nir Soffer from comment #5) > > Raz, this issue should be fixed by the attached bugs. Can you run the failing > > automation test again with the patches? > > Sure, > > testing in progress Testing of https://gerrit.ovirt.org/#/c/70443/ was successful. Link to run: https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/4.2_dev/job/rhv-4.2-ge-runner-storage/318/console
Testing of https://gerrit.ovirt.org/79166 on 4.1 started: Test Link: https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/4.1_dev/job/rhv-4.1v4-ge-runner-storage/1588/
Nir, How can we reproduce this?
Raz, this issue is impossible to reproduce. Basically you need to be have a host in start spm flow trying to acquire a host id, while the storage check loop thread is trying to acquire in the same time, blocked in clusterlock internal lock. Then, you need to deactivate this storage domain, which will raise a special error in the storage check loop, practically aborting this thread. I don't know how to reproduce this. Since you reproduced this with one of your automation tests, running this test many times should be good enough. The main fix for this issue was to remove the call in the check loop thread that may raise the special error killing the check loop thread; so this error is impossible now.
According to comment #12 closing with resolution NEXTRELEASE