Bug 1469175

Summary: vdsm path checking stops
Product: [oVirt] vdsm Reporter: Raz Tamir <ratamir>
Component: CoreAssignee: Nir Soffer <nsoffer>
Status: CLOSED NEXTRELEASE QA Contact: Raz Tamir <ratamir>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.20.0CC: aefrat, amureini, bugs, lveyde, nsoffer, ratamir, ylavi
Target Milestone: ovirt-4.1.5Keywords: Automation, Regression
Target Release: 4.19.27Flags: rule-engine: ovirt-4.1+
rule-engine: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.19.26 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-08 07:55:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1422508    
Attachments:
Description Flags
vdsm logs
none
engine and vdsm logs none

Description Raz Tamir 2017-07-10 15:17:46 UTC
Created attachment 1295854 [details]
vdsm logs

Description of problem:
vdsm path checking stops

2017-07-09 10:17:13,765+0300 INFO  (check/loop) [storage.SANLock] Host id for domain b7ba16a4-fd9f-49aa-94d3-e12fbf00740f already acquired (id=2, async=True) (clusterlock:313)
2017-07-09 10:17:13,766+0300 DEBUG (check/loop) [storage.asyncevent] <EventLoop running=False closed=False at 0x56283600> stopped (asyncevent:130)
2017-07-09 10:17:13,771+0300 DEBUG (monitor/b7ba16a) [storage.Monitor] Domain monitor for b7ba16a4-fd9f-49aa-94d3-e12fbf00740f canceled (monitor:309)
2017-07-09 10:17:13,772+0300 DEBUG (monitor/b7ba16a) [storage.Monitor] Domain monitor for b7ba16a4-fd9f-49aa-94d3-e12fbf00740f stopped (shutdown=False) (monitor:312)
2017-07-09 10:17:13,772+0300 INFO  (monitor/b7ba16a) [storage.check] Stop checking '/dev/b7ba16a4-fd9f-49aa-94d3-e12fbf00740f/metadata' (check:136)
2017-07-09 10:17:14,143+0300 DEBUG (monitor/6baec58) [storage.Safelease] Host id for domain 6baec589-6add-4868-9cd3-0da93f1d0b24 released successfully (id: 2) (clusterlock:163)
2017-07-09 10:17:14,144+0300 DEBUG (monitor/6baec58) [storage.Monitor] FINISH thread <Thread(monitor/6baec58, started daemon 139655861008128)> (concurrent:192)
2017-07-09 10:17:14,753+0300 DEBUG (monitor/f722591) [storage.Monitor] FINISH thread <Thread(monitor/f722591, started daemon 139655701546752)> (concurrent:192)
2017-07-09 10:17:14,755+0300 INFO  (monitor/1320c57) [storage.SANLock] Releasing host id for domain 1320c570-c46f-476d-b84f-6ef92c50acce (id: 2) (clusterlock:326)
2017-07-09 10:17:14,757+0300 INFO  (monitor/e8cfeff) [storage.SANLock] Releasing host id for domain e8cfeffc-ae87-4dc5-948d-042d7befc4ff (id: 2) (clusterlock:326)
2017-07-09 10:17:14,759+0300 INFO  (monitor/7a613f6) [storage.SANLock] Releasing host id for domain 7a613f66-5ec2-417a-8a89-5989030baabf (id: 2) (clusterlock:326)
2017-07-09 10:17:14,760+0300 INFO  (monitor/a0f88d0) [storage.SANLock] Releasing host id for domain a0f88d0b-fdbf-4f85-8a76-8471b612be5f (id: 2) (clusterlock:326)
2017-07-09 10:17:14,762+0300 INFO  (monitor/93d5dc6) [storage.SANLock] Releasing host id for domain 93d5dc65-31e4-444a-b46b-ce2307f0a4c1 (id: 2) (clusterlock:326)
2017-07-09 10:17:14,764+0300 INFO  (monitor/9e3e414) [storage.SANLock] Releasing host id for domain 9e3e4147-de96-4d74-a6de-3a79c50302e7 (id: 2) (clusterlock:326)
2017-07-09 10:17:14,765+0300 INFO  (monitor/34b3752) [storage.SANLock] Releasing host id for domain 34b37522-6418-429a-9a7a-c30c88b8d817 (id: 2) (clusterlock:326)
2017-07-09 10:17:14,773+0300 INFO  (monitor/b7ba16a) [storage.SANLock] Releasing host id for domain b7ba16a4-fd9f-49aa-94d3-e12fbf00740f (id: 2) (clusterlock:326)

The result is timeout to acquire leases:

2017-07-09 10:34:13,134+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler7) [disks_syncAction_b62b08ce-44de-418e] Command 'MoveOrCopyDisk' (id: '525986ac-19b0-450
9-872d-412326cda8a4') waiting on child command id: '7df669f2-f033-4151-b67c-1bd392017bb3' type:'MoveImageGroup' to complete
2017-07-09 10:34:13,157+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler7) [disks_syncAction_b62b08ce-44de-418e] Command 'MoveImageGroup' (id: '7df669f2-f033-415
1-b67c-1bd392017bb3') waiting on child command id: '92bd660f-06ac-44a4-8c62-36f378bbdbee' type:'CopyImageGroupWithData' to complete
2017-07-09 10:34:18,172+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler8) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupWithData' (id: '92bd660f-06ac
-44a4-8c62-36f378bbdbee') waiting on child command id: 'e2c9242a-ac92-4909-9e3b-58466c28b701' type:'CopyImageGroupVolumesData' to complete
2017-07-09 10:34:19,181+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler6) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupVolumesData' (id: 'e2c9242a-a
c92-4909-9e3b-58466c28b701') waiting on child command id: '72e5677b-28b9-4944-9b48-da00b780fee3' type:'CopyData' to complete
2017-07-09 10:34:20,198+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] START, GetHostJobsVDSCommand(HostName = host_mixed_
2, GetHostJobsVDSCommandParameters:{hostId='7ab3e3f7-1352-46cc-b590-82eb30e35dfc', type='storage', jobIds='[f641b9df-8c64-4b57-919e-a5cf5b052274]'}), log id: 49e5b5ed
2017-07-09 10:34:20,207+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] FINISH, GetHostJobsVDSCommand, return: {f641b9df-8c
64-4b57-919e-a5cf5b052274=HostJobInfo:{id='f641b9df-8c64-4b57-919e-a5cf5b052274', type='storage', description='copy_data', status='failed', progress='null', error='VDSError:{code='AcquireHostIdFailure', message='Ca
nnot acquire host id: ("Timeout acquiring host id, cannot acquire Lease(name='fb6ce1fc-a7ee-437c-b535-5c7d1c1b03c5', path=u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge8_
_nfs__0/7a613f66-5ec2-417a-8a89-5989030baabf/images/d0bad82e-633d-462d-ab4d-f44e21d7a2c0/fb6ce1fc-a7ee-437c-b535-5c7d1c1b03c5.lease', offset=0) (id=2)",)'}'}}, log id: 49e5b5ed
2017-07-09 10:34:20,215+03 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] Command CopyData id: '72e5677b-28b9-4944-9b48-da00b780fee3': job 'f641
b9df-8c64-4b57-919e-a5cf5b052274' execution was completed with VDSM job status 'failed'
2017-07-09 10:34:20,223+03 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] Command CopyData id: '72e5677b-28b9-4944-9b48-da00b780fee3': execution
 was completed, the command status is 'FAILED'
2017-07-09 10:34:21,236+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (DefaultQuartzScheduler1) [disks_syncAction_b62b08ce-44de-418e] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand' with failure.
2017-07-09 10:34:21,258+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler1) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupVolumesData' id: 'e2c9242a-ac92-4909-9e3b-58466c28b701' child commands '[72e5677b-28b9-4944-9b48-da00b780fee3]' executions were completed, status 'FAILED'
2017-07-09 10:34:22,293+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand] (DefaultQuartzScheduler9) [disks_syncAction_b62b08ce-44de-418e] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand' with failure.
2017-07-09 10:34:22,318+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler9) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupWithData' id: '92bd660f-06ac-44a4-8c62-36f378bbdbee' child commands '[002f899e-9bf0-499a-acd8-a3b8b556a4d6, e2c9242a-ac92-4909-9e3b-58466c28b701]' executions were completed, status 'FAILED'


vdsm.log:

2017-07-09 14:19:10,405+0300 ERROR (tasks/8) [root] Job '9b3a948f-5c16-48bc-a551-51343c34989f' failed (jobs:221)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/jobs.py", line 157, in run
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdm/api/copy_data.py", line 61, in _run
    with guarded.context(self._source.locks + self._dest.locks):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 102, in __enter__
    six.reraise(*exc)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, in __enter__
    lock.acquire()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1519, in acquire
    dom.acquireVolumeLease(self._host_id, self._img_id, self._vol_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 478, in acquireVolumeLease
    self._domainLock.acquire(hostId, lease)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 390, in acquire
    % (lease, hostId))
AcquireHostIdFailure: Cannot acquire host id: ("Timeout acquiring host id, cannot acquire Lease(name='9ad82bab-6e6e-4250-9757-e94b01f89736', path='/dev/e8cfeffc-ae87-4dc5-948d-042d7befc4ff/leases', offset=114294784) (id=2)",)




Version-Release number of selected component (if applicable):
vdsm-4.20.1-157.git79aca9e.el7.centos.x86_64
ovirt-engine-4.2.0-0.0.master.20170707124946.gitf15a6d9.el7.centos.noarch


How reproducible:
don't know

Steps to Reproduce:
1. Move disk
2.
3.

Actual results:
all domains are in actual=False

Expected results:


Additional info:

Comment 1 Raz Tamir 2017-07-10 15:18:24 UTC
Created attachment 1295855 [details]
engine and vdsm logs

Comment 2 Nir Soffer 2017-07-10 18:28:12 UTC
This is the main issue:

2017-07-09 10:17:13,766+0300 DEBUG (check/loop) [storage.asyncevent] <EventLoop running=False closed=False at 0x56283600> stopped (asyncevent:130)

The even loop used by the path checking thread should never exit. If it fails, we must have a traceback showing the underlying error.

The first step in handling this we will enhance logging so such silent death is not
possible. This will expose the real error in the next automation run.

Comment 3 Allon Mureinik 2017-07-10 19:48:23 UTC
Does this happen on 4.1.z too?

Comment 4 Nir Soffer 2017-07-10 23:46:08 UTC
(In reply to Allon Mureinik from comment #3)
> Does this happen on 4.1.z too?

It should happen with 4.1 and 4.0. I think the trigger is the patches waiting for
the host id, but they just reveal the underlying problem.

Raz, can you run the failing test with the 4.1 backports?
https://gerrit.ovirt.org/79166

Comment 5 Nir Soffer 2017-07-11 20:43:07 UTC
Raz, this issue should be fixed by the attached bugs. Can you run the failing
automation test again with the patches?

Comment 6 Raz Tamir 2017-07-12 15:16:53 UTC
(In reply to Allon Mureinik from comment #3)
> Does this happen on 4.1.z too?

No, because this issue is hidden by different bug #1422508 which is still not merged

Comment 7 Raz Tamir 2017-07-12 15:41:58 UTC
(In reply to Nir Soffer from comment #4)
> (In reply to Allon Mureinik from comment #3)
> > Does this happen on 4.1.z too?
> 
> It should happen with 4.1 and 4.0. I think the trigger is the patches
> waiting for
> the host id, but they just reveal the underlying problem.
> 
> Raz, can you run the failing test with the 4.1 backports?
> https://gerrit.ovirt.org/79166

Sure,
testing in progress

Comment 8 Raz Tamir 2017-07-12 15:42:47 UTC
(In reply to Nir Soffer from comment #5)
> Raz, this issue should be fixed by the attached bugs. Can you run the failing
> automation test again with the patches?

Sure,

testing in progress

Comment 9 Avihai 2017-07-13 04:06:08 UTC
(In reply to Raz Tamir from comment #8)
> (In reply to Nir Soffer from comment #5)
> > Raz, this issue should be fixed by the attached bugs. Can you run the failing
> > automation test again with the patches?
> 
> Sure,
> 
> testing in progress

Testing of https://gerrit.ovirt.org/#/c/70443/ was successful.
Link to run:
https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/4.2_dev/job/rhv-4.2-ge-runner-storage/318/console

Comment 11 Raz Tamir 2017-08-05 08:08:08 UTC
Nir, 
How can we reproduce this?

Comment 12 Nir Soffer 2017-08-07 18:07:03 UTC
Raz, this issue is impossible to reproduce.

Basically you need to be have a host in start spm flow trying to acquire a host id, while the storage check loop thread is trying to acquire in the same time, blocked
in clusterlock internal lock. Then, you need to deactivate this storage domain,
which will raise a special error in the storage check loop, practically aborting
this thread. I don't know how to reproduce this.

Since you reproduced this with one of your automation tests, running this test
many times should be good enough.

The main fix for this issue was to remove the call in the check loop thread that
may raise the special error killing the check loop thread; so this error is 
impossible now.

Comment 13 Raz Tamir 2017-08-08 07:55:05 UTC
According to comment #12 closing with resolution NEXTRELEASE