Bug 1469175 - vdsm path checking stops
vdsm path checking stops
Status: CLOSED NEXTRELEASE
Product: vdsm
Classification: oVirt
Component: Core (Show other bugs)
4.20.0
Unspecified Unspecified
unspecified Severity urgent (vote)
: ovirt-4.1.5
: 4.19.27
Assigned To: Nir Soffer
Raz Tamir
: Automation, Regression
Depends On:
Blocks: 1422508
  Show dependency treegraph
 
Reported: 2017-07-10 11:17 EDT by Raz Tamir
Modified: 2017-08-09 12:37 EDT (History)
7 users (show)

See Also:
Fixed In Version: v4.19.26
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-08 03:55:05 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.1+
rule-engine: blocker+


Attachments (Terms of Use)
vdsm logs (5.25 MB, application/x-gzip)
2017-07-10 11:17 EDT, Raz Tamir
no flags Details
engine and vdsm logs (2.80 MB, application/x-gzip)
2017-07-10 11:18 EDT, Raz Tamir
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 70443 master MERGED monitor: Fix domain monitoring isolation 2017-07-13 18:15 EDT
oVirt gerrit 79241 master MERGED concurrent: Log BaseException when threads finish 2017-07-11 16:33 EDT
oVirt gerrit 79242 master MERGED asyncevent: Fix double complete calls on errors 2017-07-13 10:32 EDT
oVirt gerrit 79261 master MERGED asyncevent: Add failing test for handle_error failures 2017-07-13 10:33 EDT
oVirt gerrit 79290 master MERGED asyncevent: Log errors in handle_error() 2017-07-13 10:33 EDT
oVirt gerrit 79291 master MERGED check: Handle errors the complete callback 2017-07-13 10:33 EDT
oVirt gerrit 79297 master MERGED monitor: _notifyStatusChange should not be a cancelpoint 2017-07-13 18:15 EDT
oVirt gerrit 79458 ovirt-4.1 MERGED utils: Log BaseException in utils.traceback() 2017-08-01 08:39 EDT
oVirt gerrit 79460 ovirt-4.1 MERGED asyncevent: Fix double complete calls on errors 2017-08-01 08:39 EDT
oVirt gerrit 79461 ovirt-4.1 MERGED asyncevent: Add failing test for handle_error failures 2017-08-01 08:40 EDT
oVirt gerrit 79462 ovirt-4.1 MERGED asyncevent: Log errors in handle_error() 2017-08-01 08:40 EDT
oVirt gerrit 79463 ovirt-4.1 MERGED check: Handle errors the complete callback 2017-08-01 08:40 EDT
oVirt gerrit 79464 ovirt-4.1 MERGED monitor: Fix domain monitoring isolation 2017-08-01 08:40 EDT
oVirt gerrit 79465 ovirt-4.1 MERGED monitor: _notifyStatusChange should not be a cancelpoint 2017-08-01 08:40 EDT

  None (edit)
Description Raz Tamir 2017-07-10 11:17:46 EDT
Created attachment 1295854 [details]
vdsm logs

Description of problem:
vdsm path checking stops

2017-07-09 10:17:13,765+0300 INFO  (check/loop) [storage.SANLock] Host id for domain b7ba16a4-fd9f-49aa-94d3-e12fbf00740f already acquired (id=2, async=True) (clusterlock:313)
2017-07-09 10:17:13,766+0300 DEBUG (check/loop) [storage.asyncevent] <EventLoop running=False closed=False at 0x56283600> stopped (asyncevent:130)
2017-07-09 10:17:13,771+0300 DEBUG (monitor/b7ba16a) [storage.Monitor] Domain monitor for b7ba16a4-fd9f-49aa-94d3-e12fbf00740f canceled (monitor:309)
2017-07-09 10:17:13,772+0300 DEBUG (monitor/b7ba16a) [storage.Monitor] Domain monitor for b7ba16a4-fd9f-49aa-94d3-e12fbf00740f stopped (shutdown=False) (monitor:312)
2017-07-09 10:17:13,772+0300 INFO  (monitor/b7ba16a) [storage.check] Stop checking '/dev/b7ba16a4-fd9f-49aa-94d3-e12fbf00740f/metadata' (check:136)
2017-07-09 10:17:14,143+0300 DEBUG (monitor/6baec58) [storage.Safelease] Host id for domain 6baec589-6add-4868-9cd3-0da93f1d0b24 released successfully (id: 2) (clusterlock:163)
2017-07-09 10:17:14,144+0300 DEBUG (monitor/6baec58) [storage.Monitor] FINISH thread <Thread(monitor/6baec58, started daemon 139655861008128)> (concurrent:192)
2017-07-09 10:17:14,753+0300 DEBUG (monitor/f722591) [storage.Monitor] FINISH thread <Thread(monitor/f722591, started daemon 139655701546752)> (concurrent:192)
2017-07-09 10:17:14,755+0300 INFO  (monitor/1320c57) [storage.SANLock] Releasing host id for domain 1320c570-c46f-476d-b84f-6ef92c50acce (id: 2) (clusterlock:326)
2017-07-09 10:17:14,757+0300 INFO  (monitor/e8cfeff) [storage.SANLock] Releasing host id for domain e8cfeffc-ae87-4dc5-948d-042d7befc4ff (id: 2) (clusterlock:326)
2017-07-09 10:17:14,759+0300 INFO  (monitor/7a613f6) [storage.SANLock] Releasing host id for domain 7a613f66-5ec2-417a-8a89-5989030baabf (id: 2) (clusterlock:326)
2017-07-09 10:17:14,760+0300 INFO  (monitor/a0f88d0) [storage.SANLock] Releasing host id for domain a0f88d0b-fdbf-4f85-8a76-8471b612be5f (id: 2) (clusterlock:326)
2017-07-09 10:17:14,762+0300 INFO  (monitor/93d5dc6) [storage.SANLock] Releasing host id for domain 93d5dc65-31e4-444a-b46b-ce2307f0a4c1 (id: 2) (clusterlock:326)
2017-07-09 10:17:14,764+0300 INFO  (monitor/9e3e414) [storage.SANLock] Releasing host id for domain 9e3e4147-de96-4d74-a6de-3a79c50302e7 (id: 2) (clusterlock:326)
2017-07-09 10:17:14,765+0300 INFO  (monitor/34b3752) [storage.SANLock] Releasing host id for domain 34b37522-6418-429a-9a7a-c30c88b8d817 (id: 2) (clusterlock:326)
2017-07-09 10:17:14,773+0300 INFO  (monitor/b7ba16a) [storage.SANLock] Releasing host id for domain b7ba16a4-fd9f-49aa-94d3-e12fbf00740f (id: 2) (clusterlock:326)

The result is timeout to acquire leases:

2017-07-09 10:34:13,134+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler7) [disks_syncAction_b62b08ce-44de-418e] Command 'MoveOrCopyDisk' (id: '525986ac-19b0-450
9-872d-412326cda8a4') waiting on child command id: '7df669f2-f033-4151-b67c-1bd392017bb3' type:'MoveImageGroup' to complete
2017-07-09 10:34:13,157+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler7) [disks_syncAction_b62b08ce-44de-418e] Command 'MoveImageGroup' (id: '7df669f2-f033-415
1-b67c-1bd392017bb3') waiting on child command id: '92bd660f-06ac-44a4-8c62-36f378bbdbee' type:'CopyImageGroupWithData' to complete
2017-07-09 10:34:18,172+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler8) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupWithData' (id: '92bd660f-06ac
-44a4-8c62-36f378bbdbee') waiting on child command id: 'e2c9242a-ac92-4909-9e3b-58466c28b701' type:'CopyImageGroupVolumesData' to complete
2017-07-09 10:34:19,181+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler6) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupVolumesData' (id: 'e2c9242a-a
c92-4909-9e3b-58466c28b701') waiting on child command id: '72e5677b-28b9-4944-9b48-da00b780fee3' type:'CopyData' to complete
2017-07-09 10:34:20,198+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] START, GetHostJobsVDSCommand(HostName = host_mixed_
2, GetHostJobsVDSCommandParameters:{hostId='7ab3e3f7-1352-46cc-b590-82eb30e35dfc', type='storage', jobIds='[f641b9df-8c64-4b57-919e-a5cf5b052274]'}), log id: 49e5b5ed
2017-07-09 10:34:20,207+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] FINISH, GetHostJobsVDSCommand, return: {f641b9df-8c
64-4b57-919e-a5cf5b052274=HostJobInfo:{id='f641b9df-8c64-4b57-919e-a5cf5b052274', type='storage', description='copy_data', status='failed', progress='null', error='VDSError:{code='AcquireHostIdFailure', message='Ca
nnot acquire host id: ("Timeout acquiring host id, cannot acquire Lease(name='fb6ce1fc-a7ee-437c-b535-5c7d1c1b03c5', path=u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge8_
_nfs__0/7a613f66-5ec2-417a-8a89-5989030baabf/images/d0bad82e-633d-462d-ab4d-f44e21d7a2c0/fb6ce1fc-a7ee-437c-b535-5c7d1c1b03c5.lease', offset=0) (id=2)",)'}'}}, log id: 49e5b5ed
2017-07-09 10:34:20,215+03 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] Command CopyData id: '72e5677b-28b9-4944-9b48-da00b780fee3': job 'f641
b9df-8c64-4b57-919e-a5cf5b052274' execution was completed with VDSM job status 'failed'
2017-07-09 10:34:20,223+03 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler3) [disks_syncAction_b62b08ce-44de-418e] Command CopyData id: '72e5677b-28b9-4944-9b48-da00b780fee3': execution
 was completed, the command status is 'FAILED'
2017-07-09 10:34:21,236+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (DefaultQuartzScheduler1) [disks_syncAction_b62b08ce-44de-418e] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand' with failure.
2017-07-09 10:34:21,258+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler1) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupVolumesData' id: 'e2c9242a-ac92-4909-9e3b-58466c28b701' child commands '[72e5677b-28b9-4944-9b48-da00b780fee3]' executions were completed, status 'FAILED'
2017-07-09 10:34:22,293+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand] (DefaultQuartzScheduler9) [disks_syncAction_b62b08ce-44de-418e] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand' with failure.
2017-07-09 10:34:22,318+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler9) [disks_syncAction_b62b08ce-44de-418e] Command 'CopyImageGroupWithData' id: '92bd660f-06ac-44a4-8c62-36f378bbdbee' child commands '[002f899e-9bf0-499a-acd8-a3b8b556a4d6, e2c9242a-ac92-4909-9e3b-58466c28b701]' executions were completed, status 'FAILED'


vdsm.log:

2017-07-09 14:19:10,405+0300 ERROR (tasks/8) [root] Job '9b3a948f-5c16-48bc-a551-51343c34989f' failed (jobs:221)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/jobs.py", line 157, in run
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdm/api/copy_data.py", line 61, in _run
    with guarded.context(self._source.locks + self._dest.locks):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 102, in __enter__
    six.reraise(*exc)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/guarded.py", line 93, in __enter__
    lock.acquire()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1519, in acquire
    dom.acquireVolumeLease(self._host_id, self._img_id, self._vol_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 478, in acquireVolumeLease
    self._domainLock.acquire(hostId, lease)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 390, in acquire
    % (lease, hostId))
AcquireHostIdFailure: Cannot acquire host id: ("Timeout acquiring host id, cannot acquire Lease(name='9ad82bab-6e6e-4250-9757-e94b01f89736', path='/dev/e8cfeffc-ae87-4dc5-948d-042d7befc4ff/leases', offset=114294784) (id=2)",)




Version-Release number of selected component (if applicable):
vdsm-4.20.1-157.git79aca9e.el7.centos.x86_64
ovirt-engine-4.2.0-0.0.master.20170707124946.gitf15a6d9.el7.centos.noarch


How reproducible:
don't know

Steps to Reproduce:
1. Move disk
2.
3.

Actual results:
all domains are in actual=False

Expected results:


Additional info:
Comment 1 Raz Tamir 2017-07-10 11:18 EDT
Created attachment 1295855 [details]
engine and vdsm logs
Comment 2 Nir Soffer 2017-07-10 14:28:12 EDT
This is the main issue:

2017-07-09 10:17:13,766+0300 DEBUG (check/loop) [storage.asyncevent] <EventLoop running=False closed=False at 0x56283600> stopped (asyncevent:130)

The even loop used by the path checking thread should never exit. If it fails, we must have a traceback showing the underlying error.

The first step in handling this we will enhance logging so such silent death is not
possible. This will expose the real error in the next automation run.
Comment 3 Allon Mureinik 2017-07-10 15:48:23 EDT
Does this happen on 4.1.z too?
Comment 4 Nir Soffer 2017-07-10 19:46:08 EDT
(In reply to Allon Mureinik from comment #3)
> Does this happen on 4.1.z too?

It should happen with 4.1 and 4.0. I think the trigger is the patches waiting for
the host id, but they just reveal the underlying problem.

Raz, can you run the failing test with the 4.1 backports?
https://gerrit.ovirt.org/79166
Comment 5 Nir Soffer 2017-07-11 16:43:07 EDT
Raz, this issue should be fixed by the attached bugs. Can you run the failing
automation test again with the patches?
Comment 6 Raz Tamir 2017-07-12 11:16:53 EDT
(In reply to Allon Mureinik from comment #3)
> Does this happen on 4.1.z too?

No, because this issue is hidden by different bug #1422508 which is still not merged
Comment 7 Raz Tamir 2017-07-12 11:41:58 EDT
(In reply to Nir Soffer from comment #4)
> (In reply to Allon Mureinik from comment #3)
> > Does this happen on 4.1.z too?
> 
> It should happen with 4.1 and 4.0. I think the trigger is the patches
> waiting for
> the host id, but they just reveal the underlying problem.
> 
> Raz, can you run the failing test with the 4.1 backports?
> https://gerrit.ovirt.org/79166

Sure,
testing in progress
Comment 8 Raz Tamir 2017-07-12 11:42:47 EDT
(In reply to Nir Soffer from comment #5)
> Raz, this issue should be fixed by the attached bugs. Can you run the failing
> automation test again with the patches?

Sure,

testing in progress
Comment 9 Avihai 2017-07-13 00:06:08 EDT
(In reply to Raz Tamir from comment #8)
> (In reply to Nir Soffer from comment #5)
> > Raz, this issue should be fixed by the attached bugs. Can you run the failing
> > automation test again with the patches?
> 
> Sure,
> 
> testing in progress

Testing of https://gerrit.ovirt.org/#/c/70443/ was successful.
Link to run:
https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/4.2_dev/job/rhv-4.2-ge-runner-storage/318/console
Comment 11 Raz Tamir 2017-08-05 04:08:08 EDT
Nir, 
How can we reproduce this?
Comment 12 Nir Soffer 2017-08-07 14:07:03 EDT
Raz, this issue is impossible to reproduce.

Basically you need to be have a host in start spm flow trying to acquire a host id, while the storage check loop thread is trying to acquire in the same time, blocked
in clusterlock internal lock. Then, you need to deactivate this storage domain,
which will raise a special error in the storage check loop, practically aborting
this thread. I don't know how to reproduce this.

Since you reproduced this with one of your automation tests, running this test
many times should be good enough.

The main fix for this issue was to remove the call in the check loop thread that
may raise the special error killing the check loop thread; so this error is 
impossible now.
Comment 13 Raz Tamir 2017-08-08 03:55:05 EDT
According to comment #12 closing with resolution NEXTRELEASE

Note You need to log in before you can comment on or make changes to this bug.