Bug 1430122 - SPM start task won't end before host jobs involving the master domain are complete
Summary: SPM start task won't end before host jobs involving the master domain are com...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.19.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.1.2
: 4.19.11
Assignee: Liron Aravot
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-07 22:56 UTC by Raz Tamir
Modified: 2017-05-23 08:11 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-05-23 08:11:14 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+


Attachments (Terms of Use)
engine and vdsm logs (2.83 MB, application/x-gzip)
2017-03-07 22:56 UTC, Raz Tamir
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 74534 0 master MERGED sp: _upgradePool - msd update when needed 2017-04-07 08:05:27 UTC
oVirt gerrit 74536 0 master MERGED sp: _upgradePoolDomain - upgrade only when needed 2017-04-07 08:05:31 UTC
oVirt gerrit 75321 0 ovirt-4.1 MERGED sp: _upgradePool - msd update when needed 2017-04-10 06:52:25 UTC
oVirt gerrit 75322 0 ovirt-4.1 MERGED sp: _upgradePoolDomain - upgrade only when needed 2017-04-10 06:52:09 UTC

Description Raz Tamir 2017-03-07 22:56:43 UTC
Created attachment 1261007 [details]
engine and vdsm logs

Description of problem:
When cloning a VM from template and killing the 'vdsmd' service of the SPM after calling to 'CopyVolumeDataVDSCommand', the SPM host will be non-responsive and a new SPM will elect after 5 minutes.
SPM election when killing the SPM without any job running, takes 30 seconds.
In that time (5 minutes) the DC status is invalid and all storage domains are inaccessible 

- Call from CopyVolumeDataVDSCommand:
2017-03-08 00:25:00,177+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CopyVolumeDataVDSCommand] (DefaultQuartzScheduler5) [52903ff] START, CopyVolumeDataVDSCommand(HostName = host_mixed_2, CopyVolumeDataVDSCommandParameters:{runAsync='true', hostId='9be06994-165f-4d27-b704-c7bb684edaa8', storageDomainId='null', jobId='65ccc8bf-b192-4727-8b8f-c1d5a7c18f1c', srcInfo='VdsmImageLocationInfo [storageDomainId=b6d5f4bc-6de3-4e5f-af83-af3373421cfb, imageGroupId=02238ce6-d46c-452c-bf2e-3cda74803eff, imageId=59579514-891d-4c38-9e83-896e33c0ddd8, generation=null]', dstInfo='VdsmImageLocationInfo [storageDomainId=b6d5f4bc-6de3-4e5f-af83-af3373421cfb, imageGroupId=6ec80213-db1b-4528-9dd5-56a2a78ac642, imageId=828e427c-4815-400b-9f86-1b546d0c8c1e, generation=0]', collapse='true'}), log id: 69af280c

- Kill 'vdsmd':
2017-03-08 00:25:07,108+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Connection reset by peer

- New SPM:
2017-03-08 00:30:19,139+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [2e4f8e50] EVENT_ID: IRS_HOSTED_ON_VDS(204), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Storage Pool Manager runs on Host host_mixed_2 (Address: storage-ge5-vdsm2.qa.lab.tlv.redhat.com).



Version-Release number of selected component (if applicable):
rhevm-4.1.1.3-0.1.el7
vdsm-4.19.7-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Clone VM from template
2. Run # kill `systemctl show vdsmd -p MainPID | awk -F '=' {'print $2'}` 
on the SPM during step 1
3.

Actual results:


Expected results:


Additional info:

Comment 1 Allon Mureinik 2017-03-12 10:33:32 UTC
IIUC, that's the time it takes sanlock to make sure a new SPM can be reelected, so I doubt there's anything intelligent we can do here.

Liron, any insight?

Comment 2 Liron Aravot 2017-03-20 18:33:23 UTC
The new host on which we attempt to start the spm is running a copyData job. When executing the job we  shared lock on the storage domain is acquired, during the spmStart we attempt to upgrade the master domain (for which we attempt to acquire a lock as well) - so currently the spm cannot be started while the copy job (or other jobs using that lock) are running.

Comment 3 Liron Aravot 2017-03-29 20:02:17 UTC
Moving to 4.1.2 - Allon, should I target it to 4.1.1? (I'd say that yes).

Comment 4 Allon Mureinik 2017-04-09 15:28:57 UTC
(In reply to Liron Aravot from comment #3)
> Moving to 4.1.2 - Allon, should I target it to 4.1.1? (I'd say that yes).
4.1.1 is already out there, and we won't respin for this bug.
Let's solve it for 4.1.2.

Comment 5 rhev-integ 2017-04-26 10:51:38 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[NO RELEVANT PATCHES FOUND]

For more info please contact: infra

Comment 6 Sandro Bonazzola 2017-04-27 11:18:26 UTC
Moving back to POST since a patch has not been merged yet

Comment 7 Red Hat Bugzilla Rules Engine 2017-04-27 11:18:32 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 8 Liron Aravot 2017-04-27 13:58:54 UTC
The patch that appeared as in POST wasn't eventually included in the fix for the bz. Removed it from the tracker, changing back to ON_QA.

Comment 9 Kevin Alon Goldblatt 2017-04-30 11:14:09 UTC
Verified with the following code:
------------------------------------------------
ovirt-engine-4.1.2-0.1.el7.noarch
rhevm-4.1.2-0.1.el7.noarch
vdsm-4.19.11-1.el7ev.x86_64

Verified with the following scenario:
------------------------------------------------
Create a Cloned VM from a template
After the CopyVolumeDataVDSCommand is called kill the vdsm process on the SPM
>>>>> The SPM becomes non-operational and the new SPM takes over in around 30 seconds.

Moving to VERIFIED!


Note You need to log in before you can comment on or make changes to this bug.