1430122 – SPM start task won't end before host jobs involving the master domain are complete

Bug 1430122 - SPM start task won't end before host jobs involving the master domain are complete

Summary: SPM start task won't end before host jobs involving the master domain are com...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.19.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.1.2
Target Release:	4.19.11
Assignee:	Liron Aravot
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-03-07 22:56 UTC by Raz Tamir
Modified:	2017-05-23 08:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-05-23 08:11:14 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+

Attachments	(Terms of Use)
engine and vdsm logs (2.83 MB, application/x-gzip) 2017-03-07 22:56 UTC, Raz Tamir	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	74534	master	MERGED	sp: _upgradePool - msd update when needed	2017-04-07 08:05:27 UTC
oVirt gerrit	74536	master	MERGED	sp: _upgradePoolDomain - upgrade only when needed	2017-04-07 08:05:31 UTC
oVirt gerrit	75321	ovirt-4.1	MERGED	sp: _upgradePool - msd update when needed	2017-04-10 06:52:25 UTC
oVirt gerrit	75322	ovirt-4.1	MERGED	sp: _upgradePoolDomain - upgrade only when needed	2017-04-10 06:52:09 UTC

Description Raz Tamir 2017-03-07 22:56:43 UTC

Created attachment 1261007 [details]
engine and vdsm logs

Description of problem:
When cloning a VM from template and killing the 'vdsmd' service of the SPM after calling to 'CopyVolumeDataVDSCommand', the SPM host will be non-responsive and a new SPM will elect after 5 minutes.
SPM election when killing the SPM without any job running, takes 30 seconds.
In that time (5 minutes) the DC status is invalid and all storage domains are inaccessible 

- Call from CopyVolumeDataVDSCommand:
2017-03-08 00:25:00,177+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CopyVolumeDataVDSCommand] (DefaultQuartzScheduler5) [52903ff] START, CopyVolumeDataVDSCommand(HostName = host_mixed_2, CopyVolumeDataVDSCommandParameters:{runAsync='true', hostId='9be06994-165f-4d27-b704-c7bb684edaa8', storageDomainId='null', jobId='65ccc8bf-b192-4727-8b8f-c1d5a7c18f1c', srcInfo='VdsmImageLocationInfo [storageDomainId=b6d5f4bc-6de3-4e5f-af83-af3373421cfb, imageGroupId=02238ce6-d46c-452c-bf2e-3cda74803eff, imageId=59579514-891d-4c38-9e83-896e33c0ddd8, generation=null]', dstInfo='VdsmImageLocationInfo [storageDomainId=b6d5f4bc-6de3-4e5f-af83-af3373421cfb, imageGroupId=6ec80213-db1b-4528-9dd5-56a2a78ac642, imageId=828e427c-4815-400b-9f86-1b546d0c8c1e, generation=0]', collapse='true'}), log id: 69af280c

- Kill 'vdsmd':
2017-03-08 00:25:07,108+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Connection reset by peer

- New SPM:
2017-03-08 00:30:19,139+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [2e4f8e50] EVENT_ID: IRS_HOSTED_ON_VDS(204), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Storage Pool Manager runs on Host host_mixed_2 (Address: storage-ge5-vdsm2.qa.lab.tlv.redhat.com).



Version-Release number of selected component (if applicable):
rhevm-4.1.1.3-0.1.el7
vdsm-4.19.7-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Clone VM from template
2. Run # kill `systemctl show vdsmd -p MainPID | awk -F '=' {'print $2'}` 
on the SPM during step 1
3.

Actual results:


Expected results:


Additional info:

Comment 1 Allon Mureinik 2017-03-12 10:33:32 UTC

IIUC, that's the time it takes sanlock to make sure a new SPM can be reelected, so I doubt there's anything intelligent we can do here.

Liron, any insight?

Comment 2 Liron Aravot 2017-03-20 18:33:23 UTC

The new host on which we attempt to start the spm is running a copyData job. When executing the job we  shared lock on the storage domain is acquired, during the spmStart we attempt to upgrade the master domain (for which we attempt to acquire a lock as well) - so currently the spm cannot be started while the copy job (or other jobs using that lock) are running.

Comment 3 Liron Aravot 2017-03-29 20:02:17 UTC

Moving to 4.1.2 - Allon, should I target it to 4.1.1? (I'd say that yes).

Comment 4 Allon Mureinik 2017-04-09 15:28:57 UTC

(In reply to Liron Aravot from comment #3)
> Moving to 4.1.2 - Allon, should I target it to 4.1.1? (I'd say that yes).
4.1.1 is already out there, and we won't respin for this bug.
Let's solve it for 4.1.2.

Comment 5 rhev-integ 2017-04-26 10:51:38 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[NO RELEVANT PATCHES FOUND]

For more info please contact: infra

Comment 6 Sandro Bonazzola 2017-04-27 11:18:26 UTC

Moving back to POST since a patch has not been merged yet

Comment 7 Red Hat Bugzilla Rules Engine 2017-04-27 11:18:32 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 8 Liron Aravot 2017-04-27 13:58:54 UTC

The patch that appeared as in POST wasn't eventually included in the fix for the bz. Removed it from the tracker, changing back to ON_QA.

Comment 9 Kevin Alon Goldblatt 2017-04-30 11:14:09 UTC

Verified with the following code:
------------------------------------------------
ovirt-engine-4.1.2-0.1.el7.noarch
rhevm-4.1.2-0.1.el7.noarch
vdsm-4.19.11-1.el7ev.x86_64

Verified with the following scenario:
------------------------------------------------
Create a Cloned VM from a template
After the CopyVolumeDataVDSCommand is called kill the vdsm process on the SPM
>>>>> The SPM becomes non-operational and the new SPM takes over in around 30 seconds.

Moving to VERIFIED!

Note You need to log in before you can comment on or make changes to this bug.