Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1430122

Summary: SPM start task won't end before host jobs involving the master domain are complete
Product: [oVirt] vdsm Reporter: Raz Tamir <ratamir>
Component: CoreAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.19.7CC: amureini, bugs, laravot, lveyde
Target Milestone: ovirt-4.1.2Keywords: Automation
Target Release: 4.19.11Flags: rule-engine: ovirt-4.1+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-23 08:11:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine and vdsm logs none

Description Raz Tamir 2017-03-07 22:56:43 UTC
Created attachment 1261007 [details]
engine and vdsm logs

Description of problem:
When cloning a VM from template and killing the 'vdsmd' service of the SPM after calling to 'CopyVolumeDataVDSCommand', the SPM host will be non-responsive and a new SPM will elect after 5 minutes.
SPM election when killing the SPM without any job running, takes 30 seconds.
In that time (5 minutes) the DC status is invalid and all storage domains are inaccessible 

- Call from CopyVolumeDataVDSCommand:
2017-03-08 00:25:00,177+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CopyVolumeDataVDSCommand] (DefaultQuartzScheduler5) [52903ff] START, CopyVolumeDataVDSCommand(HostName = host_mixed_2, CopyVolumeDataVDSCommandParameters:{runAsync='true', hostId='9be06994-165f-4d27-b704-c7bb684edaa8', storageDomainId='null', jobId='65ccc8bf-b192-4727-8b8f-c1d5a7c18f1c', srcInfo='VdsmImageLocationInfo [storageDomainId=b6d5f4bc-6de3-4e5f-af83-af3373421cfb, imageGroupId=02238ce6-d46c-452c-bf2e-3cda74803eff, imageId=59579514-891d-4c38-9e83-896e33c0ddd8, generation=null]', dstInfo='VdsmImageLocationInfo [storageDomainId=b6d5f4bc-6de3-4e5f-af83-af3373421cfb, imageGroupId=6ec80213-db1b-4528-9dd5-56a2a78ac642, imageId=828e427c-4815-400b-9f86-1b546d0c8c1e, generation=0]', collapse='true'}), log id: 69af280c

- Kill 'vdsmd':
2017-03-08 00:25:07,108+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Connection reset by peer

- New SPM:
2017-03-08 00:30:19,139+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [2e4f8e50] EVENT_ID: IRS_HOSTED_ON_VDS(204), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Storage Pool Manager runs on Host host_mixed_2 (Address: storage-ge5-vdsm2.qa.lab.tlv.redhat.com).



Version-Release number of selected component (if applicable):
rhevm-4.1.1.3-0.1.el7
vdsm-4.19.7-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Clone VM from template
2. Run # kill `systemctl show vdsmd -p MainPID | awk -F '=' {'print $2'}` 
on the SPM during step 1
3.

Actual results:


Expected results:


Additional info:

Comment 1 Allon Mureinik 2017-03-12 10:33:32 UTC
IIUC, that's the time it takes sanlock to make sure a new SPM can be reelected, so I doubt there's anything intelligent we can do here.

Liron, any insight?

Comment 2 Liron Aravot 2017-03-20 18:33:23 UTC
The new host on which we attempt to start the spm is running a copyData job. When executing the job we  shared lock on the storage domain is acquired, during the spmStart we attempt to upgrade the master domain (for which we attempt to acquire a lock as well) - so currently the spm cannot be started while the copy job (or other jobs using that lock) are running.

Comment 3 Liron Aravot 2017-03-29 20:02:17 UTC
Moving to 4.1.2 - Allon, should I target it to 4.1.1? (I'd say that yes).

Comment 4 Allon Mureinik 2017-04-09 15:28:57 UTC
(In reply to Liron Aravot from comment #3)
> Moving to 4.1.2 - Allon, should I target it to 4.1.1? (I'd say that yes).
4.1.1 is already out there, and we won't respin for this bug.
Let's solve it for 4.1.2.

Comment 5 rhev-integ 2017-04-26 10:51:38 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[NO RELEVANT PATCHES FOUND]

For more info please contact: infra

Comment 6 Sandro Bonazzola 2017-04-27 11:18:26 UTC
Moving back to POST since a patch has not been merged yet

Comment 7 Red Hat Bugzilla Rules Engine 2017-04-27 11:18:32 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 8 Liron Aravot 2017-04-27 13:58:54 UTC
The patch that appeared as in POST wasn't eventually included in the fix for the bz. Removed it from the tracker, changing back to ON_QA.

Comment 9 Kevin Alon Goldblatt 2017-04-30 11:14:09 UTC
Verified with the following code:
------------------------------------------------
ovirt-engine-4.1.2-0.1.el7.noarch
rhevm-4.1.2-0.1.el7.noarch
vdsm-4.19.11-1.el7ev.x86_64

Verified with the following scenario:
------------------------------------------------
Create a Cloned VM from a template
After the CopyVolumeDataVDSCommand is called kill the vdsm process on the SPM
>>>>> The SPM becomes non-operational and the new SPM takes over in around 30 seconds.

Moving to VERIFIED!