Created attachment 1261007 [details] engine and vdsm logs Description of problem: When cloning a VM from template and killing the 'vdsmd' service of the SPM after calling to 'CopyVolumeDataVDSCommand', the SPM host will be non-responsive and a new SPM will elect after 5 minutes. SPM election when killing the SPM without any job running, takes 30 seconds. In that time (5 minutes) the DC status is invalid and all storage domains are inaccessible - Call from CopyVolumeDataVDSCommand: 2017-03-08 00:25:00,177+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CopyVolumeDataVDSCommand] (DefaultQuartzScheduler5) [52903ff] START, CopyVolumeDataVDSCommand(HostName = host_mixed_2, CopyVolumeDataVDSCommandParameters:{runAsync='true', hostId='9be06994-165f-4d27-b704-c7bb684edaa8', storageDomainId='null', jobId='65ccc8bf-b192-4727-8b8f-c1d5a7c18f1c', srcInfo='VdsmImageLocationInfo [storageDomainId=b6d5f4bc-6de3-4e5f-af83-af3373421cfb, imageGroupId=02238ce6-d46c-452c-bf2e-3cda74803eff, imageId=59579514-891d-4c38-9e83-896e33c0ddd8, generation=null]', dstInfo='VdsmImageLocationInfo [storageDomainId=b6d5f4bc-6de3-4e5f-af83-af3373421cfb, imageGroupId=6ec80213-db1b-4528-9dd5-56a2a78ac642, imageId=828e427c-4815-400b-9f86-1b546d0c8c1e, generation=0]', collapse='true'}), log id: 69af280c - Kill 'vdsmd': 2017-03-08 00:25:07,108+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Connection reset by peer - New SPM: 2017-03-08 00:30:19,139+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [2e4f8e50] EVENT_ID: IRS_HOSTED_ON_VDS(204), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Storage Pool Manager runs on Host host_mixed_2 (Address: storage-ge5-vdsm2.qa.lab.tlv.redhat.com). Version-Release number of selected component (if applicable): rhevm-4.1.1.3-0.1.el7 vdsm-4.19.7-1.el7ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Clone VM from template 2. Run # kill `systemctl show vdsmd -p MainPID | awk -F '=' {'print $2'}` on the SPM during step 1 3. Actual results: Expected results: Additional info:
IIUC, that's the time it takes sanlock to make sure a new SPM can be reelected, so I doubt there's anything intelligent we can do here. Liron, any insight?
The new host on which we attempt to start the spm is running a copyData job. When executing the job we shared lock on the storage domain is acquired, during the spmStart we attempt to upgrade the master domain (for which we attempt to acquire a lock as well) - so currently the spm cannot be started while the copy job (or other jobs using that lock) are running.
Moving to 4.1.2 - Allon, should I target it to 4.1.1? (I'd say that yes).
(In reply to Liron Aravot from comment #3) > Moving to 4.1.2 - Allon, should I target it to 4.1.1? (I'd say that yes). 4.1.1 is already out there, and we won't respin for this bug. Let's solve it for 4.1.2.
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [NO RELEVANT PATCHES FOUND] For more info please contact: infra
Moving back to POST since a patch has not been merged yet
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
The patch that appeared as in POST wasn't eventually included in the fix for the bz. Removed it from the tracker, changing back to ON_QA.
Verified with the following code: ------------------------------------------------ ovirt-engine-4.1.2-0.1.el7.noarch rhevm-4.1.2-0.1.el7.noarch vdsm-4.19.11-1.el7ev.x86_64 Verified with the following scenario: ------------------------------------------------ Create a Cloned VM from a template After the CopyVolumeDataVDSCommand is called kill the vdsm process on the SPM >>>>> The SPM becomes non-operational and the new SPM takes over in around 30 seconds. Moving to VERIFIED!