Bug 1400707

Summary: Live merge failed on "timeout which can be caused by communication issues"
Product: [oVirt] ovirt-engine Reporter: Raz Tamir <ratamir>
Component: BLL.StorageAssignee: Francesco Romani <fromani>
Status: CLOSED CURRENTRELEASE QA Contact: Raz Tamir <ratamir>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: bugs, gklein, ratamir, tnisan, ylavi
Target Milestone: ovirt-4.1.0-betaKeywords: Automation, Regression, Reopened
Target Release: 4.1.0.2Flags: rule-engine: ovirt-4.1+
rule-engine: blocker+
tnisan: devel_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-01 14:37:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
spm and engine logs
none
hsm and engine logs none

Description Raz Tamir 2016-12-01 21:47:57 UTC
Created attachment 1227035 [details]
spm and engine logs

Description of problem:
When trying to perform a live merge (VM running on either SPM or HSM), the operation fails:
2016-12-01 23:28:24,616+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-6) [68218035] Command 'MergeVDSCommand(HostName = host_mixed_3, MergeVDSCommandParameters:{runAsync='true', hostId='88d0d698-e962-4d4c-b333-3667a678c580', vmId='ea659a41-088f-4521-a09d-abe4a9802f73', storagePoolId='5ef2e0f0-1bba-45b0-ab2f-6c51ba0692f9', storageDomainId='e7826af8-fe1c-44af-8cef-7e7c7af67d5e', imageGroupId='30ee327a-e5e7-44be-b9aa-a0ee11916eab', imageId='bbb0f647-ebc0-4a2c-9b4e-340a799322e0', baseImageId='8672013b-a877-43b0-9d95-9379b53ae1dd', topImageId='bbb0f647-ebc0-4a2c-9b4e-340a799322e0', bandwidth='0'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

Following with:

2016-12-01 23:28:24,616+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-48) [68218035] Host 'host_mixed_3' is not responding.
2016-12-01 23:28:24,616+02 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-5-thread-6) [68218035] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)

There is no error in vdsm and it never become not responding.




Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.0.master.20161126211319.gitae69c34.el7.centos.noarch
vdsm-4.18.999-1020.git1ff41b1.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start a VM with existing snapshot
2. Remove the snapshot
3.

Actual results:
Explained above


Expected results:
the live merge flow should finish successfully

Additional info:

Comment 1 Raz Tamir 2016-12-01 21:48:30 UTC
Created attachment 1227036 [details]
hsm and engine logs

Comment 2 Allon Mureinik 2016-12-02 00:56:44 UTC
Tentatively targetting to 4.1.
Raz - does this reproduce in 4.0.z too?

Comment 3 Raz Tamir 2016-12-02 09:28:29 UTC
Allon,
In 4.0.z we have different bug, bug #1400137.
I checked that the results are not the same before open this bug to 4.1

Comment 5 Tal Nisan 2016-12-08 15:01:39 UTC
Reproduced by Ala and it is a duplicate of bug 1400137

*** This bug has been marked as a duplicate of bug 1400137 ***

Comment 6 Tal Nisan 2016-12-08 15:09:33 UTC
Correction: while the patch attached fixes a part of bug 1400137 it is not a duplicate since bug 1400137 was affected by another bug in zstream.
Reopening this bug to track the issue

Comment 7 Francesco Romani 2016-12-14 16:41:46 UTC
This bug was caused by internal refactoring and affects unreleased (meaning no official release) software -> fixed in 4.1.0 beta.
So it don't deserve doc_text.

Comment 8 Raz Tamir 2017-01-02 11:51:15 UTC
Verified using automation - tier 1 and tier 2 passed on all storage types (nfs, iscsi, glusterfs)