Bug 969121

Summary: [engine-backend] host stuck in non-operational and SDs remain active while Data center is Non-responsive
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: ovirt-engineAssignee: Nobody's working on this, feel free to take it <nobody>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acanan, acathrow, amureini, ebenahar, iheim, jkt, laravot, lpeer, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---   
Target Release: 3.3.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: is2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
logs none

Description Elad 2013-05-30 17:17:34 UTC
Created attachment 754989 [details]
logs

Description of problem:
after interruption in reconstruct spm tries to connect to pool and fails. after that, the host stuck in non-operational and the domains remain active  

Version-Release number of selected component (if applicable):
vdsm-4.10.2-22.0.el6ev.x86_64
rhevm-3.2.0-11.29.el6ev.noarch

How reproducible:
50%

Steps to Reproduce: on 1 host and 2 SDs from different storage servers:
1. maintenance to the master domain and during that, stop vdsmd
2. reconstruct will fail 
3. start to vdsmd

Actual results:
1) host will become non-operational and stuck. 
2) the 2 SDs will remain active even though there are no active hosts in the setup.

Expected results:
1) host should not stuck in non-operational
2) storage domains should become inactive


Additional info: logs

Comment 1 Elad 2013-06-02 06:27:27 UTC
CORRECTION: reproduction steps: happened to me with 2 storage domains from the same server

Comment 2 Liron Aravot 2013-07-07 17:09:19 UTC
Elad ,
1. the logs of vdsm and the engine do not match, the engine logs are till 27/5 while the vdsm logs start at the 29/5 - please try to reproduce and attach the correct logs of less big timeframe if possible, thanks.

2. please point to the point in time in the logs in which the scenario you referred to happend, i didn't see it in the logs. I think that the best option is to reproduce the issue.

Regardless, there were two possibly related issues which reminds me the issue that you described:
1. This bug (from the text) - Domains statuses aren't changed. (Recent)
 https://bugzilla.redhat.com/show_bug.cgi?id=977169 

2. When deactivating domain, it's saved status in the compensation is set to UNKNOWN instead of Active (compensation doesn't appear in the engine log) 
https://bugzilla.redhat.com/show_bug.cgi?id=920694#c6

#2 was merged, while #1 wasn't.

Comment 3 Elad 2013-07-08 07:34:52 UTC
Created attachment 770314 [details]
logs

Managed to reproduce on 3.2: rhevm-3.2.1-0.39.el6ev.noarch

attached engine.log and vdsm.log

Comment 4 Liron Aravot 2013-07-08 08:27:58 UTC
this issue in the new log should be be solved by #2  - moving to MODIFIED

Comment 5 Elad 2013-07-15 12:08:59 UTC
Does not reprodeuced on RHEM3.3 (is5):
rhevm-3.3.0-0.6.master.el6ev.noarch
vdsm-4.11.0-121.git082925a.el6.x86_64

After interruption in reconstruct, host becomes active and reconstruct ends successfuly.

Comment 6 Itamar Heim 2014-01-21 22:28:20 UTC
Closing - RHEV 3.3 Released

Comment 7 Itamar Heim 2014-01-21 22:28:24 UTC
Closing - RHEV 3.3 Released

Comment 8 Itamar Heim 2014-01-21 22:31:15 UTC
Closing - RHEV 3.3 Released