Bug 969121 - [engine-backend] host stuck in non-operational and SDs remain active while Data center is Non-responsive
[engine-backend] host stuck in non-operational and SDs remain active while Da...
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
x86_64 Unspecified
unspecified Severity high
: ---
: 3.3.0
Assigned To: Nobody's working on this, feel free to take it
Depends On:
  Show dependency treegraph
Reported: 2013-05-30 13:17 EDT by Elad
Modified: 2016-02-10 12:08 EST (History)
11 users (show)

See Also:
Fixed In Version: is2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
logs (10.21 MB, application/x-gzip)
2013-05-30 13:17 EDT, Elad
no flags Details
logs (804.69 KB, application/x-gzip)
2013-07-08 03:34 EDT, Elad
no flags Details

External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 14767 None None None Never

  None (edit)
Description Elad 2013-05-30 13:17:34 EDT
Created attachment 754989 [details]

Description of problem:
after interruption in reconstruct spm tries to connect to pool and fails. after that, the host stuck in non-operational and the domains remain active  

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce: on 1 host and 2 SDs from different storage servers:
1. maintenance to the master domain and during that, stop vdsmd
2. reconstruct will fail 
3. start to vdsmd

Actual results:
1) host will become non-operational and stuck. 
2) the 2 SDs will remain active even though there are no active hosts in the setup.

Expected results:
1) host should not stuck in non-operational
2) storage domains should become inactive

Additional info: logs
Comment 1 Elad 2013-06-02 02:27:27 EDT
CORRECTION: reproduction steps: happened to me with 2 storage domains from the same server
Comment 2 Liron Aravot 2013-07-07 13:09:19 EDT
Elad ,
1. the logs of vdsm and the engine do not match, the engine logs are till 27/5 while the vdsm logs start at the 29/5 - please try to reproduce and attach the correct logs of less big timeframe if possible, thanks.

2. please point to the point in time in the logs in which the scenario you referred to happend, i didn't see it in the logs. I think that the best option is to reproduce the issue.

Regardless, there were two possibly related issues which reminds me the issue that you described:
1. This bug (from the text) - Domains statuses aren't changed. (Recent)

2. When deactivating domain, it's saved status in the compensation is set to UNKNOWN instead of Active (compensation doesn't appear in the engine log) 

#2 was merged, while #1 wasn't.
Comment 3 Elad 2013-07-08 03:34:52 EDT
Created attachment 770314 [details]

Managed to reproduce on 3.2: rhevm-3.2.1-0.39.el6ev.noarch

attached engine.log and vdsm.log
Comment 4 Liron Aravot 2013-07-08 04:27:58 EDT
this issue in the new log should be be solved by #2  - moving to MODIFIED
Comment 5 Elad 2013-07-15 08:08:59 EDT
Does not reprodeuced on RHEM3.3 (is5):

After interruption in reconstruct, host becomes active and reconstruct ends successfuly.
Comment 6 Itamar Heim 2014-01-21 17:28:20 EST
Closing - RHEV 3.3 Released
Comment 7 Itamar Heim 2014-01-21 17:28:24 EST
Closing - RHEV 3.3 Released
Comment 8 Itamar Heim 2014-01-21 17:31:15 EST
Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.