Bug 1026811

Summary: [engine] vm appears in 2 different rhev-h hosts (split brain)
Product: Red Hat Enterprise Virtualization Manager Reporter: Eyal Edri <eedri>
Component: ovirt-engineAssignee: Roy Golan <rgolan>
Status: CLOSED CURRENTRELEASE QA Contact: Pavel Novotny <pnovotny>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.2.0CC: acathrow, iheim, lpeer, lyarwood, mavital, michal.skrivanek, mkalinin, Rhev-m-bugs, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: is24 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1029035 (view as bug list) Environment:
Last Closed: 2014-01-21 22:14:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1029035, 1038284    
Attachments:
Description Flags
engine.log from the live system none

Description Eyal Edri 2013-11-05 13:33:56 UTC
Created attachment 819741 [details]
engine.log from the live system

Description of problem:





Version-Release number of selected component (if applicable):
engine: 3.2.4 rpm: 3.2.4-0.44.el6ev.noarch


How reproducible:
 
the problem was seen as a host (rhevh) went into non-responding mode.
a vm that was running on it stopped responding and spice didn't work to access that VM's console.

while trying to stop / reboot the problematic host, the VM appeared on 2 different hosts (via UI and on virsh command line). 

Eventually ythis caused the FS on that VM to become corrupted, which resulted in reinstalling that VM.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
vm should only run on one host at a time.

Additional info:

Comment 1 Roy Golan 2013-11-06 13:44:10 UTC
the reason for this split brain is 2 Fence flows running in parallel - the 1st is Manual intervention and the 2nd is VdsNonRespondingTreatment 

manual intervention correlation id      : 3badd494 aka MANUAL
non-responding treatment correlation id : 6d6643f2 aka AUTO

MANUAL is able to see the host is OFF before AUTO so he puts all VM DOWN and restarts them

VMs go up...

AUTO 2nd attmpt to STOP the host after seeing it ON on 1st attmept put is setting the VM DOWN in db (not killing them because it assumes the host is down)

AUTO create the VMs - now we have 2 copies of them.

log breakdown:

#the host is being stopped by AUTO
2013-11-05 12:00:01,539 INFO  [org.ovirt.engine.core.bll.StopVdsCommand] (pool-4-thread-39) [6d6643f2] Running command: StopVdsCommand internal: true. Entities affected :  ID: 493a2d38-d653-11e1-9cfb-78e7d1e48c4
c Type: VDS

# host being restarted by MANUAL
2013-11-05 12:04:21,602 INFO  [org.ovirt.engine.core.bll.RestartVdsCommand] (pool-4-thread-40) [3badd494] Running command: RestartVdsCommand internal: false. Entities affected :  ID: 493a2d38-d653-11e1-9cfb-78e7d1e48c4c Type: VDS

# fence stop action by AUTO
2013-11-05 12:05:00,392 INFO  [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-39) [6d6643f2] Using Host buri01 from CLUSTER as proxy to execute Stop command on Host buri02

# AUTO see's the host still on
2013-11-05 12:05:07,575 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-39) [6d6643f2] FINISH, FenceVdsVDSCommand, return: Test Succeeded, on, log id: 491784ea

# 1 millisecond after MANUAL see's the host is down
2013-11-05 12:05:07,833 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-40) [3badd494] FINISH, FenceVdsVDSCommand, return: Test Succeeded, off, log id: 7ec4b382

# MANUAL puts the VM status DOWN
2013-11-05 12:05:08,253 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-40) [3badd494] START, SetVmStatusVDSCommand( vmId = 6b118364-812a-458c-88dc-e90a34d44817, status = Down), log id: 36d1459a

# MANUAL starts VM on other host
2013-11-05 12:05:10,333 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-49) [3badd494] Running command: RunVmCommand internal: true. Entities affected :  ID: 6b118364-812a-458c-88dc-e90a34d44817 Type: VM

# VM is powering up
2013-11-05 12:05:14,656 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-16) [2e909e29] VM gerrit 6b118364-812a-458c-88dc-e90a34d44817 moved from WaitForLaunch --> PoweringUp

# AUTO 2nd attempt to status the host to see if its down already
2013-11-05 12:05:17,575 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (pool-4-thread-39) [6d6643f2] Attempt 2 to get vds buri02 status

# AUTO see's its off2013-11-05 12:05:18,416 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (pool-4-thread-39) [6d6643f2] vds buri02 status is off

# AUTO sets the VM down in DB
2013-11-05 12:05:18,618 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-39) [6d6643f2] START, SetVmStatusVDSCommand( vmId = 6b118364-812a-458c-88dc-e90a34d44817, status = Down), log id: 1940d4d1

# AUTO starts the VM on a 3rd host
2013-11-05 12:05:20,127 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-35) [6d6643f2] Running command: RunVmCommand internal: true. Entities affected :  ID: 6b118364-812a-458c-88dc-e90a34d44817 Type: VM

Comment 3 Pavel Novotny 2013-11-26 17:17:21 UTC
Verified in rhevm-3.3.0-0.35.beta1.el6ev.noarch (is24).

Verifications steps:
1. Have a VM and run it on a host with power management.
2. Simulate host problems by blocking the VDSM port 54321 from the host.
  - Now RHEVM tries to connect to the host and after a while it initiates reboot via PM.
3. As soon as the host reboot is initated by RHEVM, try to reboot the host also manually in Webadmin via Power Management -> Restart.

Results:
Error dialog: "Another power management action, restart, is already in progress."

Subsequent manual reboot attempts will then return error dialog:
"Another Power Management operation is still running, please retry in 128 Sec.
 Cannot restart Host. Fence operation failed."

No strange behavior on the affected VM has been seen. The VM can be ran again and is fully operational.

Comment 4 Itamar Heim 2014-01-21 22:14:29 UTC
Closing - RHEV 3.3 Released

Comment 5 Itamar Heim 2014-01-21 22:22:09 UTC
Closing - RHEV 3.3 Released