Bug 1026811 - [engine] vm appears in 2 different rhev-h hosts (split brain)
[engine] vm appears in 2 different rhev-h hosts (split brain)
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.2.0
Unspecified Unspecified
urgent Severity urgent
: ---
: 3.3.0
Assigned To: Roy Golan
Pavel Novotny
virt
: ZStream
Depends On:
Blocks: 1029035 3.3snap3
  Show dependency treegraph
 
Reported: 2013-11-05 08:33 EST by Eyal Edri
Modified: 2014-01-21 17:22 EST (History)
9 users (show)

See Also:
Fixed In Version: is24
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1029035 (view as bug list)
Environment:
Last Closed: 2014-01-21 17:14:29 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
engine.log from the live system (3.04 MB, text/x-log)
2013-11-05 08:33 EST, Eyal Edri
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 21049 None None None Never
oVirt gerrit 21143 None None None Never
oVirt gerrit 21169 None None None Never

  None (edit)
Description Eyal Edri 2013-11-05 08:33:56 EST
Created attachment 819741 [details]
engine.log from the live system

Description of problem:





Version-Release number of selected component (if applicable):
engine: 3.2.4 rpm: 3.2.4-0.44.el6ev.noarch


How reproducible:
 
the problem was seen as a host (rhevh) went into non-responding mode.
a vm that was running on it stopped responding and spice didn't work to access that VM's console.

while trying to stop / reboot the problematic host, the VM appeared on 2 different hosts (via UI and on virsh command line). 

Eventually ythis caused the FS on that VM to become corrupted, which resulted in reinstalling that VM.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
vm should only run on one host at a time.

Additional info:
Comment 1 Roy Golan 2013-11-06 08:44:10 EST
the reason for this split brain is 2 Fence flows running in parallel - the 1st is Manual intervention and the 2nd is VdsNonRespondingTreatment 

manual intervention correlation id      : 3badd494 aka MANUAL
non-responding treatment correlation id : 6d6643f2 aka AUTO

MANUAL is able to see the host is OFF before AUTO so he puts all VM DOWN and restarts them

VMs go up...

AUTO 2nd attmpt to STOP the host after seeing it ON on 1st attmept put is setting the VM DOWN in db (not killing them because it assumes the host is down)

AUTO create the VMs - now we have 2 copies of them.

log breakdown:

#the host is being stopped by AUTO
2013-11-05 12:00:01,539 INFO  [org.ovirt.engine.core.bll.StopVdsCommand] (pool-4-thread-39) [6d6643f2] Running command: StopVdsCommand internal: true. Entities affected :  ID: 493a2d38-d653-11e1-9cfb-78e7d1e48c4
c Type: VDS

# host being restarted by MANUAL
2013-11-05 12:04:21,602 INFO  [org.ovirt.engine.core.bll.RestartVdsCommand] (pool-4-thread-40) [3badd494] Running command: RestartVdsCommand internal: false. Entities affected :  ID: 493a2d38-d653-11e1-9cfb-78e7d1e48c4c Type: VDS

# fence stop action by AUTO
2013-11-05 12:05:00,392 INFO  [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-39) [6d6643f2] Using Host buri01 from CLUSTER as proxy to execute Stop command on Host buri02

# AUTO see's the host still on
2013-11-05 12:05:07,575 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-39) [6d6643f2] FINISH, FenceVdsVDSCommand, return: Test Succeeded, on, log id: 491784ea

# 1 millisecond after MANUAL see's the host is down
2013-11-05 12:05:07,833 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-40) [3badd494] FINISH, FenceVdsVDSCommand, return: Test Succeeded, off, log id: 7ec4b382

# MANUAL puts the VM status DOWN
2013-11-05 12:05:08,253 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-40) [3badd494] START, SetVmStatusVDSCommand( vmId = 6b118364-812a-458c-88dc-e90a34d44817, status = Down), log id: 36d1459a

# MANUAL starts VM on other host
2013-11-05 12:05:10,333 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-49) [3badd494] Running command: RunVmCommand internal: true. Entities affected :  ID: 6b118364-812a-458c-88dc-e90a34d44817 Type: VM

# VM is powering up
2013-11-05 12:05:14,656 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-16) [2e909e29] VM gerrit 6b118364-812a-458c-88dc-e90a34d44817 moved from WaitForLaunch --> PoweringUp

# AUTO 2nd attempt to status the host to see if its down already
2013-11-05 12:05:17,575 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (pool-4-thread-39) [6d6643f2] Attempt 2 to get vds buri02 status

# AUTO see's its off2013-11-05 12:05:18,416 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (pool-4-thread-39) [6d6643f2] vds buri02 status is off

# AUTO sets the VM down in DB
2013-11-05 12:05:18,618 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-39) [6d6643f2] START, SetVmStatusVDSCommand( vmId = 6b118364-812a-458c-88dc-e90a34d44817, status = Down), log id: 1940d4d1

# AUTO starts the VM on a 3rd host
2013-11-05 12:05:20,127 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-35) [6d6643f2] Running command: RunVmCommand internal: true. Entities affected :  ID: 6b118364-812a-458c-88dc-e90a34d44817 Type: VM
Comment 3 Pavel Novotny 2013-11-26 12:17:21 EST
Verified in rhevm-3.3.0-0.35.beta1.el6ev.noarch (is24).

Verifications steps:
1. Have a VM and run it on a host with power management.
2. Simulate host problems by blocking the VDSM port 54321 from the host.
  - Now RHEVM tries to connect to the host and after a while it initiates reboot via PM.
3. As soon as the host reboot is initated by RHEVM, try to reboot the host also manually in Webadmin via Power Management -> Restart.

Results:
Error dialog: "Another power management action, restart, is already in progress."

Subsequent manual reboot attempts will then return error dialog:
"Another Power Management operation is still running, please retry in 128 Sec.
 Cannot restart Host. Fence operation failed."

No strange behavior on the affected VM has been seen. The VM can be ran again and is fully operational.
Comment 4 Itamar Heim 2014-01-21 17:14:29 EST
Closing - RHEV 3.3 Released
Comment 5 Itamar Heim 2014-01-21 17:22:09 EST
Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.