Red Hat Bugzilla – Bug 1026811
[engine] vm appears in 2 different rhev-h hosts (split brain)
Last modified: 2014-01-21 17:22:09 EST
Created attachment 819741 [details]
engine.log from the live system
Description of problem:
Version-Release number of selected component (if applicable):
engine: 3.2.4 rpm: 3.2.4-0.44.el6ev.noarch
the problem was seen as a host (rhevh) went into non-responding mode.
a vm that was running on it stopped responding and spice didn't work to access that VM's console.
while trying to stop / reboot the problematic host, the VM appeared on 2 different hosts (via UI and on virsh command line).
Eventually ythis caused the FS on that VM to become corrupted, which resulted in reinstalling that VM.
Steps to Reproduce:
vm should only run on one host at a time.
the reason for this split brain is 2 Fence flows running in parallel - the 1st is Manual intervention and the 2nd is VdsNonRespondingTreatment
manual intervention correlation id : 3badd494 aka MANUAL
non-responding treatment correlation id : 6d6643f2 aka AUTO
MANUAL is able to see the host is OFF before AUTO so he puts all VM DOWN and restarts them
VMs go up...
AUTO 2nd attmpt to STOP the host after seeing it ON on 1st attmept put is setting the VM DOWN in db (not killing them because it assumes the host is down)
AUTO create the VMs - now we have 2 copies of them.
#the host is being stopped by AUTO
2013-11-05 12:00:01,539 INFO [org.ovirt.engine.core.bll.StopVdsCommand] (pool-4-thread-39) [6d6643f2] Running command: StopVdsCommand internal: true. Entities affected : ID: 493a2d38-d653-11e1-9cfb-78e7d1e48c4
c Type: VDS
# host being restarted by MANUAL
2013-11-05 12:04:21,602 INFO [org.ovirt.engine.core.bll.RestartVdsCommand] (pool-4-thread-40) [3badd494] Running command: RestartVdsCommand internal: false. Entities affected : ID: 493a2d38-d653-11e1-9cfb-78e7d1e48c4c Type: VDS
# fence stop action by AUTO
2013-11-05 12:05:00,392 INFO [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-39) [6d6643f2] Using Host buri01 from CLUSTER as proxy to execute Stop command on Host buri02
# AUTO see's the host still on
2013-11-05 12:05:07,575 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-39) [6d6643f2] FINISH, FenceVdsVDSCommand, return: Test Succeeded, on, log id: 491784ea
# 1 millisecond after MANUAL see's the host is down
2013-11-05 12:05:07,833 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-40) [3badd494] FINISH, FenceVdsVDSCommand, return: Test Succeeded, off, log id: 7ec4b382
# MANUAL puts the VM status DOWN
2013-11-05 12:05:08,253 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-40) [3badd494] START, SetVmStatusVDSCommand( vmId = 6b118364-812a-458c-88dc-e90a34d44817, status = Down), log id: 36d1459a
# MANUAL starts VM on other host
2013-11-05 12:05:10,333 INFO [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-49) [3badd494] Running command: RunVmCommand internal: true. Entities affected : ID: 6b118364-812a-458c-88dc-e90a34d44817 Type: VM
# VM is powering up
2013-11-05 12:05:14,656 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-16) [2e909e29] VM gerrit 6b118364-812a-458c-88dc-e90a34d44817 moved from WaitForLaunch --> PoweringUp
# AUTO 2nd attempt to status the host to see if its down already
2013-11-05 12:05:17,575 INFO [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (pool-4-thread-39) [6d6643f2] Attempt 2 to get vds buri02 status
# AUTO see's its off2013-11-05 12:05:18,416 INFO [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (pool-4-thread-39) [6d6643f2] vds buri02 status is off
# AUTO sets the VM down in DB
2013-11-05 12:05:18,618 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-39) [6d6643f2] START, SetVmStatusVDSCommand( vmId = 6b118364-812a-458c-88dc-e90a34d44817, status = Down), log id: 1940d4d1
# AUTO starts the VM on a 3rd host
2013-11-05 12:05:20,127 INFO [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-35) [6d6643f2] Running command: RunVmCommand internal: true. Entities affected : ID: 6b118364-812a-458c-88dc-e90a34d44817 Type: VM
Verified in rhevm-3.3.0-0.35.beta1.el6ev.noarch (is24).
1. Have a VM and run it on a host with power management.
2. Simulate host problems by blocking the VDSM port 54321 from the host.
- Now RHEVM tries to connect to the host and after a while it initiates reboot via PM.
3. As soon as the host reboot is initated by RHEVM, try to reboot the host also manually in Webadmin via Power Management -> Restart.
Error dialog: "Another power management action, restart, is already in progress."
Subsequent manual reboot attempts will then return error dialog:
"Another Power Management operation is still running, please retry in 128 Sec.
Cannot restart Host. Fence operation failed."
No strange behavior on the affected VM has been seen. The VM can be ran again and is fully operational.
Closing - RHEV 3.3 Released