Created attachment 819741 [details] engine.log from the live system Description of problem: Version-Release number of selected component (if applicable): engine: 3.2.4 rpm: 3.2.4-0.44.el6ev.noarch How reproducible: the problem was seen as a host (rhevh) went into non-responding mode. a vm that was running on it stopped responding and spice didn't work to access that VM's console. while trying to stop / reboot the problematic host, the VM appeared on 2 different hosts (via UI and on virsh command line). Eventually ythis caused the FS on that VM to become corrupted, which resulted in reinstalling that VM. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: vm should only run on one host at a time. Additional info:
the reason for this split brain is 2 Fence flows running in parallel - the 1st is Manual intervention and the 2nd is VdsNonRespondingTreatment manual intervention correlation id : 3badd494 aka MANUAL non-responding treatment correlation id : 6d6643f2 aka AUTO MANUAL is able to see the host is OFF before AUTO so he puts all VM DOWN and restarts them VMs go up... AUTO 2nd attmpt to STOP the host after seeing it ON on 1st attmept put is setting the VM DOWN in db (not killing them because it assumes the host is down) AUTO create the VMs - now we have 2 copies of them. log breakdown: #the host is being stopped by AUTO 2013-11-05 12:00:01,539 INFO [org.ovirt.engine.core.bll.StopVdsCommand] (pool-4-thread-39) [6d6643f2] Running command: StopVdsCommand internal: true. Entities affected : ID: 493a2d38-d653-11e1-9cfb-78e7d1e48c4 c Type: VDS # host being restarted by MANUAL 2013-11-05 12:04:21,602 INFO [org.ovirt.engine.core.bll.RestartVdsCommand] (pool-4-thread-40) [3badd494] Running command: RestartVdsCommand internal: false. Entities affected : ID: 493a2d38-d653-11e1-9cfb-78e7d1e48c4c Type: VDS # fence stop action by AUTO 2013-11-05 12:05:00,392 INFO [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-39) [6d6643f2] Using Host buri01 from CLUSTER as proxy to execute Stop command on Host buri02 # AUTO see's the host still on 2013-11-05 12:05:07,575 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-39) [6d6643f2] FINISH, FenceVdsVDSCommand, return: Test Succeeded, on, log id: 491784ea # 1 millisecond after MANUAL see's the host is down 2013-11-05 12:05:07,833 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-40) [3badd494] FINISH, FenceVdsVDSCommand, return: Test Succeeded, off, log id: 7ec4b382 # MANUAL puts the VM status DOWN 2013-11-05 12:05:08,253 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-40) [3badd494] START, SetVmStatusVDSCommand( vmId = 6b118364-812a-458c-88dc-e90a34d44817, status = Down), log id: 36d1459a # MANUAL starts VM on other host 2013-11-05 12:05:10,333 INFO [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-49) [3badd494] Running command: RunVmCommand internal: true. Entities affected : ID: 6b118364-812a-458c-88dc-e90a34d44817 Type: VM # VM is powering up 2013-11-05 12:05:14,656 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-16) [2e909e29] VM gerrit 6b118364-812a-458c-88dc-e90a34d44817 moved from WaitForLaunch --> PoweringUp # AUTO 2nd attempt to status the host to see if its down already 2013-11-05 12:05:17,575 INFO [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (pool-4-thread-39) [6d6643f2] Attempt 2 to get vds buri02 status # AUTO see's its off2013-11-05 12:05:18,416 INFO [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (pool-4-thread-39) [6d6643f2] vds buri02 status is off # AUTO sets the VM down in DB 2013-11-05 12:05:18,618 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-39) [6d6643f2] START, SetVmStatusVDSCommand( vmId = 6b118364-812a-458c-88dc-e90a34d44817, status = Down), log id: 1940d4d1 # AUTO starts the VM on a 3rd host 2013-11-05 12:05:20,127 INFO [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-35) [6d6643f2] Running command: RunVmCommand internal: true. Entities affected : ID: 6b118364-812a-458c-88dc-e90a34d44817 Type: VM
Verified in rhevm-3.3.0-0.35.beta1.el6ev.noarch (is24). Verifications steps: 1. Have a VM and run it on a host with power management. 2. Simulate host problems by blocking the VDSM port 54321 from the host. - Now RHEVM tries to connect to the host and after a while it initiates reboot via PM. 3. As soon as the host reboot is initated by RHEVM, try to reboot the host also manually in Webadmin via Power Management -> Restart. Results: Error dialog: "Another power management action, restart, is already in progress." Subsequent manual reboot attempts will then return error dialog: "Another Power Management operation is still running, please retry in 128 Sec. Cannot restart Host. Fence operation failed." No strange behavior on the affected VM has been seen. The VM can be ran again and is fully operational.
Closing - RHEV 3.3 Released