1026811 – [engine] vm appears in 2 different rhev-h hosts (split brain)

Bug 1026811 - [engine] vm appears in 2 different rhev-h hosts (split brain)

Summary: [engine] vm appears in 2 different rhev-h hosts (split brain)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Roy Golan
QA Contact:	Pavel Novotny
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	1029035 3.3snap3
TreeView+	depends on / blocked

Reported:	2013-11-05 13:33 UTC by Eyal Edri
Modified:	2014-01-21 22:22 UTC (History)
CC List:	9 users (show)
Fixed In Version:	is24
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1029035 (view as bug list)
Environment:
Last Closed:	2014-01-21 22:14:29 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine.log from the live system (3.04 MB, text/x-log) 2013-11-05 13:33 UTC, Eyal Edri	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	21049	None	None	None	Never
oVirt gerrit	21143	None	None	None	Never
oVirt gerrit	21169	None	None	None	Never

Description Eyal Edri 2013-11-05 13:33:56 UTC

Created attachment 819741 [details]
engine.log from the live system

Description of problem:





Version-Release number of selected component (if applicable):
engine: 3.2.4 rpm: 3.2.4-0.44.el6ev.noarch


How reproducible:
 
the problem was seen as a host (rhevh) went into non-responding mode.
a vm that was running on it stopped responding and spice didn't work to access that VM's console.

while trying to stop / reboot the problematic host, the VM appeared on 2 different hosts (via UI and on virsh command line). 

Eventually ythis caused the FS on that VM to become corrupted, which resulted in reinstalling that VM.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
vm should only run on one host at a time.

Additional info:

Comment 1 Roy Golan 2013-11-06 13:44:10 UTC

the reason for this split brain is 2 Fence flows running in parallel - the 1st is Manual intervention and the 2nd is VdsNonRespondingTreatment 

manual intervention correlation id      : 3badd494 aka MANUAL
non-responding treatment correlation id : 6d6643f2 aka AUTO

MANUAL is able to see the host is OFF before AUTO so he puts all VM DOWN and restarts them

VMs go up...

AUTO 2nd attmpt to STOP the host after seeing it ON on 1st attmept put is setting the VM DOWN in db (not killing them because it assumes the host is down)

AUTO create the VMs - now we have 2 copies of them.

log breakdown:

#the host is being stopped by AUTO
2013-11-05 12:00:01,539 INFO  [org.ovirt.engine.core.bll.StopVdsCommand] (pool-4-thread-39) [6d6643f2] Running command: StopVdsCommand internal: true. Entities affected :  ID: 493a2d38-d653-11e1-9cfb-78e7d1e48c4
c Type: VDS

# host being restarted by MANUAL
2013-11-05 12:04:21,602 INFO  [org.ovirt.engine.core.bll.RestartVdsCommand] (pool-4-thread-40) [3badd494] Running command: RestartVdsCommand internal: false. Entities affected :  ID: 493a2d38-d653-11e1-9cfb-78e7d1e48c4c Type: VDS

# fence stop action by AUTO
2013-11-05 12:05:00,392 INFO  [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-39) [6d6643f2] Using Host buri01 from CLUSTER as proxy to execute Stop command on Host buri02

# AUTO see's the host still on
2013-11-05 12:05:07,575 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-39) [6d6643f2] FINISH, FenceVdsVDSCommand, return: Test Succeeded, on, log id: 491784ea

# 1 millisecond after MANUAL see's the host is down
2013-11-05 12:05:07,833 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-40) [3badd494] FINISH, FenceVdsVDSCommand, return: Test Succeeded, off, log id: 7ec4b382

# MANUAL puts the VM status DOWN
2013-11-05 12:05:08,253 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-40) [3badd494] START, SetVmStatusVDSCommand( vmId = 6b118364-812a-458c-88dc-e90a34d44817, status = Down), log id: 36d1459a

# MANUAL starts VM on other host
2013-11-05 12:05:10,333 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-49) [3badd494] Running command: RunVmCommand internal: true. Entities affected :  ID: 6b118364-812a-458c-88dc-e90a34d44817 Type: VM

# VM is powering up
2013-11-05 12:05:14,656 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-16) [2e909e29] VM gerrit 6b118364-812a-458c-88dc-e90a34d44817 moved from WaitForLaunch --> PoweringUp

# AUTO 2nd attempt to status the host to see if its down already
2013-11-05 12:05:17,575 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (pool-4-thread-39) [6d6643f2] Attempt 2 to get vds buri02 status

# AUTO see's its off2013-11-05 12:05:18,416 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (pool-4-thread-39) [6d6643f2] vds buri02 status is off

# AUTO sets the VM down in DB
2013-11-05 12:05:18,618 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-39) [6d6643f2] START, SetVmStatusVDSCommand( vmId = 6b118364-812a-458c-88dc-e90a34d44817, status = Down), log id: 1940d4d1

# AUTO starts the VM on a 3rd host
2013-11-05 12:05:20,127 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-35) [6d6643f2] Running command: RunVmCommand internal: true. Entities affected :  ID: 6b118364-812a-458c-88dc-e90a34d44817 Type: VM

Comment 3 Pavel Novotny 2013-11-26 17:17:21 UTC

Verified in rhevm-3.3.0-0.35.beta1.el6ev.noarch (is24).

Verifications steps:
1. Have a VM and run it on a host with power management.
2. Simulate host problems by blocking the VDSM port 54321 from the host.
  - Now RHEVM tries to connect to the host and after a while it initiates reboot via PM.
3. As soon as the host reboot is initated by RHEVM, try to reboot the host also manually in Webadmin via Power Management -> Restart.

Results:
Error dialog: "Another power management action, restart, is already in progress."

Subsequent manual reboot attempts will then return error dialog:
"Another Power Management operation is still running, please retry in 128 Sec.
 Cannot restart Host. Fence operation failed."

No strange behavior on the affected VM has been seen. The VM can be ran again and is fully operational.

Comment 4 Itamar Heim 2014-01-21 22:14:29 UTC

Closing - RHEV 3.3 Released

Comment 5 Itamar Heim 2014-01-21 22:22:09 UTC

Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.