Bug 1090799

Summary: [RFE] engine networking went down, 90% of hosts were fenced causing a massive outage
Product: [Retired] oVirt Reporter: Oved Ourfali <oourfali>
Component: ovirt-engine-coreAssignee: Martin Perina <mperina>
Status: CLOSED CURRENTRELEASE QA Contact: sefi litmanovich <slitmano>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.5CC: bugs, gklein, iheim, mperina, rbalakri, s.kieske, yeylon
Target Milestone: ---Keywords: FutureFeature
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: ovirt-3.5.0_rc1.1 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:22:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1084611    

Description Oved Ourfali 2014-04-24 08:04:57 UTC
The switch that connects their engine had hardware issues. The switch has since been replaced, however this behavior caused the NIC to switch between up and down on the engine and it believed it had lost all connection to the hosts, as they went into a Non-Responsive mode as did the Data Center. Due to this, the engine sent fence commands to a majority of the hosts. This ultimately caused an outage of "~90% of the virtual environment", as we understand it.

The storage is connected via fibre, so the switch shouldn't have caused issues there explicitly

How reproducible:
Unknown how frequently

Steps to Reproduce:
1. Cause the switch the engine connects to hosts on to flap up/down
2. Make sure power management for the hosts is configured
3. Watch for the hosts to be set to Non-Responsive
4. Observe if the hosts are fenced

Comment 1 Sandro Bonazzola 2014-10-17 12:22:19 UTC
oVirt 3.5 has been released and should include the fix for this issue.