Bug 1090799 - [RFE] engine networking went down, 90% of hosts were fenced causing a massive outage
Summary: [RFE] engine networking went down, 90% of hosts were fenced causing a massive...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.5.0
Assignee: Martin Perina
QA Contact: sefi litmanovich
URL:
Whiteboard: infra
Depends On:
Blocks: 1084611
TreeView+ depends on / blocked
 
Reported: 2014-04-24 08:04 UTC by Oved Ourfali
Modified: 2016-02-10 19:32 UTC (History)
7 users (show)

Fixed In Version: ovirt-3.5.0_rc1.1
Clone Of:
Environment:
Last Closed: 2014-10-17 12:22:19 UTC
oVirt Team: Infra
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 29482 0 master ABANDONED [WIP]core: add option not to fence if SD is active Never
oVirt gerrit 30193 0 master MERGED core: Add config value to enable to skip fencing if SD is active Never
oVirt gerrit 30194 0 master MERGED core: Introduce FencingPolicy Never
oVirt gerrit 30195 0 master MERGED webadmin: Add FencingPolicy to cluster configuration Never
oVirt gerrit 30582 0 master MERGED core: Refactor FenceVdsActionParameters Never
oVirt gerrit 30583 0 master MERGED core: Introduce createFenceExecutor method Never
oVirt gerrit 30584 0 master MERGED core: Add fencing policy as parameter to fenceNode VDSM verb Never
oVirt gerrit 30762 0 master MERGED fencing: Skip fencing if host is maintaining its lease Never
oVirt gerrit 30937 0 master MERGED core: Include VdsSpmIdMapDAODbFacadeImpl.get(vdsId) in VdsSpmIdMapDAO Never
oVirt gerrit 31087 0 master MERGED core: Delay fencing until host storage lease renewal interval passed Never
oVirt gerrit 31227 0 master MERGED core: Fix displaying result of VdsNotRespondingTreatment Never
oVirt gerrit 31230 0 ovirt-3.5 MERGED fencing: Skip fencing if host is maintaining its lease Never
oVirt gerrit 31231 0 ovirt-engine-3.5 MERGED core: Add config value to enable to skip fencing if SD is active Never
oVirt gerrit 31232 0 ovirt-engine-3.5 MERGED core: Introduce FencingPolicy Never
oVirt gerrit 31233 0 ovirt-engine-3.5 MERGED webadmin: Add FencingPolicy to cluster configuration Never
oVirt gerrit 31234 0 ovirt-engine-3.5 MERGED core: Refactor FenceVdsActionParameters Never
oVirt gerrit 31235 0 ovirt-engine-3.5 MERGED core: Introduce createFenceExecutor method Never
oVirt gerrit 31236 0 ovirt-engine-3.5 MERGED core: Include VdsSpmIdMapDAODbFacadeImpl.get(vdsId) in VdsSpmIdMapDAO Never
oVirt gerrit 31237 0 ovirt-engine-3.5 MERGED core: Add fencing policy as parameter to fenceNode VDSM verb Never
oVirt gerrit 31238 0 ovirt-engine-3.5 MERGED core: Delay fencing until host storage lease renewal interval passed Never
oVirt gerrit 31239 0 ovirt-engine-3.5 MERGED core: Fix displaying result of VdsNotRespondingTreatment Never
oVirt gerrit 31641 0 master MERGED webadmin: Fix Fencing Policy look Never
oVirt gerrit 31655 0 ovirt-engine-3.5 MERGED webadmin: Fix Fencing Policy look Never

Description Oved Ourfali 2014-04-24 08:04:57 UTC
The switch that connects their engine had hardware issues. The switch has since been replaced, however this behavior caused the NIC to switch between up and down on the engine and it believed it had lost all connection to the hosts, as they went into a Non-Responsive mode as did the Data Center. Due to this, the engine sent fence commands to a majority of the hosts. This ultimately caused an outage of "~90% of the virtual environment", as we understand it.

The storage is connected via fibre, so the switch shouldn't have caused issues there explicitly

How reproducible:
Unknown how frequently

Steps to Reproduce:
1. Cause the switch the engine connects to hosts on to flap up/down
2. Make sure power management for the hosts is configured
3. Watch for the hosts to be set to Non-Responsive
4. Observe if the hosts are fenced

Comment 1 Sandro Bonazzola 2014-10-17 12:22:19 UTC
oVirt 3.5 has been released and should include the fix for this issue.


Note You need to log in before you can comment on or make changes to this bug.