Bug 1380295

Summary: RFE: IPaddr2 agent should make efforts to detect configured but inactive interfaces
Product: Red Hat Enterprise Linux 7 Reporter: David Juran <djuran>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED NOTABUG QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: abeekhof, agk, cfeist, cluster-maint, djuran, fdinitto, hjensas, jraju, mnovacek, mschuppe, nchandek, nkrishna, oalbrigt, royoung, vaggarwa
Target Milestone: rcKeywords: FutureFeature, Reopened
Target Release: 7.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-01 14:49:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description David Juran 2016-09-29 08:37:14 UTC
Description of problem:
During HA testing of our OSP-d deployed OpenStack, we found that a VIP does not fail over when the interface it reside on is disconnected.

In our environment, our provisioning interface is not bonded, rather it is just using a single NIC. During HA testing, we disabled the switch port (which I guess is equivalent to pulling the cable) of the provisioning interface on the controller hosting ctlplane VIP. The one the keystone admin endpoint is on.

Sep 28 09:48:26 overcloud-controller-0.localdomain kernel: be2net 0000:11:00.1 enp17s0f1: Link is Down
Sep 28 09:48:26 overcloud-controller-0.localdomain NetworkManager[1344]: <info> (enp17s0f1): link disconnected
Sep 28 09:48:28 overcloud-controller-0.localdomain kernel: be2net 0000:11:00.1 enp17s0f1: Link is Down

And that's pretty much all that happened. The VIP did _not_ fail over and the server did _not_ get fenced. Needless to say, the OverCloud was less then fully operational... 



Version-Release number of selected component (if applicable):
OSP8, deployed using OSP-d
openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch 
resource-agents-3.9.5-54.el7_2.16

How reproducible:
Every time


Steps to Reproduce:
1. Pull the cable on the admin interface on the controller hosting the ctlplane VIP
2. Try accessing the keystone admin interface


Additional info:
One way around this might be to set up an ethmonitor resource, as described in https://access.redhat.com/solutions/2044713

Comment 1 Udi Shkalim 2016-09-29 08:42:17 UTC
Hi David,

Can you elaborate on the way you "disabled" the interface?

The way the interface is disconnected can affect the HA action.

Comment 2 Fabio Massimo Di Nitto 2016-09-29 08:44:19 UTC
The VIP resource agent does not monitor eth status. This is by design and that´s why there is a ethmonitor agent.

Also please note that we cannot deploy ethmonitor automatically either. Some environment (for instance virt environment) won´t notice a cable pull (host doesn´t propagate eth status to the VM attached to a given eth).

This is expected behaviour that can be changed by using ethmonitor or pingd agent.

Comment 3 Fabio Massimo Di Nitto 2016-09-29 08:45:57 UTC
(In reply to Fabio Massimo Di Nitto from comment #2)
> The VIP resource agent does not monitor eth status. This is by design and
> that´s why there is a ethmonitor agent.
> 
> Also please note that we cannot deploy ethmonitor automatically either. Some
> environment (for instance virt environment) won´t notice a cable pull (host
> doesn´t propagate eth status to the VM attached to a given eth).
> 
> This is expected behaviour that can be changed by using ethmonitor or pingd
> agent.

Forgot to mention that the ethtool / mii-tool status detection is strictly dependent on kernel driver. If the kernel driver doesn´t export link-status, the output is moot.

Comment 4 Andrew Beekhof 2016-09-29 12:23:41 UTC
There may be corner cases we can't handle, but I think its reasonable to expect the IPaddr2 agent can handle common situations where an address has been configured but the interface is not available.

Comment 6 David Juran 2016-09-29 16:06:19 UTC
In response to #1, I had help from the network guy, but I believe he did the equivalent of pulling the cable, but inside a blade chassis.

Also, in response to #2 and #3, I agree with Andrew in #4. 
Even if link status is not fool-proof, I don't see that it would do any harm to _try_ to take action based on link-status. 

Further, from a quick read of the ethmonitor agent man-page, it apparently also can react based on whether an arping is successful or not, which would make it independent of whether link status can be detected.

Also, I noticed the Bz was moved away from OpenStack to RHEL, so I guess it's now about modifying the IPaddr2 agent to include the functionality of the ethmonitor. This of course would be just OK, and would save us complicating the OSP cluster layout even further