Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1219429

Summary:

F21: dhcp-client-identifier != hardware makes bridge receive a new address and loose connectivity

Product:

[oVirt] vdsm

Reporter:

Sahina Bose <sabose>

Component:

General

Assignee:

Ondřej Svoboda <osvoboda>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Meni Yakove <myakove>

Severity:

high

Docs Contact:

Priority:

medium

Version:

---

CC:

bazulay, bugs, danken, dcaroest, dgeevarg, didi, ecohen, fdeutsch, lsurette, mgoldboi, osvoboda, rbalakri, sabose, sbonazzo, shtripat, yeylon, ylavi

Target Milestone:

ovirt-3.6.0-rc

Flags:

ylavi: ovirt-3.6.0?
ylavi: planning_ack+
rule-engine: devel_ack+
rule-engine: testing_ack?

Target Release:

4.17.9

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

network

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-11-18 10:43:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Network

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
vdsm	none
supervdsm	none
messages	none

Description Sahina Bose 2015-05-07 10:28:34 UTC

Created attachment 1023038 [details]
vdsm

Description of problem:

I have a Fedora 21 node that I have configured ovirt-master repo and installed vdsm.

When adding this node to engine, the package installation succeeds but after SetupNetworks is called to create bridge, the node goes to Non-Operational state. (the required network is not created)

vdsm.log ->
Thread-16::ERROR::2015-05-07 15:45:18,382::API::1560::vds::(_rollback) connectivity check failed
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1558, in _rollback
    yield rollbackCtx
  File "/usr/share/vdsm/API.py", line 1421, in setupNetworks
    supervdsm.getProxy().setupNetworks(networks, bondings, options)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 774, in _callmethod
    raise convert_to_error(kind, result)
ConfigNetworkError: (10, 'connectivity check failed')

Version-Release number of selected component (if applicable):
3.6-master


Additional info:
Attached vdsm and supervdsm log

Comment 1 Sahina Bose 2015-05-07 10:30:42 UTC

Created attachment 1023049 [details]
supervdsm

Comment 2 Dan Kenigsberg 2015-05-11 00:20:15 UTC

Vdsm is asked to define the network with dhcp on top of eth0.

MainProcess|Thread-16::DEBUG::2015-05-07 15:43:11,532::supervdsmServer::105::SuperVdsm.ServerCallback::(wrapper) call setupNetworks with ({u'ovirtmgmt': {u'nic': u'eth0', u'bootproto': u'dhcp', u'STP': u'no', u'bridged': u'true', u'mtu': u'1500'}}, {}, {u'connectivityCheck': u'true', u'connectivityTimeout': 120}) {}

it got a dhcp lease

sourceRoute::INFO::2015-05-07 15:43:14,684::sourceroute::74::root::(configure) Configuring gateway - ip: 10.70.43.65, network: 10.70.40.0/22, subnet: 255.255.252.0, gateway: 10.70.43.254, table: 172370753, device: ovirtmgmt
sourceRoute::DEBUG::2015-05-07 15:43:14,705::utils::678::root::(execCmd) 

but no ping from engine for 2 minutes:

MainProcess|Thread-16::INFO::2015-05-07 15:45:13,694::api::723::setupNetworks::(_check_connectivity) Connectivity check failed, rolling back


Sahina, can you attach ifcfg-eth0 prior the installation? Was the host added to Engine with ip address other than 10.70.43.65?

Comment 3 Sahina Bose 2015-05-11 08:30:32 UTC

Dan, retried again.
Before install :

# Generated by dracut initrd
DEVICE="eth0"
ONBOOT=yes
NETBOOT=yes
UUID="ea0771d6-e4f1-4ccf-933f-340908632df9"
IPV6INIT=yes
BOOTPROTO=dhcp
TYPE=Ethernet
NAME="eth0"
/etc/sysconfig/network-scripts/ifcfg-eth0

Result of ip addr show:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:15:1e:00:6c:4e brd ff:ff:ff:ff:ff:ff
    inet 10.70.42.253/22 brd 10.70.43.255 scope global dynamic eth0
       valid_lft 70525sec preferred_lft 70525sec
    inet6 fe80::215:1eff:fe00:6c4e/64 scope link 
       valid_lft forever preferred_lft forever


From logs:
sourceRoute::INFO::2015-05-11 12:54:42,401::sourceroute::74::root::(configure) Configuring gateway - ip: 10.70.43.65, network: 10.70.40
.0/22, subnet: 255.255.252.0, gateway: 10.70.43.254, table: 172370753, device: ovirtmgmt



So it looks like the ovirtmgmt gets a different ip address from the one that was used to add to engine.

Comment 4 Dan Kenigsberg 2015-05-11 08:50:12 UTC

On reboot, do you keep getting 10.70.42.253 ?

Can you attatch /var/log/messages to see if you happen to have two dhcp servers on you network?

Comment 5 Sahina Bose 2015-05-11 11:57:15 UTC

Yes, on reboot and on deletion of ovirtmgmt bridge, I do get back 10.70.42.253

There's only 1 dhcp server on the network, checked with our sysadmin to confirm that both these leases are provided by the same dhcp server.

I have attached the stripped down version of /var/log/messages. Let me know if it suffices.

Comment 6 Sahina Bose 2015-05-11 11:57:46 UTC

Created attachment 1024198 [details]
messages

Comment 7 Dan Kenigsberg 2015-05-11 16:06:14 UTC

clearly, the dhcp process on boot is very different from the later one.
The first starts with

  DHCPREQUEST on eth0 to 10.70.34.2 port 67

May 10 10:35:17 dhcp43-73 dhclient[11852]: DHCPREQUEST on eth0 to 10.70.34.2 port 67 (xid=0x673c93ea)
May 10 10:35:17 dhcp43-73 dhclient[11852]: DHCPACK from 10.70.34.2 (xid=0x673c93ea)

and the second with

May 11 12:54:39 dhcp43-73 dhclient[12614]: DHCPDISCOVER on ovirtmgmt to 255.255.255.255 port 67 interval 5 (xid=0x3040f1ea)
May 11 12:54:40 dhcp43-73 dhclient[12614]: DHCPREQUEST on ovirtmgmt to 255.255.255.255 port 67 (xid=0x3040f1ea)
May 11 12:54:40 dhcp43-73 dhclient[12614]: DHCPOFFER from 10.70.43.254

Note that you have two DHCP servers offering addresses: one at 10.70.34.2 and the other at 10.70.43.254.

I'd consider this as a network misconfiguration. However, please share your

var/lib/dhclient/dhclient-*.leases

to understand the origin of the difference.

Comment 8 Sahina Bose 2015-05-12 06:22:46 UTC

According to our sysadmin 10.70.43.254 - is the gateway and not the DHCP server.

Dominic, could you provide the lease info from the DHCP server?

Comment 9 Dan Kenigsberg 2015-05-12 09:49:40 UTC

There might be a DHCP repeater on the gateway. Your host is of hearing range from both.

Comment 10 Dominic Geevarghese 2015-05-12 10:55:15 UTC

I have tried to reproduce the issue on rhel6/7 without luck. However the issue is reproducible in f21 always. The difference I have noticed is that, both rhel6 and rhel7 clients are using MAC address as 'dhclient-identifier' while f21 is using 'uid'. Whenever I change the config (add/remove bridge interface), I am seeing a new lease file entry in dhcp lease db with different uid. So server is treating each request as new and supplying ip. Hence f21 is getting new ip. The behavior change in f21 is deliberate per bz560361 and no difference observed when tried "send dhcp-client-identifier = hardware" parameter in dhclient.conf. Need to check why the client is sending new uid everytime user brought the interface offline/online.

Comment 11 Dominic Geevarghese 2015-05-12 11:25:43 UTC

ok I have missed something while testing dhcp-client-identifier parameter it seems. Added "send dhcp-client-identifier = hardware" in dhclient.conf and repeated the steps(with and without bridge) which could pull the same IP always.

Comment 12 Sahina Bose 2015-05-12 13:38:50 UTC

Setting this in /etc/dhcp/dhclient.conf, restarting the network - gave me a new IP address.
I used this to add to ovirt and the ovirtmgmt got the same address as well.

Comment 13 Shubhendu Tripathi 2015-06-09 07:24:34 UTC

Even I have faced this issue recently while adding f21 nodes to oVirt.

Comment 14 Dan Kenigsberg 2015-08-19 09:34:04 UTC

A slightly different scenario: define the management network on top of a bond that includes the original communicating nic as one of its slaves.


QE: please check this flow as well.

Comment 15 Dan Kenigsberg 2015-08-27 11:39:29 UTC

(In reply to Dan Kenigsberg from comment #14)
> A slightly different scenario: define the management network on top of a
> bond that includes the original communicating nic as one of its slaves.
> 
> 
> QE: please check this flow as well.

QE: please ignore my comment 14. if we build our network on top of a bond on top of two nics with two dhclients, we cannot really tell which of the identities should we inherit, so there's nothing much to do.

Comment 16 Ondřej Svoboda 2015-08-27 22:02:18 UTC

A fix covering the transition from a DHCP-configured NIC to a DHCP-configured bridge has been merged to the stable branch (3.6) and shall be a part of VDSM 4.17.4, when released.

A DHCP Unique Identifier (DUID), taken from a dhclient lease file belonging to the NIC, is reused by a dhclient run on the bridge so the same address is acquired.

Comment 17 David Caro 2015-09-02 15:11:14 UTC

The issue is happening again on vdsm 4.17.4

Comment 18 David Caro 2015-09-02 15:12:02 UTC

* It happens again on el7 (forgot to mention)

Comment 19 Yedidyah Bar David 2015-09-03 08:20:09 UTC

Only now noticed that this bug exists.

Not sure if it's a good solution, but I use the following as a workaround, for now:

lf=/var/lib/dhclient/dhclient--ovirtmgmt.lease
[ -e $lf ] || cat /var/lib/NetworkManager/*.lease /var/lib/dhclient/*.lease 2>/dev/null | grep ^default-duid | head -1 > $lf

You have to run this between getting your first lease and asking for a lease for the bridge.

Another obvious workaround is allocating static leases on the server based on mac addresses. Not always applicable and has maintenance costs.

See [1] for details.

[1] https://tools.ietf.org/html/rfc4361

Comment 20 Yedidyah Bar David 2015-09-03 09:45:39 UTC

(In reply to David Caro from comment #18)
> * It happens again on el7 (forgot to mention)

Asked David in private and he acked that the first least was acquired by networkmanager, which keeps its lease files in /var/lib/NetworkManager .

Comment 21 Ondřej Svoboda 2015-09-03 10:20:04 UTC

Didi, thanks for suggesting that the first line of a lease might be sufficient.

I have posted a quick fix that avoids using the non-existent -df on EL7 for now and will get to falling back to -lf in the coming days.

Comment 22 Yedidyah Bar David 2015-09-03 11:41:20 UTC

(In reply to Ondřej Svoboda from comment #21)
> Didi, thanks for suggesting that the first line of a lease might be
> sufficient.

I used the first _occurrence_ of default-duid, not the first line.

Since it's a workaround, I just grep all files (my machines have just one nic). In real code you should check the file for the interface you are going to use (which iiuc you do), and will probably find only one occurrence anyway.

It's probably always written first, but I wouldn't count on that, unless it's documented (unlikely).

> 
> I have posted a quick fix that avoids using the non-existent -df on EL7 for
> now and will get to falling back to -lf in the coming days.

+1

Also note the issue with NM.

Comment 24 Meni Yakove 2015-11-11 10:18:44 UTC

Can you verify this bug?