Created attachment 1023038 [details] vdsm Description of problem: I have a Fedora 21 node that I have configured ovirt-master repo and installed vdsm. When adding this node to engine, the package installation succeeds but after SetupNetworks is called to create bridge, the node goes to Non-Operational state. (the required network is not created) vdsm.log -> Thread-16::ERROR::2015-05-07 15:45:18,382::API::1560::vds::(_rollback) connectivity check failed Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1558, in _rollback yield rollbackCtx File "/usr/share/vdsm/API.py", line 1421, in setupNetworks supervdsm.getProxy().setupNetworks(networks, bondings, options) File "/usr/share/vdsm/supervdsm.py", line 50, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda> **kwargs) File "<string>", line 2, in setupNetworks File "/usr/lib64/python2.7/multiprocessing/managers.py", line 774, in _callmethod raise convert_to_error(kind, result) ConfigNetworkError: (10, 'connectivity check failed') Version-Release number of selected component (if applicable): 3.6-master Additional info: Attached vdsm and supervdsm log
Created attachment 1023049 [details] supervdsm
Vdsm is asked to define the network with dhcp on top of eth0. MainProcess|Thread-16::DEBUG::2015-05-07 15:43:11,532::supervdsmServer::105::SuperVdsm.ServerCallback::(wrapper) call setupNetworks with ({u'ovirtmgmt': {u'nic': u'eth0', u'bootproto': u'dhcp', u'STP': u'no', u'bridged': u'true', u'mtu': u'1500'}}, {}, {u'connectivityCheck': u'true', u'connectivityTimeout': 120}) {} it got a dhcp lease sourceRoute::INFO::2015-05-07 15:43:14,684::sourceroute::74::root::(configure) Configuring gateway - ip: 10.70.43.65, network: 10.70.40.0/22, subnet: 255.255.252.0, gateway: 10.70.43.254, table: 172370753, device: ovirtmgmt sourceRoute::DEBUG::2015-05-07 15:43:14,705::utils::678::root::(execCmd) but no ping from engine for 2 minutes: MainProcess|Thread-16::INFO::2015-05-07 15:45:13,694::api::723::setupNetworks::(_check_connectivity) Connectivity check failed, rolling back Sahina, can you attach ifcfg-eth0 prior the installation? Was the host added to Engine with ip address other than 10.70.43.65?
Dan, retried again. Before install : # Generated by dracut initrd DEVICE="eth0" ONBOOT=yes NETBOOT=yes UUID="ea0771d6-e4f1-4ccf-933f-340908632df9" IPV6INIT=yes BOOTPROTO=dhcp TYPE=Ethernet NAME="eth0" /etc/sysconfig/network-scripts/ifcfg-eth0 Result of ip addr show: 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:15:1e:00:6c:4e brd ff:ff:ff:ff:ff:ff inet 10.70.42.253/22 brd 10.70.43.255 scope global dynamic eth0 valid_lft 70525sec preferred_lft 70525sec inet6 fe80::215:1eff:fe00:6c4e/64 scope link valid_lft forever preferred_lft forever From logs: sourceRoute::INFO::2015-05-11 12:54:42,401::sourceroute::74::root::(configure) Configuring gateway - ip: 10.70.43.65, network: 10.70.40 .0/22, subnet: 255.255.252.0, gateway: 10.70.43.254, table: 172370753, device: ovirtmgmt So it looks like the ovirtmgmt gets a different ip address from the one that was used to add to engine.
On reboot, do you keep getting 10.70.42.253 ? Can you attatch /var/log/messages to see if you happen to have two dhcp servers on you network?
Yes, on reboot and on deletion of ovirtmgmt bridge, I do get back 10.70.42.253 There's only 1 dhcp server on the network, checked with our sysadmin to confirm that both these leases are provided by the same dhcp server. I have attached the stripped down version of /var/log/messages. Let me know if it suffices.
Created attachment 1024198 [details] messages
clearly, the dhcp process on boot is very different from the later one. The first starts with DHCPREQUEST on eth0 to 10.70.34.2 port 67 May 10 10:35:17 dhcp43-73 dhclient[11852]: DHCPREQUEST on eth0 to 10.70.34.2 port 67 (xid=0x673c93ea) May 10 10:35:17 dhcp43-73 dhclient[11852]: DHCPACK from 10.70.34.2 (xid=0x673c93ea) and the second with May 11 12:54:39 dhcp43-73 dhclient[12614]: DHCPDISCOVER on ovirtmgmt to 255.255.255.255 port 67 interval 5 (xid=0x3040f1ea) May 11 12:54:40 dhcp43-73 dhclient[12614]: DHCPREQUEST on ovirtmgmt to 255.255.255.255 port 67 (xid=0x3040f1ea) May 11 12:54:40 dhcp43-73 dhclient[12614]: DHCPOFFER from 10.70.43.254 Note that you have two DHCP servers offering addresses: one at 10.70.34.2 and the other at 10.70.43.254. I'd consider this as a network misconfiguration. However, please share your var/lib/dhclient/dhclient-*.leases to understand the origin of the difference.
According to our sysadmin 10.70.43.254 - is the gateway and not the DHCP server. Dominic, could you provide the lease info from the DHCP server?
There might be a DHCP repeater on the gateway. Your host is of hearing range from both.
I have tried to reproduce the issue on rhel6/7 without luck. However the issue is reproducible in f21 always. The difference I have noticed is that, both rhel6 and rhel7 clients are using MAC address as 'dhclient-identifier' while f21 is using 'uid'. Whenever I change the config (add/remove bridge interface), I am seeing a new lease file entry in dhcp lease db with different uid. So server is treating each request as new and supplying ip. Hence f21 is getting new ip. The behavior change in f21 is deliberate per bz560361 and no difference observed when tried "send dhcp-client-identifier = hardware" parameter in dhclient.conf. Need to check why the client is sending new uid everytime user brought the interface offline/online.
ok I have missed something while testing dhcp-client-identifier parameter it seems. Added "send dhcp-client-identifier = hardware" in dhclient.conf and repeated the steps(with and without bridge) which could pull the same IP always.
Setting this in /etc/dhcp/dhclient.conf, restarting the network - gave me a new IP address. I used this to add to ovirt and the ovirtmgmt got the same address as well.
Even I have faced this issue recently while adding f21 nodes to oVirt.
A slightly different scenario: define the management network on top of a bond that includes the original communicating nic as one of its slaves. QE: please check this flow as well.
(In reply to Dan Kenigsberg from comment #14) > A slightly different scenario: define the management network on top of a > bond that includes the original communicating nic as one of its slaves. > > > QE: please check this flow as well. QE: please ignore my comment 14. if we build our network on top of a bond on top of two nics with two dhclients, we cannot really tell which of the identities should we inherit, so there's nothing much to do.
A fix covering the transition from a DHCP-configured NIC to a DHCP-configured bridge has been merged to the stable branch (3.6) and shall be a part of VDSM 4.17.4, when released. A DHCP Unique Identifier (DUID), taken from a dhclient lease file belonging to the NIC, is reused by a dhclient run on the bridge so the same address is acquired.
The issue is happening again on vdsm 4.17.4
* It happens again on el7 (forgot to mention)
Only now noticed that this bug exists. Not sure if it's a good solution, but I use the following as a workaround, for now: lf=/var/lib/dhclient/dhclient--ovirtmgmt.lease [ -e $lf ] || cat /var/lib/NetworkManager/*.lease /var/lib/dhclient/*.lease 2>/dev/null | grep ^default-duid | head -1 > $lf You have to run this between getting your first lease and asking for a lease for the bridge. Another obvious workaround is allocating static leases on the server based on mac addresses. Not always applicable and has maintenance costs. See [1] for details. [1] https://tools.ietf.org/html/rfc4361
(In reply to David Caro from comment #18) > * It happens again on el7 (forgot to mention) Asked David in private and he acked that the first least was acquired by networkmanager, which keeps its lease files in /var/lib/NetworkManager .
Didi, thanks for suggesting that the first line of a lease might be sufficient. I have posted a quick fix that avoids using the non-existent -df on EL7 for now and will get to falling back to -lf in the coming days.
(In reply to Ondřej Svoboda from comment #21) > Didi, thanks for suggesting that the first line of a lease might be > sufficient. I used the first _occurrence_ of default-duid, not the first line. Since it's a workaround, I just grep all files (my machines have just one nic). In real code you should check the file for the interface you are going to use (which iiuc you do), and will probably find only one occurrence anyway. It's probably always written first, but I wouldn't count on that, unless it's documented (unlikely). > > I have posted a quick fix that avoids using the non-existent -df on EL7 for > now and will get to falling back to -lf in the coming days. +1 Also note the issue with NM.
Can you verify this bug?