Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1588052

Summary: DHCP configured host are going to definitively loose the management bridge if the DHCP server is down for a relevant (20 minutes???) time frame at host boot time
Product: [oVirt] vdsm Reporter: Simone Tiraboschi <stirabos>
Component: SuperVDSMAssignee: Edward Haas <edwardh>
Status: CLOSED DEFERRED QA Contact: Meni Yakove <myakove>
Severity: high Docs Contact:
Priority: medium    
Version: 4.20.23CC: bugs, danken, dholler, edwardh, elv1313, pagranat, ylavi
Target Milestone: ---Flags: sbonazzo: ovirt-4.2-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-10 11:50:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
supervdsm logs
none
agent.log
none
broker.log none

Description Simone Tiraboschi 2018-06-06 14:20:06 UTC
Description of problem:
The issue has been firstly reported by Polina regarding an HE environment with three DHCP configured hosts.

Due to a power outage we lost all the three HE hosts plus the DHCP server at the same time.

At power recovery the three hosts come up before the DHCP server.

ovirtmgmt didn't come up on the host at boot due to the lack of IPv4 address:
restore-net::DEBUG::2018-06-04 08:48:06,869::restore_net_config::336::root::(_wait_for_for_all_devices_up) waiting for set(['ovirtmgmt']) to be up.
restore-net::DEBUG::2018-06-04 08:48:08,029::restore_net_config::336::root::(_wait_for_for_all_devices_up) waiting for set(['ovirtmgmt']) to be up.
restore-net::DEBUG::2018-06-04 08:48:09,202::restore_net_config::336::root::(_wait_for_for_all_devices_up) waiting for set(['ovirtmgmt']) to be up.
restore-net::DEBUG::2018-06-04 08:48:10,580::restore_net_config::336::root::(_wait_for_for_all_devices_up) waiting for set(['ovirtmgmt']) to be up.
restore-net::DEBUG::2018-06-04 08:48:11,681::restore_net_config::336::root::(_wait_for_for_all_devices_up) waiting for set(['ovirtmgmt']) to be up.
restore-net::WARNING::2018-06-04 08:48:12,980::restore_net_config::341::root::(_wait_for_for_all_devices_up) Not all devices are up. VDSM might restore them although they were not changed since they were persisted.


so vdsm tried a recovery procedure from persisted configuration.
restore-net::DEBUG::2018-06-04 08:54:58,371::ifcfg::569::root::(writeConfFile) Writing to file /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt configuration:
# Generated by VDSM version 4.20.29-1.el7ev
DEVICE=ovirtmgmt
TYPE=Bridge
DELAY=0
STP=off
ONBOOT=yes
BOOTPROTO=dhcp
MTU=1500
DEFROUTE=yes
NM_CONTROLLED=no
IPV6INIT=yes
IPV6_AUTOCONF=yes

restore-net::DEBUG::2018-06-04 08:54:58,395::cmdutils::150::root::(exec_cmd) /sbin/ip addr flush dev eno1 scope global (cwd None)
restore-net::DEBUG::2018-06-04 08:54:58,401::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0
restore-net::DEBUG::2018-06-04 08:54:58,401::ifcfg::569::root::(writeConfFile) Writing to file /etc/sysconfig/network-scripts/ifcfg-eno1 configuration:
# Generated by VDSM version 4.20.29-1.el7ev
DEVICE=eno1
BRIDGE=ovirtmgmt
ONBOOT=yes
MTU=1500
DEFROUTE=no
NM_CONTROLLED=no
IPV6INIT=no


but also in that case the bridge didn't come up fir the same reason:
restore-net::DEBUG::2018-06-04 08:59:15,248::ifacetracking::32::root::(add) Add iface tracking for device ovirtmgmt
netlink/events::DEBUG::2018-06-04 08:59:15,249::concurrent::192::root::(run) START thread <Thread(netlink/events, started daemon 140147681171200)> (func=<bound method Monitor._scan of <vdsm.network.netlink.monitor.Monitor object at 0x7f76acfe65d0>>, args=(), kwargs={})
restore-net::DEBUG::2018-06-04 08:59:15,250::cmdutils::150::root::(exec_cmd) /usr/bin/systemd-run --scope --unit=b1511f47-331e-4152-beb4-06a307ded036 --slice=vdsm-dhclient /sbin/ifup ovirtmgmt (cwd None)
netlink/events::DEBUG::2018-06-04 08:59:25,485::concurrent::195::root::(run) FINISH thread <Thread(netlink/events, stopped daemon 140147681171200)>
restore-net::DEBUG::2018-06-04 09:00:17,891::cmdutils::158::root::(exec_cmd) FAILED: <err> = 'Running scope as unit b1511f47-331e-4152-beb4-06a307ded036.scope.\n'; <rc> = 1
restore-net::DEBUG::2018-06-04 09:00:17,891::logutils::317::root::(_report_stats) ThreadedHandler is ok in the last 78 seconds (max pending: 1)
restore-net::WARNING::2018-06-04 09:00:17,892::waitfor::116::root::(_wait_for_event) Expected event "{'scope': 'global', 'family': 'inet', 'label': u'ovirtmgmt'}" of interface "ovirtmgmt" was not caught within 10sec. Caught events: []
restore-net::DEBUG::2018-06-04 09:00:17,892::legacy_switch::437::root::(add_missing_networks) Adding network u'ovirtmgmt' failed. Running orphan-devices cleanup


and so the rollback procedure completely removed the management bridge from the host:
restore-net::ERROR::2018-06-04 09:06:02,282::__init__::54::root::(__exit__) Failed rollback transaction last known good network.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 160, in _setup_legacy
    bondings, _netinfo)
  File "/usr/lib/python2.7/site-packages/vdsm/network/legacy_switch.py", line 433, in add_missing_networks
    _add_network(network, configurator, _netinfo, bondattr, **attrs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/legacy_switch.py", line 168, in wrapped
    return func(network, configurator, net_info, bondattr, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/legacy_switch.py", line 224, in _add_network
    net_ent_to_configure.configure(**options)
  File "/usr/lib/python2.7/site-packages/vdsm/network/models.py", line 187, in configure
    self.configurator.configureBridge(self, **opts)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 114, in configureBridge
    _ifup(bridge)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 977, in _ifup
    _exec_ifup(iface, cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 942, in _exec_ifup
    _exec_ifup_by_name(iface.name, cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 928, in _exec_ifup_by_name
    raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '')
ConfigNetworkError: (29, '\n')
restore-net::INFO::2018-06-04 09:06:02,283::netconfpersistence::62::root::(removeNetwork) Removing network ovirtmgmt


After some time the DHCP come back and eno1 correctly got a IPv4 address from the DHCP server but ovirtmgmt was already gone.

[root@alma05 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 2c:59:e5:37:cf:f0 brd ff:ff:ff:ff:ff:ff
    inet 10.35.92.5/22 brd 10.35.95.255 scope global dynamic eno1
       valid_lft 24297sec preferred_lft 24297sec
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 2c:59:e5:37:cf:f4 brd ff:ff:ff:ff:ff:ff
21: ;vdsmdummy;: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 6e:ad:f0:27:1d:63 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::6cad:f0ff:fe27:1d63/64 scope link 
       valid_lft forever preferred_lft forever


So ovirt-ha-broker was able to reconnect the NFS storage domain but it was not possible to start the engine VM due to the lack of ovirtmgmt bridge.

2018-06-06 11:08:46,994+0300 ERROR (vm/cfc361da) [virt.vm] (vmId='cfc361da-e3fc-4a11-9674-42d9108d2776') The vm start process failed (vm:943)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2789, in _run
    self._setup_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2613, in _setup_devices
    dev_object.setup()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 91, in setup
    displaynetwork.create_network(display_network, self.vmid)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 26, in create_network
    display_device = _display_device(netname)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 35, in _display_device
    return net_api.network_northbound(netname)
  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 100, in network_northbound
    return netswitch.configurator.net2northbound(network_name)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 438, in net2northbound
    is_legacy = net_attr['switch'] == legacy_switch.SWITCH_TYPE
TypeError: 'NoneType' object has no attribute '__getitem__'

This happened at the same time on all the three hosts and so we definitively lost the environment until we manually recovered ovirtmgmt but it wans't that evident.


How reproducible:
100%

Steps to Reproduce:
1. deploy HE on DHCP configured host
2. stop the DHCP server
3. reboot the host and wait at least 20 minutes after host boot

Actual results:
after N minutes the management bridge will disappear,
when the DHCP server will come back the IPv4 address will be assigned to the base interface and the engine VM will not start due to the lack of ovirtmgmt bridge.

Expected results:
ovirtmgmt will stay on the host and the host will wait for DHCP

Additional info:

Comment 1 Simone Tiraboschi 2018-06-06 14:23:09 UTC
Created attachment 1448355 [details]
supervdsm logs

Comment 2 Edward Haas 2018-07-23 05:19:14 UTC
This looks more like an RFE and not a bug.
VDSM attempts to recover the host networks automatically, but if it fails, it is left to the operator/admin to handle the recovery.

I will start with my usual statement: An enterprise server should not use DHCP if you want stability. Loosing a portion of the network or the dhcp server itself will take down such servers, and that is not reasonable.

Saying that, the discussion on how to recover the host networks on boot is open for discussion. Currently, the recovery is issued in a "blocking" mode, waiting for the dhcp response until the timeout and obviously failing if the server is not responsive. The other non-blocking mode is during regular runtime, when Engine controls the setup and VDSM does not wait for the dhcp response, but returns beforehand, giving a chance for the server to respond.
If I'm not mistaken, in both cases, if the dhcp does not respond after a timeout, the dhclient is cancelled.

In other words, we are not using the PERSISTENT_DHCLIENT option, not during recovery and regular runtime. It is left up to the operator/admin to handle it.

```
    PERSISTENT_DHCLIENT=yes|no|1|0                                              
      Without this option, or if it is 'no'/'0', and BOOTPROTO=dhcp,            
      dhclient is run for the interface in "one-shot" mode; if the              
      dhcp server does not respond for a configurable timeout, then             
      dhclient exits and the interface is not brought up -                      
      the '-1' option is given to dhclient.                                     
      If PERSISTENT_DHCLIENT=yes, then dhclient will keep on trying             
      to contact the dhcp server when it does not respond - no '-1'             
      option is given to dhclient. Note: this disables the automatic            
      checking for the presence of a link before starting dhclient. 
```

I personally think it is more stable to keep the behaviour as is, it has less surprises (/more experience with it in the field) and gives more visibility to the operator/admin.
And as a general advice to customers, production hosts should not use DHCP in the server cluster unless it is in a transition stage (deployments for example).

Comment 3 Edward Haas 2018-10-02 09:36:27 UTC
Can we close this one with WONTFIX or DEFERRED?

Comment 4 Dan Kenigsberg 2018-10-10 11:50:14 UTC
Indeed, we have no capacity to handle the case of no dhcp server during boot time. In such cases, we ask to wait for an active dhcp server and reboot the hypervisor.

Comment 5 elv1313 2019-05-24 21:07:25 UTC
Created attachment 1573047 [details]
agent.log

agent.log with the error

Comment 6 elv1313 2019-05-24 21:09:00 UTC
Created attachment 1573048 [details]
broker.log

broker.log

Comment 7 elv1313 2019-05-24 21:09:47 UTC
oops, sorry, I posted the attachment here instead of the potential duplicate, please remove

https://bugzilla.redhat.com/show_bug.cgi?id=1707086