Bug 1588052
| Summary: | DHCP configured host are going to definitively loose the management bridge if the DHCP server is down for a relevant (20 minutes???) time frame at host boot time | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Simone Tiraboschi <stirabos> | ||||||||
| Component: | SuperVDSM | Assignee: | Edward Haas <edwardh> | ||||||||
| Status: | CLOSED DEFERRED | QA Contact: | Meni Yakove <myakove> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | medium | ||||||||||
| Version: | 4.20.23 | CC: | bugs, danken, dholler, edwardh, elv1313, pagranat, ylavi | ||||||||
| Target Milestone: | --- | Flags: | sbonazzo:
ovirt-4.2-
|
||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2018-10-10 11:50:14 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
Created attachment 1448355 [details]
supervdsm logs
This looks more like an RFE and not a bug.
VDSM attempts to recover the host networks automatically, but if it fails, it is left to the operator/admin to handle the recovery.
I will start with my usual statement: An enterprise server should not use DHCP if you want stability. Loosing a portion of the network or the dhcp server itself will take down such servers, and that is not reasonable.
Saying that, the discussion on how to recover the host networks on boot is open for discussion. Currently, the recovery is issued in a "blocking" mode, waiting for the dhcp response until the timeout and obviously failing if the server is not responsive. The other non-blocking mode is during regular runtime, when Engine controls the setup and VDSM does not wait for the dhcp response, but returns beforehand, giving a chance for the server to respond.
If I'm not mistaken, in both cases, if the dhcp does not respond after a timeout, the dhclient is cancelled.
In other words, we are not using the PERSISTENT_DHCLIENT option, not during recovery and regular runtime. It is left up to the operator/admin to handle it.
```
PERSISTENT_DHCLIENT=yes|no|1|0
Without this option, or if it is 'no'/'0', and BOOTPROTO=dhcp,
dhclient is run for the interface in "one-shot" mode; if the
dhcp server does not respond for a configurable timeout, then
dhclient exits and the interface is not brought up -
the '-1' option is given to dhclient.
If PERSISTENT_DHCLIENT=yes, then dhclient will keep on trying
to contact the dhcp server when it does not respond - no '-1'
option is given to dhclient. Note: this disables the automatic
checking for the presence of a link before starting dhclient.
```
I personally think it is more stable to keep the behaviour as is, it has less surprises (/more experience with it in the field) and gives more visibility to the operator/admin.
And as a general advice to customers, production hosts should not use DHCP in the server cluster unless it is in a transition stage (deployments for example).
Can we close this one with WONTFIX or DEFERRED? Indeed, we have no capacity to handle the case of no dhcp server during boot time. In such cases, we ask to wait for an active dhcp server and reboot the hypervisor. Created attachment 1573047 [details]
agent.log
agent.log with the error
Created attachment 1573048 [details]
broker.log
broker.log
oops, sorry, I posted the attachment here instead of the potential duplicate, please remove https://bugzilla.redhat.com/show_bug.cgi?id=1707086 |
Description of problem: The issue has been firstly reported by Polina regarding an HE environment with three DHCP configured hosts. Due to a power outage we lost all the three HE hosts plus the DHCP server at the same time. At power recovery the three hosts come up before the DHCP server. ovirtmgmt didn't come up on the host at boot due to the lack of IPv4 address: restore-net::DEBUG::2018-06-04 08:48:06,869::restore_net_config::336::root::(_wait_for_for_all_devices_up) waiting for set(['ovirtmgmt']) to be up. restore-net::DEBUG::2018-06-04 08:48:08,029::restore_net_config::336::root::(_wait_for_for_all_devices_up) waiting for set(['ovirtmgmt']) to be up. restore-net::DEBUG::2018-06-04 08:48:09,202::restore_net_config::336::root::(_wait_for_for_all_devices_up) waiting for set(['ovirtmgmt']) to be up. restore-net::DEBUG::2018-06-04 08:48:10,580::restore_net_config::336::root::(_wait_for_for_all_devices_up) waiting for set(['ovirtmgmt']) to be up. restore-net::DEBUG::2018-06-04 08:48:11,681::restore_net_config::336::root::(_wait_for_for_all_devices_up) waiting for set(['ovirtmgmt']) to be up. restore-net::WARNING::2018-06-04 08:48:12,980::restore_net_config::341::root::(_wait_for_for_all_devices_up) Not all devices are up. VDSM might restore them although they were not changed since they were persisted. so vdsm tried a recovery procedure from persisted configuration. restore-net::DEBUG::2018-06-04 08:54:58,371::ifcfg::569::root::(writeConfFile) Writing to file /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt configuration: # Generated by VDSM version 4.20.29-1.el7ev DEVICE=ovirtmgmt TYPE=Bridge DELAY=0 STP=off ONBOOT=yes BOOTPROTO=dhcp MTU=1500 DEFROUTE=yes NM_CONTROLLED=no IPV6INIT=yes IPV6_AUTOCONF=yes restore-net::DEBUG::2018-06-04 08:54:58,395::cmdutils::150::root::(exec_cmd) /sbin/ip addr flush dev eno1 scope global (cwd None) restore-net::DEBUG::2018-06-04 08:54:58,401::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 restore-net::DEBUG::2018-06-04 08:54:58,401::ifcfg::569::root::(writeConfFile) Writing to file /etc/sysconfig/network-scripts/ifcfg-eno1 configuration: # Generated by VDSM version 4.20.29-1.el7ev DEVICE=eno1 BRIDGE=ovirtmgmt ONBOOT=yes MTU=1500 DEFROUTE=no NM_CONTROLLED=no IPV6INIT=no but also in that case the bridge didn't come up fir the same reason: restore-net::DEBUG::2018-06-04 08:59:15,248::ifacetracking::32::root::(add) Add iface tracking for device ovirtmgmt netlink/events::DEBUG::2018-06-04 08:59:15,249::concurrent::192::root::(run) START thread <Thread(netlink/events, started daemon 140147681171200)> (func=<bound method Monitor._scan of <vdsm.network.netlink.monitor.Monitor object at 0x7f76acfe65d0>>, args=(), kwargs={}) restore-net::DEBUG::2018-06-04 08:59:15,250::cmdutils::150::root::(exec_cmd) /usr/bin/systemd-run --scope --unit=b1511f47-331e-4152-beb4-06a307ded036 --slice=vdsm-dhclient /sbin/ifup ovirtmgmt (cwd None) netlink/events::DEBUG::2018-06-04 08:59:25,485::concurrent::195::root::(run) FINISH thread <Thread(netlink/events, stopped daemon 140147681171200)> restore-net::DEBUG::2018-06-04 09:00:17,891::cmdutils::158::root::(exec_cmd) FAILED: <err> = 'Running scope as unit b1511f47-331e-4152-beb4-06a307ded036.scope.\n'; <rc> = 1 restore-net::DEBUG::2018-06-04 09:00:17,891::logutils::317::root::(_report_stats) ThreadedHandler is ok in the last 78 seconds (max pending: 1) restore-net::WARNING::2018-06-04 09:00:17,892::waitfor::116::root::(_wait_for_event) Expected event "{'scope': 'global', 'family': 'inet', 'label': u'ovirtmgmt'}" of interface "ovirtmgmt" was not caught within 10sec. Caught events: [] restore-net::DEBUG::2018-06-04 09:00:17,892::legacy_switch::437::root::(add_missing_networks) Adding network u'ovirtmgmt' failed. Running orphan-devices cleanup and so the rollback procedure completely removed the management bridge from the host: restore-net::ERROR::2018-06-04 09:06:02,282::__init__::54::root::(__exit__) Failed rollback transaction last known good network. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 160, in _setup_legacy bondings, _netinfo) File "/usr/lib/python2.7/site-packages/vdsm/network/legacy_switch.py", line 433, in add_missing_networks _add_network(network, configurator, _netinfo, bondattr, **attrs) File "/usr/lib/python2.7/site-packages/vdsm/network/legacy_switch.py", line 168, in wrapped return func(network, configurator, net_info, bondattr, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/legacy_switch.py", line 224, in _add_network net_ent_to_configure.configure(**options) File "/usr/lib/python2.7/site-packages/vdsm/network/models.py", line 187, in configure self.configurator.configureBridge(self, **opts) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 114, in configureBridge _ifup(bridge) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 977, in _ifup _exec_ifup(iface, cgroup) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 942, in _exec_ifup _exec_ifup_by_name(iface.name, cgroup) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 928, in _exec_ifup_by_name raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '') ConfigNetworkError: (29, '\n') restore-net::INFO::2018-06-04 09:06:02,283::netconfpersistence::62::root::(removeNetwork) Removing network ovirtmgmt After some time the DHCP come back and eno1 correctly got a IPv4 address from the DHCP server but ovirtmgmt was already gone. [root@alma05 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 2c:59:e5:37:cf:f0 brd ff:ff:ff:ff:ff:ff inet 10.35.92.5/22 brd 10.35.95.255 scope global dynamic eno1 valid_lft 24297sec preferred_lft 24297sec 3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 2c:59:e5:37:cf:f4 brd ff:ff:ff:ff:ff:ff 21: ;vdsmdummy;: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 6e:ad:f0:27:1d:63 brd ff:ff:ff:ff:ff:ff inet6 fe80::6cad:f0ff:fe27:1d63/64 scope link valid_lft forever preferred_lft forever So ovirt-ha-broker was able to reconnect the NFS storage domain but it was not possible to start the engine VM due to the lack of ovirtmgmt bridge. 2018-06-06 11:08:46,994+0300 ERROR (vm/cfc361da) [virt.vm] (vmId='cfc361da-e3fc-4a11-9674-42d9108d2776') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2789, in _run self._setup_devices() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2613, in _setup_devices dev_object.setup() File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 91, in setup displaynetwork.create_network(display_network, self.vmid) File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 26, in create_network display_device = _display_device(netname) File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 35, in _display_device return net_api.network_northbound(netname) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 100, in network_northbound return netswitch.configurator.net2northbound(network_name) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 438, in net2northbound is_legacy = net_attr['switch'] == legacy_switch.SWITCH_TYPE TypeError: 'NoneType' object has no attribute '__getitem__' This happened at the same time on all the three hosts and so we definitively lost the environment until we manually recovered ovirtmgmt but it wans't that evident. How reproducible: 100% Steps to Reproduce: 1. deploy HE on DHCP configured host 2. stop the DHCP server 3. reboot the host and wait at least 20 minutes after host boot Actual results: after N minutes the management bridge will disappear, when the DHCP server will come back the IPv4 address will be assigned to the base interface and the engine VM will not start due to the lack of ovirtmgmt bridge. Expected results: ovirtmgmt will stay on the host and the host will wait for DHCP Additional info: