Hide Forgot
Created attachment 1141577 [details] journal entries filtered on 'dhcp' Description of problem: Trying to setup a host (in Lago) failed, leaving the host without any network. jsonrpc.Executor/4::DEBUG::2016-03-29 23:24:33,675::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'ovirtmgmt': {u'nic': u'eth0', u'mtu': u'1500', u'bootproto': u'dhcp', u'STP': u'no', u'bridged': u'true', u'defaultRoute': True}}, u'options': {u'connectivityCheck': u'true', u'connectivityTimeout': 120}} Reactor thread::INFO::2016-03-29 23:24:46,714::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:50678 Reactor thread::DEBUG::2016-03-29 23:24:46,718::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11 Reactor thread::INFO::2016-03-29 23:24:46,718::protocoldetector::118::ProtocolDetector.Detector::(handle_read) Detected protocol xml from 127.0.0.1:50678 Reactor thread::DEBUG::2016-03-29 23:24:46,718::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 50678) BindingXMLRPC::INFO::2016-03-29 23:24:46,718::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:50678 Thread-13::INFO::2016-03-29 23:24:46,719::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:50678 started Thread-13::INFO::2016-03-29 23:24:46,720::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:50678 stopped Reactor thread::INFO::2016-03-29 23:24:46,799::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:50679 Reactor thread::DEBUG::2016-03-29 23:24:46,803::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11 Reactor thread::INFO::2016-03-29 23:24:46,804::protocoldetector::118::ProtocolDetector.Detector::(handle_read) Detected protocol xml from 127.0.0.1:50679 Reactor thread::DEBUG::2016-03-29 23:24:46,804::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 50679) ... jsonrpc.Executor/4::ERROR::2016-03-29 23:26:36,219::API::1643::vds::(_rollback) connectivity check failed Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1641, in _rollback yield rollbackCtx File "/usr/share/vdsm/API.py", line 1493, in setupNetworks supervdsm.getProxy().setupNetworks(networks, bondings, options) File "/usr/share/vdsm/supervdsm.py", line 50, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda> **kwargs) File "<string>", line 2, in setupNetworks File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result) ConfigNetworkError: (10, 'connectivity check failed') Reactor thread::INFO::2016-03-29 23:26:46,867::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:50687 Attached journal dump with DHCP related issues (note that at the end I've ran manually dhclient to get an IP to get the logs). Version-Release number of selected component (if applicable): 4.17.23-2 on EL7 (CentOS 7.1) How reproducible: Sometimes.
Created attachment 1141578 [details] vdsm.log
Note that the host had DHCP previously and when manually ran it got the DHCP address correctly too.
The interesting networking bits hide in supervdsm.log. Would you attach it?
Created attachment 1142209 [details] supervdsm log
Attached supervdsm log (from a different run, but very similar scenario, apparently). Again, running 'dhclient' manually provided the IP.
*** Bug 1323143 has been marked as a duplicate of this bug. ***
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
From the supervdsm log, the following scenario is seen: 1. A request to setup ovirtmgmt network with dhcp enabled. 2. The request is handled, dhcp responds successfully. 3. No ping/s are seen from the Engine side for 120sec, and therefore connectivity-check fails and a rollback is issued. 3. Following the restoration of the previous ifcfg files, the last ifup command for eth0 fails with the error: 'Device does not seem to be present, delaying initialization.' It is unclear why the kernel cannot see eth0 any more. MainProcess|jsonrpc.Executor/4::DEBUG::2016-03-31 08:23:56,413::utils::671::root::(execCmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/systemd-run --scope --unit=419814f5-74d4-4677-802d-8576ccfff7e8 --slice=vdsm-dhclient /usr/sbin/ifup eth0 (cwd None) MainProcess|jsonrpc.Executor/4::DEBUG::2016-03-31 08:23:56,449::utils::689::root::(execCmd) FAILED: <err> = 'Running scope as unit 419814f5-74d4-4677-802d-8576ccfff7e8.scope.\n'; <rc> = 1 MainProcess|jsonrpc.Executor/4::ERROR::2016-03-31 08:23:56,449::ifcfg::753::root::(start_devices) Failed to ifup device eth0 during rollback. Traceback (most recent call last): File "/usr/share/vdsm/network/configurators/ifcfg.py", line 750, in start_devices _exec_ifup(dev) File "/usr/share/vdsm/network/configurators/ifcfg.py", line 812, in _exec_ifup raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '') ConfigNetworkError: (29, 'ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Device does not seem to be present, delaying initialization.')
supervdsm & vdsm logs are not in sync and we see different network subnets in them: 192.168.200.0/24 in supervdsm log and 192.168.201.0/24 in vdsm log. It will be helpful if we can see the reported network information before the setup request. Mainly to understand why the connectivity check fails (IP changes after the dhcp is issued?). Regarding the eth0 disappearance: Is the eth0 a virtio device?
(In reply to Edward Haas from comment #10) > supervdsm & vdsm logs are not in sync and we see different network subnets > in them: 192.168.200.0/24 in supervdsm log and 192.168.201.0/24 in vdsm log. > > It will be helpful if we can see the reported network information before the > setup request. Mainly to understand why the connectivity check fails (IP > changes after the dhcp is issued?). Will provide better logs - possibly they were not from the same run! > > > Regarding the eth0 disappearance: > Is the eth0 a virtio device? Indeed.
Created attachment 1148068 [details] vdsm.log
Created attachment 1148069 [details] supervdsm.log
Created attachment 1148070 [details] journalctl -r |grep -i dhcp
Note for above reproduction - I've set vdsConnectionTimeout on engine to 20 seconds (to try and prevent the infamous race there).
In the last logs there seems to be a different problem: No setupNetworks is seen from Engine, but there are still a lot of ping/s arriving. We suspect this is a problem already detected and handled in Engine in the last week. Could you please mention what Engine and vdsm-jsonrpc-java version is used? We will need to catch the original problem with the connectivity check failure.
(In reply to Edward Haas from comment #16) > In the last logs there seems to be a different problem: No setupNetworks is > seen from Engine, but there are still a lot of ping/s arriving. > We suspect this is a problem already detected and handled in Engine in the > last week. > Could you please mention what Engine and vdsm-jsonrpc-java version is used? > vdsm-jsonrpc-java-1.1.9-1.el6.noarch ovirt-engine-3.6.4.1-1.el6.noarch > We will need to catch the original problem with the connectivity check > failure. I have a running setup with the failure available for analysis if needed.
So a workaround for this is disabling the ipv6 on the hosts, see https://gerrit.ovirt.org/56511
Two problems originally seen: Missing pings from Engine and failure to rollback with the 'Device does not seem to be present, delaying initialization.' error. With the help of Gal Ben Haim to recreate the issue and Dan, the rollback issue has been understood. Seems like the ifcfg-eth file was missing the 'DEVICE' line (DEVICE="eth0"), I suspect it is missing from the image. This is not a problem until problem #1 appears because VDSM is taking over and re-writes the files as needed. The reason the host initially comes up at all is because of the NetworkManager which takes over the interface (before we disable it in the host deploy scripts. Adding DEVICE="eth0" in the ifcfg-eth0 script file resolves the issue.
Moving to 3.6.7, as the no-ip-issue is now understood to be a lago-only one. We still need to research the reports as if pings are more likely to be consumed by Vdsm if ipv6 is disabled on the host.
I've just had, after tens of successful host installations, the same failure, with 3.6.6-1 engine and 1.1.9-1 json-rpc. Will try to reproduce in newer versions. Host was also stuck on 'Installing', which I thought should have been resolved by now.
Can this be closed?
(In reply to Yaniv Dary from comment #22) > Can this be closed? Yes, but I would give QE the honours.
QE declined the honours.