Bug 1322257 - VDSM: network setup fails as ping calls do not arrive.
Summary: VDSM: network setup fails as ping calls do not arrive.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.17.23.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-3.6.7
: ---
Assignee: Edward Haas
QA Contact: Meni Yakove
URL:
Whiteboard:
: 1323143 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-30 07:16 UTC by Yaniv Kaul
Modified: 2016-05-30 06:42 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-30 06:42:17 UTC
oVirt Team: Network
Embargoed:
danken: ovirt-3.6.z?
danken: ovirt-4.0.0?
rule-engine: blocker?
rule-engine: planning_ack?
danken: devel_ack+
rule-engine: testing_ack?


Attachments (Terms of Use)
journal entries filtered on 'dhcp' (11.15 KB, text/plain)
2016-03-30 07:16 UTC, Yaniv Kaul
no flags Details
vdsm.log (216.36 KB, text/plain)
2016-03-30 07:17 UTC, Yaniv Kaul
no flags Details
supervdsm log (23.85 KB, text/plain)
2016-03-31 12:56 UTC, Yaniv Kaul
no flags Details
vdsm.log (1.21 MB, text/plain)
2016-04-17 13:53 UTC, Yaniv Kaul
no flags Details
supervdsm.log (3.53 KB, text/plain)
2016-04-17 13:53 UTC, Yaniv Kaul
no flags Details
journalctl -r |grep -i dhcp (3.96 KB, text/plain)
2016-04-17 13:57 UTC, Yaniv Kaul
no flags Details

Description Yaniv Kaul 2016-03-30 07:16:29 UTC
Created attachment 1141577 [details]
journal entries filtered on 'dhcp'

Description of problem:
Trying to setup a host (in Lago) failed, leaving the host without any network.

jsonrpc.Executor/4::DEBUG::2016-03-29 23:24:33,675::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'ovirtmgmt': {u'nic': u'eth0', u'mtu': u'1500', u'bootproto': u'dhcp', u'STP': u'no', u'bridged': u'true', u'defaultRoute': True}}, u'options': {u'connectivityCheck': u'true', u'connectivityTimeout': 120}}
Reactor thread::INFO::2016-03-29 23:24:46,714::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:50678
Reactor thread::DEBUG::2016-03-29 23:24:46,718::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11
Reactor thread::INFO::2016-03-29 23:24:46,718::protocoldetector::118::ProtocolDetector.Detector::(handle_read) Detected protocol xml from 127.0.0.1:50678
Reactor thread::DEBUG::2016-03-29 23:24:46,718::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 50678)
BindingXMLRPC::INFO::2016-03-29 23:24:46,718::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:50678
Thread-13::INFO::2016-03-29 23:24:46,719::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:50678 started
Thread-13::INFO::2016-03-29 23:24:46,720::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:50678 stopped
Reactor thread::INFO::2016-03-29 23:24:46,799::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:50679
Reactor thread::DEBUG::2016-03-29 23:24:46,803::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11
Reactor thread::INFO::2016-03-29 23:24:46,804::protocoldetector::118::ProtocolDetector.Detector::(handle_read) Detected protocol xml from 127.0.0.1:50679
Reactor thread::DEBUG::2016-03-29 23:24:46,804::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 50679)

...


jsonrpc.Executor/4::ERROR::2016-03-29 23:26:36,219::API::1643::vds::(_rollback) connectivity check failed
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1641, in _rollback
    yield rollbackCtx
  File "/usr/share/vdsm/API.py", line 1493, in setupNetworks
    supervdsm.getProxy().setupNetworks(networks, bondings, options)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
ConfigNetworkError: (10, 'connectivity check failed')
Reactor thread::INFO::2016-03-29 23:26:46,867::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:50687


Attached journal dump with DHCP related issues (note that at the end I've ran manually dhclient to get an IP to get the logs).

Version-Release number of selected component (if applicable):
4.17.23-2 on EL7 (CentOS 7.1)

How reproducible:
Sometimes.

Comment 1 Yaniv Kaul 2016-03-30 07:17:02 UTC
Created attachment 1141578 [details]
vdsm.log

Comment 2 Yaniv Kaul 2016-03-30 07:17:59 UTC
Note that the host had DHCP previously and when manually ran it got the DHCP address correctly too.

Comment 3 Dan Kenigsberg 2016-03-30 09:52:03 UTC
The interesting networking bits hide in supervdsm.log. Would you attach it?

Comment 4 Yaniv Kaul 2016-03-31 12:56:09 UTC
Created attachment 1142209 [details]
supervdsm log

Comment 5 Yaniv Kaul 2016-03-31 12:57:17 UTC
Attached supervdsm log (from a different run, but very similar scenario, apparently). Again, running 'dhclient' manually provided the IP.

Comment 6 David Caro 2016-04-01 21:46:22 UTC
*** Bug 1323143 has been marked as a duplicate of this bug. ***

Comment 7 Red Hat Bugzilla Rules Engine 2016-04-14 07:15:31 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 8 Red Hat Bugzilla Rules Engine 2016-04-14 07:23:26 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 9 Edward Haas 2016-04-17 07:00:15 UTC
From the supervdsm log, the following scenario is seen:
1. A request to setup ovirtmgmt network with dhcp enabled.
2. The request is handled, dhcp responds successfully.
3. No ping/s are seen from the Engine side for 120sec, and therefore connectivity-check fails and a rollback is issued.
3. Following the restoration of the previous ifcfg files, the last ifup command for eth0 fails with the error: 'Device  does not seem to be present, delaying initialization.'

It is unclear why the kernel cannot see eth0 any more.

MainProcess|jsonrpc.Executor/4::DEBUG::2016-03-31 08:23:56,413::utils::671::root::(execCmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/systemd-run --scope --unit=419814f5-74d4-4677-802d-8576ccfff7e8 --slice=vdsm-dhclient /usr/sbin/ifup eth0 (cwd None)
MainProcess|jsonrpc.Executor/4::DEBUG::2016-03-31 08:23:56,449::utils::689::root::(execCmd) FAILED: <err> = 'Running scope as unit 419814f5-74d4-4677-802d-8576ccfff7e8.scope.\n'; <rc> = 1
MainProcess|jsonrpc.Executor/4::ERROR::2016-03-31 08:23:56,449::ifcfg::753::root::(start_devices) Failed to ifup device eth0 during rollback.
Traceback (most recent call last):
  File "/usr/share/vdsm/network/configurators/ifcfg.py", line 750, in start_devices
    _exec_ifup(dev)
  File "/usr/share/vdsm/network/configurators/ifcfg.py", line 812, in _exec_ifup
    raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '')
ConfigNetworkError: (29, 'ERROR    : [/etc/sysconfig/network-scripts/ifup-eth] Device  does not seem to be present, delaying initialization.')

Comment 10 Edward Haas 2016-04-17 07:05:28 UTC
supervdsm & vdsm logs are not in sync and we see different network subnets in them: 192.168.200.0/24 in supervdsm log and 192.168.201.0/24 in vdsm log.

It will be helpful if we can see the reported network information before the setup request. Mainly to understand why the connectivity check fails (IP changes after the dhcp is issued?).


Regarding the eth0 disappearance:
Is the eth0 a virtio device?

Comment 11 Yaniv Kaul 2016-04-17 08:33:50 UTC
(In reply to Edward Haas from comment #10)
> supervdsm & vdsm logs are not in sync and we see different network subnets
> in them: 192.168.200.0/24 in supervdsm log and 192.168.201.0/24 in vdsm log.
> 
> It will be helpful if we can see the reported network information before the
> setup request. Mainly to understand why the connectivity check fails (IP
> changes after the dhcp is issued?).

Will provide better logs - possibly they were not from the same run!

> 
> 
> Regarding the eth0 disappearance:
> Is the eth0 a virtio device?

Indeed.

Comment 12 Yaniv Kaul 2016-04-17 13:53:23 UTC
Created attachment 1148068 [details]
vdsm.log

Comment 13 Yaniv Kaul 2016-04-17 13:53:46 UTC
Created attachment 1148069 [details]
supervdsm.log

Comment 14 Yaniv Kaul 2016-04-17 13:57:11 UTC
Created attachment 1148070 [details]
journalctl -r |grep -i dhcp

Comment 15 Yaniv Kaul 2016-04-17 14:00:49 UTC
Note for above reproduction - I've set vdsConnectionTimeout on engine to 20 seconds (to try and prevent the infamous race there).

Comment 16 Edward Haas 2016-04-19 05:55:53 UTC
In the last logs there seems to be a different problem: No setupNetworks is seen from Engine, but there are still a lot of ping/s arriving.
We suspect this is a problem already detected and handled in Engine in the last week.
Could you please mention what Engine and vdsm-jsonrpc-java version is used?

We will need to catch the original problem with the connectivity check failure.

Comment 17 Yaniv Kaul 2016-04-19 08:26:41 UTC
(In reply to Edward Haas from comment #16)
> In the last logs there seems to be a different problem: No setupNetworks is
> seen from Engine, but there are still a lot of ping/s arriving.
> We suspect this is a problem already detected and handled in Engine in the
> last week.
> Could you please mention what Engine and vdsm-jsonrpc-java version is used?
> 

vdsm-jsonrpc-java-1.1.9-1.el6.noarch
ovirt-engine-3.6.4.1-1.el6.noarch

> We will need to catch the original problem with the connectivity check
> failure.

I have a running setup with the failure available for analysis if needed.

Comment 18 David Caro 2016-04-22 15:57:19 UTC
So a workaround for this is disabling the ipv6 on the hosts, see https://gerrit.ovirt.org/56511

Comment 19 Edward Haas 2016-04-25 05:24:53 UTC
Two problems originally seen: Missing pings from Engine and failure to rollback with the 'Device  does not seem to be present, delaying initialization.' error.

With the help of Gal Ben Haim to recreate the issue and Dan, the rollback issue has been understood.
Seems like the ifcfg-eth file was missing the 'DEVICE' line (DEVICE="eth0"), I suspect it is missing from the image.

This is not a problem until problem #1 appears because VDSM is taking over and re-writes the files as needed.
The reason the host initially comes up at all is because of the NetworkManager which takes over the interface (before we disable it in the host deploy scripts.

Adding DEVICE="eth0" in the ifcfg-eth0 script file resolves the issue.

Comment 20 Dan Kenigsberg 2016-04-26 12:25:34 UTC
Moving to 3.6.7, as the no-ip-issue is now understood to be a lago-only one. We still need to research the reports as if pings are more likely to be consumed by Vdsm if ipv6 is disabled on the host.

Comment 21 Yaniv Kaul 2016-05-04 09:17:36 UTC
I've just had, after tens of successful host installations, the same failure, with 3.6.6-1 engine and 1.1.9-1 json-rpc. Will try to reproduce in newer versions.
Host was also stuck on 'Installing', which I thought should have been resolved by now.

Comment 22 Yaniv Lavi 2016-05-16 12:42:46 UTC
Can this be closed?

Comment 23 Dan Kenigsberg 2016-05-17 15:03:44 UTC
(In reply to Yaniv Dary from comment #22)
> Can this be closed?

Yes, but I would give QE the honours.

Comment 24 Dan Kenigsberg 2016-05-30 06:42:17 UTC
QE declined the honours.


Note You need to log in before you can comment on or make changes to this bug.