Bug 1479666

Summary: Add host failing every few attempts, failing to renew dhcp address on top of ovirtmgmt
Product: [oVirt] ovirt-engine Reporter: Michael Burman <mburman>
Component: BLL.InfraAssignee: Edward Haas <edwardh>
Status: CLOSED CANTFIX QA Contact: Meni Yakove <myakove>
Severity: high Docs Contact:
Priority: low    
Version: 4.2.0CC: bugs, danken, edwardh, mburman, mperina, oourfali, talayan, ylavi
Target Milestone: ovirt-4.2.0Keywords: Automation
Target Release: ---Flags: rule-engine: ovirt-4.2+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-09 10:28:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
add host failed1 logs
none
add host failed2 logs
none
new logs from 4.1.6.2 failure
none
new logs from 4.1.7-3 failure none

Description Michael Burman 2017-08-09 06:16:07 UTC
Created attachment 1311006 [details]
add host failed1 logs

Description of problem:
Add host failing every few attempts - $NetworkConfiguratorException: Failed to configure management network.

Version-Release number of selected component (if applicable):
4.2.0-0.0.master.20170807080527.git5e7f780.el7.centos

How reproducible:
Every few attempts (4/6)

Steps to Reproduce:
1. Add host to master

Actual results:
Sometimes failed, host non-operational

Expected results:
Should work

Comment 1 Michael Burman 2017-08-09 06:16:32 UTC
Created attachment 1311007 [details]
add host failed2 logs

Comment 2 Oved Ourfali 2017-08-09 06:23:03 UTC
Moving to network for initial investigation.
If it turns out as infra, contact us.

Comment 3 Red Hat Bugzilla Rules Engine 2017-08-09 07:27:51 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 4 Dan Kenigsberg 2017-09-13 15:29:25 UTC
Bug 1463629 may be related.

Comment 5 Alona Kaplan 2017-10-16 13:47:20 UTC
Hi Michael,

Do you know what attribute on the host is out-of-sync? (During the 1min the host is out-of-sync, you can open the setup networks dialog and check it).

Comment 6 Alona Kaplan 2017-10-16 13:48:48 UTC
(In reply to Alona Kaplan from comment #5)
> Hi Michael,
> 
> Do you know what attribute on the host is out-of-sync? (During the 1min the
> host is out-of-sync, you can open the setup networks dialog and check it).

Please ignore.

Comment 7 Alona Kaplan 2017-10-16 15:04:44 UTC
this add-host failed because ifup ovirtmgmt has failed. most probably because the dhcp server did not respond on time.

ifup/ovirtmgmt::DEBUG::2017-08-09 08:51:47,404::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/systemd-run --scope --unit=484f8b16-35ed-4ec7-988f-c0bb0eed21b1 --slice=vdsm-dhclient /sbin/ifup ovirtmgmt (cwd None)
MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,408::ifcfg::426::root::(_atomicNetworkBackup) Backed up ovirtmgmt
MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,537::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-3 /sbin/tc qdisc show (cwd None)
MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,545::commands::93::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,545::connectivity::42::root::(check) Checking connectivity...
MainProcess|jsonrpc/0::DEBUG::2017-08-09 08:51:57,810::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call ksmTune with () {}
MainProcess|jsonrpc/0::DEBUG::2017-08-09 08:51:57,811::supervdsmServer::100::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None
ifup/ovirtmgmt::DEBUG::2017-08-09 08:52:48,351::commands::93::root::(execCmd) FAILED: <err> = 'Running scope as unit 484f8b16-35ed-4ec7-988f-c0bb0eed21b1.scope.\n'; <rc> = 1
ifup/ovirtmgmt::ERROR::2017-08-09 08:52:48,352::utils::371::root::(wrapper) Unhandled exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper
    return f(*a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/concurrent.py", line 180, in run
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 926, in _exec_ifup
    _exec_ifup_by_name(iface.name, cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 912, in _exec_ifup_by_name
    raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '')
ConfigNetworkError: (29, 'Determining IP information for ovirtmgmt... failed.')

Comment 8 Dan Kenigsberg 2017-10-16 15:09:10 UTC
Same problem here: ifup fails after 1 minute of waiting.

ifup/ovirtmgmt::DEBUG::2017-08-09 09:03:15,637::cmd::56::root::(exec_sync_bytes) /usr/bin/systemd-run --scope --unit=930d1da0-74ec-4226-8687-8543bf07fe8a --slice=vdsm-dhclient /sbin/ifup ovirtmgmt (cwd None)
MainProcess|jsonrpc/3::DEBUG::2017-08-09 09:03:15,682::cmd::56::root::(exec_sync_bytes) /sbin/tc qdisc show (cwd None)
MainProcess|jsonrpc/3::DEBUG::2017-08-09 09:03:15,687::cmd::63::root::(exec_sync_bytes) SUCCESS: <err> = ''; <rc> = 0
MainProcess|jsonrpc/3::DEBUG::2017-08-09 09:03:15,688::connectivity::44::root::(check) Checking connectivity...
MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::63::root::(exec_sync_bytes) FAILED: <err> = ''; <rc> = 1
MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::56::root::(exec_sync_bytes) /usr/sbin/lldptool set-lldp -i enp6s0 adminStatus=rx (cwd None)
MainThread::DEBUG::2017-08-09 09:03:18,482::cmd::63::root::(exec_sync_bytes) FAILED: <err> = ''; <rc> = 1
MainThread::ERROR::2017-08-09 09:03:18,483::initializer::53::root::(_lldp_init) Failed to enable LLDP on enp6s0
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line 51, in _lldp_init
    Lldp.enable_lldp_on_iface(device)
  File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line 30, in enable_lldp_on_iface
    lldptool.enable_lldp_on_iface(iface, rx_only)
  File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", line 46, in enable_lldp_on_iface
    raise EnableLldpError(rc, out, err, iface)
EnableLldpError: (1, "timeout\n'M00000001C3040000000c06enp6s0000badminStatus0002rx' command timed out.\n", '', 'enp6s0')
MainProcess|jsonrpc/2::DEBUG::2017-08-09 09:03:29,175::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call ksmTune with () {}
MainProcess|jsonrpc/2::DEBUG::2017-08-09 09:03:29,175::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None
ifup/ovirtmgmt::DEBUG::2017-08-09 09:04:16,261::cmd::63::root::(exec_sync_bytes) FAILED: <err> = 'Running scope as unit 930d1da0-74ec-4226-8687-8543bf07fe8a.scope.\n'; <rc> = 1
ifup/ovirtmgmt::ERROR::2017-08-09 09:04:16,262::concurrent::198::root::(run) FINISH thread <Thread(ifup/ovirtmgmt, started daemon 140692038866688)> failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 191, in run
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 884, in _exec_ifup
    _exec_ifup_by_name(iface.name, cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 870, in _exec_ifup_by_name
    raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '')
ConfigNetworkError: (29, '\n')


Burman, can you see this regression when you are adding 4.1 hosts to a 4.2 manager? I would like to narrow down the problematic area; is it something we did in vdsm, or is it a slower DHCP server.

Comment 9 Michael Burman 2017-10-17 06:23:49 UTC
(In reply to Dan Kenigsberg from comment #8)
> Same problem here: ifup fails after 1 minute of waiting.
> 
> ifup/ovirtmgmt::DEBUG::2017-08-09
> 09:03:15,637::cmd::56::root::(exec_sync_bytes) /usr/bin/systemd-run --scope
> --unit=930d1da0-74ec-4226-8687-8543bf07fe8a --slice=vdsm-dhclient /sbin/ifup
> ovirtmgmt (cwd None)
> MainProcess|jsonrpc/3::DEBUG::2017-08-09
> 09:03:15,682::cmd::56::root::(exec_sync_bytes) /sbin/tc qdisc show (cwd None)
> MainProcess|jsonrpc/3::DEBUG::2017-08-09
> 09:03:15,687::cmd::63::root::(exec_sync_bytes) SUCCESS: <err> = ''; <rc> = 0
> MainProcess|jsonrpc/3::DEBUG::2017-08-09
> 09:03:15,688::connectivity::44::root::(check) Checking connectivity...
> MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::63::root::(exec_sync_bytes)
> FAILED: <err> = ''; <rc> = 1
> MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::56::root::(exec_sync_bytes)
> /usr/sbin/lldptool set-lldp -i enp6s0 adminStatus=rx (cwd None)
> MainThread::DEBUG::2017-08-09 09:03:18,482::cmd::63::root::(exec_sync_bytes)
> FAILED: <err> = ''; <rc> = 1
> MainThread::ERROR::2017-08-09
> 09:03:18,483::initializer::53::root::(_lldp_init) Failed to enable LLDP on
> enp6s0
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line
> 51, in _lldp_init
>     Lldp.enable_lldp_on_iface(device)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line
> 30, in enable_lldp_on_iface
>     lldptool.enable_lldp_on_iface(iface, rx_only)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py",
> line 46, in enable_lldp_on_iface
>     raise EnableLldpError(rc, out, err, iface)
> EnableLldpError: (1,
> "timeout\n'M00000001C3040000000c06enp6s0000badminStatus0002rx' command timed
> out.\n", '', 'enp6s0')
> MainProcess|jsonrpc/2::DEBUG::2017-08-09
> 09:03:29,175::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call
> ksmTune with () {}
> MainProcess|jsonrpc/2::DEBUG::2017-08-09
> 09:03:29,175::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper)
> return ksmTune with None
> ifup/ovirtmgmt::DEBUG::2017-08-09
> 09:04:16,261::cmd::63::root::(exec_sync_bytes) FAILED: <err> = 'Running
> scope as unit 930d1da0-74ec-4226-8687-8543bf07fe8a.scope.\n'; <rc> = 1
> ifup/ovirtmgmt::ERROR::2017-08-09 09:04:16,262::concurrent::198::root::(run)
> FINISH thread <Thread(ifup/ovirtmgmt, started daemon 140692038866688)> failed
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line
> 191, in run
>     ret = func(*args, **kwargs)
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line
> 884, in _exec_ifup
>     _exec_ifup_by_name(iface.name, cgroup)
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line
> 870, in _exec_ifup_by_name
>     raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '')
> ConfigNetworkError: (29, '\n')
> 
> 
> Burman, can you see this regression when you are adding 4.1 hosts to a 4.2
> manager? I would like to narrow down the problematic area; is it something
> we did in vdsm, or is it a slower DHCP server.

Dan, i can't do that as i'm blocked by BZ 1490274

Comment 10 Michael Burman 2017-10-18 08:39:19 UTC
Dan, not sure if it's relevant, but i just managed to reproduce this bug on 4.1.6.2 with vdsm-4.19.34-1.el7ev.x86_64

Comment 11 Michael Burman 2017-10-18 08:40:14 UTC
Created attachment 1340042 [details]
new logs from 4.1.6.2 failure

Comment 12 Dan Kenigsberg 2017-10-18 18:58:55 UTC
Given the recent reports, this is not a regression nor blocker. though we'd like to understand why this happens more often.

Comment 13 Dan Kenigsberg 2017-10-24 07:39:30 UTC
I don't think we can do anything about this, or do we Edy?

Comment 14 Michael Burman 2017-10-24 11:43:14 UTC
Another reproduction from rhv-qe on 4.1.7-3
Host remain in non-responsive state for ever.

Comment 15 Michael Burman 2017-10-24 11:44:21 UTC
Created attachment 1342699 [details]
new logs from 4.1.7-3 failure

Comment 16 Edward Haas 2017-10-29 15:15:49 UTC
(In reply to Dan Kenigsberg from comment #13)
> I don't think we can do anything about this, or do we Edy?

If the dhcp response takes too long, we do not have many options.
At least not with the architecture and tools we got today.

The only future idea I can come with is something like this...
For the management network, with an original dynamic IP address, the move to a bridge base can be done as follow:
- Read the IP settings defined.
- Teardown the interface and create over it the bridge.
- Define the saved IP address manually (through iproute2 or netlink) and mark it as "dynamic" with the same lft it was before the teardown.
- Run dhclient on the bridge.


But it is tricky enough to want to avoid it.
The disadvantage of DHCP on a management network is exactly this scenario.
On one hand the addressing is managed by the DHCP server, but on the other hand its failure serve the network makes it a vulnerability of the network.

Thanks,
Edy.

Comment 17 Edward Haas 2017-11-08 16:35:45 UTC
Are you fine with closing this one?

Comment 18 Michael Burman 2017-11-09 09:01:56 UTC
(In reply to Edward Haas from comment #17)
> Are you fine with closing this one?

Hi Edy,
In comments# 7,8 and 16 you saying that there is no much to do with this bug in case of slow dhcp response. Given that, we are fine with closing it.

Comment 19 Edward Haas 2017-11-09 10:28:01 UTC
If anyone will have an idea on how to handle this scenario please raise it.
Closing the BZ for now.