Bug 1479666 - Add host failing every few attempts, failing to renew dhcp address on top of ovirtmgmt
Add host failing every few attempts, failing to renew dhcp address on top of ...
Status: CLOSED CANTFIX
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra (Show other bugs)
4.2.0
x86_64 Linux
low Severity high (vote)
: ovirt-4.2.0
: ---
Assigned To: Edward Haas
Meni Yakove
: Automation
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-09 02:16 EDT by Michael Burman
Modified: 2017-11-09 05:28 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-11-09 05:28:01 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Network
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.2+


Attachments (Terms of Use)
add host failed1 logs (167.30 KB, application/x-gzip)
2017-08-09 02:16 EDT, Michael Burman
no flags Details
add host failed2 logs (188.38 KB, application/x-gzip)
2017-08-09 02:16 EDT, Michael Burman
no flags Details
new logs from 4.1.6.2 failure (114.57 KB, application/x-gzip)
2017-10-18 04:40 EDT, Michael Burman
no flags Details
new logs from 4.1.7-3 failure (7.11 MB, application/x-gzip)
2017-10-24 07:44 EDT, Michael Burman
no flags Details

  None (edit)
Description Michael Burman 2017-08-09 02:16:07 EDT
Created attachment 1311006 [details]
add host failed1 logs

Description of problem:
Add host failing every few attempts - $NetworkConfiguratorException: Failed to configure management network.

Version-Release number of selected component (if applicable):
4.2.0-0.0.master.20170807080527.git5e7f780.el7.centos

How reproducible:
Every few attempts (4/6)

Steps to Reproduce:
1. Add host to master

Actual results:
Sometimes failed, host non-operational

Expected results:
Should work
Comment 1 Michael Burman 2017-08-09 02:16 EDT
Created attachment 1311007 [details]
add host failed2 logs
Comment 2 Oved Ourfali 2017-08-09 02:23:03 EDT
Moving to network for initial investigation.
If it turns out as infra, contact us.
Comment 3 Red Hat Bugzilla Rules Engine 2017-08-09 03:27:51 EDT
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Comment 4 Dan Kenigsberg 2017-09-13 11:29:25 EDT
Bug 1463629 may be related.
Comment 5 Alona Kaplan 2017-10-16 09:47:20 EDT
Hi Michael,

Do you know what attribute on the host is out-of-sync? (During the 1min the host is out-of-sync, you can open the setup networks dialog and check it).
Comment 6 Alona Kaplan 2017-10-16 09:48:48 EDT
(In reply to Alona Kaplan from comment #5)
> Hi Michael,
> 
> Do you know what attribute on the host is out-of-sync? (During the 1min the
> host is out-of-sync, you can open the setup networks dialog and check it).

Please ignore.
Comment 7 Alona Kaplan 2017-10-16 11:04:44 EDT
this add-host failed because ifup ovirtmgmt has failed. most probably because the dhcp server did not respond on time.

ifup/ovirtmgmt::DEBUG::2017-08-09 08:51:47,404::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/systemd-run --scope --unit=484f8b16-35ed-4ec7-988f-c0bb0eed21b1 --slice=vdsm-dhclient /sbin/ifup ovirtmgmt (cwd None)
MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,408::ifcfg::426::root::(_atomicNetworkBackup) Backed up ovirtmgmt
MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,537::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-3 /sbin/tc qdisc show (cwd None)
MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,545::commands::93::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,545::connectivity::42::root::(check) Checking connectivity...
MainProcess|jsonrpc/0::DEBUG::2017-08-09 08:51:57,810::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call ksmTune with () {}
MainProcess|jsonrpc/0::DEBUG::2017-08-09 08:51:57,811::supervdsmServer::100::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None
ifup/ovirtmgmt::DEBUG::2017-08-09 08:52:48,351::commands::93::root::(execCmd) FAILED: <err> = 'Running scope as unit 484f8b16-35ed-4ec7-988f-c0bb0eed21b1.scope.\n'; <rc> = 1
ifup/ovirtmgmt::ERROR::2017-08-09 08:52:48,352::utils::371::root::(wrapper) Unhandled exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper
    return f(*a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/concurrent.py", line 180, in run
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 926, in _exec_ifup
    _exec_ifup_by_name(iface.name, cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 912, in _exec_ifup_by_name
    raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '')
ConfigNetworkError: (29, 'Determining IP information for ovirtmgmt... failed.')
Comment 8 Dan Kenigsberg 2017-10-16 11:09:10 EDT
Same problem here: ifup fails after 1 minute of waiting.

ifup/ovirtmgmt::DEBUG::2017-08-09 09:03:15,637::cmd::56::root::(exec_sync_bytes) /usr/bin/systemd-run --scope --unit=930d1da0-74ec-4226-8687-8543bf07fe8a --slice=vdsm-dhclient /sbin/ifup ovirtmgmt (cwd None)
MainProcess|jsonrpc/3::DEBUG::2017-08-09 09:03:15,682::cmd::56::root::(exec_sync_bytes) /sbin/tc qdisc show (cwd None)
MainProcess|jsonrpc/3::DEBUG::2017-08-09 09:03:15,687::cmd::63::root::(exec_sync_bytes) SUCCESS: <err> = ''; <rc> = 0
MainProcess|jsonrpc/3::DEBUG::2017-08-09 09:03:15,688::connectivity::44::root::(check) Checking connectivity...
MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::63::root::(exec_sync_bytes) FAILED: <err> = ''; <rc> = 1
MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::56::root::(exec_sync_bytes) /usr/sbin/lldptool set-lldp -i enp6s0 adminStatus=rx (cwd None)
MainThread::DEBUG::2017-08-09 09:03:18,482::cmd::63::root::(exec_sync_bytes) FAILED: <err> = ''; <rc> = 1
MainThread::ERROR::2017-08-09 09:03:18,483::initializer::53::root::(_lldp_init) Failed to enable LLDP on enp6s0
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line 51, in _lldp_init
    Lldp.enable_lldp_on_iface(device)
  File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line 30, in enable_lldp_on_iface
    lldptool.enable_lldp_on_iface(iface, rx_only)
  File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", line 46, in enable_lldp_on_iface
    raise EnableLldpError(rc, out, err, iface)
EnableLldpError: (1, "timeout\n'M00000001C3040000000c06enp6s0000badminStatus0002rx' command timed out.\n", '', 'enp6s0')
MainProcess|jsonrpc/2::DEBUG::2017-08-09 09:03:29,175::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call ksmTune with () {}
MainProcess|jsonrpc/2::DEBUG::2017-08-09 09:03:29,175::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None
ifup/ovirtmgmt::DEBUG::2017-08-09 09:04:16,261::cmd::63::root::(exec_sync_bytes) FAILED: <err> = 'Running scope as unit 930d1da0-74ec-4226-8687-8543bf07fe8a.scope.\n'; <rc> = 1
ifup/ovirtmgmt::ERROR::2017-08-09 09:04:16,262::concurrent::198::root::(run) FINISH thread <Thread(ifup/ovirtmgmt, started daemon 140692038866688)> failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 191, in run
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 884, in _exec_ifup
    _exec_ifup_by_name(iface.name, cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 870, in _exec_ifup_by_name
    raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '')
ConfigNetworkError: (29, '\n')


Burman, can you see this regression when you are adding 4.1 hosts to a 4.2 manager? I would like to narrow down the problematic area; is it something we did in vdsm, or is it a slower DHCP server.
Comment 9 Michael Burman 2017-10-17 02:23:49 EDT
(In reply to Dan Kenigsberg from comment #8)
> Same problem here: ifup fails after 1 minute of waiting.
> 
> ifup/ovirtmgmt::DEBUG::2017-08-09
> 09:03:15,637::cmd::56::root::(exec_sync_bytes) /usr/bin/systemd-run --scope
> --unit=930d1da0-74ec-4226-8687-8543bf07fe8a --slice=vdsm-dhclient /sbin/ifup
> ovirtmgmt (cwd None)
> MainProcess|jsonrpc/3::DEBUG::2017-08-09
> 09:03:15,682::cmd::56::root::(exec_sync_bytes) /sbin/tc qdisc show (cwd None)
> MainProcess|jsonrpc/3::DEBUG::2017-08-09
> 09:03:15,687::cmd::63::root::(exec_sync_bytes) SUCCESS: <err> = ''; <rc> = 0
> MainProcess|jsonrpc/3::DEBUG::2017-08-09
> 09:03:15,688::connectivity::44::root::(check) Checking connectivity...
> MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::63::root::(exec_sync_bytes)
> FAILED: <err> = ''; <rc> = 1
> MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::56::root::(exec_sync_bytes)
> /usr/sbin/lldptool set-lldp -i enp6s0 adminStatus=rx (cwd None)
> MainThread::DEBUG::2017-08-09 09:03:18,482::cmd::63::root::(exec_sync_bytes)
> FAILED: <err> = ''; <rc> = 1
> MainThread::ERROR::2017-08-09
> 09:03:18,483::initializer::53::root::(_lldp_init) Failed to enable LLDP on
> enp6s0
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line
> 51, in _lldp_init
>     Lldp.enable_lldp_on_iface(device)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line
> 30, in enable_lldp_on_iface
>     lldptool.enable_lldp_on_iface(iface, rx_only)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py",
> line 46, in enable_lldp_on_iface
>     raise EnableLldpError(rc, out, err, iface)
> EnableLldpError: (1,
> "timeout\n'M00000001C3040000000c06enp6s0000badminStatus0002rx' command timed
> out.\n", '', 'enp6s0')
> MainProcess|jsonrpc/2::DEBUG::2017-08-09
> 09:03:29,175::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call
> ksmTune with () {}
> MainProcess|jsonrpc/2::DEBUG::2017-08-09
> 09:03:29,175::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper)
> return ksmTune with None
> ifup/ovirtmgmt::DEBUG::2017-08-09
> 09:04:16,261::cmd::63::root::(exec_sync_bytes) FAILED: <err> = 'Running
> scope as unit 930d1da0-74ec-4226-8687-8543bf07fe8a.scope.\n'; <rc> = 1
> ifup/ovirtmgmt::ERROR::2017-08-09 09:04:16,262::concurrent::198::root::(run)
> FINISH thread <Thread(ifup/ovirtmgmt, started daemon 140692038866688)> failed
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line
> 191, in run
>     ret = func(*args, **kwargs)
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line
> 884, in _exec_ifup
>     _exec_ifup_by_name(iface.name, cgroup)
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line
> 870, in _exec_ifup_by_name
>     raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '')
> ConfigNetworkError: (29, '\n')
> 
> 
> Burman, can you see this regression when you are adding 4.1 hosts to a 4.2
> manager? I would like to narrow down the problematic area; is it something
> we did in vdsm, or is it a slower DHCP server.

Dan, i can't do that as i'm blocked by BZ 1490274
Comment 10 Michael Burman 2017-10-18 04:39:19 EDT
Dan, not sure if it's relevant, but i just managed to reproduce this bug on 4.1.6.2 with vdsm-4.19.34-1.el7ev.x86_64
Comment 11 Michael Burman 2017-10-18 04:40 EDT
Created attachment 1340042 [details]
new logs from 4.1.6.2 failure
Comment 12 Dan Kenigsberg 2017-10-18 14:58:55 EDT
Given the recent reports, this is not a regression nor blocker. though we'd like to understand why this happens more often.
Comment 13 Dan Kenigsberg 2017-10-24 03:39:30 EDT
I don't think we can do anything about this, or do we Edy?
Comment 14 Michael Burman 2017-10-24 07:43:14 EDT
Another reproduction from rhv-qe on 4.1.7-3
Host remain in non-responsive state for ever.
Comment 15 Michael Burman 2017-10-24 07:44 EDT
Created attachment 1342699 [details]
new logs from 4.1.7-3 failure
Comment 16 Edward Haas 2017-10-29 11:15:49 EDT
(In reply to Dan Kenigsberg from comment #13)
> I don't think we can do anything about this, or do we Edy?

If the dhcp response takes too long, we do not have many options.
At least not with the architecture and tools we got today.

The only future idea I can come with is something like this...
For the management network, with an original dynamic IP address, the move to a bridge base can be done as follow:
- Read the IP settings defined.
- Teardown the interface and create over it the bridge.
- Define the saved IP address manually (through iproute2 or netlink) and mark it as "dynamic" with the same lft it was before the teardown.
- Run dhclient on the bridge.


But it is tricky enough to want to avoid it.
The disadvantage of DHCP on a management network is exactly this scenario.
On one hand the addressing is managed by the DHCP server, but on the other hand its failure serve the network makes it a vulnerability of the network.

Thanks,
Edy.
Comment 17 Edward Haas 2017-11-08 11:35:45 EST
Are you fine with closing this one?
Comment 18 Michael Burman 2017-11-09 04:01:56 EST
(In reply to Edward Haas from comment #17)
> Are you fine with closing this one?

Hi Edy,
In comments# 7,8 and 16 you saying that there is no much to do with this bug in case of slow dhcp response. Given that, we are fine with closing it.
Comment 19 Edward Haas 2017-11-09 05:28:01 EST
If anyone will have an idea on how to handle this scenario please raise it.
Closing the BZ for now.

Note You need to log in before you can comment on or make changes to this bug.