Bug 1479666
Summary: | Add host failing every few attempts, failing to renew dhcp address on top of ovirtmgmt | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Michael Burman <mburman> | ||||||||||
Component: | BLL.Infra | Assignee: | Edward Haas <edwardh> | ||||||||||
Status: | CLOSED CANTFIX | QA Contact: | Meni Yakove <myakove> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | low | ||||||||||||
Version: | 4.2.0 | CC: | bugs, danken, edwardh, mburman, mperina, oourfali, talayan, ylavi | ||||||||||
Target Milestone: | ovirt-4.2.0 | Keywords: | Automation | ||||||||||
Target Release: | --- | Flags: | rule-engine:
ovirt-4.2+
|
||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2017-11-09 10:28:01 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Created attachment 1311007 [details]
add host failed2 logs
Moving to network for initial investigation. If it turns out as infra, contact us. This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. Bug 1463629 may be related. Hi Michael, Do you know what attribute on the host is out-of-sync? (During the 1min the host is out-of-sync, you can open the setup networks dialog and check it). (In reply to Alona Kaplan from comment #5) > Hi Michael, > > Do you know what attribute on the host is out-of-sync? (During the 1min the > host is out-of-sync, you can open the setup networks dialog and check it). Please ignore. this add-host failed because ifup ovirtmgmt has failed. most probably because the dhcp server did not respond on time. ifup/ovirtmgmt::DEBUG::2017-08-09 08:51:47,404::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/systemd-run --scope --unit=484f8b16-35ed-4ec7-988f-c0bb0eed21b1 --slice=vdsm-dhclient /sbin/ifup ovirtmgmt (cwd None) MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,408::ifcfg::426::root::(_atomicNetworkBackup) Backed up ovirtmgmt MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,537::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-3 /sbin/tc qdisc show (cwd None) MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,545::commands::93::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|jsonrpc/2::DEBUG::2017-08-09 08:51:47,545::connectivity::42::root::(check) Checking connectivity... MainProcess|jsonrpc/0::DEBUG::2017-08-09 08:51:57,810::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call ksmTune with () {} MainProcess|jsonrpc/0::DEBUG::2017-08-09 08:51:57,811::supervdsmServer::100::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None ifup/ovirtmgmt::DEBUG::2017-08-09 08:52:48,351::commands::93::root::(execCmd) FAILED: <err> = 'Running scope as unit 484f8b16-35ed-4ec7-988f-c0bb0eed21b1.scope.\n'; <rc> = 1 ifup/ovirtmgmt::ERROR::2017-08-09 08:52:48,352::utils::371::root::(wrapper) Unhandled exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper return f(*a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/concurrent.py", line 180, in run return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 926, in _exec_ifup _exec_ifup_by_name(iface.name, cgroup) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 912, in _exec_ifup_by_name raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '') ConfigNetworkError: (29, 'Determining IP information for ovirtmgmt... failed.') Same problem here: ifup fails after 1 minute of waiting. ifup/ovirtmgmt::DEBUG::2017-08-09 09:03:15,637::cmd::56::root::(exec_sync_bytes) /usr/bin/systemd-run --scope --unit=930d1da0-74ec-4226-8687-8543bf07fe8a --slice=vdsm-dhclient /sbin/ifup ovirtmgmt (cwd None) MainProcess|jsonrpc/3::DEBUG::2017-08-09 09:03:15,682::cmd::56::root::(exec_sync_bytes) /sbin/tc qdisc show (cwd None) MainProcess|jsonrpc/3::DEBUG::2017-08-09 09:03:15,687::cmd::63::root::(exec_sync_bytes) SUCCESS: <err> = ''; <rc> = 0 MainProcess|jsonrpc/3::DEBUG::2017-08-09 09:03:15,688::connectivity::44::root::(check) Checking connectivity... MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::63::root::(exec_sync_bytes) FAILED: <err> = ''; <rc> = 1 MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::56::root::(exec_sync_bytes) /usr/sbin/lldptool set-lldp -i enp6s0 adminStatus=rx (cwd None) MainThread::DEBUG::2017-08-09 09:03:18,482::cmd::63::root::(exec_sync_bytes) FAILED: <err> = ''; <rc> = 1 MainThread::ERROR::2017-08-09 09:03:18,483::initializer::53::root::(_lldp_init) Failed to enable LLDP on enp6s0 Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line 51, in _lldp_init Lldp.enable_lldp_on_iface(device) File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line 30, in enable_lldp_on_iface lldptool.enable_lldp_on_iface(iface, rx_only) File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out, err, iface) EnableLldpError: (1, "timeout\n'M00000001C3040000000c06enp6s0000badminStatus0002rx' command timed out.\n", '', 'enp6s0') MainProcess|jsonrpc/2::DEBUG::2017-08-09 09:03:29,175::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call ksmTune with () {} MainProcess|jsonrpc/2::DEBUG::2017-08-09 09:03:29,175::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None ifup/ovirtmgmt::DEBUG::2017-08-09 09:04:16,261::cmd::63::root::(exec_sync_bytes) FAILED: <err> = 'Running scope as unit 930d1da0-74ec-4226-8687-8543bf07fe8a.scope.\n'; <rc> = 1 ifup/ovirtmgmt::ERROR::2017-08-09 09:04:16,262::concurrent::198::root::(run) FINISH thread <Thread(ifup/ovirtmgmt, started daemon 140692038866688)> failed Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 191, in run ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 884, in _exec_ifup _exec_ifup_by_name(iface.name, cgroup) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 870, in _exec_ifup_by_name raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '') ConfigNetworkError: (29, '\n') Burman, can you see this regression when you are adding 4.1 hosts to a 4.2 manager? I would like to narrow down the problematic area; is it something we did in vdsm, or is it a slower DHCP server. (In reply to Dan Kenigsberg from comment #8) > Same problem here: ifup fails after 1 minute of waiting. > > ifup/ovirtmgmt::DEBUG::2017-08-09 > 09:03:15,637::cmd::56::root::(exec_sync_bytes) /usr/bin/systemd-run --scope > --unit=930d1da0-74ec-4226-8687-8543bf07fe8a --slice=vdsm-dhclient /sbin/ifup > ovirtmgmt (cwd None) > MainProcess|jsonrpc/3::DEBUG::2017-08-09 > 09:03:15,682::cmd::56::root::(exec_sync_bytes) /sbin/tc qdisc show (cwd None) > MainProcess|jsonrpc/3::DEBUG::2017-08-09 > 09:03:15,687::cmd::63::root::(exec_sync_bytes) SUCCESS: <err> = ''; <rc> = 0 > MainProcess|jsonrpc/3::DEBUG::2017-08-09 > 09:03:15,688::connectivity::44::root::(check) Checking connectivity... > MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::63::root::(exec_sync_bytes) > FAILED: <err> = ''; <rc> = 1 > MainThread::DEBUG::2017-08-09 09:03:16,476::cmd::56::root::(exec_sync_bytes) > /usr/sbin/lldptool set-lldp -i enp6s0 adminStatus=rx (cwd None) > MainThread::DEBUG::2017-08-09 09:03:18,482::cmd::63::root::(exec_sync_bytes) > FAILED: <err> = ''; <rc> = 1 > MainThread::ERROR::2017-08-09 > 09:03:18,483::initializer::53::root::(_lldp_init) Failed to enable LLDP on > enp6s0 > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line > 51, in _lldp_init > Lldp.enable_lldp_on_iface(device) > File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line > 30, in enable_lldp_on_iface > lldptool.enable_lldp_on_iface(iface, rx_only) > File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", > line 46, in enable_lldp_on_iface > raise EnableLldpError(rc, out, err, iface) > EnableLldpError: (1, > "timeout\n'M00000001C3040000000c06enp6s0000badminStatus0002rx' command timed > out.\n", '', 'enp6s0') > MainProcess|jsonrpc/2::DEBUG::2017-08-09 > 09:03:29,175::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call > ksmTune with () {} > MainProcess|jsonrpc/2::DEBUG::2017-08-09 > 09:03:29,175::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) > return ksmTune with None > ifup/ovirtmgmt::DEBUG::2017-08-09 > 09:04:16,261::cmd::63::root::(exec_sync_bytes) FAILED: <err> = 'Running > scope as unit 930d1da0-74ec-4226-8687-8543bf07fe8a.scope.\n'; <rc> = 1 > ifup/ovirtmgmt::ERROR::2017-08-09 09:04:16,262::concurrent::198::root::(run) > FINISH thread <Thread(ifup/ovirtmgmt, started daemon 140692038866688)> failed > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line > 191, in run > ret = func(*args, **kwargs) > File > "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line > 884, in _exec_ifup > _exec_ifup_by_name(iface.name, cgroup) > File > "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line > 870, in _exec_ifup_by_name > raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '') > ConfigNetworkError: (29, '\n') > > > Burman, can you see this regression when you are adding 4.1 hosts to a 4.2 > manager? I would like to narrow down the problematic area; is it something > we did in vdsm, or is it a slower DHCP server. Dan, i can't do that as i'm blocked by BZ 1490274 Dan, not sure if it's relevant, but i just managed to reproduce this bug on 4.1.6.2 with vdsm-4.19.34-1.el7ev.x86_64 Created attachment 1340042 [details]
new logs from 4.1.6.2 failure
Given the recent reports, this is not a regression nor blocker. though we'd like to understand why this happens more often. I don't think we can do anything about this, or do we Edy? Another reproduction from rhv-qe on 4.1.7-3 Host remain in non-responsive state for ever. Created attachment 1342699 [details]
new logs from 4.1.7-3 failure
(In reply to Dan Kenigsberg from comment #13) > I don't think we can do anything about this, or do we Edy? If the dhcp response takes too long, we do not have many options. At least not with the architecture and tools we got today. The only future idea I can come with is something like this... For the management network, with an original dynamic IP address, the move to a bridge base can be done as follow: - Read the IP settings defined. - Teardown the interface and create over it the bridge. - Define the saved IP address manually (through iproute2 or netlink) and mark it as "dynamic" with the same lft it was before the teardown. - Run dhclient on the bridge. But it is tricky enough to want to avoid it. The disadvantage of DHCP on a management network is exactly this scenario. On one hand the addressing is managed by the DHCP server, but on the other hand its failure serve the network makes it a vulnerability of the network. Thanks, Edy. Are you fine with closing this one? (In reply to Edward Haas from comment #17) > Are you fine with closing this one? Hi Edy, In comments# 7,8 and 16 you saying that there is no much to do with this bug in case of slow dhcp response. Given that, we are fine with closing it. If anyone will have an idea on how to handle this scenario please raise it. Closing the BZ for now. |
Created attachment 1311006 [details] add host failed1 logs Description of problem: Add host failing every few attempts - $NetworkConfiguratorException: Failed to configure management network. Version-Release number of selected component (if applicable): 4.2.0-0.0.master.20170807080527.git5e7f780.el7.centos How reproducible: Every few attempts (4/6) Steps to Reproduce: 1. Add host to master Actual results: Sometimes failed, host non-operational Expected results: Should work