Bug 1078312
Summary: | F20 ovirt host installation fails due to python-ethtool SystemError | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Martin Perina <mperina> | ||||||
Component: | vdsm | Assignee: | Antoni Segura Puimedon <asegurap> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Gil Klein <gklein> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.4 | CC: | acathrow, acme, asegurap, bazulay, daniel.guettes, danken, davids, gklein, iheim, mgoldboi, mperina, penguin.wrangler, remoford, sbonazzo, yeylon | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.4.1 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | network | ||||||||
Fixed In Version: | ovirt-3.4.1 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-05-02 09:57:02 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1060198 | ||||||||
Attachments: |
|
ipv6addrs = devInfo.get_ipv6_addresses() SystemError: error return without exception set is a known python-ethtool bug 1057772. David, would you please have a Fedora build with it? (In reply to Dan Kenigsberg from comment #1) > ipv6addrs = devInfo.get_ipv6_addresses() > SystemError: error return without exception set > > is a known python-ethtool bug 1057772. > > David, would you please have a Fedora build with it? Here you go: http://koji.fedoraproject.org/koji/taskinfo?taskID=6655658 I'll push it out as an admin once the build is done. Martin, would you retry, with http://kojipkgs.fedoraproject.org//work/tasks/5669/6655669/python-ethtool-0.10-2.fc20.x86_64.rpm pre-installed on the host before installation? No, it doesn't help to upgrade to python-ethtool-0.10-2.fc20.x86_64, following error appeared in vdsm.log: Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 1058, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 394, in getCapabilities self.cif.threadLocal.server) File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 746, in getDeviceByIP for ipv4addr in info.get_ipv4_addresses(): SystemError: error return without exception set However, when I downgrade to python-ethtool-0.8-2.fc20.x86_64, everything works fine. *** Bug 1082391 has been marked as a duplicate of this bug. *** (In reply to Martin Perina from comment #4) > No, it doesn't help to upgrade to python-ethtool-0.10-2.fc20.x86_64, > following error appeared in vdsm.log: > > Traceback (most recent call last): > File "/usr/share/vdsm/BindingXMLRPC.py", line 1058, in wrapper > res = f(*args, **kwargs) > File "/usr/share/vdsm/BindingXMLRPC.py", line 394, in getCapabilities > self.cif.threadLocal.server) > File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 746, in > getDeviceByIP > for ipv4addr in info.get_ipv4_addresses(): > SystemError: error return without exception set And here's the downside of not having a full crash, I don't see exactly what is causing this error. But I've trying to capture most possible places related to retrieving interface information in this last build: http://koji.fedoraproject.org/koji/taskinfo?taskID=6697579 x86_64 RPM: http://kojipkgs.fedoraproject.org//work/tasks/7593/6697593/python-ethtool-0.10-3.fc20.x86_64.rpm Can you please try that build? Also, if someone can please point me at the proper source tree for this code base, that'd be great. I would just like to try to understand more of the failing situation. Created attachment 881643 [details] vdsm-4.14.6-0.fc20.x86_64 logs >And here's the downside of not having a full crash, I don't see exactly what is >causing this error. But I've trying to capture most possible places related >to retrieving interface information in this last build: >http://koji.fedoraproject.org/koji/taskinfo?taskID=6697579 > >x86_64 RPM: http://kojipkgs.fedoraproject.org//work/tasks/7593/6697593/python-ethtool-0.10-3.fc20.x86_64.rpm > >Can you please try that build? Sorry, but it still doesn't work: Thread-13::ERROR::2014-04-02 08:28:26,540::BindingXMLRPC::1086::vds::(wrapper) unexpected error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 1070, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 393, in getCapabilities ret = api.getCapabilities() File "/usr/share/vdsm/API.py", line 1185, in getCapabilities c = caps.get() File "/usr/share/vdsm/caps.py", line 369, in get caps.update(netinfo.get()) File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 557, in get netAttr.get('qosOutbound')) File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 487, in _getNetInfo ipv4addr, ipv4netmask, ipv6addrs = getIpInfo(iface) File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 317, in getIpInfo ipv6addrs = devInfo.get_ipv6_addresses() OSError: Bad socket Full vdsm logs attached. Sorry, I wanted to clear just my needinfo (In reply to Martin Perina from comment #7) > > Thread-13::ERROR::2014-04-02 > 08:28:26,540::BindingXMLRPC::1086::vds::(wrapper) unexpected error > Traceback (most recent call last): > File "/usr/share/vdsm/BindingXMLRPC.py", line 1070, in wrapper > res = f(*args, **kwargs) > File "/usr/share/vdsm/BindingXMLRPC.py", line 393, in getCapabilities > ret = api.getCapabilities() > File "/usr/share/vdsm/API.py", line 1185, in getCapabilities > c = caps.get() > File "/usr/share/vdsm/caps.py", line 369, in get > caps.update(netinfo.get()) > File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 557, in get > netAttr.get('qosOutbound')) > File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 487, in > _getNetInfo > ipv4addr, ipv4netmask, ipv6addrs = getIpInfo(iface) > File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 317, in > getIpInfo > ipv6addrs = devInfo.get_ipv6_addresses() > OSError: Bad socket > > Full vdsm logs attached. This actually makes a lot more sense, kind of. For some reason, it's not possible to query IPv6 addresses. py-ethtool gets a "Bad socket" error from the kernel (via libnl3), which it now proxies to the user. I believe this issue actually needs to be tackled in vdsm, at least catching the OSError exception. The reason it "worked" earlier py-ethtool versions is that it simply ignored many errors and just kept going. In addition, earlier versions used libnl-1, which can also behave slightly different; so in other words, errors where hidden. I can dig a little bit further if I can be given access to a system where this issue happens, with a reproducing routine. I've done some more deeper debugging on a box with this issue now. What is happening is that the socket libnl uses to communicate with the kernel gets invalidated somehow. I've not figured out if it just gets "corrupted" or if it gets closed. But the error libnl reports to python-ethtool is "Bad Socket".
I modified getIpInfo() in vdsm/netinfo.py to look like this:
----------------------------------------------------------------------
def getIpInfo(dev):
dbg = open("/tmp/pyeth.dbg", "a+")
dbg.write(">>>> Query: %s\n" % dev)
dbg.close()
devInfo = ethtool.get_interfaces_info(dev.encode('utf8'))[0]
addr = devInfo.ipv4_address
netmask = devInfo.ipv4_netmask
try:
ipv6addrs = devInfo.get_ipv6_addresses()
return (addr if addr else '',
prefix2netmask(netmask) if netmask else '',
[addr6.address + '/' + str(addr6.netmask) for addr6 in ipv6addrs])
except OSError as e:
dbg = open("/tmp/pyeth.dbg", "a+")
dbg.write(">>>> EXCEPTION: %s\n" % str(e))
dbg.close()
return (addr if addr else '',
prefix2netmask(netmask) if netmask else '',
[])
----------------------------------------------------------------------
The debug log which it now writes to reports this:
>>>> Query: ovirtmgmt
>>>> EXCEPTION: Bad socket
>>>> Query: em1
>>>> EXCEPTION: Bad socket
>>>> Query: em2
>>>> EXCEPTION: Bad socket
>>>> Query: ovirtmgmt
>>>> EXCEPTION: Bad socket
>>>> Query: bond0
>>>> EXCEPTION: Bad socket
>>>> Query: ;vdsmdummy;
>>>> EXCEPTION: Bad socket
Which means the libnl3 socket is already in a bad shape when getIpInfo() is called. vdsdm still fails, but now another place (getDeviceByIP(), vdsm/netinfo.py:647) ... with "Bad Socket" as well.
Running a "manual" python-ethtool test works fine:
--------------------------------------------------
import ethtool
devlist = ethtool.get_devices()
devs = ethtool.get_interfaces_info(devlist)
for d in devs:
print "*** Device: %s" % d.device
for ip in d.get_ipv4_addresses():
print "\t" + str(ip)
for ip in d.get_ipv6_addresses():
print "\t" + str(ip)
print "--------------"
-------------------------------------------
To ensure that it's not all the vdsm threads which causes the libnl socket to get closed, I build a special version of python-ethtool, which never closes the libnl socket. That patch also removes FD_CLOEXEC on the socket too. But vdsm behaves the same.
So currently, I'm not able to understand why this libnl socket gets in a bad state.
If this is not fixed, why is it on MODIFIED and not in ASSIGNED or POST? (In reply to Sandro Bonazzola from comment #11) > If this is not fixed, why is it on MODIFIED and not in ASSIGNED or POST? Just a pure glitch. The bug was due to the interaction between vdsm (using libnl-1) and python-ethtool (using libnl-3). Thanks to Thomas Haller for finding the reason. |
Created attachment 876394 [details] engine & vdsm logs Description of problem: Fedora 20 host installation failed with error: Host dev-18.rhev.lab.eng.brq.redhat.com installation failed. Failed to configure management network on the host. In vdsm.log this error appeared: Thread-13::ERROR::2014-03-19 15:57:28,038::BindingXMLRPC::1086::vds::(wrapper) unexpected error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 1070, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 393, in getCapabilities ret = api.getCapabilities() File "/usr/share/vdsm/API.py", line 1185, in getCapabilities c = caps.get() File "/usr/share/vdsm/caps.py", line 369, in get caps.update(netinfo.get()) File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 557, in get netAttr.get('qosOutbound')) File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 487, in _getNetInfo ipv4addr, ipv4netmask, ipv6addrs = getIpInfo(iface) File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 317, in getIpInfo ipv6addrs = devInfo.get_ipv6_addresses() SystemError: error return without exception set Network is working properly: # ifconfig em1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet6 fe80::82c1:6eff:fe6b:7e6c prefixlen 64 scopeid 0x20<link> ether 80:c1:6e:6b:7e:6c txqueuelen 1000 (Ethernet) RX packets 10679 bytes 3395134 (3.2 MiB) RX errors 0 dropped 121 overruns 0 frame 0 TX packets 3612 bytes 2150205 (2.0 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device memory 0xfbe60000-fbe80000 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 4 bytes 440 (440.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 4 bytes 440 (440.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ovirtmgmt: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.34.63.178 netmask 255.255.252.0 broadcast 10.34.63.255 inet6 2620:52:0:223c:82c1:6eff:fe6b:7e6c prefixlen 64 scopeid 0x0<global> inet6 fe80::82c1:6eff:fe6b:7e6c prefixlen 64 scopeid 0x20<link> ether 80:c1:6e:6b:7e:6c txqueuelen 0 (Ethernet) RX packets 8003 bytes 3045025 (2.9 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2626 bytes 2085009 (1.9 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 Version-Release number of selected component (if applicable): vdsm-4.14.5-0.fc20.x86_64 ovirt-engine-3.4.0-0.13.rc.fc19.noarch How reproducible: 100% Steps to Reproduce: 1. Install host Actual results: Host failed to install Expected results: Host installed successfully