Bug 1078312 - F20 ovirt host installation fails due to python-ethtool SystemError
Summary: F20 ovirt host installation fails due to python-ethtool SystemError
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: 3.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.4.1
Assignee: Antoni Segura Puimedon
QA Contact: Gil Klein
URL:
Whiteboard: network
: 1082391 (view as bug list)
Depends On:
Blocks: 1060198
TreeView+ depends on / blocked
 
Reported: 2014-03-19 15:13 UTC by Martin Perina
Modified: 2014-05-02 09:57 UTC (History)
15 users (show)

Fixed In Version: ovirt-3.4.1
Clone Of:
Environment:
Last Closed: 2014-05-02 09:57:02 UTC
oVirt Team: ---
Embargoed:


Attachments (Terms of Use)
engine & vdsm logs (224.26 KB, application/gzip)
2014-03-19 15:13 UTC, Martin Perina
no flags Details
vdsm-4.14.6-0.fc20.x86_64 logs (3.78 KB, application/gzip)
2014-04-02 06:36 UTC, Martin Perina
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 26514 0 None MERGED netlink: User libnl-3 when available Never
oVirt gerrit 26993 0 None MERGED netlink: Use libnl-3 when available Never

Description Martin Perina 2014-03-19 15:13:30 UTC
Created attachment 876394 [details]
engine & vdsm logs

Description of problem:

Fedora 20 host installation failed with error:

Host dev-18.rhev.lab.eng.brq.redhat.com installation failed. Failed to configure management network on the host.

In vdsm.log this error appeared:

Thread-13::ERROR::2014-03-19 15:57:28,038::BindingXMLRPC::1086::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 1070, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 393, in getCapabilities
    ret = api.getCapabilities()
  File "/usr/share/vdsm/API.py", line 1185, in getCapabilities
    c = caps.get()
  File "/usr/share/vdsm/caps.py", line 369, in get
    caps.update(netinfo.get())
  File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 557, in get
    netAttr.get('qosOutbound'))
  File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 487, in _getNetInfo
    ipv4addr, ipv4netmask, ipv6addrs = getIpInfo(iface)
  File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 317, in getIpInfo
    ipv6addrs = devInfo.get_ipv6_addresses()
SystemError: error return without exception set

Network is working properly:

# ifconfig 
em1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::82c1:6eff:fe6b:7e6c  prefixlen 64  scopeid 0x20<link>
        ether 80:c1:6e:6b:7e:6c  txqueuelen 1000  (Ethernet)
        RX packets 10679  bytes 3395134 (3.2 MiB)
        RX errors 0  dropped 121  overruns 0  frame 0
        TX packets 3612  bytes 2150205 (2.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device memory 0xfbe60000-fbe80000  

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 4  bytes 440 (440.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4  bytes 440 (440.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ovirtmgmt: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.34.63.178  netmask 255.255.252.0  broadcast 10.34.63.255
        inet6 2620:52:0:223c:82c1:6eff:fe6b:7e6c  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::82c1:6eff:fe6b:7e6c  prefixlen 64  scopeid 0x20<link>
        ether 80:c1:6e:6b:7e:6c  txqueuelen 0  (Ethernet)
        RX packets 8003  bytes 3045025 (2.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2626  bytes 2085009 (1.9 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Version-Release number of selected component (if applicable):

vdsm-4.14.5-0.fc20.x86_64

ovirt-engine-3.4.0-0.13.rc.fc19.noarch

How reproducible:

100%

Steps to Reproduce:
1. Install host

Actual results:

Host failed to install

Expected results:

Host installed successfully

Comment 1 Dan Kenigsberg 2014-03-19 15:49:01 UTC
ipv6addrs = devInfo.get_ipv6_addresses()
SystemError: error return without exception set

is a known python-ethtool bug 1057772.

David, would you please have a Fedora build with it?

Comment 2 David Sommerseth 2014-03-20 18:08:29 UTC
(In reply to Dan Kenigsberg from comment #1)
> ipv6addrs = devInfo.get_ipv6_addresses()
> SystemError: error return without exception set
> 
> is a known python-ethtool bug 1057772.
> 
> David, would you please have a Fedora build with it?

Here you go:
http://koji.fedoraproject.org/koji/taskinfo?taskID=6655658

I'll push it out as an admin once the build is done.

Comment 3 Dan Kenigsberg 2014-03-21 09:44:14 UTC
Martin, would you retry, with http://kojipkgs.fedoraproject.org//work/tasks/5669/6655669/python-ethtool-0.10-2.fc20.x86_64.rpm pre-installed on the host before installation?

Comment 4 Martin Perina 2014-03-21 14:07:51 UTC
No, it doesn't help to upgrade to python-ethtool-0.10-2.fc20.x86_64, following error appeared in vdsm.log:

Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 1058, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 394, in getCapabilities
    self.cif.threadLocal.server)
  File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 746, in getDeviceByIP
    for ipv4addr in info.get_ipv4_addresses():
SystemError: error return without exception set


However, when I downgrade to python-ethtool-0.8-2.fc20.x86_64, everything works fine.

Comment 5 Dan Kenigsberg 2014-03-31 08:51:17 UTC
*** Bug 1082391 has been marked as a duplicate of this bug. ***

Comment 6 David Sommerseth 2014-04-01 19:48:50 UTC
(In reply to Martin Perina from comment #4)
> No, it doesn't help to upgrade to python-ethtool-0.10-2.fc20.x86_64,
> following error appeared in vdsm.log:
> 
> Traceback (most recent call last):
>   File "/usr/share/vdsm/BindingXMLRPC.py", line 1058, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/BindingXMLRPC.py", line 394, in getCapabilities
>     self.cif.threadLocal.server)
>   File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 746, in
> getDeviceByIP
>     for ipv4addr in info.get_ipv4_addresses():
> SystemError: error return without exception set

And here's the downside of not having a full crash, I don't see exactly what is causing this error.  But I've trying to capture most possible places related to retrieving interface information in this last build: 
http://koji.fedoraproject.org/koji/taskinfo?taskID=6697579

x86_64 RPM: http://kojipkgs.fedoraproject.org//work/tasks/7593/6697593/python-ethtool-0.10-3.fc20.x86_64.rpm

Can you please try that build?

Also, if someone can please point me at the proper source tree for this code base, that'd be great.  I would just like to try to understand more of the failing situation.

Comment 7 Martin Perina 2014-04-02 06:36:50 UTC
Created attachment 881643 [details]
vdsm-4.14.6-0.fc20.x86_64 logs

>And here's the downside of not having a full crash, I don't see exactly what is 
>causing this error.  But I've trying to capture most possible places related
>to retrieving interface information in this last build: 
>http://koji.fedoraproject.org/koji/taskinfo?taskID=6697579
>
>x86_64 RPM: http://kojipkgs.fedoraproject.org//work/tasks/7593/6697593/python-ethtool-0.10-3.fc20.x86_64.rpm
>
>Can you please try that build?

Sorry, but it still doesn't work:

Thread-13::ERROR::2014-04-02 08:28:26,540::BindingXMLRPC::1086::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 1070, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 393, in getCapabilities
    ret = api.getCapabilities()
  File "/usr/share/vdsm/API.py", line 1185, in getCapabilities
    c = caps.get()
  File "/usr/share/vdsm/caps.py", line 369, in get
    caps.update(netinfo.get())
  File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 557, in get
    netAttr.get('qosOutbound'))
  File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 487, in _getNetInfo
    ipv4addr, ipv4netmask, ipv6addrs = getIpInfo(iface)
  File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 317, in getIpInfo
    ipv6addrs = devInfo.get_ipv6_addresses()
OSError: Bad socket

Full vdsm logs attached.

Comment 8 Martin Perina 2014-04-02 06:38:15 UTC
Sorry, I wanted to clear just my needinfo

Comment 9 David Sommerseth 2014-04-02 09:35:21 UTC
(In reply to Martin Perina from comment #7)
> 
> Thread-13::ERROR::2014-04-02
> 08:28:26,540::BindingXMLRPC::1086::vds::(wrapper) unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/BindingXMLRPC.py", line 1070, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/BindingXMLRPC.py", line 393, in getCapabilities
>     ret = api.getCapabilities()
>   File "/usr/share/vdsm/API.py", line 1185, in getCapabilities
>     c = caps.get()
>   File "/usr/share/vdsm/caps.py", line 369, in get
>     caps.update(netinfo.get())
>   File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 557, in get
>     netAttr.get('qosOutbound'))
>   File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 487, in
> _getNetInfo
>     ipv4addr, ipv4netmask, ipv6addrs = getIpInfo(iface)
>   File "/usr/lib64/python2.7/site-packages/vdsm/netinfo.py", line 317, in
> getIpInfo
>     ipv6addrs = devInfo.get_ipv6_addresses()
> OSError: Bad socket
> 
> Full vdsm logs attached.

This actually makes a lot more sense, kind of.  For some reason, it's not possible to query IPv6 addresses.  py-ethtool gets a "Bad socket" error from the kernel (via libnl3), which it now proxies to the user.

I believe this issue actually needs to be tackled in vdsm, at least catching the OSError exception.  The reason it "worked" earlier py-ethtool versions is that it simply ignored many errors and just kept going.  In addition, earlier versions used libnl-1, which can also behave slightly different; so in other words, errors where hidden.

I can dig a little bit further if I can be given access to a system where this issue happens, with a reproducing routine.

Comment 10 David Sommerseth 2014-04-02 12:45:21 UTC
I've done some more deeper debugging on a box with this issue now.  What is happening is that the socket libnl uses to communicate with the kernel gets invalidated somehow.  I've not figured out if it just gets "corrupted" or if it gets closed.  But the error libnl reports to python-ethtool is "Bad Socket".

I modified getIpInfo() in vdsm/netinfo.py to look like this:
----------------------------------------------------------------------
def getIpInfo(dev):
    dbg = open("/tmp/pyeth.dbg", "a+")
    dbg.write(">>>> Query: %s\n" % dev)
    dbg.close()

    devInfo = ethtool.get_interfaces_info(dev.encode('utf8'))[0]
    addr = devInfo.ipv4_address
    netmask = devInfo.ipv4_netmask
    try:
        ipv6addrs = devInfo.get_ipv6_addresses()
        return (addr if addr else '',
            prefix2netmask(netmask) if netmask else '',
            [addr6.address + '/' + str(addr6.netmask) for addr6 in ipv6addrs])

    except OSError as e:
        dbg = open("/tmp/pyeth.dbg", "a+")
        dbg.write(">>>> EXCEPTION: %s\n" % str(e))
        dbg.close()

        return (addr if addr else '',
                prefix2netmask(netmask) if netmask else '',
                [])
----------------------------------------------------------------------

The debug log which it now writes to reports this:
>>>> Query: ovirtmgmt
>>>> EXCEPTION: Bad socket
>>>> Query: em1
>>>> EXCEPTION: Bad socket
>>>> Query: em2
>>>> EXCEPTION: Bad socket
>>>> Query: ovirtmgmt
>>>> EXCEPTION: Bad socket
>>>> Query: bond0
>>>> EXCEPTION: Bad socket
>>>> Query: ;vdsmdummy;
>>>> EXCEPTION: Bad socket

Which means the libnl3 socket is already in a bad shape when getIpInfo() is called.  vdsdm still fails, but now another place (getDeviceByIP(), vdsm/netinfo.py:647) ... with "Bad Socket" as well.

Running a "manual" python-ethtool test works fine:
--------------------------------------------------
import ethtool
devlist = ethtool.get_devices()
devs = ethtool.get_interfaces_info(devlist)
for d in devs:
    print "*** Device: %s" % d.device
    for ip in d.get_ipv4_addresses():
            print "\t" + str(ip)
    for ip in d.get_ipv6_addresses():
            print "\t" + str(ip)
    print "--------------"
-------------------------------------------

To ensure that it's not all the vdsm threads which causes the libnl socket to get closed, I build a special version of python-ethtool, which never closes the libnl socket.  That patch also removes FD_CLOEXEC on the socket too.  But vdsm behaves the same.

So currently, I'm not able to understand why this libnl socket gets in a bad state.

Comment 11 Sandro Bonazzola 2014-04-03 09:43:51 UTC
If this is not fixed, why is it on MODIFIED and not in ASSIGNED or POST?

Comment 12 David Sommerseth 2014-04-03 15:05:51 UTC
(In reply to Sandro Bonazzola from comment #11)
> If this is not fixed, why is it on MODIFIED and not in ASSIGNED or POST?

Just a pure glitch.

Comment 13 Antoni Segura Puimedon 2014-04-07 23:38:54 UTC
The bug was due to the interaction between vdsm (using libnl-1) and python-ethtool (using libnl-3). Thanks to Thomas Haller for finding the reason.


Note You need to log in before you can comment on or make changes to this bug.