Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1157224

Summary: vdsm sometime reports an invalid nic speed of 2**32-1
Product: Red Hat Enterprise Virtualization Manager Reporter: GenadiC <gcheresh>
Component: vdsmAssignee: Petr Horáček <phoracek>
Status: CLOSED CURRENTRELEASE QA Contact: GenadiC <gcheresh>
Severity: urgent Docs Contact:
Priority: medium    
Version: 3.5.0CC: aberezin, bazulay, danken, ecohen, gcheresh, gklein, iheim, lpeer, lsurette, lvernia, myakove, oourfali, rbalakri, Rhev-m-bugs, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: network
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-16 13:40:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine log
none
engine and vdsm logs none

Description GenadiC 2014-10-26 10:36:02 UTC
Created attachment 950768 [details]
engine log

Description of problem:
Moving the Host between different DC/Cluster results in "Incorrect vdsm version for cluster" error.
It happened when running the automation tests (locally and jenkins)

Version-Release number of selected component (if applicable):


How reproducible:
Only by automation at this point

Steps to Reproduce:
1. Run Network Label automation Cases 15 or 16
2. Or try to move the host between different DC/Cluster version when there are labels on the Host interface
3.

Actual results:
ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-96) Failure to refresh Vds rose09.qa.lab.tlv.redhat.com runtime info. Incorrect vdsm version for cluster Global_Cluster0: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer


Expected results:
Moving the Host between supported Cluster should work

Additional info:

Comment 1 Lior Vernia 2014-10-26 19:14:08 UTC
How is this an automation blocker? It's an automated test the fails, it shouldn't block any other automated test.

Comment 2 Lior Vernia 2014-10-26 20:36:10 UTC
It's quite strange that this only happened after some moving around of the host and not always. However, what happens seems possible only if vdsm started returning NIC speed as long as part of getVdsCaps (at least in some scenario) - Dan, could this be the case? Possibly related to jsonrpc?

Comment 3 Dan Kenigsberg 2014-10-26 21:16:42 UTC
Sorry Lior, I don't understand your getVdsCaps hypothesis (or the bug). Could you elaborate?

Which Vdsm version is it? Which cluster level is involved?

Comment 4 Lior Vernia 2014-10-26 21:47:38 UTC
The ClassCastException in the engine log appears to be referring to code casting the NIC speed, extracted from the getVdsCaps dictionary, to an Integer (and apparently it's deserialized to Long).

Comment 5 GenadiC 2014-10-27 07:29:01 UTC
VDSM version - vdsm-4.16.7.1-1.el7.x86_64
Engine version 3.5.0-0.17.beta.el6ev
I was trying to move to 3.5 version Cluster from 3.5 Cluster

Comment 6 Gil Klein 2014-10-27 13:43:59 UTC
Ok, will remove the flag from this one.

Comment 7 Lior Vernia 2014-10-27 13:46:25 UTC
Doesn't sound like high priority due to the difficulty in reproduction.

Comment 8 Lior Vernia 2014-10-27 15:39:22 UTC
Genadi, could you please run the tests using xmlrpc to communicate with the host and let us know if that works?

Comment 9 GenadiC 2014-10-28 15:02:05 UTC
Lior, indeed it works without any problem with xmlrpc
Tested it twice and it worked, before that it failed every time

Comment 10 Lior Vernia 2014-10-28 15:16:51 UTC
Great, so it's either a general issue with how the engine now deserializes numbers passed from vdsm, or a specific issue with what's passed to vdsm as part of the interface speed in getVdsCaps (on jsonrpc). Still waiting to hear from Dan about the latter hypothesis.

Comment 11 Lior Vernia 2014-10-28 15:17:32 UTC
...and from Oved about the former :)

Comment 12 Oved Ourfali 2014-10-29 09:42:51 UTC
It seems like a specific issue with this one. For some reason it is caught in VdsUpdateRunTimeInfo, and results in the "Incorrect vdsm version" error, although it isn't related at all. Don't think it is a "generic" issue, but I guess Dan can respond on both hypothesis....

Comment 13 Dan Kenigsberg 2014-10-29 09:58:51 UTC
Could you provide vdsm.log (particularly the response to getCapabilities after adding the host to the new cluster).

Comment 14 GenadiC 2014-10-29 11:23:56 UTC
Created attachment 951753 [details]
engine and vdsm logs

Comment 15 Dan Kenigsberg 2014-10-30 21:16:32 UTC
Eureka: in one occasion, Vdsm returned a speed=2**32-1 for some reason. XMLRPC cannot carry this as number, so we would have seen a vdsm-side exception in that case. jsonrpc lets this through, and it explodes within Engine.

Thread-160::DEBUG::2014-10-29 07:13:32,809::__init__::498::jsonrpc.JsonRpcServer::(_serveRequest) Return 'Host.getCapabilities' in bridge with .... 'eno2': {'addr': '', 'cfg': {'DEVICE': 'eno2', 'HWADDR': 'd4:ae:52:b9:c0:c6', 'ONBOOT': 'yes', 'NM_CONTROLLED': 'no', 'MTU': '1500'}, 'ipv6addrs': [], 'mtu': '1500', 'netmask': '', 'ipv4addrs': [], 'hwaddr': 'd4:ae:52:b9:c0:c6', 'speed': 4294967295}

Can you attach connectivity.log? I wonder what shows up there.

Comment 16 Dan Kenigsberg 2014-10-30 21:34:10 UTC
http://gerrit.ovirt.org/4320 introduced this bug: if 2*32-1 is read from /sys/class/net/%s/speed it is passed as it is, since it's bigger than 0 (!)

Comment 17 GenadiC 2014-12-10 11:05:04 UTC
Verified in vt13.1