Bug 1157224
| Summary: | vdsm sometime reports an invalid nic speed of 2**32-1 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | GenadiC <gcheresh> | ||||||
| Component: | vdsm | Assignee: | Petr Horáček <phoracek> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | GenadiC <gcheresh> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 3.5.0 | CC: | aberezin, bazulay, danken, ecohen, gcheresh, gklein, iheim, lpeer, lsurette, lvernia, myakove, oourfali, rbalakri, Rhev-m-bugs, yeylon | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 3.5.0 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | network | ||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2015-02-16 13:40:26 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
How is this an automation blocker? It's an automated test the fails, it shouldn't block any other automated test. It's quite strange that this only happened after some moving around of the host and not always. However, what happens seems possible only if vdsm started returning NIC speed as long as part of getVdsCaps (at least in some scenario) - Dan, could this be the case? Possibly related to jsonrpc? Sorry Lior, I don't understand your getVdsCaps hypothesis (or the bug). Could you elaborate? Which Vdsm version is it? Which cluster level is involved? The ClassCastException in the engine log appears to be referring to code casting the NIC speed, extracted from the getVdsCaps dictionary, to an Integer (and apparently it's deserialized to Long). VDSM version - vdsm-4.16.7.1-1.el7.x86_64 Engine version 3.5.0-0.17.beta.el6ev I was trying to move to 3.5 version Cluster from 3.5 Cluster Ok, will remove the flag from this one. Doesn't sound like high priority due to the difficulty in reproduction. Genadi, could you please run the tests using xmlrpc to communicate with the host and let us know if that works? Lior, indeed it works without any problem with xmlrpc Tested it twice and it worked, before that it failed every time Great, so it's either a general issue with how the engine now deserializes numbers passed from vdsm, or a specific issue with what's passed to vdsm as part of the interface speed in getVdsCaps (on jsonrpc). Still waiting to hear from Dan about the latter hypothesis. ...and from Oved about the former :) It seems like a specific issue with this one. For some reason it is caught in VdsUpdateRunTimeInfo, and results in the "Incorrect vdsm version" error, although it isn't related at all. Don't think it is a "generic" issue, but I guess Dan can respond on both hypothesis.... Could you provide vdsm.log (particularly the response to getCapabilities after adding the host to the new cluster). Created attachment 951753 [details]
engine and vdsm logs
Eureka: in one occasion, Vdsm returned a speed=2**32-1 for some reason. XMLRPC cannot carry this as number, so we would have seen a vdsm-side exception in that case. jsonrpc lets this through, and it explodes within Engine.
Thread-160::DEBUG::2014-10-29 07:13:32,809::__init__::498::jsonrpc.JsonRpcServer::(_serveRequest) Return 'Host.getCapabilities' in bridge with .... 'eno2': {'addr': '', 'cfg': {'DEVICE': 'eno2', 'HWADDR': 'd4:ae:52:b9:c0:c6', 'ONBOOT': 'yes', 'NM_CONTROLLED': 'no', 'MTU': '1500'}, 'ipv6addrs': [], 'mtu': '1500', 'netmask': '', 'ipv4addrs': [], 'hwaddr': 'd4:ae:52:b9:c0:c6', 'speed': 4294967295}
Can you attach connectivity.log? I wonder what shows up there.
http://gerrit.ovirt.org/4320 introduced this bug: if 2*32-1 is read from /sys/class/net/%s/speed it is passed as it is, since it's bigger than 0 (!) Verified in vt13.1 |
Created attachment 950768 [details] engine log Description of problem: Moving the Host between different DC/Cluster results in "Incorrect vdsm version for cluster" error. It happened when running the automation tests (locally and jenkins) Version-Release number of selected component (if applicable): How reproducible: Only by automation at this point Steps to Reproduce: 1. Run Network Label automation Cases 15 or 16 2. Or try to move the host between different DC/Cluster version when there are labels on the Host interface 3. Actual results: ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-96) Failure to refresh Vds rose09.qa.lab.tlv.redhat.com runtime info. Incorrect vdsm version for cluster Global_Cluster0: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer Expected results: Moving the Host between supported Cluster should work Additional info: