Created attachment 1134894 [details] Logs Description of problem: [Host QoS] - Configuring 'Rate Limit' and Committed Rate' values higher then the default(2048) on the management network will end up with host loosing connectivity with the engine. Host will loose ip and only rebooting the server will recover him(restarting the network service or giving dhclient manually on the 'ovirtmgmt' won't help). "Error while executing action HostSetupNetworks: Network error during communication with the Host." - On other networks, that are not the management one, such operation will end up with --> 'Error while executing action HostSetupNetworks: Unexpected exception' The default values for 'Rate Limit' and 'Committed Rate' in the engine-config are 2048 : MaxAverageNetworkQoSValue: 2048 version: general MaxPeakNetworkQoSValue: 2048 version: general If we have a wide bandwidth nic(10Gb) and we would like to set some QoS capping higher then 2048, we need to configure this values via the engine-config and restart the engine. Once entering values higher then 2048 in the UI(rate limit, committed rate) on the management network we will loose connectivity with the host and the ip is gone. From supervdsm.log it seems like the values that are sent are negative --> aise TrafficControlException(retcode, err, command) TrafficControlException: (1, 'HFSC: Illegal "m2"\nHFSC: Illegal "rt"\n', ['/usr/sbin/tc', 'class', 'add', 'dev', u'enp7s0f0', 'parent', '1389:', 'classid', '1389:1388', 'hfsc', u'rt', 'm2', '-1294967296bit', u'ul', 'm2', '-1294967296bit', u'ls', 'm2', '800bit']) Version-Release number of selected component (if applicable): vdsm-4.17.23-0.el7ev.noarch rhevm-3.6.3.4-0.1.el6.noarch How reproducible: 100 Steps to Reproduce: 1. Increase the default values in the engine-config to: MaxAverageNetworkQoSValue=10000 MaxPeakNetworkQoSValue=10000 ans restart engine 2. Via the Setup Networks dialog, Override QoS on the management network 'ovirtmgmt' with: Link Share=100 Rate Limit=3000 Committed Rate=3000 and try to approve operation Actual results: Setup Networks operation hangs up for a long time and then, Host lost connectivity with the engine and it's has no ip. Rebooting the server will bring him up. Expected results: Configuring QoS values higher then default should be possible on the management network and on other networks. We shouldn't loose connectivity and ip with such operation. Additional info: On regular networks we get an error in vdsm.log: jsonrpc.Executor/6::ERROR::2016-03-10 15:18:47,700::__init__::526::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 521, in _serveRequest res = method(**params) File "/usr/share/vdsm/rpc/Bridge.py", line 279, in _dynamicMethod result = fn(*methodArgs) File "/usr/share/vdsm/API.py", line 1498, in setupNetworks except hooks.HookError as e: AttributeError: 'module' object has no attribute 'HookError' supervdsm.log : raise TrafficControlException(retcode, err, command) TrafficControlException: (1, 'HFSC: Illegal "m2"\nHFSC: Illegal "rt"\n', ['/usr/sbin/tc', 'class', 'add', 'dev', 'eno2', 'parent', '1389:', 'classid', '1389:1388', 'hfsc', 'rt', 'm2', '-1294967296bit', 'ul', 'm2', '-1294967296bit', 'ls', 'm2', '800bit']) engine - 'Error while executing action HostSetupNetworks: Unexpected exception' Attaching logs.
Jiri, do you know if tc and hfsc are expected to support bitrates higher than 2Gbps?
I'm told by Hannes Sowa that upstream kernel and tc accept 64bit rate limit. If would be nice if this is easily backportable to el7.
Hi, (In reply to Michael Burman from comment #0) > From supervdsm.log it seems like the values that are sent are negative --> > > aise TrafficControlException(retcode, err, command) > TrafficControlException: (1, 'HFSC: Illegal "m2"\nHFSC: Illegal "rt"\n', > ['/usr/sbin/tc', 'class', 'add', 'dev', u'enp7s0f0', 'parent', '1389:', > 'classid', '1389:1388', 'hfsc', u'rt', 'm2', '-1294967296bit', u'ul', 'm2', > '-1294967296bit', u'ls', 'm2', '800bit']) This really looks like tc is being called with negative values in the first place, so the error message it prints is expected behaviour. Not sure if I am stating the obvious here though, that's not clear from reading the comments above. (In reply to Dan Kenigsberg from comment #2) > I'm told by Hannes Sowa that upstream kernel and tc accept 64bit rate limit. > If would be nice if this is easily backportable to el7. Is this a separate request? If so, how is it related to this ticket? Or are you just highjacking the ticket? :) Looking at tc/q_hfsc.c in upstream iproute2 I don't see any differences to RHEL7 in matters of parameter parsing. It just uses get_rate(), which interprets the values as unsigned integers. Though this should already allow for values up to 4gb.
Hi, for hfsc I was wrong, it seems we don't yet support 64 bit numbers. But that shouldn't be a hard problem to fix upstream. We simply have to add a new netlink attribute and a new definition of tc_service_curve plus enhance some helpers. In tc we need to switch over to get_rate64 at some places then and send compatibility values in the other attribute still. Sorry, missed that this is hfsc, but we could still do this before 7.3. Sorry, Hannes
Created attachment 1151633 [details] support u64 parameters in hfsc Preliminary patch for 64 bit parameters to hfsc kernel support.
Yes, Phil, this was a (failed) hijack attempt. For a minute I thought that there is no need for any change on the Vdsm application level, but we'd need changes in libnl3, kernel and vdsm, besides of iproute.
Hi, (In reply to Dan Kenigsberg from comment #6) > Yes, Phil, this was a (failed) hijack attempt. For a minute I thought that > there is no need for any change on the Vdsm application level, but we'd need > changes in libnl3, kernel and vdsm, besides of iproute. Thanks for being honest with us. ;) So is Michael's problem already resolved? If not, can we keep this ticket about the integer overflow issue in vdsm and have a dedicated one for 64bit support in HFSC? Otherwise we might lose track of that vdsm issue, no? Cheers, Phil
Michael, Does your problem with calling tc from vdsm still exist? Can you validate if my assumption is correct and a negative value is passed to tc? If so, which component to move this ticket to? Thanks, Phil
Hi Phil, yes it is still exist --> jsonrpc.Executor/2::ERROR::2016-06-22 08:24:47,284::__init__::543::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 538, in _serveRequest res = method(**params) File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 195, in _dynamicMethod result = fn(*methodArgs) File "/usr/share/vdsm/API.py", line 1459, in setupNetworks supervdsm.getProxy().setupNetworks(networks, bondings, options) File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 53, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 51, in <lambda> **kwargs) File "<string>", line 2, in setupNetworks File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result) TrafficControlException: (1, 'HFSC: Illegal "m2"\nHFSC: Illegal "rt"\n', ['/usr/sbin/tc', 'class', 'add', 'dev', 'enp4s0f1', 'parent', '1389:', 'classid', '1389:1388', 'hfsc', 'rt', 'm2', '-1294967296bit', 'ul', ' m2', '-1294967296bit', 'ls', 'm2', '800bit']) yes, it is seems like negative values passed to tc indeed^^ I really don't know to which component this report should be moved.
Hi Michael, (In reply to Michael Burman from comment #9) > yes, it is seems like negative values passed to tc indeed^^ > I really don't know to which component this report should be moved. OK, thanks. I'll just move it back to where it came from then. :) Dan, if you're still interested in those 64bit counters for HFSC, please clone this ticket, set the component back to iproute and adjust the description accordingly. Thanks, Phil
Indeed, there's an integer overflow on Engine side: vdsm receives negative values. jsonrpc.Executor/7::DEBUG::2016-07-04 11:27:31,177::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {'bondings': {}, 'networks': {'net1': {' ipv6autoconf': True, 'hostQos': {'out': {'rt': {'m2': -1294967296}, 'ul': {'m2': -1294967296}, 'ls': {'m2': 100}}}, 'nic': 'enp1s0f1', 'mtu': 1500, 'dhcpv6': False, 'STP': 'no', 'bridged': ' true'}}, 'options': {'connectivityCheck': 'true', 'connectivityTimeout': 120}}
Looks like the overflow is on the engine side. It holds values as Integers, which are signed 32-bit in Java. The ul.m2 and rt.m2 values are multiplied by 8 before sending to VDSM, so that could be an additional limitation.
*** Bug 1464734 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of bug 1515877 ***