Bug 1294833

Summary: XMLRPC API of mom breaks on host with 193270 MiB ram
Product: [oVirt] mom Reporter: Oved Ourfali <oourfali>
Component: CoreAssignee: Martin Sivák <msivak>
Status: CLOSED CURRENTRELEASE QA Contact: Shira Maximov <mshira>
Severity: high Docs Contact:
Priority: unspecified    
Version: 0.5.1CC: bugs, danken, dfediuck, dpaz, dprezhev, mavital, mgoldboi, msivak, sbonazzo, s.kieske, yliberma
Target Milestone: ovirt-3.6.3Keywords: Regression
Target Release: 0.5.2Flags: rule-engine: ovirt-3.6.z+
rule-engine: blocker+
mgoldboi: planning_ack+
dfediuck: devel_ack+
mavital: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: mom-0.5.2-1 Doc Type: Bug Fix
Doc Text:
Cause: VDSM uses XML-RPC to communicate with MoM in oVirt 3.6. XML-RPC only supports int32 for numbers. Consequence: Big enough amount of memory overflows the int32 type and XML-RPC reports an error. Fix: MoM was configured to use i8 XML-RPC extension for transfering big numbers. Result: VDSM can properly retrieve statistics from MoM.
Story Points: ---
Clone Of:
: 1302001 (view as bug list) Environment:
Last Closed: 2016-02-18 11:12:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1302001    

Description Oved Ourfali 2015-12-30 15:00:33 UTC
Description of problem:
On bigmem hosts, getVdsStats fail due to issue in mom.
See additional info

Version-Release number of selected component (if applicable):
3.6.1

How reproducible:
danken claims 100%

Steps to Reproduce:
1. Install and run Vdsm+mom on a bigmem host
2. run vdsClient -s 0 getVdsStats


Actual results:
Fails

Expected results:
Works

Additional info:

vdsm.log trace:
jsonrpc.Executor/7::ERROR::2015-12-30 16:35:17,706::__init__::526::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 521, in _serveRequest
    res = method(**params)
  File "/usr/share/vdsm/rpc/Bridge.py", line 277, in _dynamicMethod
    result = fn(*methodArgs)
  File "/usr/share/vdsm/API.py", line 1384, in getStats
    stats.update(self._cif.mom.getKsmStats())
  File "/usr/share/vdsm/momIF.py", line 68, in getKsmStats
    stats = self._mom.getStatistics()['host']
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1587, in __request
    verbose=self.__verbose
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1306, in single_request
    return self.parse_response(response)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1482, in parse_response
    return u.close()
  File "/usr/lib64/python2.7/xmlrpclib.py", line 794, in close
    raise Fault(**self._stack[0])
Fault: <Fault 1: "<type 'exceptions.OverflowError'>:int exceeds XML-RPC limits">


after logging host_stats on mom/MOMFuncs we se

2015-12-30 16:47:54,796 - mom.RPCServer - INFO - host_stats {'swap_out': 0, 'swap_usage': 0, 'ksmd_cpu_usage': 0, 'anon_pages': 18073368, 'ksm_shareable': 20518684, 'ksm_pages_unshared': 0, 'swap_total': 16383996, 'ksm_pages_sharing': 0, 'cpu_count': 16, 'swap_in': 0, 'ksm_pages_to_scan': 100, 'mem_free': 178111592, 'ksm_merge_across_nodes': 1, 'ksm_pages_volatile': 0, 'mem_available': 197909444, 'ksm_pages_shared': 0, 'ksm_full_scans': 0, 'ksm_run': 0, 'ksm_sleep_millisecs': 20, 'mem_unused': 177132524}

Comment 2 Dan Kenigsberg 2016-01-03 15:50:58 UTC
As quick hack to get the host running, I've edited /usr/lib/python2.7/site-packages/mom/MOMFuncs.py

    def getStatistics(self):
        self.logger.info("getStatistics()")
        host_stats = self.threads['host_monitor'].interrogate().statistics[-1]
        host_stats = dict((k, str(v)) for (k, v) in host_stats.iteritems())
        guest_stats = {}
        guest_entities = self.threads['guest_manager'].interrogate().values()
        for entity in guest_entities:
            d = dict((k, str(v) if isinstance(v, int) else v) for (k, v) in entity.statistics[-1].iteritems())
            guest_stats[entity.properties['name']] = d
        ret = {'host': host_stats, 'guests': guest_stats}
        return ret

and /usr/share/vdsm/momIF.py

    def getKsmStats(self):
        """
        Get information about KSM and convert memory data from page
        based values to MiB.
        """

        ret = {}

        try:
            stats = self._mom.getStatistics()['host']
            stats = dict((k, int(v)) for (k, v) in stats.iteritems())
            ret['ksmState'] = bool(stats['ksm_run'])
            ret['ksmPages'] = stats['ksm_pages_to_scan']
            ret['ksmMergeAcrossNodes'] = bool(stats['ksm_merge_across_nodes'])
            ret['memShared'] = stats['ksm_pages_sharing'] * PAGE_SIZE_BYTES
            ret['memShared'] /= Mbytes
            ret['ksmCpu'] = stats['ksmd_cpu_usage']
        except (AttributeError, socket.error):
            self.log.warning("MOM not available, KSM stats will be missing.")

        return ret

Comment 3 Martin Sivák 2016-01-04 11:38:56 UTC
I do not see any big enough number in the log. XML-RPC supports signed 32 bit ints - the max value is about two billion (ten digits).

We will obviously have to stringify it..

Comment 4 Dan Kenigsberg 2016-01-04 14:07:13 UTC
The long integers of comment 0 showed up only after I've added logging to getStatistics().

Comment 5 Shira Maximov 2016-02-07 15:07:34 UTC
verified on : 
Red Hat Enterprise Virtualization Manager Version: 3.6.3-0.1.el6 
mom-0.5.2-1.el7ev.noarch

the verification run on PPC host with  251 GB memory 

verification steps : 
run vdsClient -s 0 getVdsStats , the command worked.

Comment 6 Sven Kieske 2016-02-11 14:30:57 UTC
did this work in 3.6.2 ? or was it broken? I think I need this function and I have some hosts with way more ram and I'm currently in the process to deploy 3.6.2.

so if this does not work in 3.6.2 I would have to wait for 3.6.3 release.

can someone confirm/decline if this bug is present in 3.6.2?

Thanks!