Bug 1440683

Summary:	libvirt error "Unable to encode message payload" when using virConnectGetAllDomainStats API
Product:	Red Hat Enterprise Linux 7	Reporter:	Richard W.M. Jones <rjones>
Component:	libvirt	Assignee:	Peter Krempa <pkrempa>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.4	CC:	chhu, chorn, dyuan, fj-lsoft-bm, fj-lsoft-ofuku, jsuchane, juzhou, lmen, mprivozn, mzhan, pkrempa, rjones, tumeya, tzheng, virt-bugs, xiaodwan, xuzhang, yoguma
Target Milestone:	rc	Keywords:	OtherQA
Target Release:	7.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	QJ170216-012/74Z
Fixed In Version:	libvirt-3.2.0-7.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1422795	Environment:
Last Closed:	2017-08-02 00:05:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1298243, 1317092, 1422795, 1449577

Description Richard W.M. Jones 2017-04-10 09:25:27 UTC

When virt-top is run on a guest which has many disks (about 500),
we see this error from libvirt:

  libvirt: VIR_ERR_RPC: VIR_FROM_RPC: Unable to encode message payload

The error comes from deep inside the libvirt RPC code:

  https://libvirt.org/git/?p=libvirt.git;a=blob;f=src/rpc/virnetmessage.c;h=c3a2e595c130b98f4bd95afe31bdd9109690548c;hb=HEAD#l359

and appears to indicate that a message size limit needs to be increased.

I was not able to reproduce the bug myself, but steps which may work
are described here:

  https://bugzilla.redhat.com/show_bug.cgi?id=1422795#c6

virt-top-1.0.8-23.el7.x86_64
ocaml-libvirt-0.6.1.4-15.el7.x86_64
libvirt-3.1.0-2.el7.x86_64
qemu-kvm-rhev-2.8.0-6.el7.x86_64

Comment 4 Peter Krempa 2017-04-18 07:43:51 UTC

The problem is that even if we increase the limit arbitrarily, the user then can use an even more insane configuration which will still hit the problem. The question is what is a sane configuration in this aspect.

Comment 5 Richard W.M. Jones 2017-04-18 09:23:37 UTC

Can we get an idea of how large the buffer should be increased to
make 494 disks work?  (For me 255 disks works)

The current limit is 16MB, and if (for example) we had to increase
that to 24MB or 32MB it doesn't seem like a great hardship.  The
limit is only there to stop someone exhausting memory or other
resources in either the client or the daemon, but 32MB doesn't
seem like it would cause problems.

Comment 6 Peter Krempa 2017-04-18 09:45:06 UTC

Yes, it's exactly to prevent too large messages. I think even 128 or 256MiB will allow a similar kind of protection and recent machines have lots of ram anyways.

The thing is that you still don't have the certainity that $big_number_of_disks will work.

The bulk stats API is unique in this regard since it transports a lot of text data (field names) and thus was the first to hit this limit.

It also might be possible to split the returned data into multiple messages, at least for the case where you specify a list of domains. I'm not sure though whether we can do this on the RPC level without breaking compat though.

Comment 7 Richard W.M. Jones 2017-04-18 09:50:15 UTC

That does suggest another possibility -- using a simple, fixed Huffman
encoding on the strings.  (I guess this would need to be a different
proc nr for backwards compatibility though)

Comment 8 Richard W.M. Jones 2017-04-18 12:13:04 UTC

For a similar bug (in a different place) see bug 1443066.

Comment 9 Michal Privoznik 2017-04-19 13:30:22 UTC

(In reply to Peter Krempa from comment #6)
> Yes, it's exactly to prevent too large messages. I think even 128 or 256MiB
> will allow a similar kind of protection and recent machines have lots of ram
> anyways.
> 
> The thing is that you still don't have the certainity that
> $big_number_of_disks will work.
> 
> The bulk stats API is unique in this regard since it transports a lot of
> text data (field names) and thus was the first to hit this limit.
> 
> It also might be possible to split the returned data into multiple messages,
> at least for the case where you specify a list of domains. I'm not sure
> though whether we can do this on the RPC level without breaking compat
> though.

This problem has been dealt with in the past exactly in this manner. A bug report comes in that message limits are too small. We review the case, confirm that's the problem and increase the limits. I think we should do the same here as well.

Comment 13 Peter Krempa 2017-05-15 15:03:30 UTC

So I've tried to figure out how much data you can actually return via this mechanism:

The RPC packet payload has roughly 16M limit (- a little overhead)

The data entries for a single disk have roughly 500 bytes, for a network interface 400 bytes, 1 vcpu ~150 and the rest which does not scale by domain variables is ~400 bytes.

Consider disk being 1k, this means we can return data for roughly 16k disks/interfaces.

Unfortunately, if called to return data for multiple VMs, the 16m allowance is split among them, since the data is returned in one RPC call.

This means, that a VM having ~500 disks, plus other stuff will use ~1MiB of stats data when overestimating pretty strongly. This still should fit very well if called to return data for a single VM.

It looks like that the call was used to return data for a bigger number of VMs, which scales pretty badly. Increasing RPC buffer sizes, while possible won't help much for hosts with a lot of big VMs.

A suggestion would be to use the statistics API to retrieve the information for only a limited number of VMs iteratively. That way, while losing some programmer comfort the RPC message size should be large enough to accomodate VMs with nearly insane configurations.

It's also possible to retrieve the statistics in parallel, if the set of VMs does not overlap.

Given the scenario above I think that the problem is the size explosion from returning data for multiple VMs while the data is huge due to the number of disks. Current configuration would allow to retrieve statistics for roughly 16-32 guests. Doubling or quadrupling the message size would help slightly, but going beyond that won't be possible upstream.

I can propose a patch to increase the buffer size, but as said, it will be hard to get it past 100 VMs with such config, so I'd suggest batching the calls to the stats API.

Comment 14 Richard W.M. Jones 2017-05-15 16:53:45 UTC

I think that making the 500 disks x 1 VM case work will be sufficient
for this bug in RHEL 7.4.  We might need to look again at this for 7.5,
but I hope that limit shouldn't be too invasive for libvirt right now.

However I will add (as background) that the reason that we changed
virt-top to retrieve as much stats in a single API call was because
we previously had serious scalability problems because of the round
trip time of individual libvirt requests.  So fewer round trips is
better from our point of view.

Comment 16 Peter Krempa 2017-05-24 12:05:48 UTC

I've doubled the upstream RPC message size:

commit 97863780e8846277842e746186a54c4d2941755b
Author: Peter Krempa <pkrempa>
Date:   Mon May 22 17:48:09 2017 +0200

    rpc: Bump maximum message size to 32M
    
    While most of the APIs are okay with 16M messages, the bulk stats API
    can run into the limit in big configurations. Before we devise a new
    plan for this, bump this limit slightly to accomodate some more configs.

I also wanted to add a suggestion how to use the API for big configurations but there's an ongoing discussion.

https://www.redhat.com/archives/libvir-list/2017-May/msg00853.html

Comment 19 Richard W.M. Jones 2017-05-26 09:37:20 UTC

I have to set this back to ASSIGNED / FailedQA, because
for reasons that I don't understand, this does not change
the number of disks we can add to a guest before the
virConnectGetAllDomainStats throws "Unable to encode message payload".

Comment 20 Richard W.M. Jones 2017-05-26 10:13:20 UTC

10:44 < danpb> rjones: hmm, there's some dubious logic in virNetMessageEncodePayload() for increasing the buffer length
10:44 < danpb>         unsigned int newlen = (msg->bufferLength - VIR_NET_MESSAGE_LEN_MAX) * 4;
10:44 < danpb>         if (newlen > VIR_NET_MESSAGE_MAX) {
10:44 < danpb>             virReportError(VIR_ERR_RPC, "%s", _("Unable to encode message payload"));
10:44 < danpb>             goto error;
10:44 < danpb>         }
10:45 < danpb> we're taking the existing buffer length and multiplying it by 4 and then checking against the upper limit
10:46 < danpb> we start with 65 KB,  that jumps to 256 KB, then 1 MB, then 4 MB, then 16 MB
10:47 < danpb> if you multiply 16 mb by 4, you'll get 64 mb which is bigger than the 32mb limit

Comment 21 Richard W.M. Jones 2017-05-26 10:37:21 UTC

I changed the loop so it doubles instead of * 4 on each
iteration.  However this did not fix the problem.

Next I investigated with gdb, and the behaviour is very strange.

With 310 disks, the buffer grows to only 131072 bytes and then
succeeds:

  Breakpoint 1, virNetMessageEncodePayload (msg=msg@entry=0x562db2f24650, 
    filter=0x562db21fd1b0 <xdr_remote_connect_get_all_domain_stats_ret>, 
    data=0x7f9014000b50) at rpc/virnetmessage.c:362
  362		newlen *= 2;
  (gdb) print newlen
  $15 = 131072
  (gdb) cont

With 320 disks, the buffer doubles past 33554432 and then fails:

  Breakpoint 1, virNetMessageEncodePayload (msg=msg@entry=0x562db2f22ec0, 
    filter=0x562db21fd1b0 <xdr_remote_connect_get_all_domain_stats_ret>, 
    data=0x7f90200008e0) at rpc/virnetmessage.c:362
  362		newlen *= 2;
  (gdb) print newlen
  $8 = 8388608
  (gdb) cont
  Continuing.

  Breakpoint 1, virNetMessageEncodePayload (msg=msg@entry=0x562db2f22ec0, 
    filter=0x562db21fd1b0 <xdr_remote_connect_get_all_domain_stats_ret>, 
    data=0x7f90200008e0) at rpc/virnetmessage.c:362
  362		newlen *= 2;
  (gdb) print newlen
  $9 = 16777216
  (gdb) cont
  Continuing.

  Breakpoint 1, virNetMessageEncodePayload (msg=msg@entry=0x562db2f22ec0, 
    filter=0x562db21fd1b0 <xdr_remote_connect_get_all_domain_stats_ret>, 
    data=0x7f90200008e0) at rpc/virnetmessage.c:362
  362		newlen *= 2;
  (gdb) print newlen
  $10 = 33554432
  (gdb) cont

  [virt-top then fails with the "Unable to encode message payload" error.]

I don't understand how there can be such a huge discontinuity
between 310 and 320 disks, but I suspect there must be some kind
further bug either in the loop or elsewhere in the libvirt code.

Comment 22 Richard W.M. Jones 2017-05-26 11:22:53 UTC

OK I understand what's the problem.

We're hitting this limit:

struct remote_connect_get_all_domain_stats_ret {
    remote_domain_stats_record retStats<REMOTE_DOMAIN_LIST_MAX>;
};

I'm fairly sure that the use of REMOTE_DOMAIN_LIST_MAX is
simply wrong there.  It should be another REMOTE_*_MAX limit,
although I'm not sure which -- possibly we need to define a
new limit.

We don't need to increase VIR_NET_MESSAGE_MAX at all, but I
am going to submit a patch upstream so that we double the size
of the array instead of quadrupling it (but that won't be
relevant to the eventual fix for this bug).

Comment 23 Richard W.M. Jones 2017-05-26 12:29:53 UTC

Patch posted:
https://www.redhat.com/archives/libvir-list/2017-May/msg01079.html

Comment 25 Xuesong Zhang 2017-06-23 03:14:03 UTC

Verify this BZ per following BZ.
https://bugzilla.redhat.com/show_bug.cgi?id=1422795#c26

Comment 26 errata-xmlrpc 2017-08-02 00:05:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1846

Comment 27 errata-xmlrpc 2017-08-02 01:30:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1846