Bug 869361

Summary: UV: KVM virt-manager fails to launch on large memory systems (>8TB)
Product: Red Hat Enterprise Linux 6 Reporter: George Beshers <gbeshers>
Component: libvirtAssignee: Libvirt Maintainers <libvirt-maint>
Status: CLOSED DUPLICATE QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.4CC: acathrow, ajia, berrange, ctatman, dallan, dfults, dyasny, dyuan, gbeshers, gsun, honzhang, leiwang, loriann, mprivozn, qguan, randerso, rja, tee, wshi, xuzhang
Target Milestone: rcKeywords: OtherQA
Target Release: 6.5   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-07 14:46:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 844783    

Description George Beshers 2012-10-23 16:59:21 UTC
Description of problem:

The current problem occurs when trying to launch virt-manager on a
large memory system.  Smallest I've seen so far, was uv48-sys with
8TB memory. 

When running virt-manager you'll see the error.

     libvirtError: Unable to encode message payload.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Richard W.M. Jones 2012-10-23 17:26:11 UTC
Does this bug have the wrong component?

What is "UV"?

What is the output of the following commands when run as root?

  virsh list --all
  virsh capabilities

Comment 3 George Beshers 2012-10-23 17:46:22 UTC
UV = Ultra Violet -- SGI's x86_64 super commuters.

There are actually two systems where this fails,

   UV1000 w/ 2048cpus and 8TB of memory
   UV2000 w/ 2048cpus and 16TB of memory

In both cases that is 2048 cores -- we are not currently
enabling HyperThreading.

I will add the requested information soon.

Comment 4 Derek Fults 2012-10-29 15:27:54 UTC
I have the output from ' strace -o virsh-out -ff virsh capabilities'
if that is helpful.  


# virsh list --all
 Id    Name                           State
----------------------------------------------------


#   virsh capabilities
error: failed to get capabilities
error: Unable to encode message payload
error: Reconnected to the hypervisor

[root@uv48-sys ~]# topology
System type: UV100/1000
System name: uv48-sys
Serial number: UV-00000048
Partition number: 0
     128 Blades
      64 Routers
    4096 CPUs
     128 Nodes
 9084.85 GB Memory Total
  128.00 GB Max Memory on any Node
       1 BASE I/O Riser
       2 Network Controllers
       1 Storage Controller
       8 USB Controllers
       1 VGA GPU

Comment 5 Richard W.M. Jones 2013-06-05 11:20:27 UTC
This has the wrong component, which is why no one was looking at it.

Comment 6 Daniel Berrangé 2013-06-05 14:13:00 UTC
From the info in comment #4 I'd guess it is probably not the amount of RAM that's the trigger, but rather the size of the NUMA topology causing very large capabilities XML

Comment 7 Daniel Berrangé 2013-06-05 14:14:43 UTC
Provide provide the version of the libvirt RPM that is installed when seeing this behaviour.

Comment 8 Michal Privoznik 2013-06-05 14:22:19 UTC
George, I think this is the very same bug that we've chased a while ago. Let me find it.

Comment 9 Michal Privoznik 2013-06-05 14:44:47 UTC
Found it:

https://bugzilla.redhat.com/show_bug.cgi?id=797279

Comment 10 Russ Anderson 2013-06-07 02:27:55 UTC
Move to rhel6.5 tracker.

Comment 11 Michal Privoznik 2013-06-07 06:14:25 UTC
George,

can you please provide both server & client side debug logs as well as version requested in comment 7?

http://wiki.libvirt.org/page/DebugLogs

Thanks.

Comment 12 Russ Anderson 2013-06-07 14:45:10 UTC
Michael, that info is in BZ 960683.

This BZ should get closed out as replaced by BZ 960683.
Sorry for the confusion.

Comment 13 Russ Anderson 2013-06-07 14:46:13 UTC

*** This bug has been marked as a duplicate of bug 960683 ***

Comment 14 Xuesong Zhang 2013-10-15 09:08:05 UTC
hi, Michal Privoznik,

   I'm verifying this bug in the latest libvirt 6.5 build. 
   First, I need to reproduce this bug in the old build, if it can be reproduced, then, we test the latest build to verify the bugs. But the problem is that we didn't have the large machine which memory is large than 8T.
   Since this bug is duplicated with bug 960683, and there is one attachment to simulate huge cpus on small boxes. I add that patch to the old build and try to reproduce this bug. The result is: 
   This bug (869361) can't be reproduced via that simulated path.
   The bug (960683) can be reproduced via that simulated path.
   PS. Here is the simulated path link: https://bugzilla.redhat.com/attachment.cgi?id=756168

   Would you please give me some advice, how can I simulated one env to reproduce this bug? Or how can I verify this bug in the latest build? Thanks very much.

Comment 15 Michal Privoznik 2013-10-15 09:30:42 UTC
Well I don't think this one needs to be reproduced. It is a duplicate. The orginal problem for this bug was encoding numa topology into capabilities XML. The encoded XML was too big for a libvirt packet. However, we've fixed it meanwhile and now even huge XML can be sent through.

Comment 16 Xuesong Zhang 2013-10-15 09:49:21 UTC
OK, I got it. Thanks for your quickly reply.

(In reply to Michal Privoznik from comment #15)
> Well I don't think this one needs to be reproduced. It is a duplicate. The
> orginal problem for this bug was encoding numa topology into capabilities
> XML. The encoded XML was too big for a libvirt packet. However, we've fixed
> it meanwhile and now even huge XML can be sent through.