Bug 1323988

Summary: Ballooning doesn't work on power with 4k guests
Product: Red Hat Enterprise Linux 7 Reporter: Dr. David Alan Gilbert <dgilbert>
Component: qemu-kvm-rhevAssignee: Thomas Huth <thuth>
Status: CLOSED DUPLICATE QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: chayang, dgibson, hannsj_uhl, juzhang, knoel, mdeng, michen, qzhang, thuth, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1324092 (view as bug list) Environment:
Last Closed: 2016-04-29 14:43:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1308609, 1359843    
Attachments:
Description Flags
KernelPanic
none
screenshotfor64kitB
none
64kib none

Description Dr. David Alan Gilbert 2016-04-05 09:43:47 UTC
Description of problem:
This is based on an observation looking at the ballooning code.
(and discussion with Amit and Laurent)

the 'balloon_page' code does an:

       qemu_madvise(addr, TARGET_PAGE_SIZE,
               deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);

thus if the host page size is larger than TARGET_PAGE_SIZE (which I think is the case when Power is configured for 16 or 64k pages) that qemu_madvise should always fail and never actually discard any memory from the host.
So ballooning will appear to work but you just never get any RAM back on the host.

Version-Release number of selected component (if applicable):
2.6 and older

How reproducible:
theoretically 100%

Steps to Reproduce:
1. Create a large VM using lots of RAM
2. Use that RAM in the guest by something that dirties a lot of it
3. Record the amount of host RAM used
4. Inflate the balloon in the guest to free most of the RAM
5. Record the amount of host RAM used

Actual results:
From looking at the code I reckon 3 & 5 will be similar and will not reflect the ballooned RAM.

Expected results:
The host RAM in (5) should decrease by an amount similar to the amount of ballooned memory.

Additional info:

Comment 3 David Gibson 2016-04-06 01:11:45 UTC
Drat.  I thought I'd fixed the balloon on Power way back at IBM, but looks like that was just fixing it to the point of not blowing up when trying to use it - not actually balloooning usefully.

So... the virtio-balloon is kinda broken by design, but we'll have to do what we can with what we have for now (virtio standardization efforts apparently have a new, better balloon design).

IIRC, the balloon is described by spec as working in 4kiB chunks, so TARGET_PAGE_SIZE is Just Plain Wrong on that front.

I think what we need to do is to batch contiguous 4kiB chunks listed by the guest until they make a whole host page, then do the madvise().

Comment 4 Dr. David Alan Gilbert 2016-04-06 09:59:22 UTC
(In reply to David Gibson from comment #3)
> Drat.  I thought I'd fixed the balloon on Power way back at IBM, but looks
> like that was just fixing it to the point of not blowing up when trying to
> use it - not actually balloooning usefully.
> 
> So... the virtio-balloon is kinda broken by design, but we'll have to do
> what we can with what we have for now (virtio standardization efforts
> apparently have a new, better balloon design).
> 
> IIRC, the balloon is described by spec as working in 4kiB chunks, so
> TARGET_PAGE_SIZE is Just Plain Wrong on that front.
> 
> I think what we need to do is to batch contiguous 4kiB chunks listed by the
> guest until they make a whole host page, then do the madvise().

Yes, you'd have to be careful about anything that doesn't start on a host page or anything like that.
I'm not sure if there's anything that the guest could know to help it only make sane inflation requests.

Comment 5 Thomas Huth 2016-04-11 18:09:16 UTC
I've now tested the balloon on our P8 server, and at a first glance it seems to be working - I can see the amount of free memory in the host going up when I decrease the memory of the guest via the balloon.
However, after looking at the code of QEMU and the kernel a little bit closer, it is clear that this is not working as expected *and might even cause memory corruption in some cases*:

- The madvise syscall in the kernel rounds up the length parameter to a multiple of the host PAGE_SIZE = 65536 (see mm/madvise.c):

	len = (len_in + ~PAGE_MASK) & PAGE_MASK;

- The madvise syscall in the kernel returns with an error if the address is not aligned to the host PAGE_SIZE.

So for the very first 4k chunk of the 64k page, the madvise() succeeds, but for all other chunks, the call is rejected. Meaning that if QEMU tries to free the whole 64k page, it is accidentially working right. But if it only tries to free the 4k chunks that are not aligned to 64k, then it silently fails. *And if it only tries to free the first 4k chunk without all the others, then even guest memory corruption might occur!*

We definitely need to fix this...

Comment 6 David Gibson 2016-04-12 01:28:18 UTC
Ah!  So it's kind of the reverse of what we first thought.  In fact in the common configuration (64kiB pages on both host and guest) it will work (by accident).  But if you have a 4kiB page guest on a 64kiB page host you can get data corruption.  Nasty.

Same solution, AFAICT, though.

Comment 7 Min Deng 2016-04-12 07:00:25 UTC
Hi developers,
   As we all know it supports 64kib guest on ppc host.Is it possible for QE to create 4 kiB guest on a ppc host ? If there is a way please tell us.Thanks in advance.

Min

Comment 8 David Gibson 2016-04-13 05:40:29 UTC
Min,

Yes, it's possible to create a guest using 4kiB pages, but it will require building a custom kernel.  AFAIK all current distributions supporting Power use 64kiB pages by default.

Comment 9 Min Deng 2016-04-13 06:09:03 UTC
(In reply to David Gibson from comment #8)
> Min,
> 
> Yes, it's possible to create a guest using 4kiB pages, but it will require
> building a custom kernel.  AFAIK all current distributions supporting Power
> use 64kiB pages by default.

 Could you please help to build such a custom build ? Thanks a lot.
Min

Comment 10 Thomas Huth 2016-04-13 06:23:41 UTC
(In reply to dengmin from comment #9)
>  Could you please help to build such a custom build ? Thanks a lot.

I can try to help here. FWIW, I already tried to compile an upstream kernel with 4k pages, but I got an error while trying to compile it... I'll have a try with a downstream kernel next, I'll then let you know the results.

Comment 12 Thomas Huth 2016-04-13 11:54:18 UTC
Suggested patch upstream: https://patchwork.ozlabs.org/patch/609982/

Comment 14 Min Deng 2016-04-14 06:31:43 UTC
Created attachment 1147048 [details]
KernelPanic

Failure on my guest.

Comment 16 Thomas Huth 2016-04-14 06:52:11 UTC
(In reply to dengmin from comment #14)
> Failure on my guest.

Looking at that screenshot, the only thing I could imagine is, that you accidentially tried to install my 4k kernel on a little endian guest. Could you please verify that you're guest is a big endian installation, not a little endian one?

Comment 17 Thomas Huth 2016-04-14 09:14:50 UTC
Decreased the priority/severity a little bit since the problem only occurs with 4k guests on 64k hosts - and 4k (server) guests are rather hard to find in the wild nowadays.

Comment 19 Min Deng 2016-04-14 09:59:08 UTC
Created attachment 1147105 [details]
screenshotfor64kitB

Comment 20 Min Deng 2016-04-14 09:59:54 UTC
Created attachment 1147106 [details]
64kib

Comment 23 Thomas Huth 2016-04-29 14:43:49 UTC
I'm closing this ticket for PPC, since there are no 4k guests available in the wild (all major distros seem to use 64k page size guests nowadays), and we thus don't have a real problem here. We still can track the issue itself in BZ 1324092 (but we certainly do not need two tickets to track this).

*** This bug has been marked as a duplicate of bug 1324092 ***