Bug 1123274

Summary: CentOS 6.5 VMs crash after VM Migration (after upgrading oVirt from 3.4.2 to 3.43)
Product: [Retired] oVirt Reporter: Sokratis <sokratis123k>
Component: ovirt-engine-coreAssignee: Martin Sivák <msivak>
Status: CLOSED CURRENTRELEASE QA Contact: Pavel Stehlik <pstehlik>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4CC: amedeo, bugs, ecohen, fromani, gklein, iheim, michal.skrivanek, msivak, nugtz.n1gz, ofrenkel, rbalakri, redhat-bugzilla, sokratis123k, yeylon
Target Milestone: ---   
Target Release: 3.5.1   
Hardware: x86_64   
OS: Linux   
Whiteboard: sla
Fixed In Version: ovirt-3.5.1_rc1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-01-21 16:03:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
source_and_destination_hosts_logs
none
vm_kernel_panic_during_live_migration
none
CentOS 5 VM Migration with ballooning enabled
none
failed migration destination host logs - 07102014
none
source_and_destination_host_details_07102014 none

Description Sokratis 2014-07-25 08:25:44 UTC
Description of problem:

I upgraded oVirt yesterday from 3.4.2 to 3.4.3 and performed a yum upgrade on all ovirt nodes as well without rebooting any of them.


After completing the upgrade, I tried to perform manual migration of VMs to a different host. The result was that all Windows 2003/2008  and CentOS 5 VMs were migrated successfully without any issues but all CentOS 6.5 VMs crashed although migration was completed successfully.


By looking the console of the CentOS 6.5 VMs I noticed that a few seconds after migration was complete, all the processes were killed, many out of memory errors were thrown as well as messages like "virtio_balloon virtio3: Out of puff! Cant' get 256 pages" and the VM ended up in kernel panic so I had to perform a manual poweroff and start the VM again.

Version-Release number of selected component (if applicable): 3.4.3

How reproducible:

Steps to Reproduce:
1. Perform VM migration
2.
3.

Actual results:

Migration completes successfully and VM crashes after a few seconds

Expected results:

Migration completes successfully and VM continues working normally

Additional info:

This issue occurs only on CentOS 6.5 VMs.

Comment 1 Michal Skrivanek 2014-08-15 13:57:46 UTC
might be an issue with hosts being configured or upgraded differently. Which cluster level is it?
Can you verify the actual qemu and libvirt version on both sides; and if you still reproduce the problem add qemu,libvirt, and vdsm logs?

Comment 2 Sokratis 2014-09-02 12:25:01 UTC
Created attachment 933754 [details]
source_and_destination_hosts_logs

Comment 3 Sokratis 2014-09-02 12:28:04 UTC
Cluster Level is 3.4

Hosts were added to ovirt-engine via the gui and then were put in maintenance mode.

Then a yum upgrade was performed on the cli on each hosts.

The two hosts I'm using for the tests have the same versions as below:

Kernel Version: 2.6.32 - 431.23.3.el6.x86_64
KVM Version: 0.12.1.2 - 2.415.el6_5.14
LIBVIRT Version: libvirt-0.10.2-29.el6_5.12
VDSM Version: vdsm-4.14.11.2-0.el6

I have attached the logs from a reproduction of the problem on these two hosts.

Comment 4 Francesco Romani 2014-09-15 07:42:12 UTC
Investigation in progress. Meantime, this bug may be relevant:

https://bugzilla.redhat.com/show_bug.cgi?id=1002360

Comment 5 Francesco Romani 2014-09-15 15:28:26 UTC
(In reply to Sokratis from comment #0)
[...]
> By looking the console of the CentOS 6.5 VMs I noticed that a few seconds
> after migration was complete, all the processes were killed, many out of
> memory errors were thrown as well as messages like "virtio_balloon virtio3:
> Out of puff! Cant' get 256 pages" and the VM ended up in kernel panic so I
> had to perform a manual poweroff and start the VM again.

This matches with what I can read here

https://bugzilla.redhat.com/show_bug.cgi?id=1002360#c7

So, lets' check if oVirt/VDSM misconfigured the balloon on destination host.
Was ballooning enabled? If so, how it is configured?

Do migration with CentOS 6.5 work OK if you disable ballooning?

Comment 6 Sokratis 2014-09-18 08:16:42 UTC
Ballooning was enabled (wiht 1MB reservation).

I disabled it and launched a migration and it failed with the same errors.

Comment 7 Sokratis 2014-10-01 07:14:02 UTC
I upgraded oVirt to 3.4.4 and performed the migration on 2 different VMs.

Both are running CentOS 6.5 with latest kernel (2.6.32-431.29.2).

The migration was successful  on one VM but failed on the other one with the same errors (both with ballooning enabled and disabled).

Comment 8 Francesco Romani 2014-10-01 07:17:18 UTC
(In reply to Sokratis from comment #7)
> I upgraded oVirt to 3.4.4 and performed the migration on 2 different VMs.
> 
> Both are running CentOS 6.5 with latest kernel (2.6.32-431.29.2).
> 
> The migration was successful  on one VM but failed on the other one with the
> same errors (both with ballooning enabled and disabled).

It seems more and more qemu issue. Can you provide the error the guest kernel is reporting when it crashes after a migration?

Comment 9 Sokratis 2014-10-01 10:44:14 UTC
The output during the kernel panic is very big and I could only capture the last part which I have uploaded.

Is there a way to capture the whole console output (I'm using virt-viewer)?

Comment 10 Sokratis 2014-10-01 10:45:02 UTC
Created attachment 943014 [details]
vm_kernel_panic_during_live_migration

Comment 11 Francesco Romani 2014-10-01 10:47:30 UTC
(In reply to Sokratis from comment #9)
> The output during the kernel panic is very big and I could only capture the
> last part which I have uploaded.
> 
> Is there a way to capture the whole console output (I'm using virt-viewer)?

It is a good enough start, thanks.

Comment 12 Francesco Romani 2014-10-03 10:29:26 UTC
I checked with qemu developers locally.

This
"""
By looking the console of the CentOS 6.5 VMs I noticed that a few seconds after migration was complete, all the processes were killed, many out of memory errors were thrown as well as messages like "virtio_balloon virtio3: Out of puff! Cant' get 256 pages" and the VM ended up in kernel panic so I had to perform a manual poweroff and start the VM again.
"""

is strictly tied to ballooning, so it must not happen when ballooning is disabled.

If migrations fail anyway even with ballooning disabled, it must be a different issues, and we need logs for that.

Comment 13 Sokratis 2014-10-03 12:47:45 UTC
Indeed the migration succeeds with ballooning disabled.

I also noticed that the migration on a CentOS 5 VM succeeds with ballooning enabled but some errors are thrown which can be seen in the file I just uploaded. Should I worry about these errors or not?


To sum up:

1) CentOS 5 VMs migrate successfully with ballooning enabled but some errors are thrown

2) CentOS 6 VMs fail to migrate with ballooning enabled but succeed with ballooning disabled.

I can perform a migration again and attach the logs. Which logs do you need?

Comment 14 Sokratis 2014-10-03 12:48:45 UTC
Created attachment 943711 [details]
CentOS 5 VM Migration with ballooning enabled

Comment 15 Francesco Romani 2014-10-06 13:23:37 UTC
(In reply to Sokratis from comment #13)
> 2) CentOS 6 VMs fail to migrate with ballooning enabled but succeed with
> ballooning disabled.
> 
> I can perform a migration again and attach the logs. Which logs do you need?

VDSM and libvirt logs.

But unless VDSM does a bad mess with ballooning in migrations - which is unlikely, this issue has to be moved to qemu.

Comment 16 Francesco Romani 2014-10-06 13:24:11 UTC
(In reply to Sokratis from comment #13)
> 2) CentOS 6 VMs fail to migrate with ballooning enabled but succeed with
> ballooning disabled.
> 
> I can perform a migration again and attach the logs. Which logs do you need?

VDSM and libvirt logs.

But unless VDSM does a bad mess with ballooning in migrations - which is unlikely, this issue has to be moved to qemu.

Comment 17 Francesco Romani 2014-10-06 14:15:57 UTC
To wrap up (until new logs change the scenario, but this is unlikely)

- migration succeeds
- after migration, destination VM progressively inflate balloon to reach the configured target
- guest CentOS 6.x kernel can't keep up free memory, then crashes for OOM
- can't recall, nor find evidence of, VDSM changes in balloon handling in the 3.4.2/3.4.3 timeframe, so upgrade should not be relevant

will move to qemu unless new evidence shows up

Comment 18 Francesco Romani 2014-10-06 14:17:26 UTC
workaround exists (disable ballooning) -> moving to "high"

Comment 19 Sokratis 2014-10-07 07:33:03 UTC
Created attachment 944460 [details]
failed migration destination host logs - 07102014

Comment 20 Sokratis 2014-10-07 07:36:01 UTC
I have attached the logs from a new migration that I performed.

Can you elaborate on the following comment:

"after migration, destination VM progressively inflate balloon to reach the configured target"

The memory reservation is set to 1MB. Does this affect ballooning? What do you mean "to reach the configured target"?

What about the error screenshot that I attached regarding the CentOS 5 VM?

Comment 21 Francesco Romani 2014-10-07 11:12:47 UTC
(In reply to Sokratis from comment #20)
> I have attached the logs from a new migration that I performed.
> 
> Can you elaborate on the following comment:
> 
> "after migration, destination VM progressively inflate balloon to reach the
> configured target"
> The memory reservation is set to 1MB. Does this affect ballooning? What do
> you mean "to reach the configured target"?

That was I inferred from the logs and the reported errors, but I'll move to SLA to have a better understanding, because migrationwise everything looks ok.


> What about the error screenshot that I attached regarding the CentOS 5 VM?

Seems that centos 5 is coping in a better way with OOM, but even in that case the guest is low on resources.

Comment 22 Francesco Romani 2014-10-07 11:14:08 UTC
Hi Martin,

can you take a look to the behaviour of the balloon here?

Comment 23 Martin Sivák 2014-10-07 11:47:09 UTC
Hi,

I need mom.log for that.

Can you please attach the logs for mom as well? Should be in the same directory as the vdsm.log file.

Comment 24 Martin Sivák 2014-10-07 11:51:52 UTC
Also what does Reservation: 1 MB mean? Is that the amount of guaranteed memory configured for the guests in the webadmin?

If that is so then this is a misconfiguration as well. The guest can't work with just 1MB of RAM, so it basically returns all free memory and crashes when kernel tries to allocate some internal buffer.

Can I also get the full output of

free
vdsClient -s 0 getVdsStats 
vdsClient -s 0 getAllVmStats

from the source host before migration is attempted and from the destination after the migration finishes?

Thanks

Comment 25 Sokratis 2014-10-07 14:18:52 UTC
Created attachment 944608 [details]
source_and_destination_host_details_07102014

Comment 26 Sokratis 2014-10-07 14:20:04 UTC
Yes the 1MB reservation is configured from the webadmin portal. What should be the minimum value for the migration to succeed and ballooning to work properly?

I have also attached the logs you asked for.

Comment 27 Sandro Bonazzola 2014-10-17 12:14:23 UTC
Moving pending bugs not fixed in 3.5.0 to 3.5.1.

Comment 28 Sokratis 2014-11-06 10:00:31 UTC
After powering off the same VM and enabling ballooning with 384MB reservation (total configured RAM is 2GB) I was able to perform a successful migration.

It looks like in CentOS 6 the minimum reservation of 1MB isn't enough for the migration to succeed. However since it works on CentOS 5 and Windows VMs it should work on CentOS 6 as well. Furthermore it's important to be able to set a very low reservation to increase VM density on a host.

Comment 29 Martin Sivák 2014-11-06 10:52:36 UTC
The reserved memory should be set to an amount of memory that allows the guest OS to run properly. The OS will never return memory that it needs, but it sometimes returns enough memory to make kernel buffer allocations impossible.

I see a bug happening on the destination. The first balloon status after the migration returned 0.

The logic increased it to be at least the minimum memory (1024 kB) and sent it to the VM. Which obeyed and returned all memory.. and then crashed when a kernel buffer was needed..

Comment 30 Sokratis 2014-11-06 13:34:32 UTC
Since CentOS 5 VMs are able to complete the migration even with 1MB reservation, there must be a difference in the way the CentOS 6 kernel handles the ballooning process. Unless the problem is in the way qemu/vsdm handles the migration.

Comment 31 Martin Sivák 2014-11-18 11:18:33 UTC
I am pretty sure that CentOS 5 and CentOS 6 handle the ballooning requests differently.

Reserving just 1 MB for a CentOS VM is a misconfiguration for sure though. The OS won't be able to boot with that amount of memory if we ever move to memory hotplug approach (giving VM more memory on demand instead asking the VMs to return memory).

But there was a bug as well that was fixed in the master branch and proposed for 3.5.1.

Comment 32 Sokratis 2014-11-18 11:33:41 UTC
So if the bug is fixed will we be able to migrate VMs with 1MB reservation or not?

If not, what should we do to fix this? Is there a minimum amount of RAM (regardless of the total configured RAM) that should be reserved on a CentOS 6 VM to be able to migrate properly?

I tested the same scenario on the same VM I did before (which succeeded with 384 MB reservation) with 100MB reservation and it failed to migrate.

The problem is that currently there are many CentOS 6 VMs running in our cluster and it will be very difficult to shutdown all of them in order to configure the memory reservation.

Comment 33 Martin Sivák 2014-11-18 12:44:18 UTC
MOM won't try to balloon your VM to 0 after migration. It will leave the VM some free memory and so the VM probably won't crash.

The memory is not needed specifically for migration. The migration was successful and the hypervisor then decided to ask for all the memory that was reported as not needed. Since you told it that the VM can run with just 1 MB, it tried to do that.. and the kernel inside the VM crashed.

Treat the guaranteed memory as the amount that is needed for the OS inside the VM to boot and run. I advise you to update the configuration of your VMs so they get the proper value once they are rebooted in the future.

CentOS 6 specifies the recommended minimal amount of memory for CLI operation as 392 MB.

Comment 34 Sokratis 2014-11-18 14:02:33 UTC
I tested the migration on another VM with 4096MB configured RAM and 512MB reservation.

The migration was successful but 'free -m' doesn't report the configured RAM.

Before the migration the output of 'free -m' was:

[root@host03 ]# free -m
             total       used       free     shared    buffers     cached
Mem:          3829       1355       2473          0         21        341
-/+ buffers/cache:        992       2836
Swap:         3894          0       3894

Right after the migration the output of 'free -m' was:


[root@host03 ]# free -m
             total       used       free     shared    buffers     cached
Mem:           431        343         88          0          0         23
-/+ buffers/cache:        319        112
Swap:         3894        691       3203
[root@host03 ]# free -m
             total       used       free     shared    buffers     cached
Mem:           245        150         94          0          0         15
-/+ buffers/cache:        133        111
Swap:         3894        870       3024
[root@host03 ]# free -m
             total       used       free     shared    buffers     cached
Mem:           245        152         92          0          0         19
-/+ buffers/cache:        132        112
Swap:         3894        868       3026
[root@host03 ]# free -m
             total       used       free     shared    buffers     cached
Mem:           245        148         96          0          1         21
-/+ buffers/cache:        125        119
Swap:         3894        875       3019

The migration was completed 10 minutes ago and the output of 'free -m' is:

[root@host03 ]# free -m
             total       used       free     shared    buffers     cached
Mem:          2319        589       1729          0         11        135
-/+ buffers/cache:        441       1877
Swap:         3894        726       3168

and it still hasn't reported the configured size (4096MB). 

Is this happening because of the balloon driver? How is the output of free related with the ballooning and the configured RAM? 

By the way, I configured the reservation to 392MB initially and the migration failed. It seems that the minimum amount of 392MB is only applicable to low usage OSes without any applications running. This particular VM has a JVM running and the migration was successful only after reserving 512MB so the amount of reservation depends on the system load right?

Comment 35 Sokratis 2014-11-18 14:09:56 UTC
The correct amount of configured RAM is now reported by free, 20 minutes after the migration:

[root@host03 ]# free -m
             total       used       free     shared    buffers     cached
Mem:          3829        794       3035          0         18        233
-/+ buffers/cache:        542       3287
Swap:         3894        622       3272

Comment 36 Martin Sivák 2014-11-18 15:55:39 UTC
> The migration was successful but 'free -m' doesn't report the configured RAM.

That is what the fix should solve.

> By the way, I configured the reservation to 392MB initially and the migration 
> failed. It seems that the minimum amount of 392MB is only applicable to low 
> usage OSes without any applications running.

Right. The migration crash was probably triggered by the issue we solved as part of this bug. We normally account for the amount of used memory and try to not starve the VM that much.

Comment 37 Sokratis 2014-11-19 08:12:58 UTC
>> By the way, I configured the reservation to 392MB initially and the migration 
>> failed. It seems that the minimum amount of 392MB is only applicable to low 
>> usage OSes without any applications running.

>Right. The migration crash was probably triggered by the issue we solved as part of >this bug. We normally account for the amount of used memory and try to not starve >the VM that much.

So will we be able to migrate VMs with 1MB reservation successfully with this patch or not? What will this bug actually fix?

Comment 38 Sokratis 2014-12-10 17:06:01 UTC
Is this still considered to be included in oVirt 3.5.1?

Comment 39 Martin Sivák 2014-12-10 17:24:09 UTC
Yes, it is already merged to the proper branches.

Comment 40 Sandro Bonazzola 2015-01-15 14:25:43 UTC
This is an automated message: 
This bug should be fixed in oVirt 3.5.1 RC1, moving to QA

Comment 41 Sandro Bonazzola 2015-01-21 16:03:13 UTC
oVirt 3.5.1 has been released. If problems still persist, please make note of it in this bug report.

Comment 42 Alisson Savoini Dias 2015-02-04 12:34:13 UTC
We are on OVirt 3.5.1 with CentOS 7 Hosts, and this seems to be happening. At the moment It seems to happen only on CentOS 6.6 servers, all the others migrate OK. Funny thing is that on the interface It hows the migration was successful, but the server is completely crashed, the CPU is at 100% and the console is not responding, just black.