Bug 1123274
Summary: | CentOS 6.5 VMs crash after VM Migration (after upgrading oVirt from 3.4.2 to 3.43) | ||
---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Sokratis <sokratis123k> |
Component: | ovirt-engine-core | Assignee: | Martin Sivák <msivak> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Pavel Stehlik <pstehlik> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.4 | CC: | amedeo, bugs, ecohen, fromani, gklein, iheim, michal.skrivanek, msivak, nugtz.n1gz, ofrenkel, rbalakri, redhat-bugzilla, sokratis123k, yeylon |
Target Milestone: | --- | ||
Target Release: | 3.5.1 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | sla | ||
Fixed In Version: | ovirt-3.5.1_rc1 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-01-21 16:03:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Sokratis
2014-07-25 08:25:44 UTC
might be an issue with hosts being configured or upgraded differently. Which cluster level is it? Can you verify the actual qemu and libvirt version on both sides; and if you still reproduce the problem add qemu,libvirt, and vdsm logs? Created attachment 933754 [details]
source_and_destination_hosts_logs
Cluster Level is 3.4 Hosts were added to ovirt-engine via the gui and then were put in maintenance mode. Then a yum upgrade was performed on the cli on each hosts. The two hosts I'm using for the tests have the same versions as below: Kernel Version: 2.6.32 - 431.23.3.el6.x86_64 KVM Version: 0.12.1.2 - 2.415.el6_5.14 LIBVIRT Version: libvirt-0.10.2-29.el6_5.12 VDSM Version: vdsm-4.14.11.2-0.el6 I have attached the logs from a reproduction of the problem on these two hosts. Investigation in progress. Meantime, this bug may be relevant: https://bugzilla.redhat.com/show_bug.cgi?id=1002360 (In reply to Sokratis from comment #0) [...] > By looking the console of the CentOS 6.5 VMs I noticed that a few seconds > after migration was complete, all the processes were killed, many out of > memory errors were thrown as well as messages like "virtio_balloon virtio3: > Out of puff! Cant' get 256 pages" and the VM ended up in kernel panic so I > had to perform a manual poweroff and start the VM again. This matches with what I can read here https://bugzilla.redhat.com/show_bug.cgi?id=1002360#c7 So, lets' check if oVirt/VDSM misconfigured the balloon on destination host. Was ballooning enabled? If so, how it is configured? Do migration with CentOS 6.5 work OK if you disable ballooning? Ballooning was enabled (wiht 1MB reservation). I disabled it and launched a migration and it failed with the same errors. I upgraded oVirt to 3.4.4 and performed the migration on 2 different VMs. Both are running CentOS 6.5 with latest kernel (2.6.32-431.29.2). The migration was successful on one VM but failed on the other one with the same errors (both with ballooning enabled and disabled). (In reply to Sokratis from comment #7) > I upgraded oVirt to 3.4.4 and performed the migration on 2 different VMs. > > Both are running CentOS 6.5 with latest kernel (2.6.32-431.29.2). > > The migration was successful on one VM but failed on the other one with the > same errors (both with ballooning enabled and disabled). It seems more and more qemu issue. Can you provide the error the guest kernel is reporting when it crashes after a migration? The output during the kernel panic is very big and I could only capture the last part which I have uploaded. Is there a way to capture the whole console output (I'm using virt-viewer)? Created attachment 943014 [details]
vm_kernel_panic_during_live_migration
(In reply to Sokratis from comment #9) > The output during the kernel panic is very big and I could only capture the > last part which I have uploaded. > > Is there a way to capture the whole console output (I'm using virt-viewer)? It is a good enough start, thanks. I checked with qemu developers locally. This """ By looking the console of the CentOS 6.5 VMs I noticed that a few seconds after migration was complete, all the processes were killed, many out of memory errors were thrown as well as messages like "virtio_balloon virtio3: Out of puff! Cant' get 256 pages" and the VM ended up in kernel panic so I had to perform a manual poweroff and start the VM again. """ is strictly tied to ballooning, so it must not happen when ballooning is disabled. If migrations fail anyway even with ballooning disabled, it must be a different issues, and we need logs for that. Indeed the migration succeeds with ballooning disabled. I also noticed that the migration on a CentOS 5 VM succeeds with ballooning enabled but some errors are thrown which can be seen in the file I just uploaded. Should I worry about these errors or not? To sum up: 1) CentOS 5 VMs migrate successfully with ballooning enabled but some errors are thrown 2) CentOS 6 VMs fail to migrate with ballooning enabled but succeed with ballooning disabled. I can perform a migration again and attach the logs. Which logs do you need? Created attachment 943711 [details]
CentOS 5 VM Migration with ballooning enabled
(In reply to Sokratis from comment #13) > 2) CentOS 6 VMs fail to migrate with ballooning enabled but succeed with > ballooning disabled. > > I can perform a migration again and attach the logs. Which logs do you need? VDSM and libvirt logs. But unless VDSM does a bad mess with ballooning in migrations - which is unlikely, this issue has to be moved to qemu. (In reply to Sokratis from comment #13) > 2) CentOS 6 VMs fail to migrate with ballooning enabled but succeed with > ballooning disabled. > > I can perform a migration again and attach the logs. Which logs do you need? VDSM and libvirt logs. But unless VDSM does a bad mess with ballooning in migrations - which is unlikely, this issue has to be moved to qemu. To wrap up (until new logs change the scenario, but this is unlikely) - migration succeeds - after migration, destination VM progressively inflate balloon to reach the configured target - guest CentOS 6.x kernel can't keep up free memory, then crashes for OOM - can't recall, nor find evidence of, VDSM changes in balloon handling in the 3.4.2/3.4.3 timeframe, so upgrade should not be relevant will move to qemu unless new evidence shows up workaround exists (disable ballooning) -> moving to "high" Created attachment 944460 [details]
failed migration destination host logs - 07102014
I have attached the logs from a new migration that I performed. Can you elaborate on the following comment: "after migration, destination VM progressively inflate balloon to reach the configured target" The memory reservation is set to 1MB. Does this affect ballooning? What do you mean "to reach the configured target"? What about the error screenshot that I attached regarding the CentOS 5 VM? (In reply to Sokratis from comment #20) > I have attached the logs from a new migration that I performed. > > Can you elaborate on the following comment: > > "after migration, destination VM progressively inflate balloon to reach the > configured target" > The memory reservation is set to 1MB. Does this affect ballooning? What do > you mean "to reach the configured target"? That was I inferred from the logs and the reported errors, but I'll move to SLA to have a better understanding, because migrationwise everything looks ok. > What about the error screenshot that I attached regarding the CentOS 5 VM? Seems that centos 5 is coping in a better way with OOM, but even in that case the guest is low on resources. Hi Martin, can you take a look to the behaviour of the balloon here? Hi, I need mom.log for that. Can you please attach the logs for mom as well? Should be in the same directory as the vdsm.log file. Also what does Reservation: 1 MB mean? Is that the amount of guaranteed memory configured for the guests in the webadmin? If that is so then this is a misconfiguration as well. The guest can't work with just 1MB of RAM, so it basically returns all free memory and crashes when kernel tries to allocate some internal buffer. Can I also get the full output of free vdsClient -s 0 getVdsStats vdsClient -s 0 getAllVmStats from the source host before migration is attempted and from the destination after the migration finishes? Thanks Created attachment 944608 [details]
source_and_destination_host_details_07102014
Yes the 1MB reservation is configured from the webadmin portal. What should be the minimum value for the migration to succeed and ballooning to work properly? I have also attached the logs you asked for. Moving pending bugs not fixed in 3.5.0 to 3.5.1. After powering off the same VM and enabling ballooning with 384MB reservation (total configured RAM is 2GB) I was able to perform a successful migration. It looks like in CentOS 6 the minimum reservation of 1MB isn't enough for the migration to succeed. However since it works on CentOS 5 and Windows VMs it should work on CentOS 6 as well. Furthermore it's important to be able to set a very low reservation to increase VM density on a host. The reserved memory should be set to an amount of memory that allows the guest OS to run properly. The OS will never return memory that it needs, but it sometimes returns enough memory to make kernel buffer allocations impossible. I see a bug happening on the destination. The first balloon status after the migration returned 0. The logic increased it to be at least the minimum memory (1024 kB) and sent it to the VM. Which obeyed and returned all memory.. and then crashed when a kernel buffer was needed.. Since CentOS 5 VMs are able to complete the migration even with 1MB reservation, there must be a difference in the way the CentOS 6 kernel handles the ballooning process. Unless the problem is in the way qemu/vsdm handles the migration. I am pretty sure that CentOS 5 and CentOS 6 handle the ballooning requests differently. Reserving just 1 MB for a CentOS VM is a misconfiguration for sure though. The OS won't be able to boot with that amount of memory if we ever move to memory hotplug approach (giving VM more memory on demand instead asking the VMs to return memory). But there was a bug as well that was fixed in the master branch and proposed for 3.5.1. So if the bug is fixed will we be able to migrate VMs with 1MB reservation or not? If not, what should we do to fix this? Is there a minimum amount of RAM (regardless of the total configured RAM) that should be reserved on a CentOS 6 VM to be able to migrate properly? I tested the same scenario on the same VM I did before (which succeeded with 384 MB reservation) with 100MB reservation and it failed to migrate. The problem is that currently there are many CentOS 6 VMs running in our cluster and it will be very difficult to shutdown all of them in order to configure the memory reservation. MOM won't try to balloon your VM to 0 after migration. It will leave the VM some free memory and so the VM probably won't crash. The memory is not needed specifically for migration. The migration was successful and the hypervisor then decided to ask for all the memory that was reported as not needed. Since you told it that the VM can run with just 1 MB, it tried to do that.. and the kernel inside the VM crashed. Treat the guaranteed memory as the amount that is needed for the OS inside the VM to boot and run. I advise you to update the configuration of your VMs so they get the proper value once they are rebooted in the future. CentOS 6 specifies the recommended minimal amount of memory for CLI operation as 392 MB. I tested the migration on another VM with 4096MB configured RAM and 512MB reservation. The migration was successful but 'free -m' doesn't report the configured RAM. Before the migration the output of 'free -m' was: [root@host03 ]# free -m total used free shared buffers cached Mem: 3829 1355 2473 0 21 341 -/+ buffers/cache: 992 2836 Swap: 3894 0 3894 Right after the migration the output of 'free -m' was: [root@host03 ]# free -m total used free shared buffers cached Mem: 431 343 88 0 0 23 -/+ buffers/cache: 319 112 Swap: 3894 691 3203 [root@host03 ]# free -m total used free shared buffers cached Mem: 245 150 94 0 0 15 -/+ buffers/cache: 133 111 Swap: 3894 870 3024 [root@host03 ]# free -m total used free shared buffers cached Mem: 245 152 92 0 0 19 -/+ buffers/cache: 132 112 Swap: 3894 868 3026 [root@host03 ]# free -m total used free shared buffers cached Mem: 245 148 96 0 1 21 -/+ buffers/cache: 125 119 Swap: 3894 875 3019 The migration was completed 10 minutes ago and the output of 'free -m' is: [root@host03 ]# free -m total used free shared buffers cached Mem: 2319 589 1729 0 11 135 -/+ buffers/cache: 441 1877 Swap: 3894 726 3168 and it still hasn't reported the configured size (4096MB). Is this happening because of the balloon driver? How is the output of free related with the ballooning and the configured RAM? By the way, I configured the reservation to 392MB initially and the migration failed. It seems that the minimum amount of 392MB is only applicable to low usage OSes without any applications running. This particular VM has a JVM running and the migration was successful only after reserving 512MB so the amount of reservation depends on the system load right? The correct amount of configured RAM is now reported by free, 20 minutes after the migration: [root@host03 ]# free -m total used free shared buffers cached Mem: 3829 794 3035 0 18 233 -/+ buffers/cache: 542 3287 Swap: 3894 622 3272 > The migration was successful but 'free -m' doesn't report the configured RAM. That is what the fix should solve. > By the way, I configured the reservation to 392MB initially and the migration > failed. It seems that the minimum amount of 392MB is only applicable to low > usage OSes without any applications running. Right. The migration crash was probably triggered by the issue we solved as part of this bug. We normally account for the amount of used memory and try to not starve the VM that much. >> By the way, I configured the reservation to 392MB initially and the migration >> failed. It seems that the minimum amount of 392MB is only applicable to low >> usage OSes without any applications running. >Right. The migration crash was probably triggered by the issue we solved as part of >this bug. We normally account for the amount of used memory and try to not starve >the VM that much. So will we be able to migrate VMs with 1MB reservation successfully with this patch or not? What will this bug actually fix? Is this still considered to be included in oVirt 3.5.1? Yes, it is already merged to the proper branches. This is an automated message: This bug should be fixed in oVirt 3.5.1 RC1, moving to QA oVirt 3.5.1 has been released. If problems still persist, please make note of it in this bug report. We are on OVirt 3.5.1 with CentOS 7 Hosts, and this seems to be happening. At the moment It seems to happen only on CentOS 6.6 servers, all the others migrate OK. Funny thing is that on the interface It hows the migration was successful, but the server is completely crashed, the CPU is at 100% and the console is not responding, just black. |