Bug 1474704

Summary: VM instabilities with Cisco vPC-DI instances
Product: Red Hat OpenStack Reporter: Pierre-Andre MOREY <pmorey>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED NOTABUG QA Contact: Joe H. Rahme <jhakimra>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: areis, awaugama, berrange, dasmith, dorian.grandsire, eglynn, kchamart, pmorey, saime, sbauza, sferdjao, sgordon, skhodri, srevivo, vromanso
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-08 11:19:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Sahid Ferdjaoui 2017-07-25 09:07:01 UTC
If I understand well the description your are saying that using a different guest OS (ubuntu) is working as expected, so the problem only happen with Cisco OS ?

- We need sosreport to investigate the logs. Also to have nova and libvirt in debug is something important to help finding the root cause.
- What is the version of Ubuntu used (the kernel) same question for Cisco OS?

Comment 16 skhodri 2017-08-03 13:05:59 UTC
Yes the problem seems to happen only with Cisco OS. If they start a mininet image with 22 Cpus and 114GB of RAM it work without any issue while it fails when CISCO OS is used.

Info from customer regarding kernel version used.   
Version kernel N5.5 (CISCO OS):   kernel 2.6.3.8 
Mininet used by ATOS   :   mininet 3.16.0-30-generic

Comment 17 Ademar Reis 2017-08-03 13:33:46 UTC
(In reply to skhodri from comment #16)
> Yes the problem seems to happen only with Cisco OS. If they start a mininet
> image with 22 Cpus and 114GB of RAM it work without any issue while it fails
> when CISCO OS is used.
> 
> Info from customer regarding kernel version used.   
> Version kernel N5.5 (CISCO OS):   kernel 2.6.3.8 
> Mininet used by ATOS   :   mininet 3.16.0-30-generic

If you're talking about the guest OS, this is also expected. In Linux memory overcommitment is enabled by default and processes are only killed by the OOM-Killer when the memory is actually *used*.

So the linux kernel (host) will happily accept qemu-kvm starting with more memory than the machine has, until that memory is actually used, by qemu-kvm itself or, in the case of memory allocated for a guest, by the guest OS. Different guest OSes have different memory usage patterns, which is why the behavior is different.

Potential solutions:

First of all, there should be some swap space (I'm not a memory management expert, but for a machine with 130GB of RAM, I would have *at the very least*, 64GB of swap). I don't see any reasons to have a machine with 0 swap space. If someone thinks they're doing it as a performance optimization, they're doing it wrong.

Second, they should fine-tune their memory usage and guest size, with the explanations I gave above in mind. Proper amount of RAM and swap space, depending on the memory usage patterns of the host and guest.

Fixing problems from comment #14 would also be a good idea.

Comment 18 Sahid Ferdjaoui 2017-08-04 08:42:24 UTC
I think we should close this issue as NOTABUG. 

On a host of 130 GiB with 12GiB reserved for Hugepages and no swap, customer was trying to boot a VM of 114 GiB (small pages). It seems "expected" that OOM-Killer to kill the process at some points.

During the call customer wanted to consider using only hugepages for its VM. SO we have assisted customer to reserve the 114Gib of hugepages on host and enable the usage of them with Nova. The VM get spawned with success. Since there are now using hugepages and the memory is locked/reserved for that QEMU process they should not be in a scenario where the process get killed.

Comment 19 Sahid Ferdjaoui 2017-08-08 11:19:48 UTC
Related to my previous comment I'm closing it as NOTABUG. Please feel free to reopen if needed.

Comment 20 Stephen Gordon 2017-08-17 13:48:21 UTC
(In reply to Sahid Ferdjaoui from comment #18)
> I think we should close this issue as NOTABUG. 
> 
> On a host of 130 GiB with 12GiB reserved for Hugepages and no swap, customer
> was trying to boot a VM of 114 GiB (small pages). It seems "expected" that
> OOM-Killer to kill the process at some points.
> 
> During the call customer wanted to consider using only hugepages for its VM.
> SO we have assisted customer to reserve the 114Gib of hugepages on host and
> enable the usage of them with Nova. The VM get spawned with success. Since
> there are now using hugepages and the memory is locked/reserved for that
> QEMU process they should not be in a scenario where the process get killed.

What's the default swap setup configured by director? Does it follow the guidelines in the documentation:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Storage_Administration_Guide/ch-swapspace.html

Albeit this would only result in 4 GB of swap for a > 64 GB machine.

Comment 21 Stephen Gordon 2017-08-17 21:12:34 UTC
(In reply to Stephen Gordon from comment #20)
> (In reply to Sahid Ferdjaoui from comment #18)
> > I think we should close this issue as NOTABUG. 
> > 
> > On a host of 130 GiB with 12GiB reserved for Hugepages and no swap, customer
> > was trying to boot a VM of 114 GiB (small pages). It seems "expected" that
> > OOM-Killer to kill the process at some points.
> > 
> > During the call customer wanted to consider using only hugepages for its VM.
> > SO we have assisted customer to reserve the 114Gib of hugepages on host and
> > enable the usage of them with Nova. The VM get spawned with success. Since
> > there are now using hugepages and the memory is locked/reserved for that
> > QEMU process they should not be in a scenario where the process get killed.
> 
> What's the default swap setup configured by director? Does it follow the
> guidelines in the documentation:
> 
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/
> html/Storage_Administration_Guide/ch-swapspace.html
> 
> Albeit this would only result in 4 GB of swap for a > 64 GB machine.

Filed Bug # 1482681.