Bug 1300680

Summary: OVS-DPDK failed to boot more than 1 instance on OVS-DPDK setup
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: openstack-novaAssignee: Sahid Ferdjaoui <sferdjao>
Status: CLOSED ERRATA QA Contact: Prasanth Anbalagan <panbalag>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: amuller, berrange, chrisw, dasmith, dshaks, edannon, editucci, eglynn, ekuris, fpan, jdonohue, jean-mickael.guerin, joycej, jschluet, kchamart, mlopes, nyechiel, pcm, samuel.gauthier, sbauza, sferdjao, sgordon, srevivo, twilson, vincent.jardin, vromanso
Target Milestone: ---Keywords: TechPreview, ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: hot
Fixed In Version: openstack-nova-12.0.4-5.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-31 17:36:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1194008, 1295530, 1300693    
Attachments:
Description Flags
vm failed none

Description Eran Kuris 2016-01-21 12:50:16 UTC
Created attachment 1116923 [details]
vm failed

Description of problem:
On OVS-DPDK setup 1 controller & 1 compute with data tenant/network type vlan I failed to boot instances. When create 1 VM it success. From the second vm all vm's failed .When I created multiplae VM's - Instance Count 4 for example 
few of the instances will boot as active and few will be failed.
When using DPDK we should boot vm with flavor that use hugepages - :
$ nova flavor-create  m1.medium_dpdk 6 2048 20 2
$ nova flavor-key m1.medium_dpdk set "hw:mem_page_size=large" 
attached all log 
nic type: Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter 
driver: ixgbe
Version-Release number of selected component (if applicable):
[root@puma48 ~]# rpm -qa |grep neutro
openstack-neutron-openvswitch-7.0.1-6.el7ost.noarch
python-neutron-7.0.1-6.el7ost.noarch
openstack-neutron-common-7.0.1-6.el7ost.noarch
python-neutronclient-3.1.0-1.el7ost.noarch
openstack-neutron-7.0.1-6.el7ost.noarch
[root@puma48 ~]# rpm -qa |grep dpd
openvswitch-dpdk-2.4.0-0.10346.git97bab959.2.el7.x86_64
dpdk-2.1.0-5.el7.x86_64

How reproducible:
always 

Steps to Reproduce:
1.https://docs.google.com/document/d/1K_ku6_08ooq46dFLiE7fAJ0ByFdPCb0W_q6kKqF3Y0o/edit
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Assaf Muller 2016-01-21 13:09:52 UTC
"os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM", seems like a pretty straight forward error message. Can you paste 'cat /proc/meminfo | grep HugePages'?

Comment 2 Eran Kuris 2016-01-21 13:18:54 UTC
[root@puma48 ~]# cat /proc/meminfo | grep HugePages
AnonHugePages:    376832 kB
HugePages_Total:       8
HugePages_Free:        5
HugePages_Rsvd:        0
HugePages_Surp:        0

Comment 4 Terry Wilson 2016-01-25 21:30:27 UTC
If you are trying to boot 4 instances, each with 2GB of ram, and (at least) 1GB reserved for OVS, that would definitely fail if you only have 8 1GB hugepages to begin with. I would expect more than 1 to work, though.

I logged into the test VM and can verify that it it fails to boot the second VM, whether booted separately or via --num-instances=2.

Since this is a memory-related issue, it isn't a neutron-related and I'm not sure what the issue is. Moving to openstack-nova.

Comment 5 Eran Kuris 2016-01-26 07:19:32 UTC
Terry with regular flavor which is not use hugepages I can created few Vm's .
When I used "dpdk-flavor" I could not create more than 1 VM. I try that when I boot it  separately and via --num-instances=2.

Comment 6 John Shakshober 2016-01-27 15:15:47 UTC
In order to boot additional VM's that are each 2Gb, the blue print will need to configure more  and just 8GB hugepages on the HOST.  The output show that 5 GB of hugepages were in use leaving only 3 GB of hugepages left.  # numastat -c qemu will dynamically show the number of hugepages in use by KVM.

Comment 7 Eran Kuris 2016-01-28 06:03:34 UTC
I am not sure it exact, first of all each VM is 1Gb and I tried to extend the memory usage and I see same issue .

Comment 8 Stephen Gordon 2016-01-29 21:29:16 UTC
(In reply to Eran Kuris from comment #5)
> Terry with regular flavor which is not use hugepages I can created few Vm's .
> When I used "dpdk-flavor" I could not create more than 1 VM. I try that when
> I boot it  separately and via --num-instances=2.

What you can do with regular VMs isn't really directly relevant here, as when you aren't using huge pages for the VM overcommit is allowed and the default ratio is 16:1 so to the nova-scheduler your 8 Gb machine actually looks like 128 Gb in that case.

(In reply to Eran Kuris from comment #7)
> I am not sure it exact, first of all each VM is 1Gb and I tried to extend
> the memory usage and I see same issue .

This is very confusing, you say each VM is 1 Gb but the VMs in your reproducer steps are 2 Gb w/ 2 vCPUs. That is what Terry, Shak, and I are referring to and as Shak said in comment # 6 there are only 3 x 1 Gb pages free based on the output you provided in comment # 2. That is not enough to boot the 2 x 2 Gb huge page backed VMs requested.

If you want to try and re-produce again it would be good to have the /proc/meminfo content before and after each boot request.

Comment 9 Terry Wilson 2016-01-29 22:30:01 UTC
Stephen: When I logged into his machine, it showed 5 1GB hugepages free out of 8. 1GB was used by OVS and 2GB was used by the first VM. Booting a *single* VM (which should be 2GB of mem) failed despite 5GB being available.

Comment 10 Terry Wilson 2016-01-29 22:30:53 UTC
(a single extra VM, bringing the total to 2 that is)

Comment 12 Eran Kuris 2016-02-02 06:38:33 UTC
Eoghan , I would like to set a session so we can debug the setup and process this task.  :-)

Comment 13 Sahid Ferdjaoui 2016-02-08 09:15:22 UTC
- I can confirm that flavor is 1GB and not 2BG.

[root@puma53 ~(keystone_admin)]# nova flavor-show m1.medium_dpdk
+----------------------------+--------------------------------------+
| Property                   | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 15                                   |
| extra_specs                | {"hw:mem_page_size": "large"}        |
| id                         | 2dde415a-267c-4fdc-a244-08f9a634ade5 |
| name                       | m1.medium_dpdk                       |
| os-flavor-access:is_public | True                                 |
| ram                        | 1024                                 |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 2                                    |
+----------------------------+--------------------------------------+


- We have bug in our current implementation of hugepages in Nova it seems to do not take into account NUMA node where to backing memory for guests and always use NUMA node 0.

This compute host provides 8 hugepages shared between 2 NUMA node. On NUMA node 0, 3 pages are available (1 is already used by OVS) and on NUMA node 1, 4 pages are available.

[root@puma48 ~]# virsh freepages --all
Node 0:
4KiB: 6907982
1048576KiB: 3

Node 1:
4KiB: 7054079
1048576KiB: 4

Unfortunately because of that bug we can only boot 3 guests configured with 'm1.medium_dpdk', they all will have memory backed to node 0. (It's easy to confirm that by looking to the XML)

[root@puma48 ~]# virsh dumpxml 7 | grep page
    <hugepages>
      <page size='1048576' unit='KiB' nodeset='0'/>
    </hugepages>



[root@puma53 ~(keystone_admin)]# nova list
+--------------------------------------+------+--------+------------+-------------+--------------------+
| ID                                   | Name | Status | Task State | Power State | Networks           |
+--------------------------------------+------+--------+------------+-------------+--------------------+
| 9b6f268f-a373-41ee-9dc0-040b2dd2b401 | i1   | ACTIVE | -          | Running     | net1=192.168.99.57 |
| db2fbacd-4521-4620-8fa0-97232fe464ba | i2   | ACTIVE | -          | Running     | net1=192.168.99.58 |
| 7438518b-1e91-4151-a302-ca8e5d150847 | i3   | ACTIVE | -          | Running     | net1=192.168.99.59 |
| 7f64fa95-39f2-46be-affd-b0af01b7af8d | i4   | ERROR  | -          | NOSTATE     | net1=192.168.99.60 |
+--------------------------------------+------+--------+------------+-------------+--------------------+


[root@puma48 ~]# virsh freepages --all
Node 0:
4KiB: 6885635
1048576KiB: 0

Node 1:
4KiB: 7055161
1048576KiB: 4

Comment 14 Sahid Ferdjaoui 2016-02-08 09:56:20 UTC
(In reply to Sahid Ferdjaoui from comment #13)
> [root@puma48 ~]# virsh dumpxml 7 | grep page
>     <hugepages>
>       <page size='1048576' unit='KiB' nodeset='0'/>
>     </hugepages>
> 

I made mistake attribute 'nodeset' here is to indicate on which guest nodes pages are backed.

We should to configure element memnode of nuamtune. Probably to the union of host NUMA nodes when guest does not have specific NUMA requirement.

  <numatune>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0-1'/>
  </numatune>

Comment 15 Sahid Ferdjaoui 2016-02-08 11:25:08 UTC
(In reply to Sahid Ferdjaoui from comment #14)
> (In reply to Sahid Ferdjaoui from comment #13)
> > [root@puma48 ~]# virsh dumpxml 7 | grep page
> >     <hugepages>
> >       <page size='1048576' unit='KiB' nodeset='0'/>
> >     </hugepages>
> > 
> 
> I made mistake attribute 'nodeset' here is to indicate on which guest nodes
> pages are backed.
> 
> We should to configure element memnode of nuamtune. Probably to the union of
> host NUMA nodes when guest does not have specific NUMA requirement.
> 
>   <numatune>
>     <memory mode='strict' nodeset='0'/>
>     <memnode cellid='0' mode='strict' nodeset='0-1'/>
>   </numatune>

Ok I made wrong assumption the code is well handling numa node placement, but computing of available resources is wrong.

Current code computing available page size by getting total allocated minus instances which use pages. So it does not take into account that 1 page used by OVS and continue to think it can fit instance on numa node 0.

Comment 16 Sahid Ferdjaoui 2016-02-08 14:44:54 UTC
I provided an upstream patch [1] to fix this issue

This changes provides new option 'reserved_memory_pages' which will be used to reserve from Nova point of view an amount of pages for third part component. In our use case it will be 1 page for OVS. So this will fix how nova compute free pages on host NUMA node.

[1] https://review.openstack.org/277422

Comment 17 Eran Kuris 2016-02-09 13:18:57 UTC
I tried to enter the fix manually but I didnt find the files on my setup.
Can you explain how can I verify the fix ?

Comment 18 Sahid Ferdjaoui 2016-02-09 13:47:58 UTC
(In reply to Eran Kuris from comment #17)
> I tried to enter the fix manually but I didnt find the files on my setup.
> Can you explain how can I verify the fix ?

I had a review from Daniel Berrangé who asks me to change something related to the option, let me update this upstream then when I got his ACK I will backport it for OSP8 and provide to you test packages.

We can expect to have this test packages for tomorrow or the day after, sounds good for you?

Comment 19 Eran Kuris 2016-02-09 14:07:20 UTC
yes  it sounds ok

Comment 20 Sahid Ferdjaoui 2016-02-11 09:49:44 UTC
(In reply to Eran Kuris from comment #19)
> yes  it sounds ok

You can find a scratch-build [1], please restart openstack-service after to have installed them, you will have to configure compute-nodes and services with this new option:

  reserved_memory_pages = ["0:1G:1"]

Please let me know any feedback,
s

[1] https://brewweb.devel.redhat.com/taskinfo?taskID=10470071

Comment 21 Eran Kuris 2016-02-14 08:00:10 UTC
Sahid , I would like to get more info how to install those packages so I can test this fix.
When I run yum install of those packages  I got  error : 
Error: Package: 1:openstack-nova-scheduler-12.0.1-2bz1300680v1.el7ost.noarch (/openstack-nova-scheduler-12.0.1-2bz1300680v1.el7ost.noarch)
           Requires: openstack-nova-common = 1:12.0.1-2bz1300680v1.el7ost
           Installed: 1:openstack-nova-common-12.0.1-2.el7ost.noarch (@rhelosp-8.0-puddle)
               openstack-nova-common = 1:12.0.1-2.el7ost
Error: Package: 1:openstack-nova-objectstore-12.0.1-2bz1300680v1.el7ost.noarch (/openstack-nova-objectstore-12.0.1-2bz1300680v1.el7ost.noarch)
           Requires: openstack-nova-common = 1:12.0.1-2bz1300680v1.el7ost
           Installed: 1:openstack-nova-common-12.0.1-2.el7ost.noarch (@rhelosp-8.0-puddle)
               openstack-nova-common = 1:12.0.1-2.el7ost
 You could try using --skip-broken to work around the problem
** Found 1 pre-existing rpmdb problem(s), 'yum check' output follows:
openvswitch-dpdk-2.4.0-0.10346.git97bab959.2.el7.x86_64 has installed conflicts openvswitch: openvswitch-dpdk-2.4.0-0.10346.git97bab959.2.el7.x86_64

Comment 22 Sahid Ferdjaoui 2016-02-15 09:15:37 UTC
(In reply to Eran Kuris from comment #21)
> Sahid , I would like to get more info how to install those packages so I can
> test this fix.
> When I run yum install of those packages  I got  error : 
> Error: Package: 1:openstack-nova-scheduler-12.0.1-2bz1300680v1.el7ost.noarch
> (/openstack-nova-scheduler-12.0.1-2bz1300680v1.el7ost.noarch)
>            Requires: openstack-nova-common = 1:12.0.1-2bz1300680v1.el7ost
>            Installed: 1:openstack-nova-common-12.0.1-2.el7ost.noarch
> (@rhelosp-8.0-puddle)
>                openstack-nova-common = 1:12.0.1-2.el7ost
> Error: Package:
> 1:openstack-nova-objectstore-12.0.1-2bz1300680v1.el7ost.noarch
> (/openstack-nova-objectstore-12.0.1-2bz1300680v1.el7ost.noarch)
>            Requires: openstack-nova-common = 1:12.0.1-2bz1300680v1.el7ost
>            Installed: 1:openstack-nova-common-12.0.1-2.el7ost.noarch
> (@rhelosp-8.0-puddle)
>                openstack-nova-common = 1:12.0.1-2.el7ost
>  You could try using --skip-broken to work around the problem
> ** Found 1 pre-existing rpmdb problem(s), 'yum check' output follows:
> openvswitch-dpdk-2.4.0-0.10346.git97bab959.2.el7.x86_64 has installed
> conflicts openvswitch:
> openvswitch-dpdk-2.4.0-0.10346.git97bab959.2.el7.x86_64

It's because of version number of the packages which was not incrementally updated. You can use:

  rpm -ivh --force *.rpm

Comment 24 Eran Kuris 2016-02-15 13:08:54 UTC
fix is resolving your issue. tested on vlan environment

Comment 27 Stephen Gordon 2016-03-22 13:10:35 UTC
The current fix has been reverted upstream, rather than carry a forked patch that may be incompatible with the ultimate upstream solution we must also revert the fix from RHOSP 8. When the final upstream fix is available we will re-assess backportability.

Comment 29 Stephen Gordon 2016-03-22 13:29:27 UTC
Sahid please process the revert under this BZ. We will need to create a clone for the long term resolution.

Comment 30 Sahid Ferdjaoui 2016-03-22 17:01:36 UTC
New packages with reverted option to reserve mem pages in compute node.

  openstack-nova-12.0.2-3.el7ost

Comment 31 Jon Schlueter 2016-03-23 12:30:50 UTC
dropping from advisory

Comment 33 Jon Schlueter 2016-04-05 14:23:50 UTC
(In reply to Stephen Gordon from comment #27)
> The current fix has been reverted upstream, rather than carry a forked patch
> that may be incompatible with the ultimate upstream solution we must also
> revert the fix from RHOSP 8. When the final upstream fix is available we
> will re-assess backportability.

The changes have been reverted but this bug is still open and dropped from advisory.  Can we push it out to z-stream candidate since upstream hasn't reached a solution yet?

Comment 34 Sahid Ferdjaoui 2016-06-24 15:00:25 UTC
*** Bug 1348732 has been marked as a duplicate of this bug. ***

Comment 35 joycej 2016-06-24 15:15:06 UTC
There has been no meaningful activity on this bug in almost 3 months.   Can someone provide on update on when this is targeted to be fixed?

Comment 36 Sahid Ferdjaoui 2016-06-24 15:51:48 UTC
(In reply to Sahid Ferdjaoui from comment #4 of bug 1300680)
> One possible solution (the easy one) is to have that option
> reserved_huge_pages set for all services, so the scheduler will know about
> the number of pages reserved. But the problem here is that all compute nodes
> are going to share the same number of pages reserved.
> 
> An other solution (probably better) would be to have a fix libvirt driver
> specific. So that option reserved_huge_pages will be read when the driver is
> computing available resources, the number of pages available will be stored
> subtracted by the number of pages reserved.
> 
> I'm closing this one as duplicated since we do not want to track 2 BZ for
> the same problem.
> 
> *** This bug has been marked as a duplicate of bug 1300680 ***

Comment 42 Feng Pan 2016-07-02 14:43:24 UTC
Private build provided to Cisco for testing, waiting for feedback.

Comment 43 Paul Michali 2016-07-11 13:46:19 UTC
Tested the private build and was able to reserve 256 huge pages for two NUMA nodes using 2048kB page size. Worked just fine for our application needs.

Please advise on how we move forward so that we can get this integrated into the Nova RPMs.

Thanks!

Comment 46 joycej 2016-07-22 13:59:32 UTC
Paul provided the update on the cisco side - but I guess I also need to reply so the bug system knows it doens't need info from me anymore.

Comment 50 errata-xmlrpc 2016-08-31 17:36:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1794.html

Comment 51 Martin Lopes 2017-07-24 00:28:05 UTC
Note: the google doc guide mentioned in the description has now been published here:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/network_functions_virtualization_configuration_guide/