Bug 1647498

Summary:

Open vSwitch ovs-vswitchd startup time increased significantly in minor versions larger than -63

Product:

Red Hat Enterprise Linux 7

Reporter:

Andreas Karis <akaris>

Component:

openvswitch

Assignee:

Eelco Chaudron <echaudro>

Status:

CLOSED ERRATA

QA Contact:

Rick Alongi <ralongi>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.5

CC:

alejandro.lucero, atragler, ctrautma, dalvarez, echaudro, fiezzi, kfida, pvauter, qding, ralongi, tredaelli

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openvswitch-2.9.0-85.el7fdn.x86_64.rpm

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1658320 (view as bug list)

Environment:

Last Closed:

2019-01-02 17:54:40 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
files referenced in comment 19	none

Description Andreas Karis 2018-11-07 15:46:09 UTC

Description of problem:
After upgrading Open vSwitch to the `-71` minor version for testing, one can notice a significant increase in startup time when executing `systemctl restart openvswitch`.

Before:
~~~
[root@overcloud-compute-dpdk-1 ovs]# time systemctl restart openvswitch

real    1m19.464s
user    0m0.009s
sys     0m0.026s
~~~

After upgrade:
~~~
[root@overcloud-compute-dpdk-1 ovs]# time systemctl restart openvswitch
A dependency job for openvswitch.service failed. See 'journalctl -xe' for details.

real    4m2.941s
user    0m0.018s
sys     0m0.057s
~~~

After testing with Red Hat provided test builds, minor versions `-56` to `-71`, the culprit was identified as a patch that was introduced with `openvswitch-2.9.0-63.el7fdn.x86_64`.

Version-Release number of selected component (if applicable):
Open vSwitch 2.9.0-63 and beyond

How reproducible:
Install -56 and run:
time systemctl restart openvswitch

Upgrade from -56 to -63 and run:
time systemctl restart openvswitch

Additional info:

Comment 3 Andreas Karis 2018-11-07 15:49:11 UTC

From the changelog:
~~~
* Thu Sep 13 2018 Timothy Redaelli <tredaelli> - 2.9.0-63
- FIXME: If make check fails try another time
- Backport NFP PMD's non-root related commits (#1568301):
  - net/nfp: support IOVA VA mode
  - bus/pci: forbid IOVA mode if IOMMU address width too small
  - net/nfp: check hugepages IOVAs based on DMA mask
  - mem: use address hint for mapping hugepages
  - bus/pci: use IOVAs check when setting IOVA mode
  - mem: add function for checking memsegs IOVAs addresses
  - mem: fix max DMA maskbit size
~~~

Comment 4 Eelco Chaudron 2018-11-08 20:36:03 UTC

The first step for me is to see if I can replicate the issue in my setup. I was successful and in my simple setup it went from roughly 7 seconds to 18.

Here some output of my tests:


strace -c /usr/sbin/ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 79.70   13.047093       13382       975           mmap
 19.71    3.227044        3735       864           munmap


strace -ttt -T /usr/sbin/ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid
541708874.712769 execve("/usr/sbin/ovs-vswitchd", ["/usr/sbin/ovs-vswitchd", "unix:/var/run/openvswitch/db.soc"..., "-vconsole:emer", "-vsyslog:err", "-vfile:info", "--mlockall", "--no-chdir", "--log-file=/var/log/openvswitch/"..., "--pidfile=/var/run/openvswitch/o"...], [/* 46 vars */]) = 0 <0.000114>
1541708874.713019 brk(NULL)             = 0x14bf000 <0.000009>
1541708874.713068 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3fe5dc000 <0.000010>
...
...
1541708878.219169 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3fe5da000 <0.000012>
1541708878.219204 read(17, "MemTotal:       131809796 kB\nMem"..., 1024) = 1024 <0.000038>
1541708878.219271 close(17)             = 0 <0.000013>
1541708878.219304 munmap(0x7fa3fe5da000, 4096) = 0 <0.000011>
1541708878.219341 get_mempolicy(NULL, NULL, 0, NULL, 0) = 0 <0.000009>
1541708878.219371 open("/dev/zero", O_RDONLY) = 17 <0.000012>
1541708878.219405 mmap(0x100000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.199856>
1541708879.419300 munmap(0x7f9bbab94000, 35433480192) = 0 <0.399859>
1541708879.819193 mmap(0x200000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.198899>
1541708881.018131 munmap(0x7f9bbab94000, 35433480192) = 0 <0.401643>
1541708881.419804 mmap(0x300000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.198989>
1541708882.618828 munmap(0x7f9bbab94000, 35433480192) = 0 <0.401483>
1541708883.020345 mmap(0x400000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.198770>
1541708884.219157 munmap(0x7f9bbab94000, 35433480192) = 0 <0.401635>
1541708884.620824 mmap(0x500000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.199043>
1541708885.819908 munmap(0x7f9bbab94000, 35433480192) = 0 <0.399615>
1541708886.219555 mmap(0x600000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.200235>
1541708887.419831 munmap(0x7f9bbab94000, 35433480192) = 0 <0.401402>
1541708887.821297 mmap(0x700000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.200288>
1541708889.021619 munmap(0x7f9bbab94000, 35433480192) = 0 <0.399871>
1541708889.421521 mmap(0x800000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x800000000 <1.198881>
1541708890.620436 munmap(0x800000000, 35433480192) = 0 <0.401636>
1541708891.022106 close(17)             = 0 <0.000011>


Next step would be to dig into the why part...

Comment 5 Alejandro 2018-11-09 10:24:23 UTC

I can not replicate this with testpmd and 24GB for hugepages. The memory initialization times is quite similar between pre-dma-mask-patch and current 17.11.4 stable.

I suspect this is something that OVS execution is triggering. I will try to replicate this using OVS now.

Comment 6 Eelco Chaudron 2018-11-09 10:50:01 UTC

(In reply to Alejandro from comment #5)

It's specific to this commit:

293c0c4 mem: use address hint for mapping hugepages

Adding some debug, you can see trying to get the 32G memory is taking a long time, and its need to retry as the returned memory does not match the hint:

2018-11-09T10:02:14.940Z|00016|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x100000000, 34359738368 bytes
2018-11-09T10:02:16.142Z|00017|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x100000000, 34359738368 bytes
2018-11-09T10:02:16.542Z|00018|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x200000000, 34359738368 bytes
2018-11-09T10:02:17.743Z|00019|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x200000000, 34359738368 bytes
2018-11-09T10:02:18.142Z|00020|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x300000000, 34359738368 bytes
2018-11-09T10:02:19.343Z|00021|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x300000000, 34359738368 bytes
2018-11-09T10:02:19.743Z|00022|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x400000000, 34359738368 bytes
2018-11-09T10:02:20.943Z|00023|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x400000000, 34359738368 bytes
2018-11-09T10:02:21.343Z|00024|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x500000000, 34359738368 bytes
2018-11-09T10:02:22.551Z|00025|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x500000000, 34359738368 bytes
2018-11-09T10:02:22.951Z|00026|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x600000000, 34359738368 bytes
2018-11-09T10:02:24.152Z|00027|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x600000000, 34359738368 bytes
2018-11-09T10:02:24.552Z|00028|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x700000000, 34359738368 bytes
2018-11-09T10:02:25.753Z|00029|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x700000000, 34359738368 bytes
2018-11-09T10:02:26.152Z|00030|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x800000000, 34359738368 bytes
2018-11-09T10:02:27.353Z|00031|dpdk|ERR|EAL: Return mmap() addr 0x800000000, addr_hint 0x800000000, 34359738368 bytes

I wanted to try MAP_FIXED_NOREPLACE, but unfortunately, this is only supported in the latest kernels, which we do not have.

Need to dig into how the memory architecture works in DPDK, but it tries the full 32G pages I have while I only request 2G. I assume the customer has tons of more 1G pages and does more retries. 

Maybe we should scan /proc/<pid>/maps for a space to start trying?

Comment 7 Eelco Chaudron 2018-11-09 11:38:15 UTC

(In reply to Eelco Chaudron from comment #6)

Some additional debug data...

I noticed that before get_virtual_area() gets called, some other part of the code already has all huge pages mapped (that’s why it’s failing):

```
40000000-80000000 rw-s 00000000 00:23 206083                             /dev/hugepages/rtemap_1
80000000-c0000000 rw-s 00000000 00:23 206084                             /dev/hugepages/rtemap_2
c0000000-100000000 rw-s 00000000 00:23 206085                            /dev/hugepages/rtemap_3
100000000-140000000 rw-s 00000000 00:23 206086                           /dev/hugepages/rtemap_4
140000000-180000000 rw-s 00000000 00:23 206087                           /dev/hugepages/rtemap_5
180000000-1c0000000 rw-s 00000000 00:23 206088                           /dev/hugepages/rtemap_6
1c0000000-200000000 rw-s 00000000 00:23 206089                           /dev/hugepages/rtemap_7
200000000-240000000 rw-s 00000000 00:23 206090                           /dev/hugepages/rtemap_8
240000000-280000000 rw-s 00000000 00:23 206091                           /dev/hugepages/rtemap_9
280000000-2c0000000 rw-s 00000000 00:23 206092                           /dev/hugepages/rtemap_10
2c0000000-300000000 rw-s 00000000 00:23 206093                           /dev/hugepages/rtemap_11
300000000-340000000 rw-s 00000000 00:23 206094                           /dev/hugepages/rtemap_12
340000000-380000000 rw-s 00000000 00:23 206095                           /dev/hugepages/rtemap_13
380000000-3c0000000 rw-s 00000000 00:23 206096                           /dev/hugepages/rtemap_14
3c0000000-400000000 rw-s 00000000 00:23 206097                           /dev/hugepages/rtemap_15
400000000-440000000 rw-s 00000000 00:23 206098                           /dev/hugepages/rtemap_16
440000000-480000000 rw-s 00000000 00:23 206099                           /dev/hugepages/rtemap_17
480000000-4c0000000 rw-s 00000000 00:23 206100                           /dev/hugepages/rtemap_18
4c0000000-500000000 rw-s 00000000 00:23 206101                           /dev/hugepages/rtemap_19
500000000-540000000 rw-s 00000000 00:23 206102                           /dev/hugepages/rtemap_20
540000000-580000000 rw-s 00000000 00:23 206103                           /dev/hugepages/rtemap_21
580000000-5c0000000 rw-s 00000000 00:23 206104                           /dev/hugepages/rtemap_22
5c0000000-600000000 rw-s 00000000 00:23 206105                           /dev/hugepages/rtemap_23
600000000-640000000 rw-s 00000000 00:23 206106                           /dev/hugepages/rtemap_24
640000000-680000000 rw-s 00000000 00:23 206107                           /dev/hugepages/rtemap_25
680000000-6c0000000 rw-s 00000000 00:23 206108                           /dev/hugepages/rtemap_26
6c0000000-700000000 rw-s 00000000 00:23 206109                           /dev/hugepages/rtemap_27
700000000-740000000 rw-s 00000000 00:23 206110                           /dev/hugepages/rtemap_28
740000000-780000000 rw-s 00000000 00:23 206111                           /dev/hugepages/rtemap_29
780000000-7c0000000 rw-s 00000000 00:23 206112                           /dev/hugepages/rtemap_30
7c0000000-800000000 rw-s 00000000 00:23 206113                           /dev/hugepages/rtemap_31
2aaac0000000-2aab00000000 rw-s 00000000 00:23 206082                     /dev/hugepages/rtemap_0
```

Comment 8 Alejandro 2018-11-09 14:02:36 UTC

What EAL mem init does is first mmap all the hugepages in a first stage. Later it will try to mmap as many physical contiguous hugepages as possible, remapping the hugepages to new virtual addresses. It is in this second stage when the call to get_virtual_area is done. 

The problem is, the first mmaping is also using a virtual address as a hint for the mmap call except in the first call. So the first hugepage is mapped in a high virtual address but not the other ones. From the second hugepage, the virtual address used a a hint is based on increments of hugepage size with an initial address of hugepage size. This implies that if there are more than 4GB of memory backed by hugepages, the second stage starts to have problems because the initial hint is just at that offset, 4GB. Inside get_virtual_address, if the map at the hint fails, the hint is increased by 4GB. So if you map 32 GB, you will get 7 extra mmaps calls. You can do the arithmetics for further amount of memory used. 

We do not care about the virtual addresses used in the first stage, just about the ones done in the second. So I think we should avoid a hint for the first stage what will make the kernel using really high virtual addresses that will not hopefully overlap with the ones used in the second stage.

I'm going to do some testing with those changes to see if this is viable.

Comment 9 Eelco Chaudron 2018-11-12 11:29:23 UTC

Thanks Alejandro to work on this in parallel! Patch was send upstream:

http://mails.dpdk.org/archives/dev/2018-November/118774.html

Waiting for feedback.

Comment 17 Rick Alongi 2018-12-13 12:51:13 UTC

Hi Eelco,

Can you please provide more details regarding a reproducer?  Please list which specific packages need to be installed (openvswitch, DPDK, etc.), the steps you are using to configure huge pages, etc.

Thanks,
Rick

Comment 18 Eelco Chaudron 2018-12-13 13:01:55 UTC

(In reply to Rick Alongi from comment #17)
> Hi Eelco,
> 
> Can you please provide more details regarding a reproducer?  Please list
> which specific packages need to be installed (openvswitch, DPDK, etc.), the
> steps you are using to configure huge pages, etc.

I was using the setup explained here: https://github.com/chaudron/ovs_perf

I'm reserving 32 1G pages as explained in the above link. I do not see the huge increase as the customer, for me it went from roughly 7 seconds to 18.

Comment 19 Rick Alongi 2018-12-16 22:58:23 UTC

Created attachment 1514939 [details]
files referenced in comment 19

Comment 20 Rick Alongi 2018-12-16 23:00:22 UTC

I've attached bz1647498.tar.gz containing the following files to this BZ:

restart_test_setup.txt: steps used to configure for the test
restart_test.sh: shell script to test for the issue
openvswitch-2.9.0-56.el7fdp.x86_64_result.txt: output from testing with openvswitch-2.9.0-56.el7fdp.x86_64 (no problem)
openvswitch-2.9.0-70.el7fdp.x86_64_result.txt: output from testing with openvswitch-2.9.0-70.el7fdp.x86_64 (problem)
openvswitch-2.9.0-83.el7fdp.1.x86_64_result.txt: output from testing with openvswitch-2.9.0-83.el7fdp.1.x86_64 (no problem)

I used a Pass/Fail threshold of 10 seconds for the restart time with this configuration since that was the baseline from builds before the issue was introduced.

Summary of test results:

===========================================================================
Test completed using openvswitch package openvswitch-2.9.0-56.el7fdp.x86_64
Restart iterations: 50, Time threshold specified: 10 seconds
50 of 50 iterations PASSED
0 of 50 iterations FAILED
Average restart time per iteration: 9.16 seconds
===========================================================================
Final Test Result: PASSED

===========================================================================
===========================================================================
Test completed using openvswitch package openvswitch-2.9.0-70.el7fdp.x86_64
Restart iterations: 50, Time threshold specified: 10 seconds
0 of 50 iterations PASSED
50 of 50 iterations FAILED
Average restart time per iteration: 15.12 seconds
===========================================================================
Final Test Result: FAILED

===========================================================================
===========================================================================
Test completed using openvswitch package openvswitch-2.9.0-83.el7fdp.1.x86_64
Restart iterations: 50, Time threshold specified: 10 seconds
50 of 50 iterations PASSED
0 of 50 iterations FAILED
Average restart time per iteration: 9.04 seconds
===========================================================================
Final Test Result: PASSED

Based on the results referenced, I am marking this as Verified.

Comment 22 errata-xmlrpc 2019-01-02 17:54:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0014