Bug 1647498
Summary: | Open vSwitch ovs-vswitchd startup time increased significantly in minor versions larger than -63 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Andreas Karis <akaris> | ||||
Component: | openvswitch | Assignee: | Eelco Chaudron <echaudro> | ||||
Status: | CLOSED ERRATA | QA Contact: | Rick Alongi <ralongi> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.5 | CC: | alejandro.lucero, atragler, ctrautma, dalvarez, echaudro, fiezzi, kfida, pvauter, qding, ralongi, tredaelli | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openvswitch-2.9.0-85.el7fdn.x86_64.rpm | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1658320 (view as bug list) | Environment: | |||||
Last Closed: | 2019-01-02 17:54:40 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Andreas Karis
2018-11-07 15:46:09 UTC
From the changelog: ~~~ * Thu Sep 13 2018 Timothy Redaelli <tredaelli> - 2.9.0-63 - FIXME: If make check fails try another time - Backport NFP PMD's non-root related commits (#1568301): - net/nfp: support IOVA VA mode - bus/pci: forbid IOVA mode if IOMMU address width too small - net/nfp: check hugepages IOVAs based on DMA mask - mem: use address hint for mapping hugepages - bus/pci: use IOVAs check when setting IOVA mode - mem: add function for checking memsegs IOVAs addresses - mem: fix max DMA maskbit size ~~~ The first step for me is to see if I can replicate the issue in my setup. I was successful and in my simple setup it went from roughly 7 seconds to 18. Here some output of my tests: strace -c /usr/sbin/ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 79.70 13.047093 13382 975 mmap 19.71 3.227044 3735 864 munmap strace -ttt -T /usr/sbin/ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid 541708874.712769 execve("/usr/sbin/ovs-vswitchd", ["/usr/sbin/ovs-vswitchd", "unix:/var/run/openvswitch/db.soc"..., "-vconsole:emer", "-vsyslog:err", "-vfile:info", "--mlockall", "--no-chdir", "--log-file=/var/log/openvswitch/"..., "--pidfile=/var/run/openvswitch/o"...], [/* 46 vars */]) = 0 <0.000114> 1541708874.713019 brk(NULL) = 0x14bf000 <0.000009> 1541708874.713068 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3fe5dc000 <0.000010> ... ... 1541708878.219169 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3fe5da000 <0.000012> 1541708878.219204 read(17, "MemTotal: 131809796 kB\nMem"..., 1024) = 1024 <0.000038> 1541708878.219271 close(17) = 0 <0.000013> 1541708878.219304 munmap(0x7fa3fe5da000, 4096) = 0 <0.000011> 1541708878.219341 get_mempolicy(NULL, NULL, 0, NULL, 0) = 0 <0.000009> 1541708878.219371 open("/dev/zero", O_RDONLY) = 17 <0.000012> 1541708878.219405 mmap(0x100000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.199856> 1541708879.419300 munmap(0x7f9bbab94000, 35433480192) = 0 <0.399859> 1541708879.819193 mmap(0x200000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.198899> 1541708881.018131 munmap(0x7f9bbab94000, 35433480192) = 0 <0.401643> 1541708881.419804 mmap(0x300000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.198989> 1541708882.618828 munmap(0x7f9bbab94000, 35433480192) = 0 <0.401483> 1541708883.020345 mmap(0x400000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.198770> 1541708884.219157 munmap(0x7f9bbab94000, 35433480192) = 0 <0.401635> 1541708884.620824 mmap(0x500000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.199043> 1541708885.819908 munmap(0x7f9bbab94000, 35433480192) = 0 <0.399615> 1541708886.219555 mmap(0x600000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.200235> 1541708887.419831 munmap(0x7f9bbab94000, 35433480192) = 0 <0.401402> 1541708887.821297 mmap(0x700000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x7f9bbab94000 <1.200288> 1541708889.021619 munmap(0x7f9bbab94000, 35433480192) = 0 <0.399871> 1541708889.421521 mmap(0x800000000, 35433480192, PROT_READ, MAP_PRIVATE, 17, 0) = 0x800000000 <1.198881> 1541708890.620436 munmap(0x800000000, 35433480192) = 0 <0.401636> 1541708891.022106 close(17) = 0 <0.000011> Next step would be to dig into the why part... I can not replicate this with testpmd and 24GB for hugepages. The memory initialization times is quite similar between pre-dma-mask-patch and current 17.11.4 stable. I suspect this is something that OVS execution is triggering. I will try to replicate this using OVS now. (In reply to Alejandro from comment #5) It's specific to this commit: 293c0c4 mem: use address hint for mapping hugepages Adding some debug, you can see trying to get the 32G memory is taking a long time, and its need to retry as the returned memory does not match the hint: 2018-11-09T10:02:14.940Z|00016|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x100000000, 34359738368 bytes 2018-11-09T10:02:16.142Z|00017|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x100000000, 34359738368 bytes 2018-11-09T10:02:16.542Z|00018|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x200000000, 34359738368 bytes 2018-11-09T10:02:17.743Z|00019|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x200000000, 34359738368 bytes 2018-11-09T10:02:18.142Z|00020|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x300000000, 34359738368 bytes 2018-11-09T10:02:19.343Z|00021|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x300000000, 34359738368 bytes 2018-11-09T10:02:19.743Z|00022|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x400000000, 34359738368 bytes 2018-11-09T10:02:20.943Z|00023|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x400000000, 34359738368 bytes 2018-11-09T10:02:21.343Z|00024|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x500000000, 34359738368 bytes 2018-11-09T10:02:22.551Z|00025|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x500000000, 34359738368 bytes 2018-11-09T10:02:22.951Z|00026|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x600000000, 34359738368 bytes 2018-11-09T10:02:24.152Z|00027|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x600000000, 34359738368 bytes 2018-11-09T10:02:24.552Z|00028|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x700000000, 34359738368 bytes 2018-11-09T10:02:25.753Z|00029|dpdk|ERR|EAL: Return mmap() addr 0x7fdfd831c000, addr_hint 0x700000000, 34359738368 bytes 2018-11-09T10:02:26.152Z|00030|dpdk|ERR|EAL: Calling mmap() addr 0xffffffffffffffff, addr_hint 0x800000000, 34359738368 bytes 2018-11-09T10:02:27.353Z|00031|dpdk|ERR|EAL: Return mmap() addr 0x800000000, addr_hint 0x800000000, 34359738368 bytes I wanted to try MAP_FIXED_NOREPLACE, but unfortunately, this is only supported in the latest kernels, which we do not have. Need to dig into how the memory architecture works in DPDK, but it tries the full 32G pages I have while I only request 2G. I assume the customer has tons of more 1G pages and does more retries. Maybe we should scan /proc/<pid>/maps for a space to start trying? (In reply to Eelco Chaudron from comment #6) Some additional debug data... I noticed that before get_virtual_area() gets called, some other part of the code already has all huge pages mapped (that’s why it’s failing): ``` 40000000-80000000 rw-s 00000000 00:23 206083 /dev/hugepages/rtemap_1 80000000-c0000000 rw-s 00000000 00:23 206084 /dev/hugepages/rtemap_2 c0000000-100000000 rw-s 00000000 00:23 206085 /dev/hugepages/rtemap_3 100000000-140000000 rw-s 00000000 00:23 206086 /dev/hugepages/rtemap_4 140000000-180000000 rw-s 00000000 00:23 206087 /dev/hugepages/rtemap_5 180000000-1c0000000 rw-s 00000000 00:23 206088 /dev/hugepages/rtemap_6 1c0000000-200000000 rw-s 00000000 00:23 206089 /dev/hugepages/rtemap_7 200000000-240000000 rw-s 00000000 00:23 206090 /dev/hugepages/rtemap_8 240000000-280000000 rw-s 00000000 00:23 206091 /dev/hugepages/rtemap_9 280000000-2c0000000 rw-s 00000000 00:23 206092 /dev/hugepages/rtemap_10 2c0000000-300000000 rw-s 00000000 00:23 206093 /dev/hugepages/rtemap_11 300000000-340000000 rw-s 00000000 00:23 206094 /dev/hugepages/rtemap_12 340000000-380000000 rw-s 00000000 00:23 206095 /dev/hugepages/rtemap_13 380000000-3c0000000 rw-s 00000000 00:23 206096 /dev/hugepages/rtemap_14 3c0000000-400000000 rw-s 00000000 00:23 206097 /dev/hugepages/rtemap_15 400000000-440000000 rw-s 00000000 00:23 206098 /dev/hugepages/rtemap_16 440000000-480000000 rw-s 00000000 00:23 206099 /dev/hugepages/rtemap_17 480000000-4c0000000 rw-s 00000000 00:23 206100 /dev/hugepages/rtemap_18 4c0000000-500000000 rw-s 00000000 00:23 206101 /dev/hugepages/rtemap_19 500000000-540000000 rw-s 00000000 00:23 206102 /dev/hugepages/rtemap_20 540000000-580000000 rw-s 00000000 00:23 206103 /dev/hugepages/rtemap_21 580000000-5c0000000 rw-s 00000000 00:23 206104 /dev/hugepages/rtemap_22 5c0000000-600000000 rw-s 00000000 00:23 206105 /dev/hugepages/rtemap_23 600000000-640000000 rw-s 00000000 00:23 206106 /dev/hugepages/rtemap_24 640000000-680000000 rw-s 00000000 00:23 206107 /dev/hugepages/rtemap_25 680000000-6c0000000 rw-s 00000000 00:23 206108 /dev/hugepages/rtemap_26 6c0000000-700000000 rw-s 00000000 00:23 206109 /dev/hugepages/rtemap_27 700000000-740000000 rw-s 00000000 00:23 206110 /dev/hugepages/rtemap_28 740000000-780000000 rw-s 00000000 00:23 206111 /dev/hugepages/rtemap_29 780000000-7c0000000 rw-s 00000000 00:23 206112 /dev/hugepages/rtemap_30 7c0000000-800000000 rw-s 00000000 00:23 206113 /dev/hugepages/rtemap_31 2aaac0000000-2aab00000000 rw-s 00000000 00:23 206082 /dev/hugepages/rtemap_0 ``` What EAL mem init does is first mmap all the hugepages in a first stage. Later it will try to mmap as many physical contiguous hugepages as possible, remapping the hugepages to new virtual addresses. It is in this second stage when the call to get_virtual_area is done. The problem is, the first mmaping is also using a virtual address as a hint for the mmap call except in the first call. So the first hugepage is mapped in a high virtual address but not the other ones. From the second hugepage, the virtual address used a a hint is based on increments of hugepage size with an initial address of hugepage size. This implies that if there are more than 4GB of memory backed by hugepages, the second stage starts to have problems because the initial hint is just at that offset, 4GB. Inside get_virtual_address, if the map at the hint fails, the hint is increased by 4GB. So if you map 32 GB, you will get 7 extra mmaps calls. You can do the arithmetics for further amount of memory used. We do not care about the virtual addresses used in the first stage, just about the ones done in the second. So I think we should avoid a hint for the first stage what will make the kernel using really high virtual addresses that will not hopefully overlap with the ones used in the second stage. I'm going to do some testing with those changes to see if this is viable. Thanks Alejandro to work on this in parallel! Patch was send upstream: http://mails.dpdk.org/archives/dev/2018-November/118774.html Waiting for feedback. Hi Eelco, Can you please provide more details regarding a reproducer? Please list which specific packages need to be installed (openvswitch, DPDK, etc.), the steps you are using to configure huge pages, etc. Thanks, Rick (In reply to Rick Alongi from comment #17) > Hi Eelco, > > Can you please provide more details regarding a reproducer? Please list > which specific packages need to be installed (openvswitch, DPDK, etc.), the > steps you are using to configure huge pages, etc. I was using the setup explained here: https://github.com/chaudron/ovs_perf I'm reserving 32 1G pages as explained in the above link. I do not see the huge increase as the customer, for me it went from roughly 7 seconds to 18. Created attachment 1514939 [details] files referenced in comment 19 I've attached bz1647498.tar.gz containing the following files to this BZ: restart_test_setup.txt: steps used to configure for the test restart_test.sh: shell script to test for the issue openvswitch-2.9.0-56.el7fdp.x86_64_result.txt: output from testing with openvswitch-2.9.0-56.el7fdp.x86_64 (no problem) openvswitch-2.9.0-70.el7fdp.x86_64_result.txt: output from testing with openvswitch-2.9.0-70.el7fdp.x86_64 (problem) openvswitch-2.9.0-83.el7fdp.1.x86_64_result.txt: output from testing with openvswitch-2.9.0-83.el7fdp.1.x86_64 (no problem) I used a Pass/Fail threshold of 10 seconds for the restart time with this configuration since that was the baseline from builds before the issue was introduced. Summary of test results: =========================================================================== Test completed using openvswitch package openvswitch-2.9.0-56.el7fdp.x86_64 Restart iterations: 50, Time threshold specified: 10 seconds 50 of 50 iterations PASSED 0 of 50 iterations FAILED Average restart time per iteration: 9.16 seconds =========================================================================== Final Test Result: PASSED =========================================================================== =========================================================================== Test completed using openvswitch package openvswitch-2.9.0-70.el7fdp.x86_64 Restart iterations: 50, Time threshold specified: 10 seconds 0 of 50 iterations PASSED 50 of 50 iterations FAILED Average restart time per iteration: 15.12 seconds =========================================================================== Final Test Result: FAILED =========================================================================== =========================================================================== Test completed using openvswitch package openvswitch-2.9.0-83.el7fdp.1.x86_64 Restart iterations: 50, Time threshold specified: 10 seconds 50 of 50 iterations PASSED 0 of 50 iterations FAILED Average restart time per iteration: 9.04 seconds =========================================================================== Final Test Result: PASSED Based on the results referenced, I am marking this as Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0014 |