Description of problem: Under heavy network and CPU load (VM related traffic or regular physical NIC traffic) openvswitch is observed to hang. It becomes totally unresponsive, fails to respond to packets flowing to and from VMs. In this state, the OVS processes are stuck in a un-interruptible state. OVS does not respond to kill and SIGKILL. We cannot restart any OCS services. Nothing is logged into the ovs logs once this happens and all ovs commands hang. The only workaround is to reboot the host. This is with Icehouse + RDO. I built the latest Openvswitch source code (2.3.0.1), built an RPM out of it and tried it on the system. The problem seems be fixed in the latest version of OVS. This bug is primarily targeted to add support for the latest version of OVS for RDO going forward for Icehouse and future Openstack releases. Version-Release number of selected component (if applicable): This is on Icehouse with RDO Openvswitch version 1.11, git commit id 8ce28d How reproducible: About 20% of the times Steps to Reproduce: On a multi-node physical setup, ensure there is some reasonable VM network traffic. Parallely, start a copy of a large file (50 gigs). After a while, the ovsdb-server and ovs-vswitchd processes start taking up 100% CPU and never yield. Actual results: Openvswitch gets stuck Expected results: Openvswitch should not get stuck Additional info: Some error logs for Openvswitch 2014-10-24T20:17:33Z|00020|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/openvswitch/db.sock<->) at lib/stream-fd.c:142 (99% CPU usage) 2014-10-24T20:17:33Z|00021|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage) 2014-10-24T20:17:33Z|00022|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage) 2014-10-24T20:17:33Z|00023|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage) 2014-10-24T20:17:33Z|00024|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage) 2014-10-24T20:17:33Z|00025|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage) 2014-10-24T20:17:33Z|00026|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage) 2014-10-24T20:17:33Z|00027|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage) 2014-10-24T20:17:33Z|00028|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage) 2014-10-24T20:17:33Z|00029|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage) 2014-10-24T20:17:39Z|00030|poll_loop|INFO|Dropped 10014 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate 2014-10-24T20:17:39Z|00031|poll_loop|INFO|wakeup due to 0-ms timeout at unix (98% CPU usage) 2014-10-24T20:17:42Z|00032|jsonrpc|WARN|unix: send error: Broken pipe 2014-10-24T20:17:42Z|00033|reconnect|WARN|unix: connection dropped (Broken pipe) 2014-10-24T20:18:49Z|00034|fatal_signal|WARN|terminating with signal 15 (Terminated)
Yash, Indeed 1.11 is really old. I suspect that there is a newer Open vSwitch package available (openvswitch-2.1.2-1.el6 [1]). Could you see if that fixes the issue? [1] https://repos.fedorapeople.org/repos/openstack/openstack-icehouse/epel-6/
Flavio, Thanks for looking at this issue. When I tested OVS 2.3, I installed 2 RPMs - - openvswitch-2.3.0-1.x86_64.rpm - openvswitch-kmod-2.3.0-1.el6.x86_64.rpm I could locate 'openvswitch-2.1.2-1.el6.x86_64.rpm' but could not locate the corresponding kmod RPM. Do we need that kmod RPM to test the newer OVS version?
(In reply to Yash from comment #3) > I could locate 'openvswitch-2.1.2-1.el6.x86_64.rpm' but could not locate the > corresponding kmod RPM. Do we need that kmod RPM to test the newer OVS > version? Any reason to avoid the openvswitch module provided by the standard EL6 kernel package? You should be fine without the kmod RPM.
Ok...let me simply load the user-space rpm and get back.
Hi Flavio, I was able to upgrade to OVS 2.1.2 without any additional kmod modules. But this version of OVS does not seem to fix the problem. The OVS agent is stuck after heavy network activity for about 30 min. [root@node-114 ~]# ovs-vsctl show 2014-11-11T06:42:18Z|00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (Protocol error) ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error) [root@node-114 ~]# ovs-vsctl --version ovs-vsctl (Open vSwitch) 2.1.3 Compiled Oct 10 2014 14:20:08 For OVS 2.3.0, they have added openvswitch-kmod as a dependency: Error: Package: openvswitch-2.3.0-1.x86_64 (/openvswitch-2.3.0-1.x86_64) Requires: openvswitch-kmod
Hi, Could you describe how you are testing it? I would like to reproduce the same issue here. Any idea of how many flows? Please, attach the logs from /var/log/openvswitch as well. You can either remove the requirements to openvswitch-kmod from the spec and rebuild the package or install with 'rpm --Uvh --nodeps' to be able to test 2.3.0.
To reproduce the problem, I have a 2 physical host setup with a 6 VM virtual cluster. These VMs talk over a flat DHCP neutron network (no VLANs or any sort of tunneling). These VMs are doing some form of I/O operations (reading and writing to disks on the base). The typical dataset size is 100GB. Weather we see the problem or not depends on how many systems are on a host. If you end up spawning 3-4 machines (or more) on a single host, the problem is more easily seen. If the VM distribution is more even, then we may have to increase the size of data or increase the virtual cluster size so more VMs appear on a single host. I was able to reproduce the problem with OVS 2.3.0 as well. I'm attaching a tar file containing the openvswitch logs. Although the logs say ovsdb-server was terminated (for whatever reason), the process is still hanging around.
Created attachment 956423 [details] Openvswitch logs for OVS 2.3.0
We have been stress testing OVS with traffic and I don't see why it wouldn't work in your setup. Also, the logs look fine and the CPU usage for the OVS thread is about 50~60%. Therefore, I am suspecting that OVS is a victim of another problem that prevents it to run properly. Could you attach strace to the OVS processes (ovsdb-server and ovs-vswitchd) and attach the logs? The strace will add the timestamps and duration of any call, so we can have a better idea of what is going on. # strace -f -ttt -T -p <pid> -o <logfile> Another suggestion is to watch 'perf top -a' while reproducing the issue to see what else is running and where the CPU is spending time. That could give us a clue from the system perspective.
Created attachment 956852 [details] strace logs
I am attaching the strace logs for both processes. This time ovsdb-server was hung (on a write operation per strace). You will notice the ovs-vswitchd continues to log and be functional while the server is stuck. One more point of information - the process is hung even after the network traffic stops. However, the process becomes operational again as soon as we delete the virtual cluster. The process starts logging traces again and I have attached these traces as well. Trace while leading upto the problem: ovsdb-server-strace.log vswitchd-strace.log Trace after the process resumes: ovsdb-server-resume-strace.log Since the attachment size was limited I have truncated the logs to around the time the problem shows up.
Output of perf top attached
Created attachment 957246 [details] Perf top output
(In reply to Yash from comment #12) > I am attaching the strace logs for both processes. This time ovsdb-server > was hung (on a write operation per strace). You will notice the ovs-vswitchd > continues to log and be functional while the server is stuck. So, you are saying that the last line in the strace: 338 1415815914.779600 write(17, "{\"method\":\"update\",\"id\":null,\"pa"..., 11189 continues on the other log: 338 1415816370.152027 write(5, "2014-11-12T18:19:30.147Z|02596|j"..., 189) = 189 <0.000131> ? It doesn't look like to be the case because the write in another file description with another content. Anyway, it's normal to see the syscall interrupted if you had stopped/paused strace somehow. If the strace was interrupted, then I cannot tell anything. If you had it attached during the whole test, then yeah, something is wrong and the syscall write() is stuck somewhere. Let's assume that the write() syscall is stuck. This is not an OVS fault though it would make OVS to hang as well. In this case, I suggest to debug the kernel side and look for possible reasons. I'd say to use sysrq+t and sysrq+l (you need to enable them first) while the OVS is reproducing the issue to get stack traces of all processes. The idea is to find where it is stuck inside of the kernel and what other CPUs are running. I suspect that one CPU is holding a resource that prevents the write() syscall to continue. You will need a few screenshots to get an understanding of what is moving and what is stuck. > One more point of information - the process is hung even after the network > traffic stops. However, the process becomes operational again as soon as we > delete the virtual cluster. The process starts logging traces again and I > have attached these traces as well. OK, so that is somehow related to write() syscall being blocked, but I have no idea how to debug that. fbl
(In reply to Yash from comment #13) > Output of perf top attached shrink_inactive_list shrink_mem_cgroup_zone shrink_zone zone_reclaim get_page_from_freelist __alloc_pages_nodemask kmem_getpages cache_grow cache_alloc_refill kmem_cache_alloc_node_trace __kmalloc_node __alloc_skb sock_alloc_send_pskb sock_alloc_send_skb unix_stream_sendmsg sock_aio_write do_sync_write vfs_write sys_write system_call_fastpath The above indicates the system is running out of memory. So, to write() syscall is blocking to find memory for its socket buffer. The process of freeing memory can take some time, specially if the memory pressure continues on and on. It could be a bug somewhere that is leaking memory or simply the amount of memory in the host is not enough to run the workload. You will need track down where all the memory is being used (sar, vmstat and /proc/meminfo are a good start) and see if it's a bug or not. If not, then try to tune the VM to see if you can squeeze more of it. fbl
'free' on the system shows plenty of physical memory available. So does /proc/meminfo [root@node-114 ~]# free total used free shared buffers cached Mem: 49416204 23680764 25735440 0 14352 4296996 -/+ buffers/cache: 19369416 30046788 Swap: 16498680 216 16498464 [root@node-114 ~]# cat /proc/meminfo MemTotal: 49416204 kB MemFree: 25138260 kB Buffers: 13960 kB Cached: 4865980 kB SwapCached: 216 kB Active: 13610688 kB Inactive: 9640136 kB Active(anon): 12350272 kB Inactive(anon): 6027880 kB Active(file): 1260416 kB Inactive(file): 3612256 kB Unevictable: 149004 kB Mlocked: 36584 kB SwapTotal: 16498680 kB SwapFree: 16498464 kB Dirty: 344 kB Writeback: 0 kB AnonPages: 18519072 kB Mapped: 751160 kB Shmem: 360 kB Slab: 372964 kB SReclaimable: 283540 kB SUnreclaim: 89424 kB KernelStack: 5984 kB PageTables: 56876 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 41206780 kB Committed_AS: 28687064 kB VmallocTotal: 34359738367 kB VmallocUsed: 506380 kB VmallocChunk: 34333292340 kB HardwareCorrupted: 0 kB AnonHugePages: 17555456 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 5504 kB DirectMap2M: 2082816 kB DirectMap1G: 48234496 kB
Is that while reproducing the issue?
Thanks for all your guidance Flavio. I captured some memory stats while the workload was running. I used sysrq, free and sar. Initially, there is plenty of memory free and uncached. But 5 or so minutes into the task, there is significant spike in the used memory, with as much as 24G showing up under cached while the buffer/cache shows plenty of memory available. It's around this time (6min into the task) that the OVS process hangs on a write. I was dumping 'free' at a 1 minute interval for a workload that lasts 20 minutes. I captured the o/p of 'sar -r' as well at a 10 minute interval. 'sar' output: kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 05:10:01 PM 40802688 8613516 17.43 23580 1309156 14684496 22.28 05:20:01 PM 31362120 18054084 36.53 39588 1303616 28902784 43.85 05:30:03 PM 11317292 38098912 77.10 36612 18141900 29806336 45.22 05:40:01 PM 3934840 45481364 92.04 9212 24949092 30016188 45.54 05:50:02 PM 39393044 10023160 20.28 5616 789172 7543364 11.44 Memory used before virtual cluster creation: 17% Memory used after virtual cluster creation: 36% (usage seems correct) Output of free- total used free shared buffers cached Mem: 49416204 26248800 23167404 0 36272 6796976 -/+ buffers/cache: 19415552 30000652 Swap: 16498680 840 16497840 Memory usage at minute #2 of workload: 77% Output of free (in gigs)- total used free shared buffers cached Mem: 47 34 12 0 0 15 -/+ buffers/cache: 18 28 Swap: 15 0 15 Memory usage at minute #6 of workload:: 92% Output of free (in gigs)- total used free shared buffers cached Mem: 47 44 2 0 0 24 -/+ buffers/cache: 19 27 Swap: 15 0 15 The above memory consumption stayed close to 90% for 4 more minutes followed by a sudden drop in consumption/cache as seen below- total used free shared buffers cached Mem: 47 42 4 0 0 23 -/+ buffers/cache: 19 28 Swap: 15 0 15 total used free shared buffers cached Mem: 47 24 22 0 0 6 -/+ buffers/cache: 18 28 Swap: 15 0 15 After this the memory oscillated for about 2 minutes and settled at the below values- total used free shared buffers cached Mem: 47 26 20 0 0 3 -/+ buffers/cache: 22 24 Swap: 15 0 15 This is the steady state of memory consumption while the workload had finished but the virtual cluster is Up and OVS is still stuck at 'write'. One thing that puzzles me is even though the value of free 'Mem:' goes down to 2G, there is plenty of buffer/cache. We should not be crying for memory. Another question is - why does the cache (disc cache?) build up to such a large value? My workload is trying to write 100G of data from the VMs to the host (base). Could that have something to do with the huge cache? While running the test I also captured the o/p of sysrq+t. We had reduced the TCP write/read buffer size to 1M before the test since there were more 1M fragments available than the default of 4M. Here is a summary of o/p at various stages- Nov 13 17:30:38 yav-114 kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15572kB Nov 13 17:30:38 yav-114 kernel: Node 0 DMA32: 36*4kB 46*8kB 28*16kB 27*32kB 18*64kB 21*128kB 23*256kB 20*512kB 21*1024kB 31*2048kB 572*4096kB = 2449696kB Nov 13 17:30:38 yav-114 kernel: Node 0 Normal: 11*4kB 11*8kB 60*16kB 24*32kB 11*64kB 4*128kB 303*256kB 1659*512kB 3486*1024kB 155*2048kB 348*4096kB = 6242564kB Nov 13 17:30:38 yav-114 kernel: Node 1 Normal: 38*4kB 20*8kB 136*16kB 124*32kB 34*64kB 7*128kB 11*256kB 75*512kB 4*1024kB 1*2048kB 0*4096kB = 56888kB Nov 13 17:31:56 yav-114 kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15572kB Nov 13 17:31:56 yav-114 kernel: Node 0 DMA32: 36*4kB 46*8kB 28*16kB 27*32kB 18*64kB 21*128kB 23*256kB 20*512kB 21*1024kB 31*2048kB 572*4096kB = 2449696kB Nov 13 17:31:56 yav-114 kernel: Node 0 Normal: 10*4kB 178*8kB 79*16kB 51*32kB 17*64kB 5*128kB 99*256kB 118*512kB 2201*1024kB 127*2048kB 348*4096kB = 4031176kB Nov 13 17:31:56 yav-114 kernel: Node 1 Normal: 1375*4kB 1374*8kB 3997*16kB 2906*32kB 1202*64kB 361*128kB 175*256kB 74*512kB 0*1024kB 1*2048kB 0*4096kB = 381308kB Nov 13 17:34:56 yav-114 kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15572kB Nov 13 17:34:56 yav-114 kernel: Node 0 DMA32: 4*4kB 35*8kB 31*16kB 26*32kB 19*64kB 21*128kB 23*256kB 20*512kB 21*1024kB 34*2048kB 569*4096kB = 2443416kB Nov 13 17:34:56 yav-114 kernel: Node 0 Normal: 153*4kB 156*8kB 84*16kB 45*32kB 21*64kB 10*128kB 4*256kB 2*512kB 2*1024kB 3*2048kB 8*4096kB = 50276kB Nov 13 17:34:56 yav-114 kernel: Node 1 Normal: 95*4kB 231*8kB 151*16kB 72*32kB 44*64kB 10*128kB 16*256kB 75*512kB 1*1024kB 1*2048kB 0*4096kB = 56612kB
You can run sar more frequent to get a better idea of what is happening. I'd say at maximum 1 minute interval to understand the trends and get a spike. Also, try look at 'slabtop' while reproducing the issue as it shows the slab cache usage. (can't remember if sar reports it already, but that shows a nice output that is easy to spot the top consumers) Just keep in mind that on a normal busy system, the write() syscall can get stuck on shrink_inactive_list(). It will be a problem if it remains stuck there for too long. So, I am not sure how you captured that backtrace but make sure the write is really stuck there otherwise we are in the wrong direction. I suspect you're correct though, because the possible spike on the memory consumption. 'dmesg' after reproducing some times give us a clue. I suspect that the networking tput is not enough to keep up with the workload from the VMs, so the host starts caching until it gets above the limits. The cached data remains until you shutdown the virtual cluster blocking any write() to a socket.
Ok I confirmed the ovsdb-server is stuck at the write() syscall when the memory spike happens. In my previous comment, I stated the cache stays high for about 4-5 min and then suddenly drops. During this period of high caching 'sar' tells me %memused is always above 90% (most of it used for caching based on 'free'). I was dumping sar/free at a 30sec interval. What am I looking for in 'slabtop'? I tried 'slabtop --sort=u -d 30' and here is what it says. Let me know if I should use some other option to slabtop Active / Total Objects (% used) : 7054695 / 7094042 (99.4%) Active / Total Slabs (% used) : 208078 / 208135 (100.0%) Active / Total Caches (% used) : 137 / 224 (61.2%) Active / Total Size (% used) : 790983.84K / 796880.64K (99.3%) Minimum / Average / Maximum Object : 0.02K / 0.11K / 4096.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 223 223 100% 32.12K 223 1 14272K kmem_cache 62 62 100% 8.00K 62 1 496K size-8192 56 56 100% 16.00K 56 1 896K size-16384 40 40 100% 32.00K 40 1 1280K size-32768 31 31 100% 64.00K 31 1 1984K size-65536 1 1 100% 128.00K 1 1 128K size-131072 2 2 100% 4.12K 2 1 16K bip-256 34 34 100% 2.00K 17 2 68K biovec-128 24 24 100% 4.00K 24 1 96K sgpool-128 6 6 100% 12.69K 6 1 96K kvm_vcpu 16 16 100% 0.81K 4 4 16K rpc_inode_cache 8 8 100% 2.00K 4 2 16K rpc_buffers 36 36 100% 0.94K 9 4 36K nfs_write_data 388640 386155 99% 0.03K 3470 112 13880K size-32 768 767 99% 4.00K 768 1 3072K size-4096
I forgot to mention, dmesg does not show anything. Also, when I start deleting VMs, large chunks of memory (contiguous?) free up. This could be the reason the system/process recovers. Each VM takes about 3gigs of memory so it's significant. What do you mean by 'networking tput is not enough'. Can you please elaborate a bit.
This may not be relevant anymore but in one of my earlier attachments, when the process resumes function I might have started stracing late. This time, the write calls matches the sock fd. Process suspends at: 24726 1415991593.986959 write(17, "{\"method\":\"update\",\"id\":null,\"pa"..., 11186) = 11186 <2007.600985> Process resumes at: 24726 1415993601.607051 write(17, "{\"id\":217,\"result\":[{},{\"count\":"..., 145) = 145 <0.000198> 24726 1415993601.607486 read(18, 0x17958c8, 312) = -1 EAGAIN (Resource temporarily unavailable) <0.000098> 24726 1415993601.607861 getrusage(RUSAGE_SELF, {ru_utime={0, 429934}, ru_stime={2006, 894905}, ...}) = 0 <0.000011> 24726 1415993601.608015 write(5, "2014-11-14T19:33:21.607Z|00018|t"..., 115) = 115 <0.000025> 24726 1415993601.608100 write(5, "2014-11-14T19:33:21.608Z|00019|t"..., 94) = 94 <0.000015> 24726 1415993601.608182 write(5, "2014-11-14T19:33:21.608Z|00020|c"..., 108) = 108 <0.000013> 24726 1415993601.608228 poll([{fd=11, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}, {fd=17, events=POLLIN}], 5, 5000) = 1 ([{fd=17, revents=POLLIN}]) <0.000010>
This slabtop o/p might provide more information. When the workload starts, buffer_head occupies a small cache size but starts growing upto 700M where it maxes out. I don't see anything of the order of gigs in the slabtop o/p. buffer_head keeps consuming this cache even after the workload is finished. We have to clear this cache manually and even after that the OVS process is hung on the write. [root@node-114 ~]# slabtop -s u -d 30 Active / Total Objects (% used) : 8009232 / 8033278 (99.7%) Active / Total Slabs (% used) : 224471 / 224611 (99.9%) Active / Total Caches (% used) : 132 / 217 (60.8%) Active / Total Size (% used) : 850446.84K / 855539.33K (99.4%) Minimum / Average / Maximum Object : 0.02K / 0.11K / 4096.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 6695261 6694795 99% 0.10K 180953 37 723812K buffer_head 117131 115213 98% 0.55K 16733 7 66932K radix_tree_node 552384 552363 99% 0.02K 3836 144 15344K avtab_node 216 216 100% 32.12K 216 1 13824K kmem_cache 381248 380744 99% 0.03K 3404 112 13616K size-32 13508 13464 99% 0.99K 3377 4 13508K ext4_inode_cache 50560 50276 99% 0.19K 2528 20 10112K dentry 13128 13080 99% 0.58K 2188 6 8752K inode_cache
One more point of information - if we have fewer VMs on the host, we see an identical behavior with regards to memory. A lot more memory ends up being cached (in the o/p of free) and slabtop shows more memory against 'buffer_head'. OVS does not hang if there is only 1 VM on the system. Any more VMs than that (in the same virtual cluster or some other virtual cluster) on the same host will lead to the problem. Memory is definitely playing a role in this problem. However, I strongly feel there is poor implementation inside the OVS agent that causes it to hang indefinitely. What should be our next step in trying to identify the memory problem? Can we also reach out to someone in the Openvswitch community to fix the agent so that it recovers once the memory consumption goes down. Thanks for all your help and time Flavio.
Hi Yash, The problem is that openvswitch needs memory to push data and the OS is struggling to give the memory back to openvswitch (strace shows that). I bet other daemons in the OS are facing the same issue at the spike time and we are just not noticing it. Once the OS can fulfill the memory request, the write() continues as the strace shows, resuming its normal operation. The slabtop tool shows the slab allocator usage, so it's just part of the whole memory management system. Usually, if you have a heavy filesystem usage, the buffer_head grows significantly, but it represents only the metadata, not the full picture of the whole fs memory usage. The /proc/meminfo offers a better overview of the whole system memory allocated, but not much details. There are some sysrqs that might be helpful too which shows the size of the pages, that could be helpful too. Anyway, the system seems to be running out of memory, so the solution would be either to protect the core system components from starvation or somehow limit the side effects of the workload (tuning?). Since I am not seeing any specific problem with OVS and if you agree, I would like to close this bug. Thanks, fbl
Hi Flavio, I was able to avoid the problem by turning off NUMAs zone_reclaim property. This way any process (including OVS) is allowed to look for memory outside it's assigned memory banks. echo 0 > /proc/sys/vm/zone_reclaim_mode This seems to improve I/O performance for filesystem type applications as well. We can go ahead and close this bug. Thanks for your help and pointers.
Hi Yash, I am glad that you found a way to avoid the problem. Actually, zone_reclaim_mode fits very well to what we have been troubleshooting because the system seems to have enough free memory but we didn't know in which node and at the same time OVS was blocked waiting for free memory. Most likely the workload didn't fit within the NUMA node. I appreciate that you have shared your findings, thanks! fbl