Bug 1158701 - Openvswitch hangs with current RDO version
Summary: Openvswitch hangs with current RDO version
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 5.0 (RHEL 6)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 6.0 (Juno)
Assignee: Flavio Leitner
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-10-29 21:49 UTC by Yash
Modified: 2016-04-26 15:14 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-03 13:51:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Openvswitch logs for OVS 2.3.0 (30.00 KB, application/octet-stream)
2014-11-11 21:13 UTC, Yash
no flags Details
strace logs (4.58 MB, application/octet-stream)
2014-11-12 18:51 UTC, Yash
no flags Details
Perf top output (344 bytes, text/plain)
2014-11-13 18:14 UTC, Yash
no flags Details

Description Yash 2014-10-29 21:49:21 UTC
Description of problem:
Under heavy network and CPU load (VM related traffic or regular physical NIC traffic) openvswitch is observed to hang. It becomes totally unresponsive, fails to respond to packets flowing to and from VMs. In this state, the OVS processes are stuck in a un-interruptible state. OVS does not respond to kill and SIGKILL. We cannot restart any OCS services. Nothing is logged into the ovs logs once this happens and all ovs commands hang. The only workaround is to reboot the host.

This is with Icehouse + RDO. I built the latest Openvswitch source code (2.3.0.1), built an RPM out of it and tried it on the system. The problem seems be fixed in the latest version of OVS.

This bug is primarily targeted to add support for the latest version of OVS for RDO going forward for Icehouse and future Openstack releases.



Version-Release number of selected component (if applicable):
This is on Icehouse with RDO
Openvswitch version 1.11, git commit id 8ce28d


How reproducible:
About 20% of the times


Steps to Reproduce:
On a multi-node physical setup, ensure there is some reasonable VM network traffic. Parallely, start a copy of a large file (50 gigs). After a while, the ovsdb-server and ovs-vswitchd processes start taking up 100% CPU and never yield. 

Actual results:
Openvswitch gets stuck


Expected results:
Openvswitch should not get stuck


Additional info:

Some error logs for Openvswitch

2014-10-24T20:17:33Z|00020|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/openvswitch/db.sock<->) at lib/stream-fd.c:142 (99% CPU usage)
2014-10-24T20:17:33Z|00021|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage)
2014-10-24T20:17:33Z|00022|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage)
2014-10-24T20:17:33Z|00023|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage)
2014-10-24T20:17:33Z|00024|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage)
2014-10-24T20:17:33Z|00025|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage)
2014-10-24T20:17:33Z|00026|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage)
2014-10-24T20:17:33Z|00027|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage)
2014-10-24T20:17:33Z|00028|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage)
2014-10-24T20:17:33Z|00029|poll_loop|INFO|wakeup due to 0-ms timeout at unix (99% CPU usage)
2014-10-24T20:17:39Z|00030|poll_loop|INFO|Dropped 10014 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2014-10-24T20:17:39Z|00031|poll_loop|INFO|wakeup due to 0-ms timeout at unix (98% CPU usage)
2014-10-24T20:17:42Z|00032|jsonrpc|WARN|unix: send error: Broken pipe
2014-10-24T20:17:42Z|00033|reconnect|WARN|unix: connection dropped (Broken pipe)
2014-10-24T20:18:49Z|00034|fatal_signal|WARN|terminating with signal 15 (Terminated)

Comment 2 Flavio Leitner 2014-11-07 17:25:12 UTC
Yash,

Indeed 1.11 is really old.  I suspect that there is a newer Open vSwitch package available (openvswitch-2.1.2-1.el6 [1]).

Could you see if that fixes the issue?



[1]
https://repos.fedorapeople.org/repos/openstack/openstack-icehouse/epel-6/

Comment 3 Yash 2014-11-07 21:01:35 UTC
Flavio,

Thanks for looking at this issue. When I tested OVS 2.3, I installed 2 RPMs -

- openvswitch-2.3.0-1.x86_64.rpm
- openvswitch-kmod-2.3.0-1.el6.x86_64.rpm

I could locate 'openvswitch-2.1.2-1.el6.x86_64.rpm' but could not locate the corresponding kmod RPM. Do we need that kmod RPM to test the newer OVS version?

Comment 4 Flavio Leitner 2014-11-07 22:04:41 UTC
(In reply to Yash from comment #3)
> I could locate 'openvswitch-2.1.2-1.el6.x86_64.rpm' but could not locate the
> corresponding kmod RPM. Do we need that kmod RPM to test the newer OVS
> version?

Any reason to avoid the openvswitch module provided by the standard EL6 kernel package?  You should be fine without the kmod RPM.

Comment 5 Yash 2014-11-07 22:10:03 UTC
Ok...let me simply load the user-space rpm and get back.

Comment 6 Yash 2014-11-11 06:56:50 UTC
Hi Flavio,
I was able to upgrade to OVS 2.1.2 without any additional kmod modules. But this version of OVS does not seem to fix the problem. The OVS agent is stuck after heavy network activity for about 30 min.

[root@node-114 ~]# ovs-vsctl show
2014-11-11T06:42:18Z|00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (Protocol error)
ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)

[root@node-114 ~]# ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.1.3
Compiled Oct 10 2014 14:20:08

For OVS 2.3.0, they have added openvswitch-kmod as a dependency:

Error: Package: openvswitch-2.3.0-1.x86_64 (/openvswitch-2.3.0-1.x86_64)
           Requires: openvswitch-kmod

Comment 7 Flavio Leitner 2014-11-11 15:19:18 UTC
Hi,

Could you describe how you are testing it?  I would like to reproduce the same issue here.

Any idea of how many flows? 

Please, attach the logs from /var/log/openvswitch as well.

You can either remove the requirements to openvswitch-kmod from the spec and rebuild the package or install with 'rpm --Uvh --nodeps' to be able to test 2.3.0.

Comment 8 Yash 2014-11-11 21:09:58 UTC
To reproduce the problem, I have a 2 physical host setup with a 6 VM virtual cluster. These VMs talk over a flat DHCP neutron network (no VLANs or any sort of tunneling). These VMs are doing some form of I/O operations (reading and writing to disks on the base). The typical dataset size is 100GB. 

Weather we see the problem or not depends on how many systems are on a host. If you end up spawning 3-4 machines (or more) on a single host, the problem is more easily seen. If the VM distribution is more even, then we may have to increase the size of data or increase the virtual cluster size so more VMs appear on a single host. I was able to reproduce the problem with OVS 2.3.0 as well.

I'm attaching a tar file containing the openvswitch logs. Although the logs say ovsdb-server was terminated (for whatever reason), the process is still hanging around.

Comment 9 Yash 2014-11-11 21:13:55 UTC
Created attachment 956423 [details]
Openvswitch logs for OVS 2.3.0

Comment 10 Flavio Leitner 2014-11-12 13:13:26 UTC
We have been stress testing OVS with traffic and I don't see why it wouldn't work in your setup.  Also, the logs look fine and the CPU usage for the OVS thread is about 50~60%.  Therefore, I am suspecting that OVS is a victim of another problem that prevents it to run properly.

Could you attach strace to the OVS processes (ovsdb-server and ovs-vswitchd)
and attach the logs?  The strace will add the timestamps and duration of any call, so we can have a better idea of what is going on.

# strace -f -ttt -T -p <pid> -o <logfile> 

Another suggestion is to watch 'perf top -a' while reproducing the issue to see what else is running and where the CPU is spending time.  That could give us a clue from the system perspective.

Comment 11 Yash 2014-11-12 18:51:41 UTC
Created attachment 956852 [details]
strace logs

Comment 12 Yash 2014-11-12 19:00:59 UTC
I am attaching the strace logs for both processes. This time ovsdb-server was hung (on a write operation per strace). You will notice the ovs-vswitchd continues to log and be functional while the server is stuck.

One more point of information - the process is hung even after the network traffic stops. However, the process becomes operational again as soon as we delete the virtual cluster. The process starts logging traces again and I have attached these traces as well.

Trace while leading upto the problem:
ovsdb-server-strace.log
vswitchd-strace.log

Trace after the process resumes:
ovsdb-server-resume-strace.log

Since the attachment size was limited I have truncated the logs to around the time the problem shows up.

Comment 13 Yash 2014-11-13 18:13:39 UTC
Output of perf top attached

Comment 14 Yash 2014-11-13 18:14:16 UTC
Created attachment 957246 [details]
Perf top output

Comment 15 Flavio Leitner 2014-11-13 22:20:27 UTC
(In reply to Yash from comment #12)
> I am attaching the strace logs for both processes. This time ovsdb-server
> was hung (on a write operation per strace). You will notice the ovs-vswitchd
> continues to log and be functional while the server is stuck.

So, you are saying that the last line in the strace:
338   1415815914.779600 write(17, "{\"method\":\"update\",\"id\":null,\"pa"..., 11189
continues on the other log:
338   1415816370.152027 write(5, "2014-11-12T18:19:30.147Z|02596|j"..., 189) = 189 <0.000131>
?

It doesn't look like to be the case because the write in another file description with another content.
Anyway, it's normal to see the syscall interrupted if you had stopped/paused strace somehow.

If the strace was interrupted, then I cannot tell anything.  If you had it attached during the whole test, then yeah, something is wrong and the syscall write() is stuck somewhere.

Let's assume that the write() syscall is stuck.  This is not an OVS fault though it would make OVS to hang as well.  In this case, I suggest to debug the kernel side and look for possible reasons.  I'd say to use sysrq+t and sysrq+l (you need to enable them first) while the OVS is reproducing the issue to get stack traces of all processes. The idea is to find where it is stuck inside of the kernel and what other CPUs are running.  I suspect that one CPU is holding a resource that prevents the write() syscall to continue.  You will need a few screenshots to get an understanding of what is moving and what is stuck.

> One more point of information - the process is hung even after the network
> traffic stops. However, the process becomes operational again as soon as we
> delete the virtual cluster. The process starts logging traces again and I
> have attached these traces as well.

OK, so that is somehow related to write() syscall being blocked, but I have no idea how to debug that.

fbl

Comment 16 Flavio Leitner 2014-11-13 22:38:44 UTC
(In reply to Yash from comment #13)
> Output of perf top attached

shrink_inactive_list
shrink_mem_cgroup_zone
shrink_zone
zone_reclaim
get_page_from_freelist
__alloc_pages_nodemask
kmem_getpages
cache_grow
cache_alloc_refill
kmem_cache_alloc_node_trace
__kmalloc_node
__alloc_skb
sock_alloc_send_pskb
sock_alloc_send_skb
unix_stream_sendmsg
sock_aio_write
do_sync_write
vfs_write
sys_write
system_call_fastpath


The above indicates the system is running out of memory. So, to write() syscall is blocking to find memory for its socket buffer.  The process of freeing memory can take some time, specially if the memory pressure continues on and on.

It could be a bug somewhere that is leaking memory or simply the amount of memory in the host is not enough to run the workload.  You will need track down where all the memory is being used (sar, vmstat and /proc/meminfo are a good start) and see if it's a bug or not. If not, then try to tune the VM to see if you can squeeze more of it.

fbl

Comment 17 Yash 2014-11-13 22:50:04 UTC
'free' on the system shows plenty of physical memory available. So does /proc/meminfo

[root@node-114 ~]# free
             total       used       free     shared    buffers     cached
Mem:      49416204   23680764   25735440          0      14352    4296996
-/+ buffers/cache:   19369416   30046788
Swap:     16498680        216   16498464


[root@node-114 ~]# cat /proc/meminfo
MemTotal:       49416204 kB
MemFree:        25138260 kB
Buffers:           13960 kB
Cached:          4865980 kB
SwapCached:          216 kB
Active:         13610688 kB
Inactive:        9640136 kB
Active(anon):   12350272 kB
Inactive(anon):  6027880 kB
Active(file):    1260416 kB
Inactive(file):  3612256 kB
Unevictable:      149004 kB
Mlocked:           36584 kB
SwapTotal:      16498680 kB
SwapFree:       16498464 kB
Dirty:               344 kB
Writeback:             0 kB
AnonPages:      18519072 kB
Mapped:           751160 kB
Shmem:               360 kB
Slab:             372964 kB
SReclaimable:     283540 kB
SUnreclaim:        89424 kB
KernelStack:        5984 kB
PageTables:        56876 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    41206780 kB
Committed_AS:   28687064 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      506380 kB
VmallocChunk:   34333292340 kB
HardwareCorrupted:     0 kB
AnonHugePages:  17555456 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        5504 kB
DirectMap2M:     2082816 kB
DirectMap1G:    48234496 kB

Comment 18 Flavio Leitner 2014-11-13 23:08:39 UTC
Is that while reproducing the issue?

Comment 19 Yash 2014-11-14 06:40:13 UTC
Thanks for all your guidance Flavio. I captured some memory stats while the workload was running. I used sysrq, free and sar. 

Initially, there is plenty of memory free and uncached. But 5 or so minutes into the task, there is significant spike in the used memory, with as much as 24G showing up under cached while the buffer/cache shows plenty of memory available. It's around this time (6min into the task) that the OVS process hangs on a write.

I was dumping 'free' at a 1 minute interval for a workload that lasts 20 minutes. I captured the o/p of 'sar -r' as well at a 10 minute interval. 

'sar' output:

             kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
05:10:01 PM  40802688   8613516     17.43     23580   1309156  14684496     22.28
05:20:01 PM  31362120  18054084     36.53     39588   1303616  28902784     43.85
05:30:03 PM  11317292  38098912     77.10     36612  18141900  29806336     45.22
05:40:01 PM   3934840  45481364     92.04      9212  24949092  30016188     45.54
05:50:02 PM  39393044  10023160     20.28      5616    789172   7543364     11.44

Memory used before virtual cluster creation: 17%
Memory used after virtual cluster creation: 36% (usage seems correct)
Output of free-

             total       used       free     shared    buffers     cached
Mem:      49416204   26248800   23167404          0      36272    6796976
-/+ buffers/cache:   19415552   30000652
Swap:     16498680        840   16497840

Memory usage at minute #2 of workload: 77%
Output of free (in gigs)-

             total       used       free     shared    buffers     cached
Mem:            47         34         12          0          0         15
-/+ buffers/cache:         18         28
Swap:           15          0         15

Memory usage at minute #6 of workload:: 92%
Output of free (in gigs)-

             total       used       free     shared    buffers     cached
Mem:            47         44          2          0          0         24
-/+ buffers/cache:         19         27
Swap:           15          0         15

The above memory consumption stayed close to 90% for 4 more minutes followed by a sudden drop in consumption/cache as seen below-


             total       used       free     shared    buffers     cached
Mem:            47         42          4          0          0         23
-/+ buffers/cache:         19         28
Swap:           15          0         15

             total       used       free     shared    buffers     cached
Mem:            47         24         22          0          0          6
-/+ buffers/cache:         18         28
Swap:           15          0         15

After this the memory oscillated for about 2 minutes and settled at the below values-

             total       used       free     shared    buffers     cached
Mem:            47         26         20          0          0          3
-/+ buffers/cache:         22         24
Swap:           15          0         15

This is the steady state of memory consumption while the workload had finished but the virtual cluster is Up and OVS is still stuck at 'write'.

One thing that puzzles me is even though the value of free 'Mem:' goes down to 2G, there is plenty of buffer/cache. We should not be crying for memory. Another question is - why does the cache (disc cache?) build up to such a large value? My workload is trying to write 100G of data from the VMs to the host (base). Could that have something to do with the huge cache?

While running the test I also captured the o/p of sysrq+t. We had reduced the TCP write/read buffer size to 1M before the test since there were more 1M fragments available than the default of 4M. Here is a summary of o/p at various stages-


Nov 13 17:30:38 yav-114 kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15572kB
Nov 13 17:30:38 yav-114 kernel: Node 0 DMA32: 36*4kB 46*8kB 28*16kB 27*32kB 18*64kB 21*128kB 23*256kB 20*512kB 21*1024kB 31*2048kB 572*4096kB = 2449696kB
Nov 13 17:30:38 yav-114 kernel: Node 0 Normal: 11*4kB 11*8kB 60*16kB 24*32kB 11*64kB 4*128kB 303*256kB 1659*512kB 3486*1024kB 155*2048kB 348*4096kB = 6242564kB
Nov 13 17:30:38 yav-114 kernel: Node 1 Normal: 38*4kB 20*8kB 136*16kB 124*32kB 34*64kB 7*128kB 11*256kB 75*512kB 4*1024kB 1*2048kB 0*4096kB = 56888kB


Nov 13 17:31:56 yav-114 kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15572kB
Nov 13 17:31:56 yav-114 kernel: Node 0 DMA32: 36*4kB 46*8kB 28*16kB 27*32kB 18*64kB 21*128kB 23*256kB 20*512kB 21*1024kB 31*2048kB 572*4096kB = 2449696kB
Nov 13 17:31:56 yav-114 kernel: Node 0 Normal: 10*4kB 178*8kB 79*16kB 51*32kB 17*64kB 5*128kB 99*256kB 118*512kB 2201*1024kB 127*2048kB 348*4096kB = 4031176kB
Nov 13 17:31:56 yav-114 kernel: Node 1 Normal: 1375*4kB 1374*8kB 3997*16kB 2906*32kB 1202*64kB 361*128kB 175*256kB 74*512kB 0*1024kB 1*2048kB 0*4096kB = 381308kB


Nov 13 17:34:56 yav-114 kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15572kB
Nov 13 17:34:56 yav-114 kernel: Node 0 DMA32: 4*4kB 35*8kB 31*16kB 26*32kB 19*64kB 21*128kB 23*256kB 20*512kB 21*1024kB 34*2048kB 569*4096kB = 2443416kB
Nov 13 17:34:56 yav-114 kernel: Node 0 Normal: 153*4kB 156*8kB 84*16kB 45*32kB 21*64kB 10*128kB 4*256kB 2*512kB 2*1024kB 3*2048kB 8*4096kB = 50276kB
Nov 13 17:34:56 yav-114 kernel: Node 1 Normal: 95*4kB 231*8kB 151*16kB 72*32kB 44*64kB 10*128kB 16*256kB 75*512kB 1*1024kB 1*2048kB 0*4096kB = 56612kB

Comment 20 Flavio Leitner 2014-11-14 16:30:04 UTC
You can run sar more frequent to get a better idea of what is happening.  I'd say at maximum 1 minute interval to understand the trends and get a spike.

Also, try look at 'slabtop' while reproducing the issue as it shows the slab cache usage. (can't remember if sar reports it already, but that shows a nice output that is easy to spot the top consumers)

Just keep in mind that on a normal busy system, the write() syscall can get stuck on shrink_inactive_list(). It will be a problem if it remains stuck there for too long.  So, I am not sure how you captured that backtrace but make sure the write is really stuck there otherwise we are in the wrong direction. I suspect you're correct though, because the possible spike on the memory consumption.

'dmesg' after reproducing some times give us a clue.

I suspect that the networking tput is not enough to keep up with the workload from the VMs, so the host starts caching until it gets above the limits.  The cached data remains until you shutdown the virtual cluster blocking any write() to a socket.

Comment 21 Yash 2014-11-14 19:11:49 UTC
Ok I confirmed the ovsdb-server is stuck at the write() syscall when the memory spike happens. In my previous comment, I stated the cache stays high for about 4-5 min and then suddenly drops. During this period of high caching 'sar' tells me %memused is always above 90% (most of it used for caching based on 'free'). I was dumping sar/free at a 30sec interval.

What am I looking for in 'slabtop'? I tried 'slabtop --sort=u -d 30' and here is what it says. Let me know if I should use some other option to slabtop

Active / Total Objects (% used)    : 7054695 / 7094042 (99.4%)
 Active / Total Slabs (% used)      : 208078 / 208135 (100.0%)
 Active / Total Caches (% used)     : 137 / 224 (61.2%)
 Active / Total Size (% used)       : 790983.84K / 796880.64K (99.3%)
 Minimum / Average / Maximum Object : 0.02K / 0.11K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
   223    223 100%   32.12K    223        1     14272K kmem_cache
    62     62 100%    8.00K     62        1	  496K size-8192
    56     56 100%   16.00K     56        1	  896K size-16384
    40     40 100%   32.00K     40        1	 1280K size-32768
    31     31 100%   64.00K     31        1	 1984K size-65536
     1      1 100%  128.00K	 1        1	  128K size-131072
     2      2 100%    4.12K      2        1	   16K bip-256
    34     34 100%    2.00K	17        2        68K biovec-128
    24     24 100%    4.00K     24        1	   96K sgpool-128
     6      6 100%   12.69K	 6        1        96K kvm_vcpu
    16     16 100%    0.81K	 4        4        16K rpc_inode_cache
     8      8 100%    2.00K	 4        2        16K rpc_buffers
    36     36 100%    0.94K	 9        4        36K nfs_write_data
388640 386155  99%    0.03K   3470	112     13880K size-32
   768    767  99%    4.00K    768        1	 3072K size-4096

Comment 22 Yash 2014-11-14 19:28:24 UTC
I forgot to mention, dmesg does not show anything. Also, when I start deleting VMs, large chunks of memory (contiguous?) free up. This could be the reason the system/process recovers. Each VM takes about 3gigs of memory so it's significant.

What do you mean by 'networking tput is not enough'. Can you please elaborate a bit.

Comment 23 Yash 2014-11-14 19:48:57 UTC
This may not be relevant anymore but in one of my earlier attachments, when the process resumes function I might have started stracing late. This time, the write calls matches the sock fd.

Process suspends at:
24726 1415991593.986959 write(17, "{\"method\":\"update\",\"id\":null,\"pa"..., 11186) = 11186 <2007.600985>

Process resumes at:
24726 1415993601.607051 write(17, "{\"id\":217,\"result\":[{},{\"count\":"..., 145) = 145 <0.000198>
24726 1415993601.607486 read(18, 0x17958c8, 312) = -1 EAGAIN (Resource temporarily unavailable) <0.000098>
24726 1415993601.607861 getrusage(RUSAGE_SELF, {ru_utime={0, 429934}, ru_stime={2006, 894905}, ...}) = 0 <0.000011>
24726 1415993601.608015 write(5, "2014-11-14T19:33:21.607Z|00018|t"..., 115) = 115 <0.000025>
24726 1415993601.608100 write(5, "2014-11-14T19:33:21.608Z|00019|t"..., 94) = 94 <0.000015>
24726 1415993601.608182 write(5, "2014-11-14T19:33:21.608Z|00020|c"..., 108) = 108 <0.000013>
24726 1415993601.608228 poll([{fd=11, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}, {fd=17, events=POLLIN}], 5, 5000) = 1 ([{fd=17, revents=POLLIN}]) <0.000010>

Comment 24 Yash 2014-11-15 01:17:12 UTC
This slabtop o/p might provide more information. When the workload starts, buffer_head occupies a small cache size but starts growing upto 700M where it maxes out. I don't see anything of the order of gigs in the slabtop o/p. buffer_head keeps consuming this cache even after the workload is finished. We have to clear this cache manually and even after that the OVS process is hung on the write.

[root@node-114 ~]# slabtop -s u -d 30
Active / Total Objects (% used)    : 8009232 / 8033278 (99.7%)
 Active / Total Slabs (% used)      : 224471 / 224611 (99.9%)
 Active / Total Caches (% used)     : 132 / 217 (60.8%)
 Active / Total Size (% used)       : 850446.84K / 855539.33K (99.4%)
 Minimum / Average / Maximum Object : 0.02K / 0.11K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
6695261 6694795  99%    0.10K 180953	   37    723812K buffer_head
117131 115213  98%    0.55K  16733        7     66932K radix_tree_node
552384 552363  99%    0.02K   3836	144     15344K avtab_node
   216    216 100%   32.12K    216        1     13824K kmem_cache
381248 380744  99%    0.03K   3404	112     13616K size-32
 13508  13464  99%    0.99K   3377        4     13508K ext4_inode_cache
 50560  50276  99%    0.19K   2528	 20     10112K dentry
 13128  13080  99%    0.58K   2188        6	 8752K inode_cache

Comment 25 Yash 2014-11-17 18:20:36 UTC
One more point of information - if we have fewer VMs on the host, we see an identical behavior with regards to memory. A lot more memory ends up being cached (in the o/p of free) and slabtop shows more memory against 'buffer_head'. 

OVS does not hang if there is only 1 VM on the system. Any more VMs than that (in the same virtual cluster or some other virtual cluster) on the same host will lead to the problem.

Memory is definitely playing a role in this problem. However, I strongly feel there is poor implementation inside the OVS agent that causes it to hang indefinitely.

What should be our next step in trying to identify the memory problem? Can we also reach out to someone in the Openvswitch community to fix the agent so that it recovers once the memory consumption goes down.

Thanks for all your help and time Flavio.

Comment 26 Flavio Leitner 2014-12-02 13:53:34 UTC
Hi Yash,


The problem is that openvswitch needs memory to push data and the OS is struggling to give the memory back to openvswitch (strace shows that).  I bet other daemons in the OS are facing the same issue at the spike time and we are just not noticing it.  Once the OS can fulfill the memory request, the write() continues as the strace shows, resuming its normal operation.

The slabtop tool shows the slab allocator usage, so it's just part of the whole memory management system.  Usually, if you have a heavy filesystem usage, the buffer_head grows significantly, but it represents only the metadata, not the full picture of the whole fs memory usage.

The /proc/meminfo offers a better overview of the whole system memory allocated, but not much details.  There are some sysrqs that might be helpful too which shows the size of the pages, that could be helpful too.

Anyway, the system seems to be running out of memory, so the solution would be either to protect the core system components from starvation or somehow limit the side effects of the workload (tuning?).

Since I am not seeing any specific problem with OVS and if you agree, I would like to close this bug.

Thanks,
fbl

Comment 27 Yash 2014-12-02 20:54:45 UTC
Hi Flavio,
I was able to avoid the problem by turning off NUMAs zone_reclaim property. This way any process (including OVS) is allowed to look for memory outside it's assigned memory banks.

echo 0 > /proc/sys/vm/zone_reclaim_mode

This seems to improve I/O performance for filesystem type applications as well. We can go ahead and close this bug. Thanks for your help and pointers.

Comment 28 Flavio Leitner 2014-12-03 13:51:26 UTC
Hi Yash,

I am glad that you found a way to avoid the problem.  Actually, zone_reclaim_mode fits very well to what we have been troubleshooting because the system seems to have enough free memory but we didn't know in which node and at the same time OVS was blocked waiting for free memory.  Most likely the workload didn't fit within the NUMA node.

I appreciate that you have shared your findings, thanks!

fbl


Note You need to log in before you can comment on or make changes to this bug.