Bug 1835497 - [OVN Scale] ovsdb-server crashes during ovs-appctl cluster/status check
Summary: [OVN Scale] ovsdb-server crashes during ovs-appctl cluster/status check
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Anil Vishnoi
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-13 23:00 UTC by Anil Vishnoi
Modified: 2021-06-09 14:42 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 15:59:18 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:59:36 UTC

Description Anil Vishnoi 2020-05-13 23:00:22 UTC
Description of problem:
With 500+ ovn-controller (worker nodes) connected to sb-db raft cluster with election-timer set to 60 seconds, it crashes with segmentation fault.

----
    Last State:  Terminated
      Reason:    Error
      Message:   ction dropped (Protocol error)
2020-05-13T05:37:48Z|09393|stream_ssl|WARN|SSL_accept: system error (Success)
2020-05-13T05:37:48Z|09394|reconnect|WARN|ssl:10.0.128.219:35592: connection dropped (Protocol error)
2020-05-13T05:37:48Z|09395|stream_ssl|WARN|SSL_accept: system error (Success)
2020-05-13T05:37:48Z|09396|reconnect|WARN|ssl:10.0.128.219:36050: connection dropped (Protocol error)
2020-05-13T05:37:48Z|09397|stream_ssl|WARN|SSL_accept: system error (Success)
2020-05-13T05:37:48Z|09398|reconnect|WARN|ssl:10.0.128.219:36290: connection dropped (Protocol error)
2020-05-13T05:37:51Z|09399|poll_loop|INFO|Dropped 757 log messages in last 5 seconds (most recently, 0 seconds ago) due to excessive rate
2020-05-13T05:37:51Z|09400|poll_loop|INFO|wakeup due to [POLLOUT] on fd 52 (10.0.128.219:9642<->10.0.161.65:51080) at lib/stream-ssl.c:798 (100% CPU usage)
2020-05-13T05:37:57Z|09401|poll_loop|INFO|Dropped 160 log messages in last 6 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-13T05:37:57Z|09402|poll_loop|INFO|wakeup due to [POLLIN] on fd 13 (0.0.0.0:9642<->) at lib/stream-ssl.c:968 (57% CPU usage)
2020-05-13T05:38:40Z|09403|timeval|WARN|Unreasonably long 1295ms poll interval (1253ms user, 31ms system)
2020-05-13T05:38:40Z|09404|timeval|WARN|context switches: 0 voluntary, 2 involuntary
2020-05-13T05:38:40Z|09405|coverage|INFO|Skipping details of duplicate event coverage for hash=1e253855
2020-05-13T05:42:17Z|09406|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2020-05-13T05:42:17Z|09407|jsonrpc|WARN|Dropped 7 log messages in last 270 seconds (most recently, 269 seconds ago) due to excessive rate
2020-05-13T05:42:17Z|09408|jsonrpc|WARN|ssl:10.0.173.126:39312: receive error: Connection reset by peer
2020-05-13T05:42:17Z|09409|reconnect|WARN|ssl:10.0.173.126:39312: connection dropped (Connection reset by peer)
ovsdb-server: pthread_create failed (Resource temporarily unavailable)
2020-05-13T05:49:01Z|00001|fatal_signal(urcu1)|WARN|terminating with signal 11 (Segmentation fault)

...
...
 Warning  Unhealthy  17m (x383 over 6h32m)  kubelet, ip-10-0-128-219.us-west-2.compute.internal  Readiness probe failed: command timed out
  Warning  Unhealthy  7m51s (x30 over 15m)   kubelet, ip-10-0-128-219.us-west-2.compute.internal  (combined from similar events): Readiness probe errored: rpc error: code = Unknown desc = command error: runtime/cgo: runtime/cgo: pthread_create failed: Resource temporarily unavailable
pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fd4eec718df m=0 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7fd4eec718df
stack: frame={sp:0x7ffe5d3f37f0, fp:0x0} stack=[0x7ffe5cbf4e68,0x7ffe5d3f3ea0)
00007ffe5d3f36f0:  0000000000000000  0000000000000000
00007ffe5d3f3700:  0000000000000000  0000000000000000
00007ffe5d3f3710:  0000000000000000  0000000000000000
00007ffe5d3f3720:  0000000000000000  0000000000000000
00007ffe5d3f3730:  0000000000000000  0000000000000000
00007ffe5d3f3740:  0000000000000000  0000000000000000
00007ffe5d3f3750:  0000000000000000  0000000000000000
00007ffe5d3f3760:  0000000000000000  00007ffe5d3f37b8
00007ffe5d3f3770:  00007ffe5d3f37c8  0000000000000040
00007ffe5d3f3780:  0000000000000040  0000000000000001
00007ffe5d3f3790:  000000006e43a318  000055c91cf2fb8c
00007ffe5d3f37a0:  0000000000000000  000055c91cc4bd8e <runtime.callCgoMmap+62>
00007ffe5d3f37b0:  00007ffe5d3f37b8  0000000000000000
00007ffe5d3f37c0:  0000000000000000  00007ffe5d3f3808
00007ffe5d3f37d0:  000055c91cc43a1a <runtime.mmap.func1+90>  0000000000000000
00007ffe5d3f37e0:  0000000000210800  0000002200000003
00007ffe5d3f37f0: <0000000000000000  00007fd4eca29000
00007ffe5d3f3800:  00007ffe5d3f3848  00007ffe5d3f3880
00007ffe5d3f3810:  000055c91cbf1083 <runtime.mmap+179>  00007ffe5d3f3850
00007ffe5d3f3820:  0000000000000000  00007ffe5d3f3878
00007ffe5d3f3830:  00007ffe5d3f3888  0000000000000040
00007ffe5d3f3840:  0000000000000040  0000000000000001
00007ffe5d3f3850:  000000006e43a318  000055c91cf2fb8c
00007ffe5d3f3860:  0000000000210800  000055c91cc4bd8e <runtime.callCgoMmap+62>
00007ffe5d3f3870:  fffffffe7fffffff  ffffffffffffffff
00007ffe5d3f3880:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f3890:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f38a0:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f38b0:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f38c0:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f38d0:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f38e0:  ffffffffffffffff  ffffffffffffffff
runtime: unknown pc 0x7fd4eec718df
stack: frame={sp:0x7ffe5d3f37f0, fp:0x0} stack=[0x7ffe5cbf4e68,0x7ffe5d3f3ea0)
00007ffe5d3f36f0:  0000000000000000  0000000000000000
00007ffe5d3f3700:  0000000000000000  0000000000000000
00007ffe5d3f3710:  0000000000000000  0000000000000000
00007ffe5d3f3720:  0000000000000000  0000000000000000
00007ffe5d3f3730:  0000000000000000  0000000000000000
00007ffe5d3f3740:  0000000000000000  0000000000000000
00007ffe5d3f3750:  0000000000000000  0000000000000000
00007ffe5d3f3760:  0000000000000000  00007ffe5d3f37b8
00007ffe5d3f3770:  00007ffe5d3f37c8  0000000000000040
00007ffe5d3f3780:  0000000000000040  0000000000000001
00007ffe5d3f3790:  000000006e43a318  000055c91cf2fb8c
00007ffe5d3f37a0:  0000000000000000  000055c91cc4bd8e <runtime.callCgoMmap+62>
00007ffe5d3f37b0:  00007ffe5d3f37b8  0000000000000000
00007ffe5d3f37c0:  0000000000000000  00007ffe5d3f3808
00007ffe5d3f37d0:  000055c91cc43a1a <runtime.mmap.func1+90>  0000000000000000
00007ffe5d3f37e0:  0000000000210800  0000002200000003
00007ffe5d3f37f0: <0000000000000000  00007fd4eca29000
00007ffe5d3f3800:  00007ffe5d3f3848  00007ffe5d3f3880
00007ffe5d3f3810:  000055c91cbf1083 <runtime.mmap+179>  00007ffe5d3f3850
00007ffe5d3f3820:  0000000000000000  00007ffe5d3f3878
00007ffe5d3f3830:  00007ffe5d3f3888  0000000000000040
00007ffe5d3f3840:  0000000000000040  0000000000000001
00007ffe5d3f3850:  000000006e43a318  000055c91cf2fb8c
00007ffe5d3f3860:  0000000000210800  000055c91cc4bd8e <runtime.callCgoMmap+62>
00007ffe5d3f3870:  fffffffe7fffffff  ffffffffffffffff
00007ffe5d3f3880:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f3890:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f38a0:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f38b0:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f38c0:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f38d0:  ffffffffffffffff  ffffffffffffffff
00007ffe5d3f38e0:  ffffffffffffffff  ffffffffffffffff

goroutine 1 [chan receive, locked to thread]:
runtime.gopark(0x55c91d30de50, 0xc0000180c8, 0x170e, 0x3)
  /usr/lib/golang/src/runtime/proc.go:304 +0xe6 fp=0xc00004a670 sp=0xc00004a650 pc=0x55c91cc1e376
runtime.goparkunlock(...)
  /usr/lib/golang/src/runtime/proc.go:310
runtime.chanrecv(0xc000018070, 0x0, 0xc000000101, 0x55c91cc07aa0)
  /usr/lib/golang/src/runtime/chan.go:524 +0x2ec fp=0xc00004a700 sp=0xc00004a670 pc=0x55c91cbf40fc
runtime.chanrecv1(0xc000018070, 0x0)
  /usr/lib/golang/src/runtime/chan.go:406 +0x2b fp=0xc00004a730 sp=0xc00004a700 pc=0x55c91cbf3dbb
runtime.gcenable()
  /usr/lib/golang/src/runtime/mgc.go:212 +0x97 fp=0xc00004a760 sp=0xc00004a730 pc=0x55c91cc07ab7
runtime.main()
  /usr/lib/golang/src/runtime/proc.go:166 +0x129 fp=0xc00004a7e0 sp=0xc00004a760 pc=0x55c91cc1de89
runtime.goexit()
  /usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004a7e8 sp=0xc00004a7e0 pc=0x55c91cc49f11

goroutine 2 [runnable]:
runtime.forcegchelper()
  /usr/lib/golang/src/runtime/proc.go:245 fp=0xc00004afe0 sp=0xc00004afd8 pc=0x55c91cc1e160
runtime.goexit()
  /usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004afe8 sp=0xc00004afe0 pc=0x55c91cc49f11
created by runtime.init.6
  /usr/lib/golang/src/runtime/proc.go:242 +0x37

goroutine 3 [runnable]:
runtime.bgsweep(0xc000018070)
  /usr/lib/golang/src/runtime/mgcsweep.go:64 fp=0xc00004b7d8 sp=0xc00004b7d0 pc=0x55c91cc10cb0
runtime.goexit()
  /usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004b7e0 sp=0xc00004b7d8 pc=0x55c91cc49f11
created by runtime.gcenable
  /usr/lib/golang/src/runtime/mgc.go:210 +0x5e

goroutine 4 [runnable]:
runtime.bgscavenge(0xc000018070)
  /usr/lib/golang/src/runtime/mgcscavenge.go:287 fp=0xc00004bfd8 sp=0xc00004bfd0 pc=0x55c91cc102f0
runtime.goexit()
  /usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004bfe0 sp=0xc00004bfd8 pc=0x55c91cc49f11
created by runtime.gcenable
  /usr/lib/golang/src/runtime/mgc.go:211 +0x80

rax    0x0
rbx    0x6
rcx    0x7fd4eec718df
rdx    0x0
rdi    0x2
rsi    0x7ffe5d3f37f0
rbp    0x55c91cfb2f2e
rsp    0x7ffe5d3f37f0
r8     0x0
r9     0x7ffe5d3f37f0
r10    0x8
r11    0x246
r12    0x55c91e5655d0
r13    0x0
r14    0x55c91cfa52b0
r15    0x0
rip    0x7fd4eec718df
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

-----

SIGQUIT: quit
PC=0x55c91cc4b8cd m=2 sigcode=0

goroutine 0 [idle]:
runtime.usleep(0x6b8500002710, 0x7fd400000000, 0x0, 0x271000000008, 0x6b857146485c, 0x7fd4eca27e30, 0x55c91cc31dc7, 0x0, 0x4e, 0x6b8564b5012f, ...)
  /usr/lib/golang/src/runtime/sys_linux_amd64.s:131 +0x3d fp=0x7fd4eca27df8 sp=0x7fd4eca27dd8 pc=0x55c91cc4b8cd
runtime.sysmon()
  /usr/lib/golang/src/runtime/proc.go:4296 +0x8f fp=0x7fd4eca27e58 sp=0x7fd4eca27df8 pc=0x55c91cc286bf
runtime.mstart1()
  /usr/lib/golang/src/runtime/proc.go:1201 +0xc7 fp=0x7fd4eca27e80 sp=0x7fd4eca27e58 pc=0x55c91cc20d57
runtime.mstart()
  /usr/lib/golang/src/runtime/proc.go:1167 +0x70 fp=0x7fd4eca27ea8 sp=0x7fd4eca27e80 pc=0x55c91cc20c70
rax    0xfffffffffffffffc
rbx    0x4e20
rcx    0x55c91cc4b8cd
rdx    0x0
rdi    0x7fd4eca27dd8
rsi    0x0
rbp    0x7fd4eca27de8
rsp    0x7fd4eca27dd8
r8     0x0
r9     0x0
r10    0x342b505c6f6c4a
r11    0x202
r12    0x7ffe5d3f3b3e
r13    0x7ffe5d3f3b3f
r14    0x7ffe5d3f3c30
r15    0x7fd4eca27fc0
rip    0x55c91cc4b8cd
rflags 0x202
cs     0x33
fs     0x0
gs     0x0

-----

time="2020-05-13T05:44:26Z" level=error msg="exec failed: container_linux.go:349: starting container process caused \"read init-p: connection reset by peer\""
exec failed: container_linux.go:349: starting container process caused "read init-p: connection reset by peer"
, stdout: , stderr: , exit code -1
----

Version-Release number of selected component (if applicable):


How reproducible:
At higher scale (500+ worker node) it is reproducible very frequently. 

Steps to Reproduce:
1. Deploy openshift cluster. 
2. Set sb-db raft cluster election-timer to 36 seconds
3. Scale the setup to 400+ nodes.
4. In my observation this failure happens when sb-db container readiness probe executes ( i might be wrong as well)  
```
/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
```

Actual results:


Expected results:
ovsdb-server should not crash and ovn-appctl should ideally timeout.

Additional info:
I was not able to collect the coredump during this iteration of test. I will capture it in next test, if i see the issue again. Meanwhile if this is an existing bug, please point me to the patch, so i can bring it in my deployment for testing.

Comment 1 Dan Williams 2020-05-14 02:12:38 UTC
ovsdb-server: pthread_create failed (Resource temporarily unavailable)

^^^ This would mean resource limits are exceeded and the server can't create more threads. It doesn't mean it's ovsdb-server's fault, it might mean that the host or dbserver pod needs to lift pthread resource limits.

Comment 2 Mark Michelson 2020-05-14 13:02:46 UTC
To add to what Dan said, from ovs_thread_create:

    error = pthread_create(&thread, &attr, ovsthread_wrapper, aux);
    if (error) {
        ovs_abort(error, "pthread_create failed");
    }

All OVS/OVN applications will abort when they cannot create a new thread. I'm not sure why you saw a segmentation fault instead of an abort, though.

I agree with Dan's initial assessment. However, I think it would be worth investigating the thread growth in ovsdb-server in this scenario. One interesting data point here would be the number of threads in use at the time of the abort. If you have a core dump from the crash, you can open it in gdb and run `info threads` as a way to see how many threads were in use. If the process was using thousands of threads for some reason, that likely indicates an issue that needs addressing. However, if not, then it probably does point to some system limitation that needs to be increased.

Comment 3 Anil Vishnoi 2020-05-14 23:35:26 UTC
@Mark
I think in this scenario, looks like both ovsdb-server and goruntime (running ovs-appctl readiness probe) both failed because of the same reason. But as you pointed out ovsdb-server crashed with segfault but the readiness probe failed with the current error code (SIGABRT)

runtime/cgo: runtime/cgo: pthread_create failed: Resource temporarily unavailable
pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fd4eec718df m=0 sigcode=18446744073709551610

I looked at the threads-max for each master pod that is running the sb-db cluster 
# oc exec -n openshift-ovn-kubernetes ovnkube-master-hhzrx -c sbdb --  cat /proc/sys/kernel/threads-max
1019067
# oc exec -n openshift-ovn-kubernetes ovnkube-master-pfpsz -c sbdb --  cat /proc/sys/kernel/threads-max
1019067
# oc exec -n openshift-ovn-kubernetes ovnkube-master-w8fv2 -c sbdb --  cat /proc/sys/kernel/threads-max
1019067

So looks like the value is set to pretty high value. Here are the ulimit -a from the pods 

# oc exec -n openshift-ovn-kubernetes ovnkube-master-w8fv2 -c sbdb --  ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 509533
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

If i will hit the coredump, i will collect the coredump.

Comment 4 Anil Vishnoi 2020-05-28 01:05:46 UTC
@mark/@Dan,

I digged bit more into it and it seems like sb/nb db container readiness probes are creating lot of zombie threads. 
Once sb db raft cluster goes to bad state, readiness probe keep failing and keep creating the zombie threads. 

-----
root@ip-172-31-94-220: ~ # oc exec -n openshift-ovn-kubernetes ovnkube-master-w8fv2 -c sbdb --  ps -elf
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root           1       0  9  80   0 - 731418 x64_sy May14 ?       00:13:52 ovsdb-server -vconsole:info -vfile:off --log-file=/var/log/ovn/ovsdb-server-sb.log --remote=punix:/var/run/ovn/ovnsb_db.sock --pidfile=/var/run/ovn/ovnsb_db.pid --unixctl=/var/run/ovn/ovnsb_db.ctl --remote=db:OVN_Southbound,SB_Global,connections --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --ca-cert=/ovn-ca/ca-bundle.crt --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers /etc/ovn/ovnsb_db.db
0 Z root        6542       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        6543       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6544       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6598       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        6599       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6600       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6607       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        6608       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6609       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6656       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        6657       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6658       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6665       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        6666       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6667       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6711       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        6712       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6713       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6792       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        6793       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6794       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6801       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        6802       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6803       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6810       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        6811       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6812       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6995       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        6996       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        6997       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        7004       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        7005       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        7006       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        7013       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        7014       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        7015       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        7022       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        7023       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        7024       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        7031       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        7032       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        7033       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        7040       1  0  80   0 -     0 -      May14 ?        00:00:00 [ovn-appctl] <defunct>
0 Z root        7041       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
0 Z root        7042       1  0  80   0 -     0 -      May14 ?        00:00:00 [grep] <defunct>
----

At some point of time CNO realize that cluster is not in a healty state and it restarts the sb container, but
given that system is hitting pthread limit, it fails and cluster never recovers. So it's like a slow thread leak.
If cluster get busy for more than 10 seconds (and more frequently), readiness probe will fail more frequently.
Each probe attempt creates 3 zombi threads. In my scale setup i have seen 500+ zombie threads in 4 hours,
so it's kind of a ticking bomb.

I pushed a fixed to cno that resolves this issue. I will move this bug to ovn-kubernetes now.

Comment 5 Anil Vishnoi 2020-06-03 00:58:32 UTC
Following PR is under review for this bug 

https://github.com/openshift/cluster-network-operator/pull/652

Comment 6 Anil Vishnoi 2020-06-19 06:52:46 UTC
Relevant PR is merged in the master branch.

Comment 9 Mike Fiedler 2020-10-07 19:43:05 UTC
Verified on 4.6.0-0.nightly-2020-10-03-051134.   500 node AWS cluster with m4.4xlarge masters stable with no restarts.

Comment 11 errata-xmlrpc 2020-10-27 15:59:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.