Description of problem: With 500+ ovn-controller (worker nodes) connected to sb-db raft cluster with election-timer set to 60 seconds, it crashes with segmentation fault. ---- Last State: Terminated Reason: Error Message: ction dropped (Protocol error) 2020-05-13T05:37:48Z|09393|stream_ssl|WARN|SSL_accept: system error (Success) 2020-05-13T05:37:48Z|09394|reconnect|WARN|ssl:10.0.128.219:35592: connection dropped (Protocol error) 2020-05-13T05:37:48Z|09395|stream_ssl|WARN|SSL_accept: system error (Success) 2020-05-13T05:37:48Z|09396|reconnect|WARN|ssl:10.0.128.219:36050: connection dropped (Protocol error) 2020-05-13T05:37:48Z|09397|stream_ssl|WARN|SSL_accept: system error (Success) 2020-05-13T05:37:48Z|09398|reconnect|WARN|ssl:10.0.128.219:36290: connection dropped (Protocol error) 2020-05-13T05:37:51Z|09399|poll_loop|INFO|Dropped 757 log messages in last 5 seconds (most recently, 0 seconds ago) due to excessive rate 2020-05-13T05:37:51Z|09400|poll_loop|INFO|wakeup due to [POLLOUT] on fd 52 (10.0.128.219:9642<->10.0.161.65:51080) at lib/stream-ssl.c:798 (100% CPU usage) 2020-05-13T05:37:57Z|09401|poll_loop|INFO|Dropped 160 log messages in last 6 seconds (most recently, 1 seconds ago) due to excessive rate 2020-05-13T05:37:57Z|09402|poll_loop|INFO|wakeup due to [POLLIN] on fd 13 (0.0.0.0:9642<->) at lib/stream-ssl.c:968 (57% CPU usage) 2020-05-13T05:38:40Z|09403|timeval|WARN|Unreasonably long 1295ms poll interval (1253ms user, 31ms system) 2020-05-13T05:38:40Z|09404|timeval|WARN|context switches: 0 voluntary, 2 involuntary 2020-05-13T05:38:40Z|09405|coverage|INFO|Skipping details of duplicate event coverage for hash=1e253855 2020-05-13T05:42:17Z|09406|stream_ssl|WARN|SSL_read: system error (Connection reset by peer) 2020-05-13T05:42:17Z|09407|jsonrpc|WARN|Dropped 7 log messages in last 270 seconds (most recently, 269 seconds ago) due to excessive rate 2020-05-13T05:42:17Z|09408|jsonrpc|WARN|ssl:10.0.173.126:39312: receive error: Connection reset by peer 2020-05-13T05:42:17Z|09409|reconnect|WARN|ssl:10.0.173.126:39312: connection dropped (Connection reset by peer) ovsdb-server: pthread_create failed (Resource temporarily unavailable) 2020-05-13T05:49:01Z|00001|fatal_signal(urcu1)|WARN|terminating with signal 11 (Segmentation fault) ... ... Warning Unhealthy 17m (x383 over 6h32m) kubelet, ip-10-0-128-219.us-west-2.compute.internal Readiness probe failed: command timed out Warning Unhealthy 7m51s (x30 over 15m) kubelet, ip-10-0-128-219.us-west-2.compute.internal (combined from similar events): Readiness probe errored: rpc error: code = Unknown desc = command error: runtime/cgo: runtime/cgo: pthread_create failed: Resource temporarily unavailable pthread_create failed: Resource temporarily unavailable SIGABRT: abort PC=0x7fd4eec718df m=0 sigcode=18446744073709551610 goroutine 0 [idle]: runtime: unknown pc 0x7fd4eec718df stack: frame={sp:0x7ffe5d3f37f0, fp:0x0} stack=[0x7ffe5cbf4e68,0x7ffe5d3f3ea0) 00007ffe5d3f36f0: 0000000000000000 0000000000000000 00007ffe5d3f3700: 0000000000000000 0000000000000000 00007ffe5d3f3710: 0000000000000000 0000000000000000 00007ffe5d3f3720: 0000000000000000 0000000000000000 00007ffe5d3f3730: 0000000000000000 0000000000000000 00007ffe5d3f3740: 0000000000000000 0000000000000000 00007ffe5d3f3750: 0000000000000000 0000000000000000 00007ffe5d3f3760: 0000000000000000 00007ffe5d3f37b8 00007ffe5d3f3770: 00007ffe5d3f37c8 0000000000000040 00007ffe5d3f3780: 0000000000000040 0000000000000001 00007ffe5d3f3790: 000000006e43a318 000055c91cf2fb8c 00007ffe5d3f37a0: 0000000000000000 000055c91cc4bd8e <runtime.callCgoMmap+62> 00007ffe5d3f37b0: 00007ffe5d3f37b8 0000000000000000 00007ffe5d3f37c0: 0000000000000000 00007ffe5d3f3808 00007ffe5d3f37d0: 000055c91cc43a1a <runtime.mmap.func1+90> 0000000000000000 00007ffe5d3f37e0: 0000000000210800 0000002200000003 00007ffe5d3f37f0: <0000000000000000 00007fd4eca29000 00007ffe5d3f3800: 00007ffe5d3f3848 00007ffe5d3f3880 00007ffe5d3f3810: 000055c91cbf1083 <runtime.mmap+179> 00007ffe5d3f3850 00007ffe5d3f3820: 0000000000000000 00007ffe5d3f3878 00007ffe5d3f3830: 00007ffe5d3f3888 0000000000000040 00007ffe5d3f3840: 0000000000000040 0000000000000001 00007ffe5d3f3850: 000000006e43a318 000055c91cf2fb8c 00007ffe5d3f3860: 0000000000210800 000055c91cc4bd8e <runtime.callCgoMmap+62> 00007ffe5d3f3870: fffffffe7fffffff ffffffffffffffff 00007ffe5d3f3880: ffffffffffffffff ffffffffffffffff 00007ffe5d3f3890: ffffffffffffffff ffffffffffffffff 00007ffe5d3f38a0: ffffffffffffffff ffffffffffffffff 00007ffe5d3f38b0: ffffffffffffffff ffffffffffffffff 00007ffe5d3f38c0: ffffffffffffffff ffffffffffffffff 00007ffe5d3f38d0: ffffffffffffffff ffffffffffffffff 00007ffe5d3f38e0: ffffffffffffffff ffffffffffffffff runtime: unknown pc 0x7fd4eec718df stack: frame={sp:0x7ffe5d3f37f0, fp:0x0} stack=[0x7ffe5cbf4e68,0x7ffe5d3f3ea0) 00007ffe5d3f36f0: 0000000000000000 0000000000000000 00007ffe5d3f3700: 0000000000000000 0000000000000000 00007ffe5d3f3710: 0000000000000000 0000000000000000 00007ffe5d3f3720: 0000000000000000 0000000000000000 00007ffe5d3f3730: 0000000000000000 0000000000000000 00007ffe5d3f3740: 0000000000000000 0000000000000000 00007ffe5d3f3750: 0000000000000000 0000000000000000 00007ffe5d3f3760: 0000000000000000 00007ffe5d3f37b8 00007ffe5d3f3770: 00007ffe5d3f37c8 0000000000000040 00007ffe5d3f3780: 0000000000000040 0000000000000001 00007ffe5d3f3790: 000000006e43a318 000055c91cf2fb8c 00007ffe5d3f37a0: 0000000000000000 000055c91cc4bd8e <runtime.callCgoMmap+62> 00007ffe5d3f37b0: 00007ffe5d3f37b8 0000000000000000 00007ffe5d3f37c0: 0000000000000000 00007ffe5d3f3808 00007ffe5d3f37d0: 000055c91cc43a1a <runtime.mmap.func1+90> 0000000000000000 00007ffe5d3f37e0: 0000000000210800 0000002200000003 00007ffe5d3f37f0: <0000000000000000 00007fd4eca29000 00007ffe5d3f3800: 00007ffe5d3f3848 00007ffe5d3f3880 00007ffe5d3f3810: 000055c91cbf1083 <runtime.mmap+179> 00007ffe5d3f3850 00007ffe5d3f3820: 0000000000000000 00007ffe5d3f3878 00007ffe5d3f3830: 00007ffe5d3f3888 0000000000000040 00007ffe5d3f3840: 0000000000000040 0000000000000001 00007ffe5d3f3850: 000000006e43a318 000055c91cf2fb8c 00007ffe5d3f3860: 0000000000210800 000055c91cc4bd8e <runtime.callCgoMmap+62> 00007ffe5d3f3870: fffffffe7fffffff ffffffffffffffff 00007ffe5d3f3880: ffffffffffffffff ffffffffffffffff 00007ffe5d3f3890: ffffffffffffffff ffffffffffffffff 00007ffe5d3f38a0: ffffffffffffffff ffffffffffffffff 00007ffe5d3f38b0: ffffffffffffffff ffffffffffffffff 00007ffe5d3f38c0: ffffffffffffffff ffffffffffffffff 00007ffe5d3f38d0: ffffffffffffffff ffffffffffffffff 00007ffe5d3f38e0: ffffffffffffffff ffffffffffffffff goroutine 1 [chan receive, locked to thread]: runtime.gopark(0x55c91d30de50, 0xc0000180c8, 0x170e, 0x3) /usr/lib/golang/src/runtime/proc.go:304 +0xe6 fp=0xc00004a670 sp=0xc00004a650 pc=0x55c91cc1e376 runtime.goparkunlock(...) /usr/lib/golang/src/runtime/proc.go:310 runtime.chanrecv(0xc000018070, 0x0, 0xc000000101, 0x55c91cc07aa0) /usr/lib/golang/src/runtime/chan.go:524 +0x2ec fp=0xc00004a700 sp=0xc00004a670 pc=0x55c91cbf40fc runtime.chanrecv1(0xc000018070, 0x0) /usr/lib/golang/src/runtime/chan.go:406 +0x2b fp=0xc00004a730 sp=0xc00004a700 pc=0x55c91cbf3dbb runtime.gcenable() /usr/lib/golang/src/runtime/mgc.go:212 +0x97 fp=0xc00004a760 sp=0xc00004a730 pc=0x55c91cc07ab7 runtime.main() /usr/lib/golang/src/runtime/proc.go:166 +0x129 fp=0xc00004a7e0 sp=0xc00004a760 pc=0x55c91cc1de89 runtime.goexit() /usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004a7e8 sp=0xc00004a7e0 pc=0x55c91cc49f11 goroutine 2 [runnable]: runtime.forcegchelper() /usr/lib/golang/src/runtime/proc.go:245 fp=0xc00004afe0 sp=0xc00004afd8 pc=0x55c91cc1e160 runtime.goexit() /usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004afe8 sp=0xc00004afe0 pc=0x55c91cc49f11 created by runtime.init.6 /usr/lib/golang/src/runtime/proc.go:242 +0x37 goroutine 3 [runnable]: runtime.bgsweep(0xc000018070) /usr/lib/golang/src/runtime/mgcsweep.go:64 fp=0xc00004b7d8 sp=0xc00004b7d0 pc=0x55c91cc10cb0 runtime.goexit() /usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004b7e0 sp=0xc00004b7d8 pc=0x55c91cc49f11 created by runtime.gcenable /usr/lib/golang/src/runtime/mgc.go:210 +0x5e goroutine 4 [runnable]: runtime.bgscavenge(0xc000018070) /usr/lib/golang/src/runtime/mgcscavenge.go:287 fp=0xc00004bfd8 sp=0xc00004bfd0 pc=0x55c91cc102f0 runtime.goexit() /usr/lib/golang/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00004bfe0 sp=0xc00004bfd8 pc=0x55c91cc49f11 created by runtime.gcenable /usr/lib/golang/src/runtime/mgc.go:211 +0x80 rax 0x0 rbx 0x6 rcx 0x7fd4eec718df rdx 0x0 rdi 0x2 rsi 0x7ffe5d3f37f0 rbp 0x55c91cfb2f2e rsp 0x7ffe5d3f37f0 r8 0x0 r9 0x7ffe5d3f37f0 r10 0x8 r11 0x246 r12 0x55c91e5655d0 r13 0x0 r14 0x55c91cfa52b0 r15 0x0 rip 0x7fd4eec718df rflags 0x246 cs 0x33 fs 0x0 gs 0x0 ----- SIGQUIT: quit PC=0x55c91cc4b8cd m=2 sigcode=0 goroutine 0 [idle]: runtime.usleep(0x6b8500002710, 0x7fd400000000, 0x0, 0x271000000008, 0x6b857146485c, 0x7fd4eca27e30, 0x55c91cc31dc7, 0x0, 0x4e, 0x6b8564b5012f, ...) /usr/lib/golang/src/runtime/sys_linux_amd64.s:131 +0x3d fp=0x7fd4eca27df8 sp=0x7fd4eca27dd8 pc=0x55c91cc4b8cd runtime.sysmon() /usr/lib/golang/src/runtime/proc.go:4296 +0x8f fp=0x7fd4eca27e58 sp=0x7fd4eca27df8 pc=0x55c91cc286bf runtime.mstart1() /usr/lib/golang/src/runtime/proc.go:1201 +0xc7 fp=0x7fd4eca27e80 sp=0x7fd4eca27e58 pc=0x55c91cc20d57 runtime.mstart() /usr/lib/golang/src/runtime/proc.go:1167 +0x70 fp=0x7fd4eca27ea8 sp=0x7fd4eca27e80 pc=0x55c91cc20c70 rax 0xfffffffffffffffc rbx 0x4e20 rcx 0x55c91cc4b8cd rdx 0x0 rdi 0x7fd4eca27dd8 rsi 0x0 rbp 0x7fd4eca27de8 rsp 0x7fd4eca27dd8 r8 0x0 r9 0x0 r10 0x342b505c6f6c4a r11 0x202 r12 0x7ffe5d3f3b3e r13 0x7ffe5d3f3b3f r14 0x7ffe5d3f3c30 r15 0x7fd4eca27fc0 rip 0x55c91cc4b8cd rflags 0x202 cs 0x33 fs 0x0 gs 0x0 ----- time="2020-05-13T05:44:26Z" level=error msg="exec failed: container_linux.go:349: starting container process caused \"read init-p: connection reset by peer\"" exec failed: container_linux.go:349: starting container process caused "read init-p: connection reset by peer" , stdout: , stderr: , exit code -1 ---- Version-Release number of selected component (if applicable): How reproducible: At higher scale (500+ worker node) it is reproducible very frequently. Steps to Reproduce: 1. Deploy openshift cluster. 2. Set sb-db raft cluster election-timer to 36 seconds 3. Scale the setup to 400+ nodes. 4. In my observation this failure happens when sb-db container readiness probe executes ( i might be wrong as well) ``` /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound ``` Actual results: Expected results: ovsdb-server should not crash and ovn-appctl should ideally timeout. Additional info: I was not able to collect the coredump during this iteration of test. I will capture it in next test, if i see the issue again. Meanwhile if this is an existing bug, please point me to the patch, so i can bring it in my deployment for testing.
ovsdb-server: pthread_create failed (Resource temporarily unavailable) ^^^ This would mean resource limits are exceeded and the server can't create more threads. It doesn't mean it's ovsdb-server's fault, it might mean that the host or dbserver pod needs to lift pthread resource limits.
To add to what Dan said, from ovs_thread_create: error = pthread_create(&thread, &attr, ovsthread_wrapper, aux); if (error) { ovs_abort(error, "pthread_create failed"); } All OVS/OVN applications will abort when they cannot create a new thread. I'm not sure why you saw a segmentation fault instead of an abort, though. I agree with Dan's initial assessment. However, I think it would be worth investigating the thread growth in ovsdb-server in this scenario. One interesting data point here would be the number of threads in use at the time of the abort. If you have a core dump from the crash, you can open it in gdb and run `info threads` as a way to see how many threads were in use. If the process was using thousands of threads for some reason, that likely indicates an issue that needs addressing. However, if not, then it probably does point to some system limitation that needs to be increased.
@Mark I think in this scenario, looks like both ovsdb-server and goruntime (running ovs-appctl readiness probe) both failed because of the same reason. But as you pointed out ovsdb-server crashed with segfault but the readiness probe failed with the current error code (SIGABRT) runtime/cgo: runtime/cgo: pthread_create failed: Resource temporarily unavailable pthread_create failed: Resource temporarily unavailable SIGABRT: abort PC=0x7fd4eec718df m=0 sigcode=18446744073709551610 I looked at the threads-max for each master pod that is running the sb-db cluster # oc exec -n openshift-ovn-kubernetes ovnkube-master-hhzrx -c sbdb -- cat /proc/sys/kernel/threads-max 1019067 # oc exec -n openshift-ovn-kubernetes ovnkube-master-pfpsz -c sbdb -- cat /proc/sys/kernel/threads-max 1019067 # oc exec -n openshift-ovn-kubernetes ovnkube-master-w8fv2 -c sbdb -- cat /proc/sys/kernel/threads-max 1019067 So looks like the value is set to pretty high value. Here are the ulimit -a from the pods # oc exec -n openshift-ovn-kubernetes ovnkube-master-w8fv2 -c sbdb -- ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 509533 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 1048576 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited If i will hit the coredump, i will collect the coredump.
@mark/@Dan, I digged bit more into it and it seems like sb/nb db container readiness probes are creating lot of zombie threads. Once sb db raft cluster goes to bad state, readiness probe keep failing and keep creating the zombie threads. ----- root@ip-172-31-94-220: ~ # oc exec -n openshift-ovn-kubernetes ovnkube-master-w8fv2 -c sbdb -- ps -elf F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S root 1 0 9 80 0 - 731418 x64_sy May14 ? 00:13:52 ovsdb-server -vconsole:info -vfile:off --log-file=/var/log/ovn/ovsdb-server-sb.log --remote=punix:/var/run/ovn/ovnsb_db.sock --pidfile=/var/run/ovn/ovnsb_db.pid --unixctl=/var/run/ovn/ovnsb_db.ctl --remote=db:OVN_Southbound,SB_Global,connections --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --ca-cert=/ovn-ca/ca-bundle.crt --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers /etc/ovn/ovnsb_db.db 0 Z root 6542 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 6543 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6544 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6598 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 6599 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6600 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6607 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 6608 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6609 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6656 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 6657 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6658 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6665 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 6666 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6667 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6711 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 6712 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6713 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6792 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 6793 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6794 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6801 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 6802 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6803 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6810 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 6811 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6812 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6995 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 6996 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 6997 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 7004 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 7005 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 7006 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 7013 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 7014 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 7015 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 7022 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 7023 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 7024 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 7031 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 7032 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 7033 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 7040 1 0 80 0 - 0 - May14 ? 00:00:00 [ovn-appctl] <defunct> 0 Z root 7041 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> 0 Z root 7042 1 0 80 0 - 0 - May14 ? 00:00:00 [grep] <defunct> ---- At some point of time CNO realize that cluster is not in a healty state and it restarts the sb container, but given that system is hitting pthread limit, it fails and cluster never recovers. So it's like a slow thread leak. If cluster get busy for more than 10 seconds (and more frequently), readiness probe will fail more frequently. Each probe attempt creates 3 zombi threads. In my scale setup i have seen 500+ zombie threads in 4 hours, so it's kind of a ticking bomb. I pushed a fixed to cno that resolves this issue. I will move this bug to ovn-kubernetes now.
Following PR is under review for this bug https://github.com/openshift/cluster-network-operator/pull/652
Relevant PR is merged in the master branch.
Verified on 4.6.0-0.nightly-2020-10-03-051134. 500 node AWS cluster with m4.4xlarge masters stable with no restarts.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196