Bug 1766482

Summary: When a node host is CPU usage to spike, goroutine stack traces was recoreded with Resource temporarily unavailable
Product: OpenShift Container Platform Reporter: Daein Park <dapark>
Component: NodeAssignee: Robert Krawitz <rkrawitz>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.zCC: aos-bugs, jokerman, rphillips
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1766628 (view as bug list) Environment:
Last Closed: 2020-01-23 11:09:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1766628, 1766629    

Description Daein Park 2019-10-29 08:54:15 UTC
Description of problem:

When a master host OCP4.1 on AWS is CPU usage to spike, the following goroutine stack traces has been recorded as follows.
What does the goroutine stack trace is meaning ? How can we translate it ?

 1. Event los
 2. Failed node event logs
 3. Operator status (false only)


1. Events logs using "oc get events"
~~~
openshift-marketplace                                   41m         Warning   Unhealthy                     pod/marketplace-operator-xxxxxxxxxx-dwvtm                                      Readiness probe failed: Get http://10.130.0.68:8080/healthz: dial tcp 10.130.0.68:8080: connect: connection refused
openshift-service-catalog-apiserver-operator            21m         Normal    Created                       pod/openshift-service-catalog-apiserver-operator-xxxxxxxxxx-mzflx             Created container operator
openshift-service-catalog-apiserver-operator            21m         Normal    Started                       pod/openshift-service-catalog-apiserver-operator-xxxxxxxxxx-mzflx             Started container operator
openshift-apiserver                                     34m         Warning   Unhealthy                     pod/apiserver-aaaaa                                                           Readiness probe failed: HTTP probe failed with statuscode: 500
openshift-marketplace                                   41m         Warning   Unhealthy                     pod/marketplace-operator-xxxxxxxxxx-dwvtm                                      Liveness probe failed: Get http://10.130.0.68:8080/healthz: dial tcp 10.130.0.68:8080: connect: connection refused
openshift-sdn                                           41m         Warning   Unhealthy                     pod/sdn-aaaaa                                                                 Liveness probe failed: Get http://10.0.1.68:10256/healthz: dial tcp 10.0.1.68:10256: connect: connection refused
openshift-apiserver                                     41m         Warning   Unhealthy                     pod/apiserver-aaaaa                                                           Liveness probe failed: HTTP probe failed with statuscode: 500
openshift-machine-config-operator                       41m         Warning   Unhealthy                     pod/etcd-quorum-guard-xxxxxxxxxx-aaaaa                                        Readiness probe errored: rpc error: code = Unknown desc = command error: runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7f17532e593f m=0 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7f17532e593f
stack: frame={sp:0x7fffa5464480, fp:0x0} stack=[0x7fffa4c65978,0x7fffa54649a0)
00007fffa5464380:  00007fffa5464388  00007fffa54643d8
00007fffa5464390:  00007fffa54643d8  000055f120d679d1 <runtime.persistentalloc1+465>
00007fffa54643a0:  000055f121620990  0000000000000000
00007fffa54643b0:  0000000000010000  0000000000000008
00007fffa54643c0:  0000000000000120  000055f100000000
00007fffa54643d0:  0000000000000120  0000000000000000
00007fffa54643e0:  0000000000000130  0000000000000007
00007fffa54643f0:  0000000000000013  000055f122988098
00007fffa5464400:  000000000000011f  0000000000000150
00007fffa5464410:  ffffffffffffffa8  0000000000000004
00007fffa5464420:  0000003400000013  0000000000000000
00007fffa5464430:  0000000000000000  0000006e0000005b
00007fffa5464440:  0000000000000000  0000007c00000077
00007fffa5464450:  0000000000000120  00007f175366cc60
00007fffa5464460:  0000000000000120  0000000000020a20
00007fffa5464470:  00007f1753671710  000055f1229885e0
00007fffa5464480: <0000000000000000  00007f1753334efa
00007fffa5464490:  00007f175009b700  00007fffa54644d0
00007fffa54644a0:  000055f120da3c47 <runtime.step+279>  000055f1214163a2
00007fffa54644b0:  00000000001ea699  00000000001ea699
00007fffa54644c0:  0000027500000002  00000000001ea699
00007fffa54644d0:  00007fffa5464578  000055f120da300f <runtime.pcvalue+319>
00007fffa54644e0:  000055f121434058  00000000001cc9e3
00007fffa54644f0:  00000000001cc9e3  00007fffa5464538
00007fffa5464500:  fffffffe7fffffff  ffffffffffffffff
00007fffa5464510:  ffffffffffffffff  ffffffffffffffff
00007fffa5464520:  ffffffffffffffff  ffffffffffffffff
00007fffa5464530:  ffffffffffffffff  ffffffffffffffff
00007fffa5464540:  ffffffffffffffff  ffffffffffffffff
00007fffa5464550:  ffffffffffffffff  ffffffffffffffff
00007fffa5464560:  ffffffffffffffff  ffffffffffffffff
00007fffa5464570:  ffffffffffffffff  ffffffffffffffff
runtime: unknown pc 0x7f17532e593f
stack: frame={sp:0x7fffa5464480, fp:0x0} stack=[0x7fffa4c65978,0x7fffa54649a0)
00007fffa5464380:  00007fffa5464388  00007fffa54643d8
00007fffa5464390:  00007fffa54643d8  000055f120d679d1 <runtime.persistentalloc1+465>
00007fffa54643a0:  000055f121620990  0000000000000000
00007fffa54643b0:  0000000000010000  0000000000000008
00007fffa54643c0:  0000000000000120  000055f100000000
00007fffa54643d0:  0000000000000120  0000000000000000
00007fffa54643e0:  0000000000000130  0000000000000007
00007fffa54643f0:  0000000000000013  000055f122988098
00007fffa5464400:  000000000000011f  0000000000000150
00007fffa5464410:  ffffffffffffffa8  0000000000000004
00007fffa5464420:  0000003400000013  0000000000000000
00007fffa5464430:  0000000000000000  0000006e0000005b
00007fffa5464440:  0000000000000000  0000007c00000077
00007fffa5464450:  0000000000000120  00007f175366cc60
00007fffa5464460:  0000000000000120  0000000000020a20
00007fffa5464470:  00007f1753671710  000055f1229885e0
00007fffa5464480: <0000000000000000  00007f1753334efa
00007fffa5464490:  00007f175009b700  00007fffa54644d0
00007fffa54644a0:  000055f120da3c47 <runtime.step+279>  000055f1214163a2
00007fffa54644b0:  00000000001ea699  00000000001ea699
00007fffa54644c0:  0000027500000002  00000000001ea699
00007fffa54644d0:  00007fffa5464578  000055f120da300f <runtime.pcvalue+319>
00007fffa54644e0:  000055f121434058  00000000001cc9e3
00007fffa54644f0:  00000000001cc9e3  00007fffa5464538
00007fffa5464500:  fffffffe7fffffff  ffffffffffffffff
00007fffa5464510:  ffffffffffffffff  ffffffffffffffff
00007fffa5464520:  ffffffffffffffff  ffffffffffffffff
00007fffa5464530:  ffffffffffffffff  ffffffffffffffff
00007fffa5464540:  ffffffffffffffff  ffffffffffffffff
00007fffa5464550:  ffffffffffffffff  ffffffffffffffff
00007fffa5464560:  ffffffffffffffff  ffffffffffffffff
00007fffa5464570:  ffffffffffffffff  ffffffffffffffff

goroutine 1 [running, locked to thread]:
runtime.asmcgocall(0x55f12104a890, 0xc000028720)
                                                        /usr/lib/golang/src/runtime/asm_amd64.s:622 +0x3f fp=0xc0000286f8 sp=0xc0000286f0 pc=0x55f120db2f3f
runtime.newm1(0xc00005e380)
                                                        /usr/lib/golang/src/runtime/proc.go:1935 +0xc8 fp=0xc000028748 sp=0xc0000286f8 pc=0x55f120d8bd28
runtime.newm(0x55f1213cd040, 0x0)
                                                        /usr/lib/golang/src/runtime/proc.go:1919 +0x9f fp=0xc000028778 sp=0xc000028748 pc=0x55f120d8bbaf
runtime.startTemplateThread()
                                                        /usr/lib/golang/src/runtime/proc.go:1955 +0x51 fp=0xc000028798 sp=0xc000028778 pc=0x55f120d8bdf1
runtime.main()
                                                        /usr/lib/golang/src/runtime/proc.go:184 +0x19c fp=0xc0000287e0 sp=0xc000028798 pc=0x55f120d8799c
runtime.goexit()

goroutine 1 [running, locked to thread]:
runtime.asmcgocall(0x55f12104a890, 0xc000028720)
                                                        /usr/lib/golang/src/runtime/asm_amd64.s:622 +0x3f fp=0xc0000286f8 sp=0xc0000286f0 pc=0x55f120db2f3f
runtime.newm1(0xc00005e380)
                                                        /usr/lib/golang/src/runtime/proc.go:1935 +0xc8 fp=0xc000028748 sp=0xc0000286f8 pc=0x55f120d8bd28
runtime.newm(0x55f1213cd040, 0x0)
                                                        /usr/lib/golang/src/runtime/proc.go:1919 +0x9f fp=0xc000028778 sp=0xc000028748 pc=0x55f120d8bbaf
runtime.startTemplateThread()
                                                        /usr/lib/golang/src/runtime/proc.go:1955 +0x51 fp=0xc000028798 sp=0xc000028778 pc=0x55f120d8bdf1
runtime.main()
                                                        /usr/lib/golang/src/runtime/proc.go:184 +0x19c fp=0xc0000287e0 sp=0xc000028798 pc=0x55f120d8799c
runtime.goexit()
                                                        /usr/lib/golang/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0000287e8 sp=0xc0000287e0 pc=0x55f120db3801

goroutine 2 [force gc (idle)]:
runtime.gopark(0x55f1213ccf98, 0x55f121620970, 0x1410, 0x1)
                                                        /usr/lib/golang/src/runtime/proc.go:302 +0xf1 fp=0xc000028f80 sp=0xc000028f60 pc=0x55f120d87e01
runtime.goparkunlock(0x55f121620970, 0x1410, 0x1)
                                                        /usr/lib/golang/src/runtime/proc.go:308 +0x55 fp=0xc000028fb0 sp=0xc000028f80 pc=0x55f120d87eb5
runtime.forcegchelper()
                                                        /usr/lib/golang/src/runtime/proc.go:251 +0xb7 fp=0xc000028fe0 sp=0xc000028fb0 pc=0x55f120d87c77
runtime.goexit()
                                                        /usr/lib/golang/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc000028fe8 sp=0xc000028fe0 pc=0x55f120db3801
created by runtime.init.5
                                                        /usr/lib/golang/src/runtime/proc.go:240 +0x37

goroutine 3 [GC sweep wait]:
runtime.gopark(0x55f1213ccf98, 0x55f121620b00, 0x55f120d7140c, 0x1)
                                                        /usr/lib/golang/src/runtime/proc.go:302 +0xf1 fp=0xc000029780 sp=0xc000029760 pc=0x55f120d87e01
runtime.goparkunlock(0x55f121620b00, 0x55f1210a140c, 0x1)
                                                        /usr/lib/golang/src/runtime/proc.go:308 +0x55 fp=0xc0000297b0 sp=0xc000029780 pc=0x55f120d87eb5
runtime.bgsweep(0xc000052000)

runtime.goexit()
                                                        /usr/lib/golang/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0000297e0 sp=0xc0000297d8 pc=0x55f120db3801
created by runtime.gcenable
                                                        /usr/lib/golang/src/runtime/mgc.go:216 +0x5a

rax    0x0
rbx    0x6
rcx    0x7f17532e593f
rdx    0x0
rdi    0x2
rsi    0x7fffa5464480
rbp    0x55f1210ae632
rsp    0x7fffa5464480
r8     0x0
r9     0x7fffa5464480
r10    0x8
r11    0x246
r12    0x55f122988720
r13    0x11
r14    0x55f1210a12b8
r15    0x0
rip    0x7f17532e593f
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

-----

SIGQUIT: quit
PC=0x55f120db5683 m=2 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x55f121621230, 0x80, 0x7f175109cd50, 0x0, 0x7f1700000000, 0x55f120db52b8, 0x3c, 0x0, 0x7f175109cd68, 0x55f120d65970, ...)
                                                        /usr/lib/golang/src/runtime/sys_linux_amd64.s:532 +0x23 fp=0x7f175109cd20 sp=0x7f175109cd18 pc=0x55f120db5683
runtime.futexsleep(0x55f121621230, 0x0, 0xdf8475800)
                                                        /usr/lib/golang/src/runtime/os_linux.go:63 +0xc0 fp=0x7f175109cd70 sp=0x7f175109cd20 pc=0x55f120d83b10
runtime.notetsleep_internal(0x55f121621230, 0xdf8475800, 0x989680)
                                                        /usr/lib/golang/src/runtime/lock_futex.go:193 +0x130 fp=0x7f175109cda8 sp=0x7f175109cd70 pc=0x55f120d65970
runtime.notetsleep(0x55f121621230, 0xdf8475800, 0x0)
                                                        /usr/lib/golang/src/runtime/lock_futex.go:216 +0x58 fp=0x7f175109cdd0 sp=0x7f175109cda8 pc=0x55f120d65a48
runtime.sysmon()
                                                        /usr/lib/golang/src/runtime/proc.go:4369 +0x175 fp=0x7f175109ce58 sp=0x7f175109cdd0 pc=0x55f120d92655
runtime.mstart1()
                                                        /usr/lib/golang/src/runtime/proc.go:1282 +0xea fp=0x7f175109ce80 sp=0x7f175109ce58 pc=0x55f120d8ab7a
runtime.mstart()
                                                        /usr/lib/golang/src/runtime/proc.go:1248 +0x70 fp=0x7f175109cea8 sp=0x7f175109ce80 pc=0x55f120d8aa70

rax    0xfffffffffffffffc
rbx    0x0
rcx    0x55f120db5683
rdx    0x0
rdi    0x55f121621230
rsi    0x80
rbp    0x7f175109cd60
rsp    0x7f175109cd18
r8     0x0
r9     0x0
r10    0x7f175109cd50
r11    0x246
r12    0x7fffa546462e
r13    0x7fffa546462f
r14    0x7fffa5464720
r15    0x7f175109cfc0
rip    0x55f120db5683
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

-----

SIGQUIT: quit
PC=0x55f120db5681 m=3 sigcode=0

goroutine 0 [idle]:
runtime.futex(0xc00002c840, 0x80, 0x0, 0x0, 0x0, 0x55f120d83b5c, 0x0, 0x0, 0x, stdout: , stderr: , exit code -1
openshift-machine-api                                   16m         Warning   BackOff                       pod/cluster-autoscaler-operator-xxxxxxxxxx-aaaaa                              Back-off restarting failed container
openshift-operator-lifecycle-manager                    19m         Warning   BackOff                       pod/catalog-operator-xxxxxxxxxx-xmpmm                                         Back-off restarting failed container
openshift-authentication                                41m         Warning   Unhealthy                     pod/oauth-openshift-xxxxxxxxxx-aaaaa                                          Liveness probe failed: Get https://10.130.0.65:6443/healthz: dial tcp 10.130.0.65:6443: connect: connection refused
openshift-authentication                                34m         Warning   Unhealthy                     pod/oauth-openshift-xxxxxxxxxx-aaaaa                                          Readiness probe failed: Get https://10.130.0.65:6443/healthz: dial tcp 10.130.0.65:6443: connect: connection refused
openshift-apiserver                                     41m         Warning   Unhealthy                     pod/apiserver-aaaaa                                                           Readiness probe failed: Get https://10.130.0.11:8443/healthz: dial tcp 10.130.0.11:8443: connect: connection refused
openshift-marketplace                                   21m         Warning   BackOff                       pod/marketplace-operator-xxxxxxxxxx-dwvtm                                      Back-off restarting failed container
openshift-machine-config-operator                       41m         Warning   Unhealthy                     pod/etcd-quorum-guard-xxxxxxxxxx-aaaaa                                        Readiness probe errored: rpc error: code = Unknown desc = command error: runtime/cgo: pthread_create failed: Resource temporarily unavailable
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fc5a5c5e93f m=3 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7fc5a5c5e93f
stack: frame={sp:0x7fc5a3214820, fp:0x0} stack=[0x7fc5a2a15288,0x7fc5a3214e88)
00007fc5a3214720:  0000000000000000  0000000000000000
:
~~~

2. Failed node events logs
~~~
  :
  openshift-machine-config-operator                      etcd-quorum-guard-xxxxxxxxxx-aaaaa                                         10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         66d
  :
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1720m (49%)   0 (0%)
  memory                      5315Mi (34%)  1748Mi (11%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:
  Type     Reason        Age                From                                                     Message
  ----     ------        ----               ----                                                     -------
  Normal   NodeReady     31m (x3 over 16h)  kubelet, ip-10-0-1-68.ap-northeast-1.compute.internal  Node ip-10-0-1-68.ap-northeast-1.compute.internal status is now: NodeReady
  Normal   NodeNotReady  25m (x4 over 16h)  kubelet, ip-10-0-1-68.ap-northeast-1.compute.internal  Node ip-10-0-1-68.ap-northeast-1.compute.internal status is now: NodeNotReady
  Warning  SystemOOM     25m                kubelet, ip-10-0-1-68.ap-northeast-1.compute.internal  System OOM encountered
~~~

3. Operator Status (only false) 
~~~
machine-config                       4.1.0     False       False         True       89d
monitoring                           4.1.0     False       True          True       65d
~~~

Version-Release number of selected component (if applicable):

OCP 4.1.0

How reproducible:

N/A

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 11 errata-xmlrpc 2020-01-23 11:09:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062