Bug 1612006

Summary: node daemon in SIGSEGV loop
Product: OpenShift Container Platform Reporter: Borja Aranda <farandac>
Component: StorageAssignee: Tomas Smetana <tsmetana>
Status: CLOSED ERRATA QA Contact: Liang Xia <lxia>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, aos-storage-staff, bbennett, bchilds, lxia, mmariyan, tsmetana, wehe
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-13 19:27:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Borja Aranda 2018-08-03 08:47:18 UTC
Description of problem:

- Customer updated the internal master DNS, after the change was propagated they restarted atomic-openshift-node in the env one by one.
- This node is in SIGSEGV crash loop after that change.
- The service appears to have stabilized after it was evacuated. It appears that it was crashing for each pod that was removed during `oc adm drain` but has been up since then.
- There are a lot coredumps generated under /var/lib/origin
- The other nodes doesn't show any malfunction


========= stack =======================
reconciler.go:376] Could not construct volume information: impossible to reconstruct glusterfs volume spec from volume mountpath
reconciler.go:376] Could not construct volume information: impossible to reconstruct glusterfs volume spec from volume mountpath
reconciler.go:376] Could not construct volume information: impossible to reconstruct glusterfs volume spec from volume mountpath
reconciler.go:376] Could not construct volume information: impossible to reconstruct glusterfs volume spec from volume mountpath
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x43740cd]

### goroutine raising panic leads to volumemanager.(*volumeManager).Run

goroutine 447 [running]: 
panic(0x4d294e0, 0xf162930)
/usr/lib/golang/src/runtime/panic.go:540 +0x45e fp=0xc424609b48 sp=0xc424609aa0 pc=0x42e76eruntime.panicmem()
/usr/lib/golang/src/runtime/panic.go:63 +0x5e fp=0xc424609b68 sp=0xc424609b48 pc=0x42d49eruntime.sigpanic()
/usr/lib/golang/src/runtime/signal_unix.go:367 +0x17c fp=0xc424609bb8 sp=0xc424609b68 pc=0x445d5c
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler.(*reconciler).syncStates(0xc4223e6180, 0xc4225aa030, 0x2c)

[...]

created by github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager.(*volumeManager).Run
/builddir/build/BUILD/atomic-openshift-git-0.6bc473e/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager/volume_manager.go:249 +0x153
=======================================



===== analysis of a coredump =====
(dlv) thread 1076
Switched from 1369 to 1076

(dlv) bt
0  0x0000000000461354 in runtime.memmove at /usr/lib/golang/src/runtime/memmove_amd64.s:277
1  0x0000000000445eeb in runtime.sigtrampgo at /usr/lib/golang/src/runtime/signal_unix.go:774
2  0x000000c424609aa0 in ??? at ?:-1
3  0x000000000042f2c2 in runtime.startpanic_m at /usr/lib/golang/src/runtime/panic.go:658
4  0x01007fce58325dc0 in ??? at ?:-1
5  0x000000000045c2bc in runtime.newdefer.func1 at /usr/lib/golang/src/runtime/panic.go:208

(dlv) up 0
> runtime.memmove() /usr/lib/golang/src/runtime/memmove_amd64.s:277 (PC: 0x461354)
Frame 0: /usr/lib/golang/src/runtime/memmove_amd64.s:277 (PC: 461354)
   272:		MOVOU	X9, 144(DI)
   273:		MOVOU	X10, 160(DI)
   274:		MOVOU	X11, 176(DI)
   275:		MOVOU	X12, 192(DI)
   276:		MOVOU	X13, 208(DI)
=> 277:		MOVOU	X14, 224(DI)
   278:		MOVOU	X15, 240(DI)
   279:		CMPQ	BX, $256
   280:		LEAQ	256(SI), SI
   281:		LEAQ	256(DI), DI
   282:		JGE	move_256through2048

(dlv) up 1
> runtime.memmove() /usr/lib/golang/src/runtime/memmove_amd64.s:277 (PC: 0x461354)
Frame 1: /usr/lib/golang/src/runtime/signal_unix.go:774 (PC: 445eeb)
   769:		}
   770:		stsp := uintptr(unsafe.Pointer(st.ss_sp))
   771:		g.m.gsignal.stack.lo = stsp
   772:		g.m.gsignal.stack.hi = stsp + st.ss_size
   773:		g.m.gsignal.stackguard0 = stsp + _StackGuard
=> 774:		g.m.gsignal.stackguard1 = stsp + _StackGuard
   775:	}
   776:	
   777:	// restoreGsignalStack restores the gsignal stack to the value it had
   778:	// before entering the signal handler.
   779:	//go:nosplit

(dlv) up 3
> runtime.memmove() /usr/lib/golang/src/runtime/memmove_amd64.s:277 (PC: 0x461354)
Frame 4: ?:-1 (PC: 1007fce58325dc0)
======================================

Version-Release number of selected component (if applicable):
atomic-openshift-3.9.25-1.git.0.6bc473e.el7.x86_64

How reproducible:
Not reproducible

Comment 4 Tomas Smetana 2018-08-13 11:13:54 UTC
There's an apparent bug in the reconciler's reconstructVolume method that might have caused this:
https://github.com/openshift/ose/blob/enterprise-3.9/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go#L487

It's been fixed in 3.10, however it's a simple one-line change, so I would consider backporting to 3.9.

Comment 5 Tomas Smetana 2018-08-13 11:33:05 UTC
https://github.com/openshift/ose/pull/1387

Comment 6 Tomas Smetana 2018-08-24 14:32:27 UTC
The process has changed: we should backport to Origin:
https://github.com/openshift/origin/pull/20707

Comment 14 Liang Xia 2018-11-29 04:57:54 UTC
Tried with below OCP version, 
$ oc version
oc v3.9.55
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-lxia-master-etcd-1:8443
openshift v3.9.55
kubernetes v1.9.1+a0ce1bc657


The steps used to verify,
* set up a OCP cluster with 2 nodes.
* prepare some pods using gluster volumes and make sure they are running.
* drain one of the node.
* restart node service on the drained node.
* wait some time (20 seconds in my case), mark node schedulable.
* check the nodes/pods.

Going through the drain node/restart node service/mark node schedulable/check nodes and pods for 30+ times, and did not see broken nodes/pods.

Moving bug to verified.

Comment 16 errata-xmlrpc 2018-12-13 19:27:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748