Description of problem: - Customer updated the internal master DNS, after the change was propagated they restarted atomic-openshift-node in the env one by one. - This node is in SIGSEGV crash loop after that change. - The service appears to have stabilized after it was evacuated. It appears that it was crashing for each pod that was removed during `oc adm drain` but has been up since then. - There are a lot coredumps generated under /var/lib/origin - The other nodes doesn't show any malfunction ========= stack ======================= reconciler.go:376] Could not construct volume information: impossible to reconstruct glusterfs volume spec from volume mountpath reconciler.go:376] Could not construct volume information: impossible to reconstruct glusterfs volume spec from volume mountpath reconciler.go:376] Could not construct volume information: impossible to reconstruct glusterfs volume spec from volume mountpath reconciler.go:376] Could not construct volume information: impossible to reconstruct glusterfs volume spec from volume mountpath panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x43740cd] ### goroutine raising panic leads to volumemanager.(*volumeManager).Run goroutine 447 [running]: panic(0x4d294e0, 0xf162930) /usr/lib/golang/src/runtime/panic.go:540 +0x45e fp=0xc424609b48 sp=0xc424609aa0 pc=0x42e76eruntime.panicmem() /usr/lib/golang/src/runtime/panic.go:63 +0x5e fp=0xc424609b68 sp=0xc424609b48 pc=0x42d49eruntime.sigpanic() /usr/lib/golang/src/runtime/signal_unix.go:367 +0x17c fp=0xc424609bb8 sp=0xc424609b68 pc=0x445d5c github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler.(*reconciler).syncStates(0xc4223e6180, 0xc4225aa030, 0x2c) [...] created by github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager.(*volumeManager).Run /builddir/build/BUILD/atomic-openshift-git-0.6bc473e/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager/volume_manager.go:249 +0x153 ======================================= ===== analysis of a coredump ===== (dlv) thread 1076 Switched from 1369 to 1076 (dlv) bt 0 0x0000000000461354 in runtime.memmove at /usr/lib/golang/src/runtime/memmove_amd64.s:277 1 0x0000000000445eeb in runtime.sigtrampgo at /usr/lib/golang/src/runtime/signal_unix.go:774 2 0x000000c424609aa0 in ??? at ?:-1 3 0x000000000042f2c2 in runtime.startpanic_m at /usr/lib/golang/src/runtime/panic.go:658 4 0x01007fce58325dc0 in ??? at ?:-1 5 0x000000000045c2bc in runtime.newdefer.func1 at /usr/lib/golang/src/runtime/panic.go:208 (dlv) up 0 > runtime.memmove() /usr/lib/golang/src/runtime/memmove_amd64.s:277 (PC: 0x461354) Frame 0: /usr/lib/golang/src/runtime/memmove_amd64.s:277 (PC: 461354) 272: MOVOU X9, 144(DI) 273: MOVOU X10, 160(DI) 274: MOVOU X11, 176(DI) 275: MOVOU X12, 192(DI) 276: MOVOU X13, 208(DI) => 277: MOVOU X14, 224(DI) 278: MOVOU X15, 240(DI) 279: CMPQ BX, $256 280: LEAQ 256(SI), SI 281: LEAQ 256(DI), DI 282: JGE move_256through2048 (dlv) up 1 > runtime.memmove() /usr/lib/golang/src/runtime/memmove_amd64.s:277 (PC: 0x461354) Frame 1: /usr/lib/golang/src/runtime/signal_unix.go:774 (PC: 445eeb) 769: } 770: stsp := uintptr(unsafe.Pointer(st.ss_sp)) 771: g.m.gsignal.stack.lo = stsp 772: g.m.gsignal.stack.hi = stsp + st.ss_size 773: g.m.gsignal.stackguard0 = stsp + _StackGuard => 774: g.m.gsignal.stackguard1 = stsp + _StackGuard 775: } 776: 777: // restoreGsignalStack restores the gsignal stack to the value it had 778: // before entering the signal handler. 779: //go:nosplit (dlv) up 3 > runtime.memmove() /usr/lib/golang/src/runtime/memmove_amd64.s:277 (PC: 0x461354) Frame 4: ?:-1 (PC: 1007fce58325dc0) ====================================== Version-Release number of selected component (if applicable): atomic-openshift-3.9.25-1.git.0.6bc473e.el7.x86_64 How reproducible: Not reproducible
There's an apparent bug in the reconciler's reconstructVolume method that might have caused this: https://github.com/openshift/ose/blob/enterprise-3.9/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go#L487 It's been fixed in 3.10, however it's a simple one-line change, so I would consider backporting to 3.9.
https://github.com/openshift/ose/pull/1387
The process has changed: we should backport to Origin: https://github.com/openshift/origin/pull/20707
Tried with below OCP version, $ oc version oc v3.9.55 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-lxia-master-etcd-1:8443 openshift v3.9.55 kubernetes v1.9.1+a0ce1bc657 The steps used to verify, * set up a OCP cluster with 2 nodes. * prepare some pods using gluster volumes and make sure they are running. * drain one of the node. * restart node service on the drained node. * wait some time (20 seconds in my case), mark node schedulable. * check the nodes/pods. Going through the drain node/restart node service/mark node schedulable/check nodes and pods for 30+ times, and did not see broken nodes/pods. Moving bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3748