1449297 – Core dump/panic running OpenShift concurrent builds and reliability tests with docker version docker-1.12.6-19 and higher

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1449297 - Core dump/panic running OpenShift concurrent builds and reliability tests with docker version docker-1.12.6-19 and higher

Summary: Core dump/panic running OpenShift concurrent builds and reliability tests wit...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	docker
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Daniel Walsh
QA Contact:	atomic-bugs@redhat.com
Docs Contact:
URL:
Whiteboard:	aos-scalability-36
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-09 14:38 UTC by Vikas Laad
Modified:	2020-08-13 09:10 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-30 15:01:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Vikas Laad 2017-05-09 14:38:01 UTC

Description of problem:
Running build tests on OCP ends up with following error
May  8 13:44:36 ip-172-31-41-8 dockerd-current: time="2017-05-08T13:44:36.526657651-04:00" level=error msg="containerd: deleting container" error="signal: aborted (core dumped): \"panic: runtime error: invalid memory address or nil pointer dereference [recovered]\\n\\tpanic: runtime error: invalid memory address or nil pointer dereference\\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x61224c]\\n\\ngoroutine 1 [running]:\\npanic(0x6ec1c0, 0xc420016050)\\n\\t/usr/lib/golang/src/runtime/panic.go:500 +0x1a1 fp=0xc42007ebf0 sp=0xc42007eb60\\ngithub.com/urfave/cli.HandleAction.func1(0xc42007f748)\\n\\t/builddir/build/BUILD/docker-92b10e401221d31096291aa3fb7b5c413eeb5539/runc-7f4b00e035c25f3a8d0dabc1658fe831a3bb6d13/Godeps/_workspace/src/github.com/urfave/cli/app.go:478 +0x247 fp=0xc42007ec90 sp=0xc42007ebf0\\nruntime.call32(0x0, 0x768d28, 0xc42000c130, 0x800000008)\\n\\t/usr/lib/golang/src/runtime/asm_amd64.s:479 +0x4c fp=0xc42007ecc0 sp=0xc42007ec90\\npanic(0x6ec1c0, 0xc420016050)\\n\\t/usr/lib/golang/src/runtime/panic.go:458 +0x243 fp=0xc42007ed50 sp=0xc42007ecc0\\nruntime.panicmem()\\n\\t/usr/lib/golang/src/runtime/panic.go:62 +0x6d fp=0xc42007ed80 sp=0xc42007ed50\\nruntime.sigpanic()\\n\\t/usr/lib/golang/src/runtime/sigpanic_unix.go:24 +0x214 fp=0xc42007edd8 sp=0xc42007ed80\\ngithub.com/coreos/go-systemd/dbus.(*Conn).startJob(0x0, 0x0, 0x749062, 0x29, 0xc4201305a0, 0x2, 0x2, 0x0, 0x0, 0x0)\\n\\t/builddir/build/BUILD/docker-92b10e401221d31096291aa3fb7b5c413eeb5539/runc-7f4b00e035c25f3a8d0dabc1658fe831a3bb6d13/Godeps/_workspace/src/github.com/coreos/go-systemd/dbus/methods.go:47 +0xcc fp=0xc42007ee60 sp=0xc42007edd8\\ngithub.com/coreos/go-systemd/dbus.(*Conn).StopUnit(0x0, 0xc420122dc0, 0x4d, 0x73c478, 0x7, 0x0, 0x73bc9b, 0x6, 0x3)\\n\\t/builddir/build/BUILD/docker-92b10e401221d31096291aa3fb7b5c413eeb5539/runc-7f4b00e035c25f3a8d0dabc1658fe831a3bb6d13/Godeps/_workspace/src/github.com/coreos/go-systemd/dbus/methods.go:99 +0x14b fp=0xc42007eee8 sp=0xc42007ee60\\ngithub.com/opencontainers/runc/libcontainer/cgroups/systemd.(*Manager).Destroy(0xc42013

Almost 3000 core.* files are present in / and it filled up disk space
root@ip-172-31-42-232: / # file core.15979
core.15979: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/libexec/docker/docker-runc-current --systemd-cgroup=true delete ff4472c8f8', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/libexec/docker/docker-runc-current', platform: 'x86_64'

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 7.3 (Maipo)

root@ip-172-31-42-232: / # docker version                                                                                     
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-19.git92b10e4.el7.x86_64
 Go version:      go1.7.4
 Git commit:      92b10e4/1.12.6
 Built:           Tue May  2 15:06:29 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-19.git92b10e4.el7.x86_64
 Go version:      go1.7.4
 Git commit:      92b10e4/1.12.6
 Built:           Tue May  2 15:06:29 2017
 OS/Arch:         linux/amd64


root@ip-172-31-42-232: / # openshift version
openshift v3.6.65
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

Steps to Reproduce:
1. Install OCP v3.6.65 with docker docker-1.12.6-19
2. Run lot of builds

Actual results:
Builds get stuck in Pending state because nodes are running out of disk space

Expected results:
Builds should run fine

Additional info:
Will attach /var/log/messages

Comment 3 Antonio Murdaca 2017-05-09 14:47:47 UTC

seems related to systemd cgroup manager in runc?

Comment 4 Vikas Laad 2017-05-09 19:06:47 UTC

I ran the same test in a cluster where 2 worker nodes are running different version of docker docker-1.12.6-17 and docker-1.12.6-18

on docker-1.12.6-18 I can see the core file

root@ip-172-31-32-176: / # file core.38868 
core.38868: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-r', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execf
n: '/usr/bin/dockerd-current', platform: 'x86_64'
root@ip-172-31-32-176: / # file core.97530 
-current --add-runtime docker-runc=/usr/libexec/docker/docker-r', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/bin/dockerd-current', platform: 'x86_64'

docker-1.12.6-17 looks clean so far, I am also attaching logs from docker-1.12.6-18 (/var/log/messages)

Comment 6 Mike Fiedler 2017-05-31 15:07:02 UTC

Seeing thousands of core dumps with this issue.  Marking this bz urgent.

Comment 7 Mike Fiedler 2017-05-31 15:07:43 UTC

OpenShift 3.6.79 and docker-1.12.6-25

Comment 8 Mrunal Patel 2017-05-31 15:10:20 UTC

Fixed here https://github.com/projectatomic/runc/pull/7

Comment 9 Daniel Walsh 2017-05-31 16:13:57 UTC

Lockesh I guess we need a new docker-runc package based off of this pull request?

Comment 10 Daniel Walsh 2017-06-30 15:01:01 UTC

I believe this is fixed in the latest release.

Note You need to log in before you can comment on or make changes to this bug.