1472121 – Running fluentd prestart hook 2 caused error: " signal: segmentation fault"

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1472121 - Running fluentd prestart hook 2 caused error: " signal: segmentation fault"

Summary: Running fluentd prestart hook 2 caused error: " signal: segmentation fault"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	docker
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	7.4
Assignee:	Lokesh Mandvekar
QA Contact:	atomic-bugs@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1473328
TreeView+	depends on / blocked

Reported:	2017-07-18 06:31 UTC by Xia Zhao
Modified:	2019-03-06 01:05 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1473328 1473333 (view as bug list)
Environment:
Last Closed:	2017-08-02 00:13:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
inventory file used for logging deployment (714 bytes, text/plain) 2017-07-19 10:29 UTC, Xia Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2344	0	normal	SHIPPED_LIVE	docker bug fix and enhancement update	2017-08-08 22:51:38 UTC

Description Xia Zhao 2017-07-18 06:31:45 UTC

Description of problem:
Deploy logging 3.5.0 stacks to OCP 3.5, fluentd is in CrashLoopBackOff
# oc get po
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-bpgcj       1/1       Running            0          22m
logging-es-l3izmyfo-1-zcr8k   1/1       Running            0          22m
logging-fluentd-dsz2c         0/1       CrashLoopBackOff   9          22m
logging-fluentd-s69kj         0/1       CrashLoopBackOff   9          22m
logging-kibana-1-xj5mk        2/2       Running            0          22m

# oc logs logging-fluentd-dsz2c
container_linux.go:247: starting container process caused "process_linux.go:334: running prestart hook 2 caused \"error running hook: signal: segmentation fault (core dumped), stdout: , stderr: \""

Version-Release number of selected component (if applicable):
logging-fluentd         3.5.0-19            9f606b56b5d2        10 hours ago        232.8 MB

How reproducible:
Always

Steps to Reproduce:
1.Deploy logging 3.5.0 stacks to OCP 3.5
2.
3.

Actual results:
Fluentd is in CrashLoopBackOff status

Expected results:
Fluentd should be in running status

Additional info:

Comment 1 Xia Zhao 2017-07-18 06:33:21 UTC

ansible version:
# rpm -qa | grep ansible
openshift-ansible-docs-3.5.99-1.git.0.fb5babb.el7.noarch
openshift-ansible-callback-plugins-3.5.99-1.git.0.fb5babb.el7.noarch
ansible-2.2.3.0-1.el7.noarch
openshift-ansible-3.5.99-1.git.0.fb5babb.el7.noarch
openshift-ansible-lookup-plugins-3.5.99-1.git.0.fb5babb.el7.noarch
openshift-ansible-roles-3.5.99-1.git.0.fb5babb.el7.noarch
openshift-ansible-filter-plugins-3.5.99-1.git.0.fb5babb.el7.noarch
openshift-ansible-playbooks-3.5.99-1.git.0.fb5babb.el7.noarch

# openshift version
openshift v3.5.5.31.3
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Comment 4 Jeff Cantrill 2017-07-18 15:35:00 UTC

Do you have an environment we can poke around in?

Comment 9 Xia Zhao 2017-07-19 10:29:57 UTC

Created attachment 1300959 [details]
inventory file used for logging deployment

Comment 10 Qian Cai 2017-07-19 15:10:16 UTC

From what I can see in your case, there are only two prestart hooks usually.

"libnetwork-setkey" and "oci-register-machine".

You can disable the later by setting in /usr/libexec/oci/hooks.d/oci-register-machine
to see which one is the culprit.

Also, you can double check if there are more prestart hooks in your cases by looking
at

# cat /run/docker/libcontainerd/<container ID>/config.json | python -mjson.tool

Comment 11 Rich Megginson 2017-07-19 15:19:56 UTC

(In reply to CAI Qian from comment #10)
> From what I can see in your case, there are only two prestart hooks usually.
> 
> "libnetwork-setkey" and "oci-register-machine".
> 
> You can disable the later by setting in
> /usr/libexec/oci/hooks.d/oci-register-machine
> to see which one is the culprit.

How?  I am not familiar with oci-register-machine, or any hooks for that matter.

> 
> Also, you can double check if there are more prestart hooks in your cases by
> looking
> at
> 
> # cat /run/docker/libcontainerd/<container ID>/config.json | python
> -mjson.tool

There are 4: libnetwork-setkey, /usr/libexec/oci/hooks.d/oci-register-machine, /usr/libexec/oci/hooks.d/oci-systemd-hook, /usr/libexec/oci/hooks.d/oci-umount

Comment 12 Qian Cai 2017-07-19 15:32:00 UTC

Set "disabled : true" there before running the container will do.

For oci-systemd-hook, you will need to add a env like this,

docker run --env oci-systemd-hook=disabled -it --rm  fedora /bin/bash

I am not sure if your oci-systemd-hook version has this capability yet. If not, you can just replace /usr/libexec/oci/hooks.d/oci-systemd-hook with a dummy binary or copy /usr/bin/echo over there doing nothing before running the container.

For oci-umount, just remove all entries in /etc/oci-umount.conf before running the container.

Comment 13 Qian Cai 2017-07-19 15:58:31 UTC

I am very much suspect that oci-umount is a culprit since it is new. Your version of oci-umount might miss the commit to bail out early with an empty oci-umount.conf. You best bet would to just,

# cp /usr/bin/echo /usr/libexec/oci/hooks.d/oci-umount

before running the test.

Comment 14 Rich Megginson 2017-07-19 16:51:26 UTC

(In reply to CAI Qian from comment #13)
> I am very much suspect that oci-umount is a culprit since it is new. Your
> version of oci-umount might miss the commit to bail out early with an empty
> oci-umount.conf. You best bet would to just,
> 
> # cp /usr/bin/echo /usr/libexec/oci/hooks.d/oci-umount
> 
> before running the test.

Thanks!  The problem is definitely oci-umount - without it everything works fine - when I add it back I get the crash.

Comment 15 Qian Cai 2017-07-19 17:55:45 UTC

Adding Vivek for oci-umount issue then.

Comment 16 Vivek Goyal 2017-07-19 18:41:38 UTC

So what command should I run to reproduce the issue. I simply ran "docker run -ti fedora bash" and that seems to work just fine on this node.

Comment 17 Rich Megginson 2017-07-19 19:42:03 UTC

@vivek - note that this _only_ happens with the fluentd pod, which is doing this:

    umount /var/lib/docker/containers/*/shm || :

But afaict, the pod never gets to this point - that is, if I enable set -x in the fluentd run.sh script, I see nothing in oc logs.  I suppose it is possible that the umount is causing the segfault which then kills the buffered log messages.

I see you are already on the system

ssh -i libra.pem root.centralci.eng.rdu2.redhat.com

Comment 18 Vivek Goyal 2017-07-19 21:04:32 UTC

@Rich, yes I am playing on this system, trying to figure out what's going on. I am thinking that segfault is happening in oci-umount itself. If that's the case then run.sh inside container will never get a chance to run at all?

Also there seem to be two fluentd containers on system. One of them seems to do successful start sometimes and other does not. I am not able to nail down yet that why one of them is working sometimes.

As of now.

oc logs logging-fluentd-9zzgq is failing

while

oc logs logging-fluentd-zc61t is successful.

BTW, I have also taken upstream oci-umount compiled it and placed it on system in an attempt to narrow it down a bit.

Comment 19 Vivek Goyal 2017-07-19 21:22:52 UTC

oc logs logging-fluentd-9zzgq fails even if I remove oci-umount hook. So something is not right. That seems to suggest that there might be two errors.

Comment 20 Rich Megginson 2017-07-19 21:23:28 UTC

(In reply to Vivek Goyal from comment #19)
> oc logs logging-fluentd-9zzgq fails even if I remove oci-umount hook. So
> something is not right. That seems to suggest that there might be two errors.

logging-fluentd-9zzgq is running on another node

Comment 21 Vivek Goyal 2017-07-19 22:12:14 UTC

Ok, I think I found the bug in oci-umount plugin. Basically we can end up doing "strlen(NULL)" and that will crash the plugin.

And this can happen when destination and source are same as specified in /etc/oci-umount.conf. For example, try running a container was follows.

docker run -ti -v /var/lib/docker/devicemapper:/var/lib/docker/devicemapper fedora bash

And crash is reproduced. I will fix it soon.

Comment 22 Vivek Goyal 2017-07-19 22:20:26 UTC

Create a PR. I think this should fix it.

https://github.com/projectatomic/oci-umount/pull/13

Comment 23 Vivek Goyal 2017-07-19 22:23:19 UTC

I have compiled new oci-umount on this system (with the fix) and now this node is running oci-umount with the fix. Please test and make sure it is working.

Comment 28 Xia Zhao 2017-07-24 07:39:19 UTC

Test passed on docker-1.12.6-48, fluentd is able to be running and logging system worked fine:

# oc get po
NAME                          READY     STATUS    RESTARTS   AGE
logging-curator-1-lbnzk       1/1       Running   0          13m
logging-es-s5sp96at-1-v3lpb   1/1       Running   0          13m
logging-fluentd-c9dqq         1/1       Running   0          13m
logging-fluentd-l3bn0         1/1       Running   0          13m
logging-kibana-1-p63qs        2/2       Running   0          13m

# openshift version
openshift v3.5.5.31.3
kubernetes v1.5.2+43a9be4
etcd 3.1.0

# docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-48.git0fdc778.el7.x86_64
 Go version:      go1.8.3
 Git commit:      0fdc778/1.12.6
 Built:           Thu Jul 20 00:06:39 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-48.git0fdc778.el7.x86_64
 Go version:      go1.8.3
 Git commit:      0fdc778/1.12.6
 Built:           Thu Jul 20 00:06:39 2017
 OS/Arch:         linux/amd64

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)

# rpm -q openvswitch
openvswitch-2.7.0-8.git20170530.el7fdb.x86_64

Comment 29 Luwen Su 2017-07-24 10:42:59 UTC

Per comment#28, move to verified.

Comment 31 errata-xmlrpc 2017-08-02 00:13:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2344

Note You need to log in before you can comment on or make changes to this bug.