Bug 1472121 - Running fluentd prestart hook 2 caused error: " signal: segmentation fault"
Running fluentd prestart hook 2 caused error: " signal: segmentation fault"
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: docker (Show other bugs)
7.4
Unspecified Unspecified
high Severity high
: rc
: 7.4
Assigned To: Lokesh Mandvekar
atomic-bugs@redhat.com
: Extras, Regression
Depends On:
Blocks: 1473328
  Show dependency treegraph
 
Reported: 2017-07-18 02:31 EDT by Xia Zhao
Modified: 2017-08-01 20:13 EDT (History)
18 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1473328 1473333 (view as bug list)
Environment:
Last Closed: 2017-08-01 20:13:50 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
inventory file used for logging deployment (714 bytes, text/plain)
2017-07-19 06:29 EDT, Xia Zhao
no flags Details

  None (edit)
Description Xia Zhao 2017-07-18 02:31:45 EDT
Description of problem:
Deploy logging 3.5.0 stacks to OCP 3.5, fluentd is in CrashLoopBackOff
# oc get po
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-bpgcj       1/1       Running            0          22m
logging-es-l3izmyfo-1-zcr8k   1/1       Running            0          22m
logging-fluentd-dsz2c         0/1       CrashLoopBackOff   9          22m
logging-fluentd-s69kj         0/1       CrashLoopBackOff   9          22m
logging-kibana-1-xj5mk        2/2       Running            0          22m

# oc logs logging-fluentd-dsz2c
container_linux.go:247: starting container process caused "process_linux.go:334: running prestart hook 2 caused \"error running hook: signal: segmentation fault (core dumped), stdout: , stderr: \""

Version-Release number of selected component (if applicable):
logging-fluentd         3.5.0-19            9f606b56b5d2        10 hours ago        232.8 MB

How reproducible:
Always

Steps to Reproduce:
1.Deploy logging 3.5.0 stacks to OCP 3.5
2.
3.

Actual results:
Fluentd is in CrashLoopBackOff status

Expected results:
Fluentd should be in running status

Additional info:
Comment 1 Xia Zhao 2017-07-18 02:33:21 EDT
ansible version:
# rpm -qa | grep ansible
openshift-ansible-docs-3.5.99-1.git.0.fb5babb.el7.noarch
openshift-ansible-callback-plugins-3.5.99-1.git.0.fb5babb.el7.noarch
ansible-2.2.3.0-1.el7.noarch
openshift-ansible-3.5.99-1.git.0.fb5babb.el7.noarch
openshift-ansible-lookup-plugins-3.5.99-1.git.0.fb5babb.el7.noarch
openshift-ansible-roles-3.5.99-1.git.0.fb5babb.el7.noarch
openshift-ansible-filter-plugins-3.5.99-1.git.0.fb5babb.el7.noarch
openshift-ansible-playbooks-3.5.99-1.git.0.fb5babb.el7.noarch

# openshift version
openshift v3.5.5.31.3
kubernetes v1.5.2+43a9be4
etcd 3.1.0
Comment 4 Jeff Cantrill 2017-07-18 11:35:00 EDT
Do you have an environment we can poke around in?
Comment 9 Xia Zhao 2017-07-19 06:29 EDT
Created attachment 1300959 [details]
inventory file used for logging deployment
Comment 10 CAI Qian 2017-07-19 11:10:16 EDT
From what I can see in your case, there are only two prestart hooks usually.

"libnetwork-setkey" and "oci-register-machine".

You can disable the later by setting in /usr/libexec/oci/hooks.d/oci-register-machine
to see which one is the culprit.

Also, you can double check if there are more prestart hooks in your cases by looking
at

# cat /run/docker/libcontainerd/<container ID>/config.json | python -mjson.tool
Comment 11 Rich Megginson 2017-07-19 11:19:56 EDT
(In reply to CAI Qian from comment #10)
> From what I can see in your case, there are only two prestart hooks usually.
> 
> "libnetwork-setkey" and "oci-register-machine".
> 
> You can disable the later by setting in
> /usr/libexec/oci/hooks.d/oci-register-machine
> to see which one is the culprit.

How?  I am not familiar with oci-register-machine, or any hooks for that matter.

> 
> Also, you can double check if there are more prestart hooks in your cases by
> looking
> at
> 
> # cat /run/docker/libcontainerd/<container ID>/config.json | python
> -mjson.tool

There are 4: libnetwork-setkey, /usr/libexec/oci/hooks.d/oci-register-machine, /usr/libexec/oci/hooks.d/oci-systemd-hook, /usr/libexec/oci/hooks.d/oci-umount
Comment 12 CAI Qian 2017-07-19 11:32:00 EDT
Set "disabled : true" there before running the container will do.

For oci-systemd-hook, you will need to add a env like this,

docker run --env oci-systemd-hook=disabled -it --rm  fedora /bin/bash

I am not sure if your oci-systemd-hook version has this capability yet. If not, you can just replace /usr/libexec/oci/hooks.d/oci-systemd-hook with a dummy binary or copy /usr/bin/echo over there doing nothing before running the container.

For oci-umount, just remove all entries in /etc/oci-umount.conf before running the container.
Comment 13 CAI Qian 2017-07-19 11:58:31 EDT
I am very much suspect that oci-umount is a culprit since it is new. Your version of oci-umount might miss the commit to bail out early with an empty oci-umount.conf. You best bet would to just,

# cp /usr/bin/echo /usr/libexec/oci/hooks.d/oci-umount

before running the test.
Comment 14 Rich Megginson 2017-07-19 12:51:26 EDT
(In reply to CAI Qian from comment #13)
> I am very much suspect that oci-umount is a culprit since it is new. Your
> version of oci-umount might miss the commit to bail out early with an empty
> oci-umount.conf. You best bet would to just,
> 
> # cp /usr/bin/echo /usr/libexec/oci/hooks.d/oci-umount
> 
> before running the test.

Thanks!  The problem is definitely oci-umount - without it everything works fine - when I add it back I get the crash.
Comment 15 CAI Qian 2017-07-19 13:55:45 EDT
Adding Vivek for oci-umount issue then.
Comment 16 Vivek Goyal 2017-07-19 14:41:38 EDT
So what command should I run to reproduce the issue. I simply ran "docker run -ti fedora bash" and that seems to work just fine on this node.
Comment 17 Rich Megginson 2017-07-19 15:42:03 EDT
@vivek - note that this _only_ happens with the fluentd pod, which is doing this:

    umount /var/lib/docker/containers/*/shm || :

But afaict, the pod never gets to this point - that is, if I enable set -x in the fluentd run.sh script, I see nothing in oc logs.  I suppose it is possible that the umount is causing the segfault which then kills the buffered log messages.

I see you are already on the system

ssh -i libra.pem root@host-8-174-54.host.centralci.eng.rdu2.redhat.com
Comment 18 Vivek Goyal 2017-07-19 17:04:32 EDT
@Rich, yes I am playing on this system, trying to figure out what's going on. I am thinking that segfault is happening in oci-umount itself. If that's the case then run.sh inside container will never get a chance to run at all?

Also there seem to be two fluentd containers on system. One of them seems to do successful start sometimes and other does not. I am not able to nail down yet that why one of them is working sometimes.

As of now.

oc logs logging-fluentd-9zzgq is failing

while

oc logs logging-fluentd-zc61t is successful.

BTW, I have also taken upstream oci-umount compiled it and placed it on system in an attempt to narrow it down a bit.
Comment 19 Vivek Goyal 2017-07-19 17:22:52 EDT
oc logs logging-fluentd-9zzgq fails even if I remove oci-umount hook. So something is not right. That seems to suggest that there might be two errors.
Comment 20 Rich Megginson 2017-07-19 17:23:28 EDT
(In reply to Vivek Goyal from comment #19)
> oc logs logging-fluentd-9zzgq fails even if I remove oci-umount hook. So
> something is not right. That seems to suggest that there might be two errors.

logging-fluentd-9zzgq is running on another node
Comment 21 Vivek Goyal 2017-07-19 18:12:14 EDT
Ok, I think I found the bug in oci-umount plugin. Basically we can end up doing "strlen(NULL)" and that will crash the plugin.

And this can happen when destination and source are same as specified in /etc/oci-umount.conf. For example, try running a container was follows.

docker run -ti -v /var/lib/docker/devicemapper:/var/lib/docker/devicemapper fedora bash

And crash is reproduced. I will fix it soon.
Comment 22 Vivek Goyal 2017-07-19 18:20:26 EDT
Create a PR. I think this should fix it.

https://github.com/projectatomic/oci-umount/pull/13
Comment 23 Vivek Goyal 2017-07-19 18:23:19 EDT
I have compiled new oci-umount on this system (with the fix) and now this node is running oci-umount with the fix. Please test and make sure it is working.
Comment 28 Xia Zhao 2017-07-24 03:39:19 EDT
Test passed on docker-1.12.6-48, fluentd is able to be running and logging system worked fine:

# oc get po
NAME                          READY     STATUS    RESTARTS   AGE
logging-curator-1-lbnzk       1/1       Running   0          13m
logging-es-s5sp96at-1-v3lpb   1/1       Running   0          13m
logging-fluentd-c9dqq         1/1       Running   0          13m
logging-fluentd-l3bn0         1/1       Running   0          13m
logging-kibana-1-p63qs        2/2       Running   0          13m

# openshift version
openshift v3.5.5.31.3
kubernetes v1.5.2+43a9be4
etcd 3.1.0

# docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-48.git0fdc778.el7.x86_64
 Go version:      go1.8.3
 Git commit:      0fdc778/1.12.6
 Built:           Thu Jul 20 00:06:39 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-48.git0fdc778.el7.x86_64
 Go version:      go1.8.3
 Git commit:      0fdc778/1.12.6
 Built:           Thu Jul 20 00:06:39 2017
 OS/Arch:         linux/amd64

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)

# rpm -q openvswitch
openvswitch-2.7.0-8.git20170530.el7fdb.x86_64
Comment 29 Luwen Su 2017-07-24 06:42:59 EDT
Per comment#28, move to verified.
Comment 31 errata-xmlrpc 2017-08-01 20:13:50 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2344

Note You need to log in before you can comment on or make changes to this bug.