Bug 1472121
| Summary: | Running fluentd prestart hook 2 caused error: " signal: segmentation fault" | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Xia Zhao <xiazhao> | ||||
| Component: | docker | Assignee: | Lokesh Mandvekar <lsm5> | ||||
| Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 7.4 | CC: | amurdaca, aos-bugs, dwalsh, fcami, fkluknav, jokerman, lsm5, lsu, mmccomas, mpatel, nhosoi, pweil, qcai, rmeggins, santiago, vgoyal, wsun, xiazhao | ||||
| Target Milestone: | rc | Keywords: | Extras, Regression | ||||
| Target Release: | 7.4 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1473328 1473333 (view as bug list) | Environment: | |||||
| Last Closed: | 2017-08-02 00:13:50 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1473328 | ||||||
| Attachments: |
|
||||||
|
Description
Xia Zhao
2017-07-18 06:31:45 UTC
ansible version: # rpm -qa | grep ansible openshift-ansible-docs-3.5.99-1.git.0.fb5babb.el7.noarch openshift-ansible-callback-plugins-3.5.99-1.git.0.fb5babb.el7.noarch ansible-2.2.3.0-1.el7.noarch openshift-ansible-3.5.99-1.git.0.fb5babb.el7.noarch openshift-ansible-lookup-plugins-3.5.99-1.git.0.fb5babb.el7.noarch openshift-ansible-roles-3.5.99-1.git.0.fb5babb.el7.noarch openshift-ansible-filter-plugins-3.5.99-1.git.0.fb5babb.el7.noarch openshift-ansible-playbooks-3.5.99-1.git.0.fb5babb.el7.noarch # openshift version openshift v3.5.5.31.3 kubernetes v1.5.2+43a9be4 etcd 3.1.0 Do you have an environment we can poke around in? Created attachment 1300959 [details]
inventory file used for logging deployment
From what I can see in your case, there are only two prestart hooks usually. "libnetwork-setkey" and "oci-register-machine". You can disable the later by setting in /usr/libexec/oci/hooks.d/oci-register-machine to see which one is the culprit. Also, you can double check if there are more prestart hooks in your cases by looking at # cat /run/docker/libcontainerd/<container ID>/config.json | python -mjson.tool (In reply to CAI Qian from comment #10) > From what I can see in your case, there are only two prestart hooks usually. > > "libnetwork-setkey" and "oci-register-machine". > > You can disable the later by setting in > /usr/libexec/oci/hooks.d/oci-register-machine > to see which one is the culprit. How? I am not familiar with oci-register-machine, or any hooks for that matter. > > Also, you can double check if there are more prestart hooks in your cases by > looking > at > > # cat /run/docker/libcontainerd/<container ID>/config.json | python > -mjson.tool There are 4: libnetwork-setkey, /usr/libexec/oci/hooks.d/oci-register-machine, /usr/libexec/oci/hooks.d/oci-systemd-hook, /usr/libexec/oci/hooks.d/oci-umount Set "disabled : true" there before running the container will do. For oci-systemd-hook, you will need to add a env like this, docker run --env oci-systemd-hook=disabled -it --rm fedora /bin/bash I am not sure if your oci-systemd-hook version has this capability yet. If not, you can just replace /usr/libexec/oci/hooks.d/oci-systemd-hook with a dummy binary or copy /usr/bin/echo over there doing nothing before running the container. For oci-umount, just remove all entries in /etc/oci-umount.conf before running the container. I am very much suspect that oci-umount is a culprit since it is new. Your version of oci-umount might miss the commit to bail out early with an empty oci-umount.conf. You best bet would to just, # cp /usr/bin/echo /usr/libexec/oci/hooks.d/oci-umount before running the test. (In reply to CAI Qian from comment #13) > I am very much suspect that oci-umount is a culprit since it is new. Your > version of oci-umount might miss the commit to bail out early with an empty > oci-umount.conf. You best bet would to just, > > # cp /usr/bin/echo /usr/libexec/oci/hooks.d/oci-umount > > before running the test. Thanks! The problem is definitely oci-umount - without it everything works fine - when I add it back I get the crash. Adding Vivek for oci-umount issue then. So what command should I run to reproduce the issue. I simply ran "docker run -ti fedora bash" and that seems to work just fine on this node. @vivek - note that this _only_ happens with the fluentd pod, which is doing this:
umount /var/lib/docker/containers/*/shm || :
But afaict, the pod never gets to this point - that is, if I enable set -x in the fluentd run.sh script, I see nothing in oc logs. I suppose it is possible that the umount is causing the segfault which then kills the buffered log messages.
I see you are already on the system
ssh -i libra.pem root.centralci.eng.rdu2.redhat.com
@Rich, yes I am playing on this system, trying to figure out what's going on. I am thinking that segfault is happening in oci-umount itself. If that's the case then run.sh inside container will never get a chance to run at all? Also there seem to be two fluentd containers on system. One of them seems to do successful start sometimes and other does not. I am not able to nail down yet that why one of them is working sometimes. As of now. oc logs logging-fluentd-9zzgq is failing while oc logs logging-fluentd-zc61t is successful. BTW, I have also taken upstream oci-umount compiled it and placed it on system in an attempt to narrow it down a bit. oc logs logging-fluentd-9zzgq fails even if I remove oci-umount hook. So something is not right. That seems to suggest that there might be two errors. (In reply to Vivek Goyal from comment #19) > oc logs logging-fluentd-9zzgq fails even if I remove oci-umount hook. So > something is not right. That seems to suggest that there might be two errors. logging-fluentd-9zzgq is running on another node Ok, I think I found the bug in oci-umount plugin. Basically we can end up doing "strlen(NULL)" and that will crash the plugin. And this can happen when destination and source are same as specified in /etc/oci-umount.conf. For example, try running a container was follows. docker run -ti -v /var/lib/docker/devicemapper:/var/lib/docker/devicemapper fedora bash And crash is reproduced. I will fix it soon. Create a PR. I think this should fix it. https://github.com/projectatomic/oci-umount/pull/13 I have compiled new oci-umount on this system (with the fix) and now this node is running oci-umount with the fix. Please test and make sure it is working. Test passed on docker-1.12.6-48, fluentd is able to be running and logging system worked fine: # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-lbnzk 1/1 Running 0 13m logging-es-s5sp96at-1-v3lpb 1/1 Running 0 13m logging-fluentd-c9dqq 1/1 Running 0 13m logging-fluentd-l3bn0 1/1 Running 0 13m logging-kibana-1-p63qs 2/2 Running 0 13m # openshift version openshift v3.5.5.31.3 kubernetes v1.5.2+43a9be4 etcd 3.1.0 # docker version Client: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-48.git0fdc778.el7.x86_64 Go version: go1.8.3 Git commit: 0fdc778/1.12.6 Built: Thu Jul 20 00:06:39 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-48.git0fdc778.el7.x86_64 Go version: go1.8.3 Git commit: 0fdc778/1.12.6 Built: Thu Jul 20 00:06:39 2017 OS/Arch: linux/amd64 # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.4 (Maipo) # rpm -q openvswitch openvswitch-2.7.0-8.git20170530.el7fdb.x86_64 Per comment#28, move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2344 |