Created attachment 1395413 [details] strace of logging-fluentd startup until it crashes Description of problem: In the OpenShift scalability CI lab, logging-fluentd is crashing immediately after startup. The exact same build of logging-fluentd (3.9.0-0.41.0) runs fine on AWS. Both environments are cloud and have a huge number of differences (hardware, kernel version, virtualization method, etc) which I will try to narrow down as much as possible. The crash is not load related, logging-fluentd just never comes to Running when USE_JEMALLOC uses its default value of true. Externally, the sypod just restarts over and over with nothing in the pod log. Running oc debug and then running "run.sh" manually to start fluentd shows that fluentd is seg faulting and core dumping immediately after "exec fluentd --no-supervisor ": + rm -f /etc/fluent/configs.d/openshift/filter-pre-force-utf8.conf + touch /etc/fluent/configs.d/openshift/filter-pre-force-utf8.conf + type -p jemalloc-config + '[' true = true ']' ++ jemalloc-config --libdir ++ jemalloc-config --revision + export LD_PRELOAD=/usr/libexec/fluentd/lib/libjemalloc.so.2 + LD_PRELOAD=/usr/libexec/fluentd/lib/libjemalloc.so.2 + '[' true = true ']' + rm -f /etc/fluent/configs.d/openshift/filter-exclude-journal-debug.conf + touch /etc/fluent/configs.d/openshift/filter-exclude-journal-debug.conf + [[ -n '' ]] + exec fluentd --no-supervisor Segmentation fault (core dumped) The core dump has very little of value (to me anyways - I can provide it): Core was generated by `/usr/bin/ruby /usr/bin/fluentd --no-supervisor'. Program terminated with signal 11, Segmentation fault. #0 0x00007f7a96deafaf in _dl_runtime_profile_avx512 () from /lib64/ld-linux-x86-64.so.2 Missing separate debuginfos, use: debuginfo-install glibc-2.17-221.el7.x86_64 nss-softokn-freebl-3.34.0-2.el7.x86_64 (gdb) bt #0 0x00007f7a96deafaf in _dl_runtime_profile_avx512 () from /lib64/ld-linux-x86-64.so.2 #1 0x0000000000000000 in ?? () (gdb) info threads Id Target Id Frame 2 LWP 292 0x00007f7a96682f06 in ?? () * 1 LWP 345 0x00007f7a96deafaf in _dl_runtime_profile_avx512 () from /lib64/ld-linux-x86-64.so.2 Gathered an strace of fluentd startup and the crash is a SEGV_ACCERR while trying to fork a child process (full strace attached): 641 mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5837516000 641 mprotect(0x7f5837516000, 4096, PROT_NONE) = 0 641 clone(child_stack=0x7f58375177f0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f58375199d0, tls=0x7f5837519700, child_tidptr=0x7f58375199d0) = 642 642 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7f5837516d40} --- 642 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} --- Where does libjemalloc come in? If I set the environment variable USE_JEMALLOC=false then the crash does not occur and logging-fluentd fully initializes and goes into Running state. Version-Release number of selected component (if applicable): Logging 3.9.0-0.41.0 How reproducible: Always, in this specific OpenStack cluster Steps to Reproduce: This problem is occurring immediately upon starting logging-fluentd in this cluster. Actual results: logging-fluentd will not start - there are no entries in the pod log. Running the pod with oc debug ds/logging-fluentd and then running bash -x ./run.sh to start fluentd will demonstrate the issue Additional info: Will add a comparison of working and non-working environments in a follow-up comment.
Searching jemalloc issues did not yield a match (so far). Several issues mention problems with jemalloc and Transparent Huge Pages (which might be an inherent problem for usage in OpenShift). I turned THP off on one of the nodes where the crash was occurring but the issue still occurred. This crash does not seem related to THP
> Where does libjemalloc come in? here + export LD_PRELOAD=/usr/libexec/fluentd/lib/libjemalloc.so.2 + LD_PRELOAD=/usr/libexec/fluentd/lib/libjemalloc.so.2 > If I set the environment variable USE_JEMALLOC=false then the crash does not occur and logging-fluentd fully initializes and goes into Running state. So I guess the workaround for now use USE_JEMALLOC=false selinux? any AVCs? I don't think it is specifically related to jemalloc, since fluentd works with jemalloc in many other environments. @nhosoi - could this be related to the execmem issue you have been working on?
> Where does libjemalloc come in? Sorry, rhetorical question :-) > So I guess the workaround for now use USE_JEMALLOC=false Agree, but that also re-introduces known leaks/memory growth. But, I understatnd. > selinux? any AVCs? Nothing in /var/log/audit. Also, tested with setenforce 0 and the issue was still reproducible. >I don't think it is specifically related to jemalloc, since fluentd works with jemalloc in many other environments. Yes, it definitely works for me in other environments. Comparison of env coming in subsequent comment.
Full list of installed packages in working and non-working environments attached. Non-working (scale CI cluster) RHEL 7.4.4 Kernel 3.10.0-829.el7.x86_64 Spectre/Meltdown patch status: /sys/kernel/debug/x86/ibpb_enabled:0 /sys/kernel/debug/x86/ibrs_enabled:0 /sys/kernel/debug/x86/pti_enabled OpenShift 3.9.0-0.24.0 on Docker 1.13 OpenStack 11 and KVM Skylake processors on the hypervisors Working (AWS) RHEL 7.4.4 Kernel 3.10.0-843.el7.x86_64 Spectre/Meltdown patch status: /sys/kernel/debug/x86/ibpb_enabled:0 /sys/kernel/debug/x86/ibrs_enabled:0 /sys/kernel/debug/x86/pti_enabled:1 OpenShift 3.9.0-0.38.0 on Docker 1.13 AWS EC2 and Xen Broadwell processors on the hypervisors (AWS m4 instances)
Created attachment 1395452 [details] installed packages from environment where fluentd crashes
Created attachment 1395453 [details] installed packages from environment where fluentd does not crash
Hi Mike, Could you run "ldd /usr/bin/fluentd" on the both envs? Are there any differences? You may need to install glibc-common to get ldd...
non-working/crash env: sh-4.2# ldd /usr/bin/fluentd not a dynamic executable sh-4.2# ldd /usr/bin/ruby linux-vdso.so.1 => (0x00007fff56f8d000) libruby.so.2.0 => /lib64/libruby.so.2.0 (0x00007f2589d9d000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f2589b81000) librt.so.1 => /lib64/librt.so.1 (0x00007f2589979000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f2589775000) libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f258953e000) libm.so.6 => /lib64/libm.so.6 (0x00007f258923c000) libc.so.6 => /lib64/libc.so.6 (0x00007f2588e79000) /lib64/ld-linux-x86-64.so.2 (0x00007f258a1fa000) libfreebl3.so => /lib64/libfreebl3.so (0x00007f2588c76000) sh-4.2# yum list installed | grep glibc glibc.x86_64 2.17-196.el7_4.2 @anaconda/7.4 glibc-common.x86_64 2.17-196.el7_4.2 @anaconda/7.4 working env: sh-4.2# ldd /usr/bin/fluentd not a dynamic executable sh-4.2# ldd /usr/bin/ruby linux-vdso.so.1 => (0x00007ffe55df8000) libruby.so.2.0 => /lib64/libruby.so.2.0 (0x00007f89371fc000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f8936fe0000) librt.so.1 => /lib64/librt.so.1 (0x00007f8936dd8000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f8936bd4000) libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f893699d000) libm.so.6 => /lib64/libm.so.6 (0x00007f893669b000) libc.so.6 => /lib64/libc.so.6 (0x00007f89362d8000) /lib64/ld-linux-x86-64.so.2 (0x00007f8937659000) libfreebl3.so => /lib64/libfreebl3.so (0x00007f89360d5000)
Submitted too soon - glibc is identical version in both env. working: sh-4.2# yum list installed | grep glibc glibc.x86_64 2.17-196.el7_4.2 @anaconda/7.4 glibc-common.x86_64 2.17-196.el7_4.2 @anaconda/7.4
Hmmm, could deny_execmem be set on the crashed host by any chance? Can we have the output from "semanage boolean --list"? > Gathered an strace of fluentd startup and the crash is a SEGV_ACCERR while trying to fork a child process (full strace attached): > > 641 mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5837516000 > 641 mprotect(0x7f5837516000, 4096, PROT_NONE) = 0 > 641 clone(child_stack=0x7f58375177f0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f58375199d0, tls=0x7f5837519700, child_tidptr=0x7f58375199d0) = 642 > 642 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7f5837516d40} --- > 642 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
Created attachment 1395643 [details] semanage boolean --list on the compute where fluentd pod is running deny_execmem (off , off) Allow deny to execmem Full listing attached.
Thanks, Mike. The list is identical to mine on the test env...
(In reply to Mike Fiedler from comment #4) > Full list of installed packages in working and non-working environments > attached. > > Non-working (scale CI cluster) > > RHEL 7.4.4 > Kernel 3.10.0-829.el7.x86_64 Spectre/Meltdown patch status: > /sys/kernel/debug/x86/ibpb_enabled:0 > /sys/kernel/debug/x86/ibrs_enabled:0 > /sys/kernel/debug/x86/pti_enabled > OpenShift 3.9.0-0.24.0 on Docker 1.13 > OpenStack 11 and KVM > Skylake processors on the hypervisors Is this a CPU variant with AVX-512? (There are many different Skylake chips out there.) Lazy binding needs a fair amount of stack space since glibc-2.17-196.el7_4.1. We changed the thread stack accounting in glibc-2.17-221.el7, essentially increasing the stack allocation by 4 KiB. This may help here. If lazy binding is indeed the culprit, LD_BIND_NOW=1 may be a workaround, too. The other question is what is creating such small stacks. PTHREAD_STACK_MIN is *not* intended as a guideline for small stacks, it is the absolute minimum stack that can be used to launch a new thread. This does not mean the thread is able to do anything useful with that stack.
We need a full backtrace to debug this any further and to distinguish the exact root cause of the issue. Please install the debuginfo as directed by the debugger "Missing separate debuginfos, use: debuginfo-install glibc-2.17-221.el7.x86_64 nss-softokn-freebl-3.34.0-2.el7.x86_64", re-run the scenario, and get a full backtrace of the exact crash location.
re: comment 14: LD_BIND_NOW=1 worked as a workaround. No crash. re: comment 15. Installed the debuginfo packages and here is what I get from bt full and info threads. Let me know if there are other things you need to see. Core was generated by `/usr/bin/ruby /usr/bin/fluentd --no-supervisor'. Program terminated with signal 11, Segmentation fault. #0 0x00007fba05101faf in _dl_runtime_profile_avx512 () at ../sysdeps/x86_64/dl-trampoline.h:262 262 VMOVA %VEC(5), (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp) (gdb) bt #0 0x00007fba05101faf in _dl_runtime_profile_avx512 () at ../sysdeps/x86_64/dl-trampoline.h:262 #1 0x0000000000000000 in ?? () (gdb) bt full #0 0x00007fba05101faf in _dl_runtime_profile_avx512 () at ../sysdeps/x86_64/dl-trampoline.h:262 No locals. #1 0x0000000000000000 in ?? () No symbol table info available. (gdb) info thread Id Target Id Frame 2 LWP 7 0x00007fba0499a0ad in ?? () * 1 LWP 67 0x00007fba05101faf in _dl_runtime_profile_avx512 () at ../sysdeps/x86_64/dl-trampoline.h:262
(In reply to Mike Fiedler from comment #17) > re: comment 14: LD_BIND_NOW=1 worked as a workaround. No crash. Great news. So we do have a workaround that allows us to use jemalloc. > > re: comment 15. Installed the debuginfo packages and here is what I get > from bt full and info threads. Let me know if there are other things you > need to see. > > Core was generated by `/usr/bin/ruby /usr/bin/fluentd --no-supervisor'. > > Program terminated with signal 11, Segmentation fault. > > #0 0x00007fba05101faf in _dl_runtime_profile_avx512 () at > ../sysdeps/x86_64/dl-trampoline.h:262 > 262 VMOVA %VEC(5), (LR_VECTOR_OFFSET + VECTOR_SIZE*5)(%rsp) > > (gdb) bt > #0 0x00007fba05101faf in _dl_runtime_profile_avx512 () at > ../sysdeps/x86_64/dl-trampoline.h:262 > #1 0x0000000000000000 in ?? () > (gdb) bt full > #0 0x00007fba05101faf in _dl_runtime_profile_avx512 () at > ../sysdeps/x86_64/dl-trampoline.h:262 > No locals. > #1 0x0000000000000000 in ?? () > No symbol table info available. > (gdb) info thread > Id Target Id Frame > 2 LWP 7 0x00007fba0499a0ad in ?? () > * 1 LWP 67 0x00007fba05101faf in _dl_runtime_profile_avx512 () > at ../sysdeps/x86_64/dl-trampoline.h:262
(In reply to Rich Megginson from comment #19) > (In reply to Mike Fiedler from comment #17) > > re: comment 14: LD_BIND_NOW=1 worked as a workaround. No crash. > > Great news. So we do have a workaround that allows us to use jemalloc. This is an indicator that you are very close to the limit of the stack. Someone should analyze why you are close enough to the stack limit that this makes a difference. Additional stack usage could push you over again, and no additional tweaking will help. As Florian notes in comment #14, the usage of PTHREAD_STACK_MIN is the minimum amount of stack to start a thread, but anything else requires more stack.
(In reply to Carlos O'Donell from comment #20) > (In reply to Rich Megginson from comment #19) > > (In reply to Mike Fiedler from comment #17) > > > re: comment 14: LD_BIND_NOW=1 worked as a workaround. No crash. > > > > Great news. So we do have a workaround that allows us to use jemalloc. > > This is an indicator that you are very close to the limit of the stack. > Someone should analyze why you are close enough to the stack limit that this > makes a difference. Additional stack usage could push you over again, and no > additional tweaking will help. As Florian notes in comment #14, the usage of > PTHREAD_STACK_MIN is the minimum amount of stack to start a thread, but > anything else requires more stack. So, how do we proceed with trying to use glibc malloc, if/when we hit the wall with jemalloc and cannot use it anymore?
(In reply to Rich Megginson from comment #21) > (In reply to Carlos O'Donell from comment #20) > > (In reply to Rich Megginson from comment #19) > > > (In reply to Mike Fiedler from comment #17) > > > > re: comment 14: LD_BIND_NOW=1 worked as a workaround. No crash. > > > > > > Great news. So we do have a workaround that allows us to use jemalloc. > > > > This is an indicator that you are very close to the limit of the stack. > > Someone should analyze why you are close enough to the stack limit that this > > makes a difference. Additional stack usage could push you over again, and no > > additional tweaking will help. As Florian notes in comment #14, the usage of > > PTHREAD_STACK_MIN is the minimum amount of stack to start a thread, but > > anything else requires more stack. > > So, how do we proceed with trying to use glibc malloc, if/when we hit the > wall with jemalloc and cannot use it anymore? I'm spinning up a project to analyze RSS usage in glibc malloc, and I'd like to use fluentd and 389ds as examples/guinea pigs. We have a high-speed malloc tracer version of libc.so.6 that can trace all malloc API family calls across all threads. We work with you to capture traces, and then run them through a simulator so we can recreate the fragmentation and pause the simulator to look at heap evolution. I want to be clear here that this is just the root cause analysis of why there is increased RSS, after that we have to have a deeper think about how that might be fixed. It is a longer road, but if we start today we'll be able to offer you more choices in the future, hopefully one that allows you to choose glibc malloc and reduce your need to bundle other allocators.
I would like to find out what actually is going on. I reviewed the available evidence, and it does not conclusively show that lack of stack space is the immediate cause of the crash. I tried to reproduce the issue with: glibc-2.17-222.el7.x86_64 ruby-libs-2.0.0.648-30.el7.x86_64 rubygem-yajl-ruby-1.3.1-1.el7.x86_64 rubygem-msgpack-1.2.0-1.el7.x86_64 rubygem-cool.io-1.5.3-1.el7.x86_64 rubygem-http_parser.rb-0.6.0-4.el7.x86_64 rubygem-string-scrub-0.0.5-4.el7.x86_64 libyaml-0.1.4-11.el7_0.x86_64 rubygem-bigdecimal-1.2.0-30.el7.x86_64 ruby-irb-2.0.0.648-30.el7.noarch ruby-2.0.0.648-30.el7.x86_64 rubygem-io-console-0.4.2-30.el7.x86_64 rubygem-rdoc-4.0.0-30.el7.noarch rubygem-json-2.0.2-3.el7.x86_64 rubygems-2.0.14.1-30.el7.noarch rubygem-psych-2.0.0-30.el7.x86_64 rubygem-thread_safe-0.3.6-1.el7.noarch rubygem-tzinfo-1.2.4-1.el7.noarch rubygem-tzinfo-data-1.2017.3-1.el7.noarch rubygem-sigdump-0.2.4-1.el7.noarch fluentd-0.12.42-1.el7.x86_64 However, fluentd starts up: # LD_PRELOAD=/usr/libexec/fluentd/lib/libjemalloc.so.2 /usr/bin/ruby /usr/bin/fluentd --no-supervisor 2018-02-19 06:08:47 -0500 [info]: reading config file path="/etc/fluent/fluent.conf" 2018-02-19 06:08:47 -0500 [info]: starting fluentd-0.12.42 without supervision 2018-02-19 06:08:47 -0500 [info]: gem 'fluentd' version '0.12.42' 2018-02-19 06:08:47 -0500 [info]: using configuration file: <ROOT> </ROOT> And this configuration uses _dl_runtime_resolve_xsavec, not _dl_runtime_profile_avx512. Since comment 17 shows that the _dl_runtime_profile_avx512 symbol is *not* spurious, I wonder why this dynamic linker trampoline is used on your system. This should not happen, and I have not seen any explanation for that yet.
@fweimer re: comment 28 - what information is required? I'm happy to gather it but it wasn't clear what you were looking for. Thanks.
(In reply to Mike Fiedler from comment #29) > @fweimer re: comment 28 - what information is required? I'm happy to > gather it but it wasn't clear what you were looking for. Thanks. I need to replicate the issue, so that I can poke at the running binary with GDB. I tried to replicate your installation, but the issue does not reproduce, so either my configuration is different, or I'm using different software versions/builds.
Further investigation: Running LD_PRELOAD=/usr/libexec/fluentd/lib/libjemalloc.so.2 /usr/bin/ruby /usr/bin/fluentd --no-supervisor with glibc 2.17-221.el7 - no crash, fluentd starts with glibc 2.17-196.el7_4.2 - the crash described here occurs Lucky? or something fixed in the later version? @fweimer I have a system exhibiting the behavior on 2.17-196. If you are still interested, please send/link me your public ssh key via e-mail and I will send you access instructions.
(In reply to Mike Fiedler from comment #34) > Further investigation: > > Running LD_PRELOAD=/usr/libexec/fluentd/lib/libjemalloc.so.2 /usr/bin/ruby > /usr/bin/fluentd --no-supervisor > > with glibc 2.17-221.el7 - no crash, fluentd starts > > with glibc 2.17-196.el7_4.2 - the crash described here occurs I see a different crash on the system you supplied, not matching the one in the description: (gdb) bt #0 0x00007ffff7df0faf in _dl_runtime_resolve_xsavec () from /lib64/ld-linux-x86-64.so.2 #1 0x00007ffff72fdd9b in start_thread () from /lib64/libpthread.so.0 #2 0x00007ffff66e634d in clone () from /lib64/libc.so.6 Note that the expected ld.so trampoline is used. This crash happens because Ruby creates the timer thread with a size of PTHREAD_STACK_MIN, and glibc subtracts the size of the TLS data from that (which is a glibc bug—we never even get to the user-supplied thread function, so this is a clear conformance issue). jemalloc needs much, much more TLS data than glibc malloc (6104 bytes, I believe), which is why LD_PRELOADing it triggers the crash. And the crash happens in the expected place in the dynamic linker trampoline, right after the stack allocation for the large save area required by AVX-512. So to summarize: This crash (in comment 34) is the known issue around PTHREAD_STACK_MIN. The workaround in glibc 2.17-221.el7 appears to help here, but with 6104 bytes of TLS data consumed by jemalloc, it is still going to be very tight. However, there have been *different* backtraces posted in this bug (comment 0 and comment 17), allegedly even with glibc 2.17-221.el7, and those are not explained by the issue discussed in this comment. If you can reproduce the different crash on glibc-2.17-221.el7.x86_64, I would love to see that under GDB and poke around for a bit. I think we may have a second bug here.
Commits pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/6ed6a427c4e2c82f775ab31887866d5560b4f8d5 Bug 1544815 - logging-fluentd SEGV_ACCERR crash on init when libjemalloc is used in scalability CI lab cluster https://bugzilla.redhat.com/show_bug.cgi?id=1544815 Use export LD_BIND_NOW=1 to workaround the problem when using jemalloc. https://github.com/openshift/origin-aggregated-logging/commit/3310075669a0253856cdb018257c037085052434 Merge pull request #977 from richm/bug-1544815 Automatic merge from submit-queue. Bug 1544815 - logging-fluentd SEGV_ACCERR crash on init when libjemalloc is used in scalability CI lab cluster
Verified on 3.9.2. LD_BIND_NOW is set.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489
Commits pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/1774d58f9488046c0ee4f835e986106078d4cb8c Bug 1544815 - logging-fluentd SEGV_ACCERR crash on init when libjemalloc is used in scalability CI lab cluster https://bugzilla.redhat.com/show_bug.cgi?id=1544815 Use export LD_BIND_NOW=1 to workaround the problem when using jemalloc. https://github.com/openshift/origin-aggregated-logging/commit/1f36bf5166fcb3900fb0e0647de839f095f0ae0e Merge pull request #978 from openshift-cherrypick-robot/cherry-pick-977-to-es5.x [es5.x] Bug 1544815 - logging-fluentd SEGV_ACCERR crash on init when libjemalloc is used in scalability CI lab cluster