This bug report is coming from a CentOS 8.3.2011 system. I hope it will be welcomed. Description of problem: nscd dies about every 19 seconds for me on this Dell R740xd server. Feb 11 06:53:12 darkstar systemd[1]: nscd.service: Main process exited, code=killed, status=6/ABRT Feb 11 06:53:12 darkstar systemd[1]: nscd.service: Failed with result 'signal'. Feb 11 06:53:12 darkstar systemd[1]: nscd.service: Service RestartSec=100ms expired, scheduling restart. Feb 11 06:53:12 darkstar systemd[1]: nscd.service: Scheduled restart job, restart counter is at 1828. Version-Release number of selected component (if applicable): nscd-2.28-127.el8.x86_64 How reproducible: always Steps to Reproduce: 1. systemctl start nscd 2. monitor logs, notice it exits. Additional info: NIS is used with a netgroup db of about 21,030 bytes (ypcat -k netgroup). I have numerous other systems similarly configured that do not have nscd die, perhaps the # of cores on this box (2x Xeon 8168, HT enabled, 96 total) or the RAM (1.5TB, MemTotal=1583374272 kB). [root@darkstar ~]# gdb nscd GNU gdb (GDB) Red Hat Enterprise Linux 8.2-12.el8 [snip] Reading symbols from nscd...Reading symbols from /usr/lib/debug/usr/sbin/nscd.debug...done. done. (gdb) run -dF Starting program: /usr/sbin/nscd -dF [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". warning: Loadable section ".note.gnu.property" outside of ELF segments [20-ish repeats of previous line deleted] [New Thread 0x7fffdb10c700 (LWP 26641)] [New Thread 0x7fffdaf0b700 (LWP 26642)] [New Thread 0x7fffdad0a700 (LWP 26643)] [New Thread 0x7fffdab09700 (LWP 26644)] [New Thread 0x7fffda908700 (LWP 26645)] [New Thread 0x7fffda707700 (LWP 26646)] [New Thread 0x7fffda506700 (LWP 26647)] [New Thread 0x7fffda305700 (LWP 26648)] [New Thread 0x7fffda104700 (LWP 26649)] Thread 6 "nscd" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffda908700 (LWP 26645)] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 return ret; Missing separate debuginfos, use: yum debuginfo-install audit-libs-3.0-0.17.20191104git1c2f876.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 krb5-libs-1.18.2-5.el8.x86_64 libblkid-2.32.1-24.el8.x86_64 libcap-2.26-4.el8.x86_64 libcap-ng-0.7.9-5.el8.x86_64 libcom_err-1.45.6-1.el8.x86_64 libgcc-8.3.1-5.1.el8.x86_64 libmount-2.32.1-24.el8.x86_64 libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 libselinux-2.9-4.el8_3.x86_64 libtirpc-1.1.4-4.el8.x86_64 libuuid-2.32.1-24.el8.x86_64 nss_nis-3.0-8.el8.x86_64 openssl-libs-1.1.1g-12.el8_3.x86_64 pcre2-10.32-2.el8.x86_64 systemd-libs-239-41.el8_3.1.x86_64 zlib-1.2.11-16.el8_2.x86_64 I have also seen: free(): double free detected in tcache 2 Thread 6 "nscd" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffda908700 (LWP 22399)] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 return ret;
Sorry for omitting backtrace. Easily reproduced via: echo -e "run -dF\nbt\nquit" | gdb /usr/sbin/nscd #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007ffff71b8c35 in __GI_abort () at abort.c:79 #2 0x00007ffff7211987 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff731e11d "%s\n") at ../sysdeps/posix/libc_fatal.c:181 #3 0x00007ffff7218d8c in malloc_printerr (str=str@entry=0x7ffff731fd40 "free(): double free detected in tcache 2") at malloc.c:5374 #4 0x00007ffff721aafd in _int_free (av=0x7fffb8000020, p=0x7fffb8000b90, have_lock=<optimized out>) at malloc.c:4213 #5 0x0000555555571927 in addinnetgrX (db=0x555555779600 <dbs+1408>, fd=-1, key=<optimized out>, uid=4294967295, he=0x7fffdb10ea38, dh=0x7fffdb10e9f0, req=<optimized out>, req=<optimized out>) at netgroupcache.c:605 #6 0x0000555555571d57 in readdinnetgr (db=<optimized out>, he=<optimized out>, dh=<optimized out>) at netgroupcache.c:663 #7 0x0000555555567c6f in prune_cache (table=table@entry=0x555555779600 <dbs+1408>, now=<optimized out>, now@entry=1613146515, fd=fd@entry=-1) at cache.c:415 #8 0x000055555555c3f7 in nscd_run_prune (p=<optimized out>) at connections.c:1555 #9 0x00007ffff7bbc14a in start_thread (arg=<optimized out>) at pthread_create.c:479 #10 0x00007ffff7293f23 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb)
Thanks for submitting this issue. The backtrace is particularly helpful in identifying a possible cause. I notice that you're using CentOS, which is a distinct product from Fedora, CentOS Stream, or RHEL. * Can you reproduce the issue on CentOS Stream? * Can you reproduce the issue on RHEL 8.3? In order to prioritize this issue we would need you to confirm that you can reproduce it on a supported release, and even better if you can get a support case attached to this bug by working with Red Hat support. We don't normally handle CentOS defects in this tracker, instead there is a distinct tracker for that here: https://bugs.centos.org/main_page.php
Created attachment 1759135 [details] proposed fix to double free in nscd I am not comfortable moving this server to an alternate OS at this time (I thought CentOS Linux would be supported through 2021). Siddhesh, thank you for the Sourceware link. I rebuilt glibc with a the proposed untested patch, adding it as Patch999 to the spec. It did not apply directly, likely due other context around the patch being patched, so with prepared sources, I hand-made the two code changes and generated a new diff (attached). While it's early to tell for sure, nscd hasn't crashed for 1.5 hours, which is an improvement.
(In reply to schanzle from comment #6) > Created attachment 1759135 [details] > proposed fix to double free in nscd > > I am not comfortable moving this server to an alternate OS at this time (I > thought CentOS Linux would be supported through 2021). I understand, it's just that CentOS Linux bugs are recorded and prioritized separately through a different tracker, i.e. https://bugs.centos.org/main_page.php . Luckily Carlos was able to identify a possible cause through the backtrace and was able to easily confirm that it's a bug. > Siddhesh, thank you for the Sourceware link. I rebuilt glibc with a the > proposed untested patch, adding it as Patch999 to the spec. It did not > apply directly, likely due other context around the patch being patched, so > with prepared sources, I hand-made the two code changes and generated a new > diff (attached). > > While it's early to tell for sure, nscd hasn't crashed for 1.5 hours, which > is an improvement. Thanks for testing the patch, hopefully it fixes your use case.
> While it's early to tell for sure, nscd hasn't crashed for 1.5 hours, which is an improvement. Status update: Running for about a week and no crashes. I really appreciate a working nscd. Nightly, this server scans several NFS servers to get metadata - basically 'find -ls'. Without nscd, the scan time of one server with 7.5 million objects increases from 45 minutes to 3hr45m - about a 5X increase. I attribute this to slow NIS lookups of uid/gid data. We are moving to sssd/AD, but until then, this works well enough. Thanks for all the effort behind the scenes to make this fix available.
When will this fix be published as an update? nscd-2.28-151.el8.x86_64 from glibc-2.28-151.el8.src.rpm still crashes.
(In reply to schanzle from comment #15) > When will this fix be published as an update? > > nscd-2.28-151.el8.x86_64 from glibc-2.28-151.el8.src.rpm still crashes. Sorry, I cannot comment publicly on future release dates. If you need urgent assistance or a hotfix, please contact Customer Support: https://access.redhat.com/support/cases/ Thank you for your understanding.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: glibc security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4358