Bug 2217921

Summary: nscd aborts with failed assert in prune_cache
Product: Red Hat Enterprise Linux 8 Reporter: yanf
Component: glibcAssignee: glibc team <glibc-bugzilla>
Status: CLOSED MIGRATED QA Contact: qe-baseos-tools-bugs
Severity: medium Docs Contact:
Priority: medium    
Version: 8.8CC: ashankar, casantos, codonell, cww, dj, fweimer, jwright, mijjapur, pfrankli, sipoyare
Target Milestone: rcKeywords: Bugfix, MigratedToJIRA, Triaged
Target Release: ---Flags: mijjapur: needinfo? (yanf)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-11 14:43:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
bt full of ABRT event
none
*actual* nscd backtrace
none
nscd backtrace with all symbols none

Description yanf 2023-06-27 13:36:57 UTC
Created attachment 1972848 [details]
bt full of ABRT event

Description of problem:

NSCD exits with ABRT when reading `passwd` cache. 

Version-Release number of selected component (if applicable):
glibc-2.28-189.1.el8.x86_64

How reproducible:
always

Steps to Reproduce:
1. start nscd
2. ABRT almost immediately

Actual results:
strace shows :

[pid 219045] write(2</dev/null>, "nscd: cache.c:426: prune_cache: Assertion `dh->usable' failed.\n", 63) = 63


Expected results:
runs without error

Additional info:

Debug output (actual IDs redacted for safety / confidentiality) :

Mon 26 Jun 2023 06:55:18 PM EDT - 452466: Reloading "<redacted>" in user database cache!
Mon 26 Jun 2023 06:55:18 PM EDT - 452466: Reloading "<redacted>" in user database cache!
Mon 26 Jun 2023 06:55:18 PM EDT - 452466: Reloading "<redacted>" in user database cache!
Mon 26 Jun 2023 06:55:18 PM EDT - 452466: Reloading "<redacted>" in user database cache!
nscd: cache.c:426: prune_cache: Assertion `dh->usable' failed.

Back trace :

#0  0x00007fd2d11336cc in __nscd_get_map_ref () from /lib64/libc.so.6
#1  0x00007fd2d112fa7a in nscd_getpw_r () from /lib64/libc.so.6
#2  0x00007fd2d112feac in __nscd_getpwuid_r () from /lib64/libc.so.6
#3  0x00007fd2d10c6dbf in getpwuid_r@@GLIBC_2.2.5 () from /lib64/libc.so.6
#4  0x00007fd2d17d0976 in pam_modutil_getpwuid () from /lib64/libpam.so.0
#5  0x00007fd2cdd8cb12 in pam_sm_authenticate () from /usr/lib64/security/pam_succeed_if.so
#6  0x00007fd2d17ca7b4 in _pam_dispatch () from /lib64/libpam.so.0
#7  0x00005630c43259a3 in cron_close_pam ()
#8  0x00005630c43251cf in do_command ()
#9  0x00005630c4324170 in job_runqueue ()
#10 0x00005630c432193c in main ()

bt full attached.

Comment 1 yanf 2023-06-27 13:43:54 UTC
Obviously, if I clear the cache, problem goes away. I have the problematic passwd cache file, but can't post it here for obvious reaons. Might be able to send it direct under our mutual NDA.

Comment 2 yanf 2023-06-27 14:52:25 UTC
Created attachment 1972857 [details]
*actual* nscd backtrace

The previous bt was a related one from crond, but I was able to repro the issue under gdb, which gave this backtrace.

Comment 3 yanf 2023-06-27 15:08:16 UTC
Created attachment 1972871 [details]
nscd backtrace with all symbols

After adding the missing debug symbols package

Comment 5 Carlos O'Donell 2023-06-30 13:31:59 UTC
If you are a Red Hat customer with an active subscription, please visit the Red Hat Customer Portal [1] for assistance with your issue.

[1] http://access.redhat.com/

Comment 6 Carlos Santos 2023-06-30 14:39:45 UTC
I'm providing the required link to the support ticket in the customer portal.

Comment 14 Florian Weimer 2023-07-25 12:21:08 UTC
I looked at this for some time and I'm still not sure what might be causing this. We need some sort of reproducer, or at least the corrupted mapping that triggers this.

This issue seems different from the known concurrency issues (which I think cannot happen on x86-64 due to its strong memory model). I wonder if it could be caused by inconsistent data coming back from LDAP and trigger expiration of cache entries that is not time-based, hence triggering an assert.

Comment 15 yanf 2023-07-28 19:36:59 UTC
@fweimer I uploaded the corrupt nscd passwd file to RH case number 03548682 aes-256-cbc encrypted.

You will need a password to decrypt it. Please reach out of band.

Comment 16 Murali Prudhvi Ijjapureddi 2023-08-02 11:58:58 UTC
@yanf 

Hello Yan!

I have updated the support ticket and updating the same information here for your reference -

Please update the support ticket with the requested information, and we will take this further.

Thanks! - Murali

====================================================
>>Hello Yan!

>>Thank you for updating the support ticket.

>>I see that you want to share the decryption password for the file that you have shared here on the support ticket as well as on the BugZilla ticket.

>>I understand that you want to share the password out of band over email. However, this is not a recommended process.

>>It is best to keep all the communication and information on the support portal for security reasons, and tracking purposes.

>>I had a word with Florian, engineer working on the bug ticket to get a better understanding of the progress we have had so far on the issue.

>>Let's work on this together for sharing the password on an alternate secure medium; for the engineer to access it and work on the issue.

>>I tried calling you on the number that we have on file for your contact - "2124780000". But, it looks like a dummy placeholder number, and I wasn't able to reach you.

>>Could you provide your contact number along with country code to reach you and discuss this further?

>>Awaiting your response.

>>Thank you!

>>Regards,
>>Murali Prudhvi.
====================================================

Comment 19 RHEL Program Management 2023-08-11 14:43:46 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues.