Bug 1277672 - nscd is not caching ldap netgroup data properly, hangs on nscd -i netgroup
nscd is not caching ldap netgroup data properly, hangs on nscd -i netgroup
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: glibc (Show other bugs)
6.4
x86_64 Linux
unspecified Severity medium
: rc
: ---
Assigned To: glibc team
qe-baseos-tools
:
Depends On:
Blocks: 1374441 1461138 1435615
  Show dependency treegraph
 
Reported: 2015-11-03 14:36 EST by Edward Goodwin
Modified: 2018-01-10 23:38 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1435615 (view as bug list)
Environment:
Last Closed: 2017-11-08 04:20:22 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to fix nscd -i command hang (573 bytes, patch)
2017-03-31 16:53 EDT, Al Heisner
no flags Details | Diff
patch to fix nscd -i command hang (333 bytes, patch)
2017-04-05 12:05 EDT, Al Heisner
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Debian BTS 800523 None None None Never
Sourceware 22161 None None None 2017-09-21 03:37 EDT

  None (edit)
Description Edward Goodwin 2015-11-03 14:36:19 EST
Description of problem:

nscd is not properly caching netgroup data from LDAP. It doesn't appear to be updating the cached data from the ldap server. attempting to invalidate the netgroup cache with `nscd -i netgroup` hangs. As a result sudoRole ldap sudo records that reference netgroups with "+" syntax are not being honored. Stopping nscd resolves the issue. 


Version-Release number of selected component (if applicable):

nscd-2.12-1.166.el6_7.3.x86_64
nss-3.19.1-3.el6_6.x86_64

How reproducible: always


Steps to Reproduce:

--- /etc/nscd.conf ---
        enable-cache            netgroup        yes
        positive-time-to-live   netgroup        28800
        negative-time-to-live   netgroup        20
        suggested-size          netgroup        211
        check-files             netgroup        yes
        persistent              netgroup        yes
        shared                  netgroup        yes
        max-db-size             netgroup        33554432


----------------------

--- /etc/nsswitch.conf ---
passwd:     compat
passwd_compat:    ldap
shadow:     compat
group:      files ldap
hosts:      files dns

bootparams: nisplus [NOTFOUND=return] files
ethers:     files
netmasks:   files
networks:   files
protocols:  files
rpc:        files
services:   files
netgroup:   ldap
publickey:  nisplus
automount:  files nisplus
aliases:    files nisplus
sudoers:  files ldap

1. run getent netgroup webusers

webusers: ( ,egoodwin, ) ( ,jsmith, ) ( ,aflanders, )

2. update the netgroup on the ldap side , getent returns the same result

webusers: ( ,egoodwin, ) ( ,jsmith, ) ( ,aflanders, )

3. stop nscd with `service nscd stop`

4. getent returns the up-to-date values

webusers: ( ,egoodwin, ) ( ,jsmith, ) ( ,aflanders, ) ( ,jpate, )

5. restart nscd, it returns the cached values

webusers: ( ,egoodwin, ) ( ,jsmith, ) ( ,aflanders, )

6.  attempt to invalidate the netgroup cache with nscd -i 



Actual results:

nscd will hang in step 6. The nscd netgroup cache will never update properly. 


Expected results:

nscd -i to invalidate the cache and repopulate from the source. 


Additional info:
able to reproduce this on several rhel 6.4 servers
Comment 2 Hubert Kario 2015-11-04 06:57:22 EST
nss package is the Mozilla Network Security Services

Name Service Switch (both nscd and /etc/nsswitch.conf) is part of glibc

you can see that by either asking the rpm db who owns the file:
# rpm -qf /etc/nsswitch.conf 
glibc-2.12-1.166.el6.x86_64

or which source package is the origin of nscd:
# rpm -qi nscd | grep Source
Group       : System Environment/Daemons    Source RPM: glibc-2.12-1.166.el6.src.rpm
Comment 3 Edward Goodwin 2015-11-04 10:54:29 EST
I'm confirming the following versions of glibc are installed

glibc-2.12-1.166.el6_7.3.i686
glibc-2.12-1.166.el6_7.3.x86_64
Comment 6 Edward Goodwin 2016-02-29 06:16:56 EST
My workaround for this issue has been to disable netgroup cache in /etc/nscd.conf. It is enabled by default 

        enable-cache            netgroup        no
        positive-time-to-live   netgroup        28800
        negative-time-to-live   netgroup        20
        suggested-size          netgroup        211
        check-files             netgroup        yes
        persistent              netgroup        yes
        shared                  netgroup        yes
        max-db-size             netgroup        33554432

Any help on this issue is appreciated, its a major annoyance in our environment.
Comment 7 Hubert Kario 2016-02-29 08:00:39 EST
Please contact customer support and point them to this bug, this will increase its priority.
Comment 12 Al Heisner 2017-03-31 16:52:04 EDT
I also encountered problems with nscd -i while solving netgroup cache issues. I'll attach patch to this ticket I used to solve hang when running nscd -i.
Comment 13 Al Heisner 2017-03-31 16:53 EDT
Created attachment 1267982 [details]
patch to fix nscd -i command hang
Comment 14 Al Heisner 2017-03-31 18:03:17 EDT
Also, see https://bugzilla.redhat.com/show_bug.cgi?id=1436335 for probable reason netgroup cache is failing w/ sudo
Comment 15 Al Heisner 2017-04-04 15:35:52 EDT
Sorry, after another look at this I found I had patch test contamination. It seems somehow the patch I posted to ticket 1436335 was preventing the condition causing nscd to hang when trying to invalidate cache. When I remove the other patch, nscd hangs even with this one applied.
Comment 16 Al Heisner 2017-04-05 12:05 EDT
Created attachment 1269025 [details]
patch to fix nscd -i command hang

This patch corrects orphaned readlocks on netgroup db in nscd
Comment 18 Florian Weimer 2017-11-08 04:20:15 EST
Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017.  During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

This issue does not qualify, but we are considering incorporating the upstream change into Red Hat Enterprise Linux 7; see bug 1435615.
Comment 19 Red Hat Bugzilla Rules Engine 2017-11-08 04:20:22 EST
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
Comment 20 Edward Goodwin 2018-01-10 17:55:22 EST
I find it extremely discouraging that RedHat lets the bug sit  for two years until the release transitions to Prod Phase 3 and the declines to fix it; a very convenient way to get out of fixing downstream bugs.
Comment 21 Carlos O'Donell 2018-01-10 22:59:59 EST
(In reply to Edward Goodwin from comment #20)
> I find it extremely discouraging that RedHat lets the bug sit  for two years
> until the release transitions to Prod Phase 3 and the declines to fix it; a
> very convenient way to get out of fixing downstream bugs.

Edward,

Thank you for your feedback, we do appreciate it.

We only just managed to reproduce and fix this issue in September, and the backport risk exceeded what we could do in RHEL 6.10. Please keep in mind our customers expect very little change at this point in the RHEL 6 release.

We have scheduled a fix for this issue in the upcoming RHEL 7.5 release (Bug 1435615). So while progress was slow we had not forgotten about this issue and we strive to improve subsequent versions of RHEL.

Thank you again.
Comment 22 Al Heisner 2018-01-10 23:16:50 EST
Why would it take 5 months to reproduce and fix this issue when I posted the patch I've been using for 9 months now without issue, showing exactly where the orphaned readlocks were back in April of 2017?
Comment 23 Carlos O'Donell 2018-01-10 23:38:29 EST
(In reply to Al Heisner from comment #22)
> Why would it take 5 months to reproduce and fix this issue when I posted the
> patch I've been using for 9 months now without issue, showing exactly where
> the orphaned readlocks were back in April of 2017?

Al,

Thank you also for your feedback.

All changes going into Red Hat Enterprise Linux are verified and validated. This includes looking at broader issues with the locks in nscd and running our own tests on a number of different configurations. Despite the outcome being a small number of lines of code change, the underlying work that goes into accepting those changes is not small. In this case the work also ran against the deadline for transition from Phase 2 to Phase 3 and we appreciate the difficulty that brings for some customers.

As I mentioned in comment #21 the fix is scheduled for RHEL 7.5.

Note You need to log in before you can comment on or make changes to this bug.