Description of problem: nscd seems to be caching an incorrect value for a users shell. On an RHEL3 host configured as follows: /etc/nsswitch.conf: passwd: compat /etc/passwd (relevant bits): +mqm:::::: +::0:0:::/bin/false (The mqm entry in NIS contains a shell of /bin/csh) On a regular basis, attempts to su to the mqm user will fail. When this happens "getent passwd mqm" returns /bin/false for the user shell, however a ypcat of the passwd map in NIS shows the actual user shell to be /bin/csh. Further analysis reveals that these episodes last almost exactly 10 minutes, which happens to be the positive-time-to-live value for passwd entry caching in nscd. The issue can always be resolved by restarting nscd. The issue can be avoided by turning off passwd caching in nscd. As a result, we suspect that the issue resides in nscd. Version-Release number of selected component (if applicable): nscd-2.3.2-95.30 How reproducible: We cannot reproduce this on demand. However, it does occur regularly on hosts that do frequent su and sudo commands as part of custom applications. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
In revisiting this bug I've discovered an error in how I described the problem. It seems that when this happens, the username in question is always referenced indirectly in the passwd file via +@ notation and the netgroup NIS database. So, the passwd snipit above should really read: +@mq_acts:::::: +::0:0:::/bin/false Where the netgroup "mq_acts" contains an entry for the "mqm" user. All other details of the problem remain the same. (And this is something we continue to see on occasion on both AS 2.1 and RHEL 3.)
If it is one of the getXXbyYY{,_r} lookups rather than getXXent{,_r}, then it might be because nss_compat uses innetgr function in several places to see if a particular use is in a netgroup or not. Now, innetgr has no error reporting, it only returns 1 if the netgroup contains the machine/user/domain triple and 0 otherwise. So, 0 can be returned both when there really is not such triple or if some error occurred (such as transient failure due to busy NIS server). Not sure what's better, if to keep the code as is, or use some other function instead of innetgr and fail the whole request just because the netgroup lookup failed.