Hide Forgot
Description of problem: Application uses gethostbyname() for host resolve, and nscd used here for caching. The hosts db is over NIS and RHEL 7 machine acts as its client. Semiconductor fab controlling, long-running binaries (called "EI") need to be forced to switch a certain underlying service through a NIS change. Starting or stopping nscd has no influence on the name resolution of running 1800 EIs on 30 servers and a simple test-binary. The running binaries gethostbyname() calls don't cause NIS lookups any more, once the nscd process they were started with was restarted. In RHEL5/6 after restarting the nscd with volatile cache the running test binaries next gethostbyname() causes a NIS query to be started right away. In tests, we observe from tshark monitor that a NIS query is been send immediately after nscd restart. The entries are also updated after positive TTL of the host. nscd versions used were nscd-2.5-123.el5_11.3.x86_64 and glibc-2.12-1.149.el6.x86_64 However in RHEL7, restarting nscd final decouples the running binary from the NIS nameservice and no subsequent gethostbyname makes an NIS query leave the machine. After a nscd restart, the application never queries NIS and it continues to use the host IP till it is manually restarted and performing a nscd cache cleanup and service restart. Version-Release number of selected component (if applicable): Red Hat Enterprise Linux 7.2 nscd-2.17-106.el7_2.8.x86_64 How reproducible: Always Steps to Reproduce: 1. A NIS server needs to be setup with hosts db enabled. /var/yp/Makefile all: passwd group hosts rpc services netid protocols mail \ # /usr/lib64/yp/ypinit -m The AAA can be resolved its /etc/hosts Reference : https://access.redhat.com/solutions/7247 2. At RHEL 7 client; /etc/nsswitch.conf hosts: files nis dns /etc/nscd.conf enable-cache hosts yes positive-time-to-live hosts 120 negative-time-to-live hosts 20 suggested-size hosts 211 check-files hosts yes persistent hosts no shared hosts yes max-db-size hosts 33554432 Install wireshark. Open 3 terminals; On 1, # tshark -i eth0 -tad -R "ypserv.key==testhost" On 2, # nscd -i hosts # systemctl nscd restart On 3, run the test application 'nss' (attached) # gcc -o nss nss.c # ./nss testhost 10 3. Positive TTL is 120 secs. Scenario 1 : After running for a few seconds (eg. 40 sec) Do a `# systemctl restart nscd`, the cache entry is now static. No requests to NIS from now. Scenario 2: Change the 'testhost' hosts entry at NIS server; rebuild the NIS db. Do a `# systemctl restart nscd` at RHEL 7 client. No requests to NIS from now. With RHEL 6/5, a new request is being sent to NIS server each time. Actual results: RHEL 7 # nscd -i hosts # systemctl restart nscd # ./nss testhost 10 & [1] 14954 # 2016:09:01 22:23:32: testhost - 10.76.1.138 2016:09:01 22:23:42: testhost - 10.76.1.138 2016:09:01 22:23:52: testhost - 10.76.1.138 2016:09:01 22:24:02: testhost - 10.76.1.138 # systemctl restart nscd # 2016:09:01 22:24:12: testhost - 10.76.1.138 2016:09:01 22:24:22: testhost - 10.76.1.138 2016:09:01 22:24:32: testhost - 10.76.1.138 2016:09:01 22:24:42: testhost - 10.76.1.138 # systemctl restart nscd # 2016:09:01 22:24:52: testhost - 10.76.1.138 2016:09:01 22:25:02: testhost - 10.76.1.138 2016:09:01 22:25:12: testhost - 10.76.1.138 2016:09:01 22:25:22: testhost - 10.76.1.138 2016:09:01 22:25:32: testhost - 10.76.1.138 2016:09:01 22:25:42: testhost - 10.76.1.138 2016:09:01 22:25:52: testhost - 10.76.1.138 2016:09:01 22:26:02: testhost - 10.76.1.138 2016:09:01 22:26:12: testhost - 10.76.1.138 2016:09:01 22:26:22: testhost - 10.76.1.138 2016:09:01 22:26:32: testhost - 10.76.1.138 2016:09:01 22:26:42: testhost - 10.76.1.138 2016:09:01 22:26:52: testhost - 10.76.1.138 2016:09:01 22:27:02: testhost - 10.76.1.138 2016:09:01 22:27:12: testhost - 10.76.1.138 2016:09:01 22:27:22: testhost - 10.76.1.138 2016:09:01 22:27:32: testhost - 10.76.1.138 2016:09:01 22:27:42: testhost - 10.76.1.138 2016:09:01 22:27:52: testhost - 10.76.1.138 ^C # fg ./nss testhost 10 ^C # date Thu Sep 1 22:28:05 IST 2016 # tshark -i eth0 -tad -R "ypserv.key==testhost" tshark: -R without -2 is deprecated. For single-pass filtering use -Y. Running as user "root" and group "root". This could be dangerous. Capturing on 'eth0' 160 2016-09-01 22:23:32.084217172 10.65.5.157 -> 10.65.9.243 YPSERV 122 V2 MATCH Call testdk/hosts.byname/testhost Expected results: RHEL 6 # nscd -i hosts # service nscd restart Stopping nscd: [ OK ] Starting nscd: [ OK ] # ./nss testhost 10 & [1] 3565 # 2016:09:01 12:53:10: testhost - 10.76.1.138 2016:09:01 12:53:20: testhost - 10.76.1.138 2016:09:01 12:53:30: testhost - 10.76.1.138 2016:09:01 12:53:40: testhost - 10.76.1.138 service nscd restart Stopping nscd: [ OK ] Starting nscd: [ OK ] # 2016:09:01 12:53:50: testhost - 10.76.1.138 2016:09:01 12:54:00: testhost - 10.76.1.138 2016:09:01 12:54:10: testhost - 10.76.1.138 2016:09:01 12:54:20: testhost - 10.76.1.138 # service nscd restart Stopping nscd: [ OK ] Starting nscd: [ OK ] # 2016:09:01 12:54:30: testhost - 10.65.9.237 2016:09:01 12:54:40: testhost - 10.65.9.237 2016:09:01 12:54:50: testhost - 10.65.9.237 2016:09:01 12:55:00: testhost - 10.65.9.237 2016:09:01 12:55:10: testhost - 10.65.9.237 2016:09:01 12:55:20: testhost - 10.65.9.237 2016:09:01 12:55:30: testhost - 10.65.9.237 2016:09:01 12:55:40: testhost - 10.65.9.237 2016:09:01 12:55:50: testhost - 10.65.9.237 2016:09:01 12:56:00: testhost - 10.65.9.237 2016:09:01 12:56:10: testhost - 10.65.9.237 2016:09:01 12:56:20: testhost - 10.65.9.237 2016:09:01 12:56:30: testhost - 10.65.9.237 2016:09:01 12:56:40: testhost - 10.65.9.237 2016:09:01 12:56:50: testhost - 192.168.1.199 2016:09:01 12:57:00: testhost - 192.168.1.199 2016:09:01 12:57:10: testhost - 192.168.1.199 2016:09:01 12:57:20: testhost - 192.168.1.199 2016:09:01 12:57:30: testhost - 192.168.1.199 ^C # fg ./nss testhost 10 ^C # date Thu Sep 1 12:57:39 EDT 2016 # # tshark -i eth0 -tad -R "ypserv.key==testhost" Running as user "root" and group "root". This could be dangerous. Capturing on eth0 2016-09-01 12:53:10.418441730 10.65.9.245 -> 10.65.9.243 YPSERV 122 V2 MATCH Call testdk/hosts.byname/testhost 2016-09-01 12:53:50.423415477 10.65.9.245 -> 10.65.9.243 YPSERV 122 V2 MATCH Call testdk/hosts.byname/testhost 2016-09-01 12:54:30.428302823 10.65.9.245 -> 10.65.9.243 YPSERV 122 V2 MATCH Call testdk/hosts.byname/testhost 2016-09-01 12:56:45.071571957 10.65.9.245 -> 10.65.9.243 YPSERV 122 V2 MATCH Call testdk/hosts.byname/testhost 2016-09-01 12:59:00.073666316 10.65.9.245 -> 10.65.9.243 YPSERV 122 V2 MATCH Call testdk/hosts.byname/testhost Additional info:
Created attachment 1196899 [details] reproducer program
This issue is not being considered for release with RHEL 7.4 but we will continue to look at this problem in more detail and keep you updated as we make progress. This can include a rhel-7.4.z release with this fix, or an immediate hotifx, depending on the user requirements, timeline and impact.
RCA done, looks like we upstream kernel patch 735f2770a770156100f534646158cb58cb8b2939 (see https://patchwork.kernel.org/patch/9254247/) Reassigning to kernel.
(In reply to DJ Delorie from comment #7) > RCA done, looks like we upstream kernel patch > 735f2770a770156100f534646158cb58cb8b2939 (see > https://patchwork.kernel.org/patch/9254247/) Yes, but why nscd doesn't hit this problem on rhel-6 which has the same PF_SIGNALED check? OK, probably rhel6's version doesn't rely on CLONE_CHILD_CLEARTID, but it would be nice to verify to ensure we actually understand whats going on.
(In reply to Deepu K S from comment #0) > > In RHEL5/6 after restarting the nscd with volatile cache the running test > binaries next gethostbyname() causes a NIS query to be started right away. And afaik there are no incompatible cleartid changes between rhel6 and rhel7. Do you know how did they came to conclusion that something is wrong with set_tid_address() (compared to rhel6) ? There is nothing in description about that... I agree, that simple patch might help, but only because the changelog mentions the (hopefully) same nscd/CLONE_CHILD_CLEARTID problem. I can build the rhel7 kernel with the trivial backport, can you test it?
(In reply to Oleg Nesterov from comment #9) > (In reply to Deepu K S from comment #0) > > > > In RHEL5/6 after restarting the nscd with volatile cache the running test > > binaries next gethostbyname() causes a NIS query to be started right away. > > And afaik there are no incompatible cleartid changes between rhel6 and rhel7. > Do you know how did they came to conclusion that something is wrong with > set_tid_address() (compared to rhel6) ? There is nothing in description > about that... Customer didn't point out issue with set_tid_address(). They had found the difference in behaviour between RHEL 6 and 7. Under RHEL5/6 after restarting the nscd with volatile cache the running test binarie's next gethostbyname() causes a NIS querie to be started right away. Under RHEL7.1 restarting nscd final decouples the running binary from the NIS nameservice and no subsequent gethostbyname makes an NIS query leave the machine. Thus no chance of getting any update exists. I think it was found from our RCA done (from comment #7). > > I agree, that simple patch might help, but only because the changelog > mentions > the (hopefully) same nscd/CLONE_CHILD_CLEARTID problem. > > I can build the rhel7 kernel with the trivial backport, can you test it? Yes. I can test that. I shall also share with our customer who would be able to test it in actual environments. Thanks.
(In reply to Deepu K S from comment #10) > (In reply to Oleg Nesterov from comment #9) > > (In reply to Deepu K S from comment #0) > > > > > > In RHEL5/6 after restarting the nscd with volatile cache the running test > > > binaries next gethostbyname() causes a NIS query to be started right away. > > > > And afaik there are no incompatible cleartid changes between rhel6 and rhel7. > > Do you know how did they came to conclusion that something is wrong with > > set_tid_address() (compared to rhel6) ? There is nothing in description > > about that... > > Customer didn't point out issue with set_tid_address(). They had found the > difference in behaviour between RHEL 6 and 7. but the subject clearly blames set_tid_address? > Under RHEL5/6 after restarting the nscd with volatile cache the running test > binarie's next gethostbyname() causes a NIS querie to be started right away. > Under RHEL7.1 restarting nscd final decouples the running binary from the > NIS nameservice and no subsequent gethostbyname makes an NIS query leave the > machine. Thus no chance of getting any update exists. Sorry, I can't understand this ;) OK, nevermind. If you can test the patch and the problem goes away this all doesn't really matter. > > I can build the rhel7 kernel with the trivial backport, can you test it? > > Yes. I can test that. > I shall also share with our customer who would be able to test it in actual > environments. Great, thanks, https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13812322
(In reply to Oleg Nesterov from comment #11) > > Yes. I can test that. > > I shall also share with our customer who would be able to test it in actual > > environments. > > Great, thanks, > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13812322 Hi Oleg, Sorry. I missed downloading the packages from above build. It seems the download link is not valid now. Could you please initiate a new brew build. Thanks.
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13997847
We checked with latest 7.4 kernel and nscd packages and the nscd cache works correctly now. Customer has also acknowledged the issue to be resolved for them. Thanks!
(In reply to Deepu K S from comment #14) > > We checked with latest 7.4 kernel hmm. afaics the 7.4 kernel doesn't have that patch, so it was something else. Nevermind... > and nscd packages and the nscd cache works > correctly now. > Customer has also acknowledged the issue to be resolved for them. OK, can we close this bug? Or did you mean they checked the test kernel with that patch I built for you?