Bug 491351

Summary: automount segfault after lookup failure
Product: Red Hat Enterprise Linux 5 Reporter: Sachin Prabhu <sprabhu>
Component: autofsAssignee: Ian Kent <ikent>
Status: CLOSED ERRATA QA Contact: BaseOS QE <qe-baseos-auto>
Severity: high Docs Contact:
Priority: low    
Version: 5.3CC: cward, ikent, tao, ykopkova
Target Milestone: rcKeywords: OtherQA
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, if a name lookup failed while creating a TCP or UDP client, automount would destroy the client, but would not set the rpc client to NULL. Therefore, subsequent lookup attempts would attempt to use the invalid rpc client, which would lead to a segmentation fault. Now, when a name lookup fails, autofs sets the rpc client to NULL, and therefore avoids the segmentation fault on subsequent lookup attempts.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 11:59:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to clear rpc client on lookup fail none

Description Sachin Prabhu 2009-03-20 15:13:09 UTC
A user is experiencing automount segfaults on all of their systems on autofs-5.0.1-0.rc2.102 (and previous versions).  

  Feb  4 06:00:44 rlph047 automount[3491]: create_tcp_client:299: hostname lookup failed: No such file or directory
  Feb  4 06:00:44 rlph047 kernel: automount[28201]: segfault at 00002aab00b9401c rip 00002aaaab63ea00 rsp 0000000040820078 error 6

The segfault is always preceded by the hostname lookup failure, and it seems to happen at random.  We have had them check their DNS setup and they know of no problems that would cause lookup failures. They are using 8-way redundant NFS servers in all of their maps.

We have a core file along with the following partial backtrace.

  #0  0x00002aaaab63ea02 in ?? ()
  #1  0x00002aaaac4a0e91 in rpc_destroy_tcp_client (info=0x40a20fc0) at rpc_subs.c:384
  #2  0x00002aaaac49ff6e in get_nfs_info (logopt=0, host=0x555561636e00, pm_info=0x40a20f70, rpc_info=0x40a20fc0, proto=0x40a20ea0 "h\001", version=16,
   options=0x40a210c0 "ro,hard,intr,vers=3,noquota,tcp,timeo=600,retrans=2", random_selection=0) at replicated.c:582
  #3  0x00002aaaac4a0699 in prune_host_list (logopt=0, list=0x40a21168, vers=51, options=0x40a210c0 "ro,hard,intr,vers=3,noquota,tcp,timeo=600,retrans=2",
   random_selection=0) at replicated.c:640
...


Which afaict appears to be segfaulting in the macro:

  #define clnt_control(cl,rq,in) ((*(cl)->cl_ops->cl_control)(cl,rq,in))

Since the problem seems to occur in the replicated code path, I had them see if they could reproduce it without the redundant servers and so far they have not been able to.

I attempted to reproduce the issue by setting up a redundant NFS share on my cluster and specifying one invalid hostname in the map:

  images -rw,hard,intr,bg,vers=3noquota,nosuid,tcp,timeo=600,retrans=2 jrummy5-1-clust.ruemker.pvt,jrummy5-2-clust.ruemker.pvt,test1.ruemker.pvt:/mnt/lv1

where the first 2 are valid hostnames and the 3rd is not.  However it looks like I hit a different lookup failure than him:

  /var/log/messages.3:Feb 11 12:43:34 jrummy5-64 automount[15245]: host test1.ruemker.pvt: lookup failure 1

So it appears the customer's setup succeeds at the first lookup in modules/replicated.c:add_host_addrs but is failing in lib/rpc_subs.c:create_{udp,tcp}_client.  On the assumption that the lookup failure was a direct cause of the segfault, I had them try autofs-5.0.1-0.rc2.102 built with this upstream patch:

  http://www.kernel.org/pub/linux/daemons/autofs/v5/autofs-5.0.3-remove-redundant-dns-name-lookups.patch

because it removes redundant lookups when the addr is already known.  This also appears to fix / work around the problem as they haven't seen any failures on the system since installing it.  

Customer has currently worked around the problem by disabling replicated servers.

STEPS TO REPRODUCE: 
-Configure autofs map with replicated NFS servers
-Cause a lookup failure that only triggers the create_tcp_client lookup failure and not the failure in add_host_addrs

EXPECTED RESULTS:  Autofs recovers from failed lookup and resumes operation

Comment 4 Ian Kent 2009-03-20 16:14:15 UTC
Created attachment 336067 [details]
Patch to clear rpc client on lookup fail

I suspect this patch will help.

Comment 5 Ian Kent 2009-03-20 16:17:56 UTC
I have added this patch to rev 102 and made a scratch build.
It can be found at
/mnt/redhat/brewroot/scratch/ikent/task_1733533
and is revision 0.rc2.102.bz491351.1.

Can we have this tested to see if I'm correct please.

Comment 9 Ian Kent 2009-05-21 04:15:48 UTC
This issue has been fixed in the latest autofs package
autofs-5.0.1-0.rc2.125.

I was not able to reproduce this issue however the issue has
been verified by the customer.

Comment 11 Chris Ward 2009-07-03 18:27:39 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 13 Ruediger Landmann 2009-08-31 11:39:15 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Previously, if a name lookup failed while creating a TCP or UDP client, automount would destroy the client, but would not set the rpc client to NULL. Therefore, subsequent lookup attempts would attempt to use the invalid rpc client, which would lead to a segmentation fault. Now, when a name lookup fails, autofs sets the rpc client to NULL, and therefore avoids the segmentation fault on subsequent lookup attempts.

This leads to a subsequent SEGV when attempting to
use the invalid client.

Comment 14 Ruediger Landmann 2009-09-01 00:07:02 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,4 +1 @@
-Previously, if a name lookup failed while creating a TCP or UDP client, automount would destroy the client, but would not set the rpc client to NULL. Therefore, subsequent lookup attempts would attempt to use the invalid rpc client, which would lead to a segmentation fault. Now, when a name lookup fails, autofs sets the rpc client to NULL, and therefore avoids the segmentation fault on subsequent lookup attempts.
+Previously, if a name lookup failed while creating a TCP or UDP client, automount would destroy the client, but would not set the rpc client to NULL. Therefore, subsequent lookup attempts would attempt to use the invalid rpc client, which would lead to a segmentation fault. Now, when a name lookup fails, autofs sets the rpc client to NULL, and therefore avoids the segmentation fault on subsequent lookup attempts.-
-This leads to a subsequent SEGV when attempting to
-use the invalid client.

Comment 15 errata-xmlrpc 2009-09-02 11:59:53 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1397.html