2139504 – segfault due to lookup_mod->context address being freed and reused while multiple threads were using it

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2139504 - segfault due to lookup_mod->context address being freed and reused while multiple threads were using it

Summary: segfault due to lookup_mod->context address being freed and reused while mult...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	autofs
Sub Component:
Version:	8.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Ian Kent
QA Contact:	Kun Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2144686 2147491
TreeView+	depends on / blocked

Reported:	2022-11-02 18:09 UTC by Frank Sorenson
Modified:	2023-05-16 11:06 UTC (History)
CC List:	3 users (show)
Fixed In Version:	autofs-5.1.4-88.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2144686 2147491 (view as bug list)
Environment:
Last Closed:	2023-05-16 09:05:44 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
reproducer (377.02 KB, application/x-xz) 2022-11-16 23:52 UTC, Frank Sorenson	no flags	Details
autofs amd-style map file for use with reproducer (4.92 MB, text/plain) 2022-12-05 02:31 UTC, Frank Sorenson	no flags	Details
reproducer (24.63 KB, text/x-csrc) 2022-12-05 02:38 UTC, Frank Sorenson	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-138137	0	None	None	None	2022-11-02 18:24:24 UTC
Red Hat Product Errata	RHBA-2023:2970	0	None	None	None	2023-05-16 09:05:59 UTC

Description Frank Sorenson 2022-11-02 18:09:54 UTC

Description of problem:

The address of the lookup_mod->context changed, with the memory freed and the address reused, while several threads were still using references to it, resulting in a segfault when one of those threads tried to dereference the old context.

Another thread was using the new lookup->context address, and was reusing the original address for another purpose.


Version-Release number of selected component (if applicable):

	RHEL 8.6
	autofs-5.1.4-82.el8.x86_64 


How reproducible:

	unknown; customer has reported two segfaults


Steps to Reproduce:

	unknown


Actual results:

	segfault in lookup_mount()


Expected results:

	no segfault


Additional info:

	coredump analysis to follow

Comment 4 Ian Kent 2022-11-03 03:08:16 UTC

I wonder what the customer was doing that lead to this.
A debug log from startup lasting a day or so (should be enough)
would be useful.

The code pattern I think could result in this has been in use for
quite a long time. So I really should check it's what I think it
is and work out what is being done to cause it before adding
reference counting to the context.

Ian

Comment 10 Frank Sorenson 2022-11-09 00:51:48 UTC

I've replicated this twice now.  Still working to make it more reliable.

I currently have a couple very large amd map files (120K and 160K entries), most entries looking like this:

mount1/subdir1                 type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/user            type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/user/u          type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/user/u/us       type:=auto;fs:=${map};pref:=${key}/;opts:=nobrowsable
mount1/subdir1/user/u/us/user1 addopts:=sec=sys;rhost:=nfsserver1;rfs:=/server_dir1/server_dir2;sublink:=user1
mount1/subdir1/data            type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/data/...

I then kicked off 10 processes which stat() the leaf entries randomly
then 'killall -HUP automount' frequently

periodically I get the segfault


It's not an exact science yet, but I'm working on a couple ideas.

Comment 11 Ian Kent 2022-11-09 06:38:01 UTC

(In reply to Frank Sorenson from comment #10)
> I've replicated this twice now.  Still working to make it more reliable.
> 
> I currently have a couple very large amd map files (120K and 160K entries),
> most entries looking like this:
> 
> mount1/subdir1                
> type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
> mount1/subdir1/user           
> type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
> mount1/subdir1/user/u         
> type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
> mount1/subdir1/user/u/us      
> type:=auto;fs:=${map};pref:=${key}/;opts:=nobrowsable
> mount1/subdir1/user/u/us/user1
> addopts:=sec=sys;rhost:=nfsserver1;rfs:=/server_dir1/server_dir2;sublink:
> =user1
> mount1/subdir1/data           
> type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
> mount1/subdir1/data/...

I don't think the browse option makes a difference, the maps are
small so creating a directory or two won't make much difference.

Still I might be wrong and all we need is for it to reproduce.

It's walking down that path matching the key so the options
type:=auto;fs=${map} that causes the map reuse.

> 
> I then kicked off 10 processes which stat() the leaf entries randomly
> then 'killall -HUP automount' frequently
> 
> periodically I get the segfault
> 
> 
> It's not an exact science yet, but I'm working on a couple ideas.

This is hard to reproduce, I'm not surprised your finding it difficult.

We'll need this soon.
I have a patch (2 actually) that might fix it.

I'll post once I have a build.

Ian

Comment 12 Ian Kent 2022-11-09 07:10:10 UTC

Can we try this build please:
http://brew-task-repos.usersys.redhat.com/repos/scratch/ikent/autofs/5.1.4/85.el8/

I used a build target of rhel-8.6, if that's a problem let me
know and I'll make a build for the target you need.

Comment 13 Frank Sorenson 2022-11-16 23:52:33 UTC

Created attachment 1924846 [details]
reproducer

this reproduces the bug fairly reliably (although it's still a little temperamental; it'll crash very quickly 20 times in a row, then may or may not crash at all for another 10 full runs.  Not sure the determining factor yet)

	
/etc/exports:
	/repro *(rw,no_root_squash,sec=sys,fsid=0)

# exportfs -a


in /etc/autofs.conf:
[ amd ]
autofs_use_lofs = no


in /etc/auto.master:
	/rhbz2139504	file,amd:/etc/repro-toplevel.map	dismount_interval=60,timeout=60,negative_timeout=1

put these mapfiles in /etc
	repro.map-gold
	repro-toplevel.map

run the reproducer script:
	# rhbz2139504-repro


the script will:
  * stop automount
  * unmount anything left over from previous runs
  * copy the 'gold' mapfile to /etc/repro.map
  * start 5 child processes which will repeatedly 'stat' a set of leaf paths randomly
  * the main process will then randomly sort the 'gold' file /etc/repro.map-gold to a temporary file, then rename the temporary file /etc/repro.map
  * send SIGHUP to automount
  * sleep 15 seconds, checking to see whether automount has died or not
  * either exit (if automount stopped) or loop back to 'sort the 'gold' file'
  * if, after sending SIGHUP twice, automount is still running, loop all the way back to the beginning ('stop automount'); the bug will almost always hit on the first or second SIGHUP, and rarely after (just an observation...not sure why)
  * if, after performing the entire loop 20 times, the bug has not reproduced, exit the script


(I'm not saying it's perfect...)

Comment 14 Ian Kent 2022-11-17 00:06:10 UTC

(In reply to Frank Sorenson from comment #13)
> Created attachment 1924846 [details]
> reproducer
> 
> this reproduces the bug fairly reliably (although it's still a little
> temperamental; it'll crash very quickly 20 times in a row, then may or may
> not crash at all for another 10 full runs.  Not sure the determining factor
> yet)

So the build in comment#12 does still crash?

Ian

Comment 15 Ian Kent 2022-11-17 00:40:36 UTC

(In reply to Frank Sorenson from comment #13)
> Created attachment 1924846 [details]
> reproducer
> 
> this reproduces the bug fairly reliably (although it's still a little
> temperamental; it'll crash very quickly 20 times in a row, then may or may
> not crash at all for another 10 full runs.  Not sure the determining factor
> yet)

Are you saying the reproducer will always eventually see the crash if
autofs is broken in this way?

Comment 16 Frank Sorenson 2022-11-17 03:07:34 UTC

(In reply to Ian Kent from comment #14)
> So the build in comment#12 does still crash?

No.  At least I'm not seeing a crash with the patched autofs.  Still testing.


(In reply to Ian Kent from comment #15)

> Are you saying the reproducer will always eventually see the crash if
> autofs is broken in this way?

So far the reproducer has always eventually crashed the *unpatched* autofs (5.1.4-84.el8).  Just not always very quickly.

Comment 17 Ian Kent 2022-11-18 00:47:27 UTC

(In reply to Frank Sorenson from comment #16)
> (In reply to Ian Kent from comment #14)
> > So the build in comment#12 does still crash?
> 
> No.  At least I'm not seeing a crash with the patched autofs.  Still testing.
> 
> 
> (In reply to Ian Kent from comment #15)
> 
> > Are you saying the reproducer will always eventually see the crash if
> > autofs is broken in this way?
> 
> So far the reproducer has always eventually crashed the *unpatched* autofs
> (5.1.4-84.el8).  Just not always very quickly.

So it sounds like I should go ahead with a merge request fir this change.

The logging and mount table handling will need to be different bugs.

Comment 22 Ian Kent 2022-11-27 01:00:07 UTC

Found a couple of related problems, back to assigned while I update the package.

Comment 26 Frank Sorenson 2022-12-05 02:31:11 UTC

Created attachment 1929976 [details]
autofs amd-style map file for use with reproducer

autofs amd-style map file for use with reproducer

place rhbz2139504.map in /etc

edit /etc/auto.master:

/rhbz2139504	file,amd:/etc/rhbz2139504.map	dismount_interval=600,timeout=600

Comment 27 Frank Sorenson 2022-12-05 02:38:39 UTC

Created attachment 1929977 [details]
reproducer

Comment 45 errata-xmlrpc 2023-05-16 09:05:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (autofs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2970

Note You need to log in before you can comment on or make changes to this bug.