Bug 2139504

Summary:

segfault due to lookup_mod->context address being freed and reused while multiple threads were using it

Product:

Red Hat Enterprise Linux 8

Reporter:

Frank Sorenson <fsorenso>

Component:

autofs

Assignee:

Ian Kent <ikent>

Status:

CLOSED ERRATA

QA Contact:

Kun Wang <kunwan>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

8.6

CC:

dwysocha, fhirtz, xzhou

Target Milestone:

Keywords:

CustomerScenariosInitiative, Triaged

Target Release:

---

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

autofs-5.1.4-88.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

2144686 2147491 (view as bug list)

Environment:

Last Closed:

2023-05-16 09:05:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2144686, 2147491

Attachments:

Description	Flags
reproducer	none
autofs amd-style map file for use with reproducer	none
reproducer	none

Description Frank Sorenson 2022-11-02 18:09:54 UTC

Description of problem:

The address of the lookup_mod->context changed, with the memory freed and the address reused, while several threads were still using references to it, resulting in a segfault when one of those threads tried to dereference the old context.

Another thread was using the new lookup->context address, and was reusing the original address for another purpose.


Version-Release number of selected component (if applicable):

	RHEL 8.6
	autofs-5.1.4-82.el8.x86_64 


How reproducible:

	unknown; customer has reported two segfaults


Steps to Reproduce:

	unknown


Actual results:

	segfault in lookup_mount()


Expected results:

	no segfault


Additional info:

	coredump analysis to follow

Comment 4 Ian Kent 2022-11-03 03:08:16 UTC

I wonder what the customer was doing that lead to this.
A debug log from startup lasting a day or so (should be enough)
would be useful.

The code pattern I think could result in this has been in use for
quite a long time. So I really should check it's what I think it
is and work out what is being done to cause it before adding
reference counting to the context.

Ian

Comment 10 Frank Sorenson 2022-11-09 00:51:48 UTC

I've replicated this twice now.  Still working to make it more reliable.

I currently have a couple very large amd map files (120K and 160K entries), most entries looking like this:

mount1/subdir1                 type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/user            type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/user/u          type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/user/u/us       type:=auto;fs:=${map};pref:=${key}/;opts:=nobrowsable
mount1/subdir1/user/u/us/user1 addopts:=sec=sys;rhost:=nfsserver1;rfs:=/server_dir1/server_dir2;sublink:=user1
mount1/subdir1/data            type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/data/...

I then kicked off 10 processes which stat() the leaf entries randomly
then 'killall -HUP automount' frequently

periodically I get the segfault


It's not an exact science yet, but I'm working on a couple ideas.

Comment 11 Ian Kent 2022-11-09 06:38:01 UTC

(In reply to Frank Sorenson from comment #10)
> I've replicated this twice now.  Still working to make it more reliable.
> 
> I currently have a couple very large amd map files (120K and 160K entries),
> most entries looking like this:
> 
> mount1/subdir1                
> type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
> mount1/subdir1/user           
> type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
> mount1/subdir1/user/u         
> type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
> mount1/subdir1/user/u/us      
> type:=auto;fs:=${map};pref:=${key}/;opts:=nobrowsable
> mount1/subdir1/user/u/us/user1
> addopts:=sec=sys;rhost:=nfsserver1;rfs:=/server_dir1/server_dir2;sublink:
> =user1
> mount1/subdir1/data           
> type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
> mount1/subdir1/data/...

I don't think the browse option makes a difference, the maps are
small so creating a directory or two won't make much difference.

Still I might be wrong and all we need is for it to reproduce.

It's walking down that path matching the key so the options
type:=auto;fs=${map} that causes the map reuse.

> 
> I then kicked off 10 processes which stat() the leaf entries randomly
> then 'killall -HUP automount' frequently
> 
> periodically I get the segfault
> 
> 
> It's not an exact science yet, but I'm working on a couple ideas.

This is hard to reproduce, I'm not surprised your finding it difficult.

We'll need this soon.
I have a patch (2 actually) that might fix it.

I'll post once I have a build.

Ian

Comment 12 Ian Kent 2022-11-09 07:10:10 UTC

Can we try this build please:
http://brew-task-repos.usersys.redhat.com/repos/scratch/ikent/autofs/5.1.4/85.el8/

I used a build target of rhel-8.6, if that's a problem let me
know and I'll make a build for the target you need.

Comment 13 Frank Sorenson 2022-11-16 23:52:33 UTC

Created attachment 1924846 [details]
reproducer

this reproduces the bug fairly reliably (although it's still a little temperamental; it'll crash very quickly 20 times in a row, then may or may not crash at all for another 10 full runs.  Not sure the determining factor yet)

	
/etc/exports:
	/repro *(rw,no_root_squash,sec=sys,fsid=0)

# exportfs -a


in /etc/autofs.conf:
[ amd ]
autofs_use_lofs = no


in /etc/auto.master:
	/rhbz2139504	file,amd:/etc/repro-toplevel.map	dismount_interval=60,timeout=60,negative_timeout=1

put these mapfiles in /etc
	repro.map-gold
	repro-toplevel.map

run the reproducer script:
	# rhbz2139504-repro


the script will:
  * stop automount
  * unmount anything left over from previous runs
  * copy the 'gold' mapfile to /etc/repro.map
  * start 5 child processes which will repeatedly 'stat' a set of leaf paths randomly
  * the main process will then randomly sort the 'gold' file /etc/repro.map-gold to a temporary file, then rename the temporary file /etc/repro.map
  * send SIGHUP to automount
  * sleep 15 seconds, checking to see whether automount has died or not
  * either exit (if automount stopped) or loop back to 'sort the 'gold' file'
  * if, after sending SIGHUP twice, automount is still running, loop all the way back to the beginning ('stop automount'); the bug will almost always hit on the first or second SIGHUP, and rarely after (just an observation...not sure why)
  * if, after performing the entire loop 20 times, the bug has not reproduced, exit the script


(I'm not saying it's perfect...)

Comment 14 Ian Kent 2022-11-17 00:06:10 UTC

(In reply to Frank Sorenson from comment #13)
> Created attachment 1924846 [details]
> reproducer
> 
> this reproduces the bug fairly reliably (although it's still a little
> temperamental; it'll crash very quickly 20 times in a row, then may or may
> not crash at all for another 10 full runs.  Not sure the determining factor
> yet)

So the build in comment#12 does still crash?

Ian

Comment 15 Ian Kent 2022-11-17 00:40:36 UTC

(In reply to Frank Sorenson from comment #13)
> Created attachment 1924846 [details]
> reproducer
> 
> this reproduces the bug fairly reliably (although it's still a little
> temperamental; it'll crash very quickly 20 times in a row, then may or may
> not crash at all for another 10 full runs.  Not sure the determining factor
> yet)

Are you saying the reproducer will always eventually see the crash if
autofs is broken in this way?

Comment 16 Frank Sorenson 2022-11-17 03:07:34 UTC

(In reply to Ian Kent from comment #14)
> So the build in comment#12 does still crash?

No.  At least I'm not seeing a crash with the patched autofs.  Still testing.


(In reply to Ian Kent from comment #15)

> Are you saying the reproducer will always eventually see the crash if
> autofs is broken in this way?

So far the reproducer has always eventually crashed the *unpatched* autofs (5.1.4-84.el8).  Just not always very quickly.

Comment 17 Ian Kent 2022-11-18 00:47:27 UTC

(In reply to Frank Sorenson from comment #16)
> (In reply to Ian Kent from comment #14)
> > So the build in comment#12 does still crash?
> 
> No.  At least I'm not seeing a crash with the patched autofs.  Still testing.
> 
> 
> (In reply to Ian Kent from comment #15)
> 
> > Are you saying the reproducer will always eventually see the crash if
> > autofs is broken in this way?
> 
> So far the reproducer has always eventually crashed the *unpatched* autofs
> (5.1.4-84.el8).  Just not always very quickly.

So it sounds like I should go ahead with a merge request fir this change.

The logging and mount table handling will need to be different bugs.

Comment 22 Ian Kent 2022-11-27 01:00:07 UTC

Found a couple of related problems, back to assigned while I update the package.

Comment 26 Frank Sorenson 2022-12-05 02:31:11 UTC

Created attachment 1929976 [details]
autofs amd-style map file for use with reproducer

autofs amd-style map file for use with reproducer

place rhbz2139504.map in /etc

edit /etc/auto.master:

/rhbz2139504	file,amd:/etc/rhbz2139504.map	dismount_interval=600,timeout=600

Comment 27 Frank Sorenson 2022-12-05 02:38:39 UTC

Created attachment 1929977 [details]
reproducer

Comment 45 errata-xmlrpc 2023-05-16 09:05:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (autofs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2970