Bug 2139504
| Summary: | segfault due to lookup_mod->context address being freed and reused while multiple threads were using it | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Frank Sorenson <fsorenso> | ||||||||
| Component: | autofs | Assignee: | Ian Kent <ikent> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Kun Wang <kunwan> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 8.6 | CC: | dwysocha, fhirtz, xzhou | ||||||||
| Target Milestone: | rc | Keywords: | CustomerScenariosInitiative, Triaged | ||||||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | autofs-5.1.4-88.el8 | Doc Type: | If docs needed, set a value | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 2144686 2147491 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2023-05-16 09:05:44 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 2144686, 2147491 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Frank Sorenson
2022-11-02 18:09:54 UTC
I wonder what the customer was doing that lead to this. A debug log from startup lasting a day or so (should be enough) would be useful. The code pattern I think could result in this has been in use for quite a long time. So I really should check it's what I think it is and work out what is being done to cause it before adding reference counting to the context. Ian I've replicated this twice now. Still working to make it more reliable.
I currently have a couple very large amd map files (120K and 160K entries), most entries looking like this:
mount1/subdir1 type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/user type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/user/u type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/user/u/us type:=auto;fs:=${map};pref:=${key}/;opts:=nobrowsable
mount1/subdir1/user/u/us/user1 addopts:=sec=sys;rhost:=nfsserver1;rfs:=/server_dir1/server_dir2;sublink:=user1
mount1/subdir1/data type:=auto;fs:=${map};pref:=${key}/;opts:=browsable
mount1/subdir1/data/...
I then kicked off 10 processes which stat() the leaf entries randomly
then 'killall -HUP automount' frequently
periodically I get the segfault
It's not an exact science yet, but I'm working on a couple ideas.
(In reply to Frank Sorenson from comment #10) > I've replicated this twice now. Still working to make it more reliable. > > I currently have a couple very large amd map files (120K and 160K entries), > most entries looking like this: > > mount1/subdir1 > type:=auto;fs:=${map};pref:=${key}/;opts:=browsable > mount1/subdir1/user > type:=auto;fs:=${map};pref:=${key}/;opts:=browsable > mount1/subdir1/user/u > type:=auto;fs:=${map};pref:=${key}/;opts:=browsable > mount1/subdir1/user/u/us > type:=auto;fs:=${map};pref:=${key}/;opts:=nobrowsable > mount1/subdir1/user/u/us/user1 > addopts:=sec=sys;rhost:=nfsserver1;rfs:=/server_dir1/server_dir2;sublink: > =user1 > mount1/subdir1/data > type:=auto;fs:=${map};pref:=${key}/;opts:=browsable > mount1/subdir1/data/... I don't think the browse option makes a difference, the maps are small so creating a directory or two won't make much difference. Still I might be wrong and all we need is for it to reproduce. It's walking down that path matching the key so the options type:=auto;fs=${map} that causes the map reuse. > > I then kicked off 10 processes which stat() the leaf entries randomly > then 'killall -HUP automount' frequently > > periodically I get the segfault > > > It's not an exact science yet, but I'm working on a couple ideas. This is hard to reproduce, I'm not surprised your finding it difficult. We'll need this soon. I have a patch (2 actually) that might fix it. I'll post once I have a build. Ian Can we try this build please: http://brew-task-repos.usersys.redhat.com/repos/scratch/ikent/autofs/5.1.4/85.el8/ I used a build target of rhel-8.6, if that's a problem let me know and I'll make a build for the target you need. Created attachment 1924846 [details]
reproducer
this reproduces the bug fairly reliably (although it's still a little temperamental; it'll crash very quickly 20 times in a row, then may or may not crash at all for another 10 full runs. Not sure the determining factor yet)
/etc/exports:
/repro *(rw,no_root_squash,sec=sys,fsid=0)
# exportfs -a
in /etc/autofs.conf:
[ amd ]
autofs_use_lofs = no
in /etc/auto.master:
/rhbz2139504 file,amd:/etc/repro-toplevel.map dismount_interval=60,timeout=60,negative_timeout=1
put these mapfiles in /etc
repro.map-gold
repro-toplevel.map
run the reproducer script:
# rhbz2139504-repro
the script will:
* stop automount
* unmount anything left over from previous runs
* copy the 'gold' mapfile to /etc/repro.map
* start 5 child processes which will repeatedly 'stat' a set of leaf paths randomly
* the main process will then randomly sort the 'gold' file /etc/repro.map-gold to a temporary file, then rename the temporary file /etc/repro.map
* send SIGHUP to automount
* sleep 15 seconds, checking to see whether automount has died or not
* either exit (if automount stopped) or loop back to 'sort the 'gold' file'
* if, after sending SIGHUP twice, automount is still running, loop all the way back to the beginning ('stop automount'); the bug will almost always hit on the first or second SIGHUP, and rarely after (just an observation...not sure why)
* if, after performing the entire loop 20 times, the bug has not reproduced, exit the script
(I'm not saying it's perfect...)
(In reply to Frank Sorenson from comment #13) > Created attachment 1924846 [details] > reproducer > > this reproduces the bug fairly reliably (although it's still a little > temperamental; it'll crash very quickly 20 times in a row, then may or may > not crash at all for another 10 full runs. Not sure the determining factor > yet) So the build in comment#12 does still crash? Ian (In reply to Frank Sorenson from comment #13) > Created attachment 1924846 [details] > reproducer > > this reproduces the bug fairly reliably (although it's still a little > temperamental; it'll crash very quickly 20 times in a row, then may or may > not crash at all for another 10 full runs. Not sure the determining factor > yet) Are you saying the reproducer will always eventually see the crash if autofs is broken in this way? (In reply to Ian Kent from comment #14) > So the build in comment#12 does still crash? No. At least I'm not seeing a crash with the patched autofs. Still testing. (In reply to Ian Kent from comment #15) > Are you saying the reproducer will always eventually see the crash if > autofs is broken in this way? So far the reproducer has always eventually crashed the *unpatched* autofs (5.1.4-84.el8). Just not always very quickly. (In reply to Frank Sorenson from comment #16) > (In reply to Ian Kent from comment #14) > > So the build in comment#12 does still crash? > > No. At least I'm not seeing a crash with the patched autofs. Still testing. > > > (In reply to Ian Kent from comment #15) > > > Are you saying the reproducer will always eventually see the crash if > > autofs is broken in this way? > > So far the reproducer has always eventually crashed the *unpatched* autofs > (5.1.4-84.el8). Just not always very quickly. So it sounds like I should go ahead with a merge request fir this change. The logging and mount table handling will need to be different bugs. Found a couple of related problems, back to assigned while I update the package. Created attachment 1929976 [details]
autofs amd-style map file for use with reproducer
autofs amd-style map file for use with reproducer
place rhbz2139504.map in /etc
edit /etc/auto.master:
/rhbz2139504 file,amd:/etc/rhbz2139504.map dismount_interval=600,timeout=600
Created attachment 1929977 [details]
reproducer
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (autofs bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2970 |