Bug 433857 - rpc.mountd segfaults due to uninitialized value in e2fsprogs devname.c [PATCH]
Summary: rpc.mountd segfaults due to uninitialized value in e2fsprogs devname.c [PATCH]
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: e2fsprogs
Version: 8
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Eric Sandeen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-02-21 19:52 UTC by Philip Spencer
Modified: 2008-03-13 07:43 UTC (History)
3 users (show)

Fixed In Version: 1.40.4-2.fc8
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-03-13 07:43:17 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Patch to fix the potentially uninitialized reference to info.exists. (399 bytes, patch)
2008-02-21 19:52 UTC, Philip Spencer
no flags Details | Diff
patch sent upstream (3.00 KB, patch)
2008-02-21 22:35 UTC, Eric Sandeen
no flags Details | Diff

Description Philip Spencer 2008-02-21 19:52:14 UTC
Description of problem:

Periodically, when the system is under times of high device manager activity
such as when many snapshot volumes are being created/removed for backup
purposes, rpc.mountd dies with a segmentation fault, rendering our NFS server
unusable until it is restarted. (This impacts all our users so severely that I
have had to start a script to watch for mountd segfaults and run a "service nfs
stop; sleep 1; service nfs start" when it happens).

I was finally able to capture a core dump and have traced the problem to the
e2fsprogs and device-mapper libraries to which rpc.mountd is linked.

Specifically, in the e2fsprogs source file lib/blkid/devname.c, in the function
dm_device_has_dep, we find the following code:

        struct dm_info info;
        int i;

        task = dm_task_create(DM_DEVICE_DEPS);
        if (!task)
                return 0;

        dm_task_set_name(task, name);
        dm_task_run(task);
        dm_task_get_info(task, &info);

        if (!info.exists) {
                dm_task_destroy(task);
                return 0;
        }

        deps = dm_task_get_deps(task);
 
This code does not properly check the return value of dm_task_get_info. In the
case of no info structure at all, this function returns null and does not set
any fields in the "info" structure. Therefore, info.exists will be uninitialized
and the "if (!info.exists)" test may well succeed even though there is no info.
The code will then go on to call dm_task_get_deps on a task with no dependents,
resulting in a segfault inside that function (which is probably something that
should be fixed there as well).

The attached simple patch resolves this problem.

Version-Release number of selected component (if applicable):
e2fsprogs-1.40.4-1.fc8 and earlier.

Comment 1 Philip Spencer 2008-02-21 19:52:14 UTC
Created attachment 295554 [details]
Patch to fix the potentially uninitialized reference to info.exists.

Comment 2 Philip Spencer 2008-02-21 19:57:42 UTC
Sorry -- I meant to say "fail" instead of "succeed" above. My comment should read:

  ... info.exists will be uninitialized and the "if (!info.exists)" test may
well FAIL even though there is no info. The code will then go on ...

Comment 3 Eric Sandeen 2008-02-21 20:17:07 UTC
Thanks for the report, and the patch.  I'll review & get it sent upstream and
into F8 ASAP.

-Eric

Comment 4 Alasdair Kergon 2008-02-21 21:13:15 UTC
The patch is incomplete.  What about the other non-void functions there?

And is there any more libdevmapper code in there with similar carelessness?

Comment 5 Eric Sandeen 2008-02-21 22:35:10 UTC
Created attachment 295566 [details]
patch sent upstream

agk, I agree.  This is the patch I sent upstream.

Comment 6 Eric Sandeen 2008-02-22 15:03:27 UTC
Thus spake Ted:

> This looks good, but I assume that the bug was caused by some race
> condition where if you try to call dm_task_get_info() while some other
> process is creating or removing a snapshot, dm_task_get_info() is
> returning some kind of EAGAIN, or some other "Try again; we're busy"
> error, right?
> 
> If that is the case, can you try to find out what error is being
> returned?  It may be the right thing to do is to check to see if we
> are getting a "resource is locked; try again in a sec" error message,
> and retry the dm_task_get_info(), instead of just returning a failure.
> 
> Thanks!!
> 
> 						- Ted

agk, does this sound like the right approach to you?  On a very quick look at
device-mapper I don't see this sort of differentiation on errors... unless there
is some other context to interpret the 0/1 return values?

Thanks,
-Eric

Comment 7 Philip Spencer 2008-02-22 15:40:58 UTC
I (the original poster) know very little about either e2fsprogs or
device-mapper, and had originally just assumed it would be normal for the info
field to be null after a call to DM_DEVICE_DEPS if there were no dependents, but
now after a quick look at the sources I see that the info field "dmi" inside the
task structure is just what is returned by the ioctl, so it does appear to me
now that some sort of error occurred, and that otherwise it would have returned
a non-null dmi with a zero "exists" flag inside it.

Correct me if I'm wrong, but it seems that:

  -- No point in retrying dm_task_get_info(); it is just unpacking the "dmi"
structure returned by the previous dm_task_run call, which is null. It is in
dm_task_run that the error occurred.

  -- The code in dm_task_run seems to already take care of retrying EAGAIN
conditions.

  -- One obvious other type of race condition would be if the device were
removed in between the task creation and call to dm_task_run. In that case,
Eric's patch seems to do exactly the right thing -- no point in continuing if
the device is gone anyway.

  -- But, I don't think that's the race condition we're seeing. A gdb printout
of the task structure shows

 {type = 7, dev_name = 0x2aaaaace3e10 "vg1-snapweb-cow", head = 0x0,
  tail = 0x0, read_only = 0, event_nr = 0, major = -1, minor = -1, uid = 0,
  gid = 6, mode = 432, dmi = {v4 = 0x0, v1 = 0x0}, newname = 0x0,
  message = 0x0, geometry = 0x0, sector = 0, no_flush = 0, no_open_count = 0,
  skip_lockfs = 0, suppress_identical_reload = 0, uuid = 0x0}

This is associated to the snapshot volume "snapweb" which was being backed up at
the time. Timestamps on the backup logs indicate that my backup script moved on
to the next filesystem 30 seconds AFTER the segfault, so, unless something
really slowed down the system so that deallocation of the snapweb volume took a
full 30 seconds, it does not appear that the segfault occurred during the
unmounting and deallocating of snapweb.

I also don't understand why major/minor are -1 in the above structure; is that
normal?

Comment 8 Philip Spencer 2008-02-22 18:15:30 UTC
You know what -- I went back and double-checked all the logs, and somehow
or other I must have recorded a timestamp wrong as 3:19:21 instead of
3:19:51.

The segfault did in fact happen at exactly the same time as my backup script
moved on to the next filesystem. So, it occurred during the unmount and lvremove
of the snapshot volume. It is, then, entirely expected that the device-mapper
routines would return an error if the device no longer existed when the task was
run. This also explains why major/minor = -1/-1 if the device no longer exists.

My apologies for mixing up the timestamps! And no bug in device-mapper, just
the one in e2fsprogs whch segfaulted in this circumstance instead of
dropping the device from its list. Having it fail outright, and not list the
device at all, is the correct behaviour for this situation -- just as if the
device had already been removed before the blkid routines were run.

Comment 9 Eric Sandeen 2008-02-29 16:44:25 UTC
On its way to rawhide via e2fsprogs-1.40.7-1

This will probably push to F8 eventually, though I'd like to get some soak time
first.

Changelog:

Fix bug which could cause libblkid to seg fault if a device mapper
volume disappears while it is being probed.  (Addresses RedHat
Bugzilla: #433857)

I'll leave the bug open 'til it gets to F8... perhaps I'll backport just this
patch, for now.

Comment 10 Fedora Update System 2008-02-29 17:30:03 UTC
e2fsprogs-1.40.4-2.fc8 has been submitted as an update for Fedora 8

Comment 11 Fedora Update System 2008-03-01 09:26:29 UTC
e2fsprogs-1.40.4-2.fc8 has been pushed to the Fedora 8 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update e2fsprogs'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-2138

Comment 12 Fedora Update System 2008-03-13 07:43:14 UTC
e2fsprogs-1.40.4-2.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.