Bug 587336

Summary: Multipathd exits with -1 in main.c prepare_namespace()
Product: Red Hat Enterprise Linux 5 Reporter: Shane Bradley <sbradley>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: agk, bmarzins, bmr, christophe.varoqui, dwysocha, heinzm, junichi.nomura, kueda, lmb, mbroz, prajnoha, prockai, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-07-01 19:15:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
systemtap script to figure out what is causing the error none

Description Shane Bradley 2010-04-29 15:53:34 UTC
Description of problem:
Multipathd fails to start in daemon mode. Running without daemoning
mode works with no issue. Multipathd will run fine in foreground.

An strace was collected and showed an EFAULT error:
15926 10:05:47.818185 stat("/etc/localtime",  <unfinished ...>
15925 10:05:47.818209 mount("/var/cache/multipathd", "/sbin", NULL, MS_BIND, NULL <unfinished ...>
15926 10:05:47.818242 <... stat resumed> {st_dev=makedev(253, 0), st_ino=98358, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16, st_size=2945, st_atime=2010/04/12-10:05:47, st_mtime=2009/10/08-14:56:03, st_ctime=2009/10/08-14:56:03}) = 0
15925 10:05:47.818287 <... mount resumed> ) = -1 EFAULT (Bad address)
15926 10:05:47.818314 sendto(6, "<27>Apr 12 10:05:47 multipathd: "..., 71, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
15925 10:05:47.818341 rt_sigprocmask(SIG_BLOCK, [USR1], [], 8) = 0

  EFAULT One of the pointer arguments points outside the user address space.

Running through gdb determined the failure is in the prepare_namespace():
1349            vector_foreach_slot (conf->binvec, bin,i) {
1357            free_strvec(conf->binvec);
1358            conf->binvec = NULL;
1366            if (mount(CALLOUT_DIR, "/sbin", NULL, MS_BIND, NULL) < 0) {
1367                    condlog(0, "cannot bind ramfs on /sbin");
1368                    return -1;

The values in the strace look valid for the function call.  Customer
has tried running the mount command in the function and replicated the
procedure manual. Everything seems to work just fine when ran
manually.

Version-Release number of selected component (if applicable):
device-mapper-multipath-0.4.7-23.el5-x86_64

How reproducible:
Everytime on customer's specific system.
Have not been able to recreate in lab.

Steps to Reproduce:
1. /etc/init.d/multipathd start
  
Actual results:
Multipathd start fails.
$ /etc/init.d/multipathd status
multipathd dead but pid file exists

Expected results:
Multipathd should start up correctly.

Additional info:

Kernel: 2.6.18-128.el5
$ grep device-mapper installed-rpms
device-mapper-1.02.28-2.el5-x86_64
device-mapper-event-1.02.28-2.el5-x86_64
device-mapper-multipath-0.4.7-23.el5-x86_64

$ grep udev installed-rpms
udev-095-14.19.el5-x86_64

$ grep util-linux installed-rpms
util-linux-2.13-0.50.el5-x86_64

Comment 2 Ben Marzinski 2010-04-29 18:51:15 UTC
Created attachment 410200 [details]
systemtap script to figure out what is causing the error

run

# stap it800403.stp

When it says
"starting"

run

# service multipathd start

If things work correctly, you should see something like

starting
multipathd entered sys_mount for <unknown> on /var/cache/multipathd
 entered copy_mount_options for ramfs
 exitted copy_mount_options with 0
 entered getname for /var/cache/multipathd
 exitted getname with -139636435095552
 entered copy_mount_options for <unknown>
 exitted copy_mount_options with 0
 entered copy_mount_options for maxsize=3664746
 exitted copy_mount_options with 0
 entered do_mount for /var/cache/multipathd
 exitted do_mount with 0
multipathd exitted sys_mount with 0
multipathd entered sys_mount for /var/cache/multipathd on /sbin
 entered copy_mount_options for <unknown>
 exitted copy_mount_options with 0
 entered getname for /sbin
 exitted getname with -139636435095552
 entered copy_mount_options for /var/cache/multipathd
 exitted copy_mount_options with 0
 entered copy_mount_options for <unknown>
 exitted copy_mount_options with 0
 entered do_mount for /sbin
 exitted do_mount with 0
multipathd exitted sys_mount with 0
multipathd entered sys_mount for /var/cache/multipathd on /bin
 entered copy_mount_options for <unknown>
 exitted copy_mount_options with 0
 entered getname for /bin
 exitted getname with -139636435095552
 entered copy_mount_options for /var/cache/multipathd
 exitted copy_mount_options with 0
 entered copy_mount_options for <unknown>
 exitted copy_mount_options with 0
 entered do_mount for /bin
 exitted do_mount with 0
multipathd exitted sys_mount with 0
multipathd entered sys_mount for /var/cache/multipathd on /tmp
 entered copy_mount_options for <unknown>
 exitted copy_mount_options with 0
 entered getname for /tmp
 exitted getname with -139636435095552
 entered copy_mount_options for /var/cache/multipathd
 exitted copy_mount_options with 0
 entered copy_mount_options for <unknown>
 exitted copy_mount_options with 0
 entered do_mount for /tmp
 exitted do_mount with 0
multipathd exitted sys_mount with 0

If it fails, you should see some exitted lines that return -14 (-EFAULT)
getname() is returning a pointer, so it most likely won't return the same as this one. As long as it doesn't return -14, it should be fine.

Please run this script on the machine that can reproduce the issue, and
copy the results into the bugzilla.

Comment 4 Ben Marzinski 2010-06-30 16:40:05 UTC
Have you gotten a chance to reproduce this with the systemtap script?