Description of problem:
On a RHEL 5.3 host with 50 luns mapped (50 luns x 4 paths = 200 devices) and dm-multipath configured, it is seen that the multipathd daemon holds onto the underlying SCSI devices for an extended delay after successfully flushing out the maps (using multipath -F). As per bugzilla 474627, this should not take more than 10 seconds. But this delay often (1 in 10 iterations) exceeds the prescribed 10 seconds. In some cases, it goes upto 20 to 30 seconds. If a subsequent SCSI rescan is triggered to detect newly mapped luns during this interval, SCSI scan errors are seen on the host as follows:
Jun 15 19:45:13 IBMx336-200-108 kernel: kobject_add failed for 3:0:0:0 with -EEXIST, don't try to register things with the same name in the same directory.
Jun 15 19:45:13 IBMx336-200-108 kernel:
Jun 15 19:45:13 IBMx336-200-108 kernel: Call Trace:
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff801497ae>] kobject_add+0x16e/0x199
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff801b5443>] device_add+0x85/0x372
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff8807e1d4>] :scsi_mod:scsi_sysfs_add_sdev+0x35/0x218
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff8807c5cd>] :scsi_mod:scsi_probe_and_add_lun+0x8b1/0x9c9
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff8807ac2c>] :scsi_mod:scsi_execute_req+0x79/0xcf
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff8807cfd0>] :scsi_mod:__scsi_scan_target+0x425/0x5e2
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff8004c75d>] sscanf+0x49/0x51
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff8807d436>] :scsi_mod:scsi_scan_target+0x6c/0x83
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff8815affe>] :scsi_transport_fc:fc_user_scan+0x5b/0x8b
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff8807dd84>] :scsi_mod:store_scan+0x9b/0xc5
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff80106afd>] sysfs_write_file+0xb9/0xe8
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff8001659e>] vfs_write+0xce/0x174
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff80016e6b>] sys_write+0x45/0x6e
Jun 15 19:45:13 IBMx336-200-108 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Jun 15 19:45:13 IBMx336-200-108 kernel:
Jun 15 19:45:13 IBMx336-200-108 kernel: error 1
Jun 15 19:45:13 IBMx336-200-108 kernel: scsi 3:0:0:0: Unexpected response from lun 0 while scanning, scan aborted
Jun 15 19:45:13 IBMx336-200-108 kernel: Vendor: NETAPP Model: LUN Rev: 7310
Jun 15 19:45:13 IBMx336-200-108 kernel: Type: Direct-Access ANSI SCSI revision: 04
From the above error, it looks like the device table references are not properly removed from the kernel during device removal by the multipathd due to which a subsequent SCSI rescan hits the kobject_add failed error with -EEXIST while trying to allocate a new device (due to the same sysfs name already being used by the old stale entry).
Version-Release number of selected component (if applicable):
RHEL 5.3 GA (2.6.18-128.el5)
Multipath package: device-mapper-multipath-0.4.7-23.el5
Intermittent (like 1 in 10 iterations).
Steps to Reproduce:
1. Map 50 LUNs with 4 paths each to a RHEL 5.3 host and configure dm-multipath on them.
2. Flush the dm-multipath devices by running the command "multipath -F".
3. Wait for sometime, say 20 Seconds and check the output of the command "lsmod | grep <driver>" to check the usage count of the driver. For example, if the host has QLogic HBAs, run "lsmod | grep qla".
4. Also, check the output of multipathd -k "list paths" during this extended delay. Some paths are listed as [active][ready], which implies that multipathd is still holding these paths.
Multipathd holds onto the underlying devices after successfully flushing out all the maps for more than the prescribed 10 seconds interval.
Multipathd daemon should free up all the underlying devices within a definite period of time (like 10 seconds as per the fix mentioned in bugzilla 474627), irrespective of the number of dm-multipath devices configured on the host.
Unless this is a quick fix in RHEL 5.4, I don't see this happening until RHEL 5.5. Will defer on the technical aspects of this to Ben and Tom.
Is anything printed to /var/log/messages when you do the
# multipath -F
can you watch the messages by running
# tail -f /var/log/messages
When you run
# multipath -F
you should see messages that look like
multipathd: dm-2: remove map (uevent)
Check to make sure these messages are all happening in a reasonable time. multipathd won't remove the device until it gets a uevent message to do so. If these messages are being delayed for some reason, multipathd can't do anything about it.
Created attachment 348691 [details]
tarball having messages and script to reproduce this issue
I am not seeing the "remove map (uevent)" messages in the syslog messages file. I am attaching a script (create_delete_multipath.sh) which helps reproduce this issue. Also attached are the screen logs of this script and the /var/log/messages file.
In the file "terminal_logs.log", multipath devices were flushed out at 19:38:48, but we do not see any remove map uevent in messages file after the command was run. We can see that the usage count of the driver was quite high (lsmod output) even after waiting for 15 seconds and multipathd -k"list paths" showed some paths as [active][ready]. In the messages file, we can see that the kobject_add error happened at 19:45:13, which is because of the SCSI scan issued on the host by the script to scan newly mapped luns.
At Jun 15 19:39:54, which is the soonest time after the 19:38:48, multipathd is still in the process of adding paths.
Can you try commenting out the following line in /etc/udev/rules.d/40-multipath.rules
KERNEL!="dm-[0-9]*", ACTION=="add", PROGRAM=="/bin/bash -c '/sbin/lsmod | /bin/grep ^dm_multipath'", RUN+="/sbin/multipath -v0 %M:%m"
This could well speed up mulitpath creation. It does in many cases. Also, could you try putting a sleep between when you create the multipath devices and when you remove them. Let me know if either of these two workarounds makes the problem go away.
This issue is not reproducible if we comment out the line in /etc/udev/rules.d/40-multipath.rules file as suggested by you.
Are there any side effects in commenting out this line? If none, why is this not the default - as this considerably reduces the multipath device creation time, especially when you have large no. of luns/paths mapped to the host?
There are no significant side effects. Multipath devices won't be automatically created if the multipathd daemon is not running, but this shouldn't be a big deal for users, since multipathd is essential for multipath device to work correctly.
Changing this fixes a number of bugs. Unfortunately, the flood of bugs came in after the development freeze. If we can mark this bug as a blocker, I can easily reroll the package with this fixed.
Updating the severity of this bz as this issue is a blocker to our overlying applications.
The line is now commented out.
There are some additional queries/suggestions regarding this issue:
1) You had mentioned adding a sleep between multipath device creation and device removal. But how long should this sleep be? Won't that be dependent on the number of luns/paths mapped to the host? So how do we come up with a reliable sleep value that satisfies all these use cases?
2) Overlying applications depend on the successful completion of the multipath device creation/removal before firing subsequent commands on all these devices. And as per your fix for bug 474627, the multipathd would still continue holding the multipathed devices for around 10 seconds even after successfully flushing them out with the 'multipath -F' command. But how would applications know that this is complete? Is there any way they can be notified by the multipathd to indicate the task is done?
Or perhaps the 'multipath -F' command itself should have blocked for the duration of this time interval - so that these applications would know that the multipathd has released the underlying devices after successfully executing the 'multipath -F' command?
This info is crucial for reliably configuring/unconfiguring the Linux storage stack, as required by these applications.
In response to comment #13
1. Yes, this will be dependent on how many LUNs there are, but removing that line can cut the creation time down dramatically in some setups.
2. As long as you don't start removing devices until they are completely created, there should be no 10 second wait. The ten second wait is actually in the creation code. Sometimes multipathd get the uevent before the device is completely set up. In this case, it needs to wait. If you tear down the map while it is doing this, multipathd isn't sure if it is waiting for the information to be there, or if the information is already gone.
Right now you can find out how many paths multipathd is monitoring by running
# lsof -c multipathd | grep -c BLK
You can check this against the number of multipathed paths by running
# multipath -l | grep -c '^ \\_'
These should match. You should be able to run these commands in a loop with a short sleep until they return the same number. This isn't the prettiest
solution, and for RHEL 5.5 or RHEL 6 we can probably add a multipath option that waits until multipathds state is synced with the device-mapper's state
(which is what multipath -ll checks) to do this for you.
Thanks for the info.
Yes, I think it is crucial to ascertain that dm-multipath has stabilized (by ensuring the device mapper's state is in sync with multipathd's state) during the storage stack config/unconfig. A multipath option for doing that would be really helpful. Do you want me to file separate bugzillas for the same (for RHEL 6.0, RHEL 5.5 & RHEL 4.9)?
This made Snapshot 2 - go ahead and pull that down and let us know your testing status ASAP. Thanks!
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.