Bug 681144
Summary: | [NetApp 6.0.z Bug] DM-Multipath fails to update paths during IO with fabric faults | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Rajashekhar M A <rajashekhar.a> | ||||||||
Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Storage QE <storage-qe> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 6.0 | CC: | agk, bmarzins, coughlan, cward, dwysocha, fge, heinzm, marting, mbroz, mzywusko, plyons, prajnoha, prockai, xdl-redhat-bugzilla, zkabelac | ||||||||
Target Milestone: | rc | Keywords: | OtherQA, Regression, ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | All | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | device-mapper-multipath-0.4.9-39.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
When a path was removed, the multipathd daemon did not always remove the path sysfs device from its cache. The daemon kept searching the cache for the device and created sysfs devices without the vecs lock held. Because of this, paths could have pointed to invalid sysfs devices and caused multipathd to crash. The multipathd daemon now always removes the sysfs device from cache when deleting a path and accesses the cache only with the vecs lock held.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2011-05-19 14:13:01 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 684684 | ||||||||||
Attachments: |
|
Description
Rajashekhar M A
2011-03-01 09:34:50 UTC
Created attachment 481576 [details]
Tarball with messages and multipath.conf file.
Attached is a tarball with full /var/log/messages files, (for both tur and directio path_checker scenarios). The tarball also has the multipath.conf file.
With directio, are you sure that you don't see this on RHEL-6.0 installs? Also, do you know if the path issues clears up in a couple of minutes. directio is an asynchronous checker. It won't actually fail a path until the IO returns failed, or it times out. Could you try adding something like fast_io_fail_tmo 5 to the defaults section. This will make sure that directio gets back IO to a failed device after 5 seconds of waiting, which should make directio a lot more responsive. The tur issue makes much more sense as a regression. I did add something that could make you more likely to see "failed to get parent" messages. Multipath was caching that sysfs information forever, even if the device was deleted. This was causing a memory leak. Now multipath frees up that information when a device is removed. However that information should be available when you are adding a path. The important things to look for are that the proper sysfs directories exist. To get the sysfs device, multipathd checks /sys/block/<devname> so for your setup it would check /sys/block/sdbk This should be a symlink to a directory. When you reproduce this, can you please post where /sys/block/<devname> is a symlink to. This directory appears to exist, otherwise, you would have failed in common_sysfs_pathinfo(), before you ever had a chance to get to try to get the parent device. For example, on my system I see: [root@ask-07 ~]# ls -l /sys/block/sdb lrwxrwxrwx. 1 root root 0 Jan 19 06:21 /sys/block/sdb -> ../devices/pci0000:00/0000:00:0a.0/0000:06:00.0/host8/rport-8:0-0/target8:0:0/8:0:0:0/block/sdb and /sys/devices/pci0000:00/0000:00:0a.0/0000:06:00.0/host8/rport-8:0-0/target8:0:0/8:0:0:0/block/sdb exists. As long as this directory starts with "/sys/devices", The parent should simply be the sysfs device with the last directory chopped off. So on my system, it is. /sys/devices/pci0000:00/0000:00:0a.0/0000:06:00.0/host8/rport-8:0-0/target8:0:0/8:0:0:0/block or if that is a link (It's not for me), whatever that links points to. If the last element is "block", which you can see from above, it is for me, multipath grabs the parent of this directory. For me this is /sys/devices/pci0000:00/0000:00:0a.0/0000:06:00.0/host8/rport-8:0-0/target8:0:0/8:0:0:0 So unless your scsi sysfs devices are set up a lot different then mine, it appears that you are able to access a sysfs directory, but not it's parent, unless, of course, the issue is that the directory has actually been removed while this is happening. I will work on some test packages that print out all these paths when you fail to get a sysfs device. Please check if those sysfs directories exist when you see this issue. There are debug packages available at: http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/x86_64/ and http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/i686/ These will print out a lot more information when you get the "failed to get parent" messages Created attachment 482629 [details]
syslog with test rpm messages
I could reproduce the bug with the test rpms (used tur as path_checker):
# multipath -ll /dev/sdbk
360a98000486e2f65686f6246516f4468 dm-9 NETAPP,LUN
size=5.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 2:0:1:20 sdca 68:224 active ready running
| `- 3:0:1:20 sdcf 69:48 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
`- 2:0:0:20 sdbh 67:176 active ready running
#
Here are the messages which I see in syslog:
Mar 6 02:25:42 IBMx3250-200-178 multipathd: sdbk: add path (uevent)
Mar 6 02:25:42 IBMx3250-200-178 multipathd: sysfs dev has no parent
Mar 6 02:25:42 IBMx3250-200-178 multipathd: sdbk: failed to get parent
Mar 6 02:25:42 IBMx3250-200-178 multipathd: device: /devices/pci0000:00/0000:00:03.0/0000:06:00.1/host3/rport-3:0-2/target3:0:0/3:0:0:20/block/sdbk
Mar 6 02:25:42 IBMx3250-200-178 multipathd: device:
Mar 6 02:25:42 IBMx3250-200-178 multipathd: sdbk: failed to store path info
Mar 6 02:25:42 IBMx3250-200-178 multipathd: uevent trigger error
But, checked the directories, they do exist:
# ls -l /sys/block/sdbk
lrwxrwxrwx. 1 root root 0 Mar 6 02:36 /sys/block/sdbk -> ../devices/pci0000:00/0000:00:03.0/0000:06:00.1/host3/rport-3:0-2/target3:0:0/3:0:0:20/block/sdbk
# cd /sys/devices/pci0000:00/0000:00:03.0/0000:06:00.1/host3/rport-3:0-2/target3:0:0/3:0:0:20/block/sdbk
# ls
alignment_offset capability device ext_range inflight queue removable size stat trace
bdi dev discard_alignment holders power range ro slaves subsystem uevent
# pwd
/sys/devices/pci0000:00/0000:00:03.0/0000:06:00.1/host3/rport-3:0-2/target3:0:0/3:0:0:20/block/sdbk
# cd ../
# ls
sdbk
# pwd
/sys/devices/pci0000:00/0000:00:03.0/0000:06:00.1/host3/rport-3:0-2/target3:0:0/3:0:0:20/block
#
The other map where we see three paths is 360a98000486e2f65686f6246516f4168:
# multipath -ll 360a98000486e2f65686f6246516f4168
360a98000486e2f65686f6246516f4168 dm-7 NETAPP,LUN
size=5.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 2:0:1:19 sdby 68:192 active ready running
| `- 3:0:1:19 sdce 69:32 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
`- 2:0:0:19 sdbf 67:144 active ready running
#
Attached is the full syslog when the bug got reproduced.
The issue is that: "sysfs dev has no parent" should have a path after dev. For some reason, the path has been cleared out. Same with the second device line in Mar 6 02:25:42 IBMx3250-200-178 multipathd: device: /devices/pci0000:00/0000:00:03.0/0000:06:00.1/host3/rport-3:0-2/target3:0:0/3:0:0:20/block/sdbk Mar 6 02:25:42 IBMx3250-200-178 multipathd: device: Somehow, the path part of you sysfs devicestructure is being cleared out. This seems pretty odd. My best guess is that the sysfs device structure is getting deleted while it still in the cache, and part of the memory is getting reused. But I can't find anyplace where that could happen. There are two new sets of packages. They are at the same location as before bz681144.2 Just adds some more printout messages, so I can see better how the sysfs device is getting setup. bz681144.3 removes the part of the last zstream commit that had to do with sysfs devices. I can't see why it should cause this to fail, but if removing it fixes the problem, then that's where the problem must be. It would be great if you could try both and see if bz681144.3 fixes the issue, and post the output from bz681144.2 We will update the bz once we collect more data from new test RPMs (bz681144.2). We have already tested with the GAed version of the device-mapper-multipath (0.4.9-31.el6) and also the first errata that was released (0.4.9-31.el6_01). We did not hit the issue. This seems to be the problem only with the last commit, i.e., 0.4.9-31.el6_02. Created attachment 482890 [details]
syslog and command logs with 681144.2 test rpms and directio
Attached is a zip file with full syslog messages and a few command outputs when the bug got reproduced (this time with directio) with the test rpms.
(In reply to comment #3) > With directio, are you sure that you don't see this on RHEL-6.0 installs? The [failed][ready][running] problem with directio is seen only with the latest multipath errata package i.e. 0.4.9-31.el6_02. And not seen with previous packages like 0.4.9-31.el6 & 0.4.9-31.el6_01. So it does look like a regression affecting directio as well. > Also, do you know if the path issues clears up in a couple of minutes. No. The path continues to remain in the same [failed][ready][running] status, and never recovers - though it is actually online (i.e. TUR is successful on this path). I'd still really like you to test with 681144.3, even though you already tested with the earlier zstream packages, since there were two memory leaks that got closed in the latest package, and most of the code was for one that didn't effect the sysfs parent devices. If you can still reproduce the tur checker error with 681144.3, then somehow the code to fix the other memory leak is messing with the parent sysfs device cache. Also, I assume that the directio bug has to do with the other memory leak, and if you can still reproduce it with 681144.3, then I'll know for sure. Thanks for getting the 681144.2 test data back so quickly. I working on a test package that should hopefully fix the tur issue. I don't think it will fix the directio issue, but I don't understand exactly what's happening with that issue yet, so it's possible. There are new packages available that should fix the tur issue. They may fix the directio issue as well. These are the 681144.4 packages, available at the same location as before. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Multipathd was not always removing a path's sysfs device from cache when the path was removed. Also, it was searching the the cache and creating sysfs devices without the vecs lock held. Because of this paths would occasionally have invalid sysfs devices, causing multipathd crashes and other errors. Multipathd now always removes the sysfs device from cache when deleting the path, and it only accesses the cache with the vecs lock held. > I'd still really like you to test with 681144.3... We have tested with 681144.3 rpms, we did not hit the issue. > There are new packages available that should fix the tur issue. They may fix the directio issue as well. These are the 681144.4 packages, available at the same location as before. We are currently testing the 681144.4 set of rpms. Will update the bugzilla with our results once we are done. Our tests showed that 681144.4 set of RPMs fixes the issue. ~~ Partners and Customers ~~ This bug was included in RHEL 6.1 Beta. Please confirm the status of this request as soon as possible. If you're having problems accessing 6.1 bits, are delayed in your test execution or find in testing that the request was not addressed adequately, please let us know. Thanks! Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Multipathd was not always removing a path's sysfs device from cache when the path was removed. Also, it was searching the the cache and creating sysfs devices without the vecs lock held. Because of this paths would occasionally have invalid sysfs devices, causing multipathd crashes and other errors. Multipathd now always removes the sysfs device from cache when deleting the path, and it only accesses the cache with the vecs lock held.+When a path was removed, the multipathd daemon did not always remove the path sysfs device from its cache. The daemon kept searching the cache for the device and created sysfs devices without the vecs lock held. Because of this, paths could have pointed to invalid sysfs devices and caused multipathd to crash. The multipathd daemon now always removes the sysfs device from cache when deleting a path and accesses the cache only with the vecs lock held. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0725.html |