Description of problem: Host setup: RHEL 5.1 RC1 (2.6.18-52.el5) using Emulex LPe11002 HBA cards & 8.1.10.9 driver. Along with the root lun, 2 additional data luns are mapped to the RHEL host. Each lun has 4 FCP paths each - 2 primary & 2 secondary paths. Now fault injections are performed where the FCP paths alternately go offline and then online as follows: 1) First the primary FCP paths are made offline. After an interval of 10 minutes, the primary paths are brought back to online status. 2) After the next 10 minutes, the secondary paths are then made offline. These paths are made online again after 10 minutes. The above cycle is repeated in a loop i.e. at any given point of time, either primary or secondary paths are made available for each lun on the RHEL host. Within a couple of hours of these iterations, the host becomes unresponsive and freezes. The freeze is always reproducible for the above scenario. The only way out is then to hard boot the machine. This happens for both IO & non IO runs on the data luns. Version-Release number of selected component (if applicable): device-mapper-multipath-0.4.7-12.el5 How reproducible: Always. Steps to Reproduce: 1. Along with the root lun, map 2 data luns to the RHEL 5.1 root device multipathed host. Each lun has 4 FCP paths each - 2 primary & 2 secondary paths. 2. Now perform fault injections on the FCP paths as described above. Actual results: The host freezes within a couple of hours of the path fault injections necessitating a hard boot. Expected results: The host should survive the above fault injections. Additional info: 1) A normal RHEL 5.1 (non SANbooted root device multipathed) host survives path fault injections even for longetivity runs of 72 hours i.e. the freeze is seen only on root device multipathed hosts during path faults. 2) We also tried tweaking the disk timeout values of the underlying SCSI paths as well as the lpfc_devloss_tmo values of the Emulex driver. But the freeze was still reproducible. The freeze is also seen irrespective of using the default mpath_prio_netapp prio callout or the modified mpath_prio_ontap callout.
Created attachment 240921 [details] Config and sysrq dumps of the host The attachment contains 4 files: 1) Config.txt - Configuration file of the RHEL 5.1 root device multipathed host. 2) RHEL5.1-WithoutIO-SANbootfreeze.txt - Sysrq dumps of the host during the freeze for a non IO scenario. 3) RHEL5.1-WithIO-SANbootfreeze.txt - Sysrq dumps of the host during the freeze for a IO scenario on the data luns. 4) RHEL5.1-Debug-SANboot-1.txt - Sysrq dumps of the host during the freeze, again for a non IO scenario. Here, the debug kernel was used (2.6.18-52.el5debug).
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
NetApp states this is blocking SAN boot support on RHEL 5.1.
Adding Emulex - just as FYI for them...
Emulex is reviewing this bugzilla now.
I tried a similar test on the following environment, but I can't reproduce the problem. o HBA : QLA2340(2Gbps * 1port) * 2 o Storage : NEC iStorage (2Gbps * 2port) o FC-Connection: FC-AL (No FC-Switch between HBAs and the storage) o Configurations: - 3 LUNs from the storage (each LUN has 2 paths) - Failover configuration (1 path in each priority-group) - Directio path checker o Fault injection: - Disk offline/online using sysfs (like the attached script) - Remove/restore FC cables alternately manually Accoding to the console log attached to the comment#1, path checking by multipathd looks stopped from a certain point of time. If it is true, the paths would not be reinstated and all paths became down eventually. Are there any related logs in /var/log/messages? Also, I think answers to the following questions would help isolating the problem. o What was the path status from dm's view at each point of time? You can check it with "dmsetup status". If paths in one priority group are failed before you inject a fault to paths of the other priority group, all paths will be down. o What fault injection method was used? Is it NetApp specific or generic? I attached a test script using sysfs to offline/online the paths. Could you try this to see if the problem can be reproduced? o Does it happen if other path checkers like "tur" or "readsector0" are used? o Does it happen if the 2 data LUNs aren't used? (Create only 1 multipath device for the root volume) o Does it happen if 2 paths (1 path in each priority-group) are used? o Does it happen if other HBAs like QLogic are used?
Created attachment 248471 [details] Path fail testing script using sysfs to offline/online paths You need to modify some variables in the script for your environment.
The RHEL5.1-Debug-SANboot-1.txt file shows "possible recursive locking" in the devicemapper code: [ INFO: possible recursive locking detected ] 2.6.18-52.el5debug #1 --------------------------------------------- lvm/457 is trying to acquire lock: (&md->io_lock){----}, at: [<f88b078f>] while... dm_request+0x15/0xe6 [dm_mod] but task is already holding lock: (&md->io_lock){----}, at: [<f88b078f>] dm_request+0x15/0xe6 [dm_mod] other info that might help us debug this: 1 lock held by lvm/457: #0: (&md->io_lock){----}, at: [<f88b078f>] dm_request+0x15/0xe6 [dm_mod] stack backtrace: [<c043be58>] __lock_acquire+0x70c/0x922 [<c043c5bb>] lock_acquire+0x4b/0x68 [<f88b078f>] dm_request+0x15/0xe6 [dm_mod] Could this indicate a real issue in dm? Also, is nmi_watchdog enabled? Can you invoke kdb and get a backtrace of any uninterruptible processes a few minutes apart to see if they stay in the same place? (Or, without kdb, invoke the sysrq a few minutes apart.)
Please find my replies below: o What was the path status from dm's view at each point of time? You can check it with "dmsetup status". If paths in one priority group are failed before you inject a fault to paths of the other priority group, all paths will be down. -- We did run the "dmsetup status --target=multipath -v" in a loop. It properly displayed the corresponding path status during the faults. But during faults, SCSI error messages cluttered the console screen making it difficult to check the dmsetup status. And once the freeze is hit, it was impossible to detect the dmsetup status as the host was not responsive. Also, the path faults were run in such a manner that first paths in one priority group were offlined and then onlined. Only after this was it repeated for the paths in the next priority group i.e. either primary/secondary paths (or both) were available for each lun at any given point of time. o What fault injection method was used? Is it NetApp specific or generic? I attached a test script using sysfs to offline/online the paths. Could you try this to see if the problem can be reproduced? -- This is Netapp specific in the sense that a dual clustered Netapp controller setup was used as the target. Each controller head has 2 target ports thereby totalling 4 target ports altogether i.e. for each lun on any one controller head, 4 FCP paths were available on the host - 2 primary (through the local head) & 2 secondary (through the partner head). Now faults are run in such a manner that when one controller head goes down, the partner head takes over and then vice versa. This is a repeated in a loop. On the host, this would correspond to paths getting offlined and then onlined as described above (in the first point). We did take a Finisar trace during the freeze. On analyzing them, we could see the whole sequence of RSCNs, GID_FTs, PLOGIs, PRLIs which is normal during faults. But subsequently, the initiator ports never proceed beyond the REPORT LUNs commands for the respective luns during these fault injections. The target properly responds to the initiator commands, but it is the initiator ports that remain idle after receiving a GOOD STATUS reply to the corresponding REPORT LUNs commands. So the target can be ruled out in this case. We also enabled the HBA error logging, SCSI error logging & the multipathd logging (by running multipathd -v4) and even used the debug kernel - but to no avail. No relevant messages were seen in the /var/log/messages (after rebooting the host) or serial console during the freeze. o Does it happen if other path checkers like "tur" or "readsector0" are used? -- Yes. o Does it happen if the 2 data LUNs aren't used? (Create only 1 multipath device for the root volume) -- Not tried this yet. o Does it happen if 2 paths (1 path in each priority-group) are used? -- Not tried this yet. o Does it happen if other HBAs like QLogic are used? -- Not tried this yet. Could this indicate a real issue in dm? Also, is nmi_watchdog enabled? -- No, nmi watchdog is not enabled. Can you invoke kdb and get a backtrace of any uninterruptible processes a few minutes apart to see if they stay in the same place? (Or, without kdb, invoke the sysrq a few minutes apart.) -- I think the "recursive locking" message was displayed during the host reboot and not during the freeze. Anyways I can collect successive sysrq dumps (which are a few minutes apart) during the freeze if thats what you want.
Hi Emulex, > We did take a Finisar trace during the freeze. On analyzing them, > we could see the whole sequence of RSCNs, GID_FTs, PLOGIs, PRLIs > which is normal during faults. But subsequently, the initiator ports > never proceed beyond the REPORT LUNs commands for the respective luns > during these fault injections. The target properly responds to > the initiator commands, but it is the initiator ports that remain idle > after receiving a GOOD STATUS reply to the corresponding REPORT LUNs > commands. So the target can be ruled out in this case. Do you think that it means the device driver is stalling and the first suspect? Or do you think that it is a result of other components' fault or something?
> o What was the path status from dm's view at each point of time? > You can check it with "dmsetup status". > If paths in one priority group are failed before you inject a fault > to paths of the other priority group, all paths will be down. > > -- We did run the "dmsetup status --target=multipath -v" in a > loop. It properly displayed the corresponding path status during > the faults. But during faults, SCSI error messages cluttered the > console screen making it difficult to check the dmsetup > status. And once the freeze is hit, it was impossible to detect > the dmsetup status as the host was not responsive. Also, the path > faults were run in such a manner that first paths in one priority > group were offlined and then onlined. Only after this was it > repeated for the paths in the next priority group i.e. either > primary/secondary paths (or both) were available for each lun at > any given point of time. I meant that what was the path status BEFORE injecting a fault. In this testing loop, all paths should be active from dm's view just before injecting a fault. But if something wrong happened in multipathd, the onlined paths might not be activated yet when the paths in the next pg-group were offlined. Is it possible for you to check 'dmsetup status' on the testing host before every fault injection? Also, is it possible to stop injecting fault if either of the paths are marked as 'F' in the output of dmsetup? Then, you would see the crucial state where all paths should be available but dm sees some of them failed. Anyway, could you attach all console logs and /var/log/messages to this bugzilla if possible? Those might include helpful information to isolate that the cause is in multipathd, the device driver or other kernel components. Also, results of the trials in comment#7 (especially the result of sysfs fault injection test the attached in comment#8) could help investigation very much.
Add Mukesh @Emulex to CC
What I meant was that the issue seemed purely a host related one from the Finisar trace. But identifying which layer in the storage stack triggered the freeze is the tough part. As requested, I'll try checking the dmsetup status on the host before every fault. Will also attach the relevant console logs & /var/log/messages for this scenario.
>>Hi Emulex, >>Do you think that it means the device driver is stalling and >>the first suspect? >>Or do you think that it is a result of other components' fault >>or something? We don’t have enough data to suspect device driver at this point. Device driver is completing Fibre channel discovery without any error. There is no device driver thread in non interruptible state in all sysrq task dump. Additional test data ( back trace of any uninterruptible processes a few minutes apart )on DEBUG KERNEL with NMI WATCHDOG enable will help. >>-- I think the "recursive locking" message was displayed during the host >>reboot and not during the freeze. It's worth debugging why we see this "recursive locking" message during system boot.
> >>-- I think the "recursive locking" message was displayed during the host > >>reboot and not during the freeze. > > It's worth debugging why we see this "recursive locking" message during system > boot. It's a known problem. http://marc.info/?l=dm-devel&m=116322663022361&w=2 http://marc.info/?l=dm-devel&m=116379258818712&w=2
Created attachment 249531 [details] Sysrq dumps + /var/log/messages Attachment contains the following: 1) Sysrq dumps - After the freeze, the process state was collected twice 10 minutes apart (not yet enabled nmi_watchdog). 2) /var/log/messages - Collected the messages file after rebooting the host. One can see the last multipathd message at 21:48:25 due to the path faults before the freeze. The subsequent messages correspond to the host reboot.
System is in hung state because root device is loosing all path. HBA driver is returning DID_ERROR on active path ( 3:0:1:0 and 4:0:0:0 ) while other two paths remain dead. We need additional information to figure out why lpfc driver is returning DID_ERROR. Please run test with NO IO and lpfc driver log verbose set to 0x40. you can turn on lpfc driver log level by adding following line to /etc/modprobe.conf options lpfc lpfc_log_verbose=0x40
Created attachment 250041 [details] Syrq dumps + messages with lpfc verbose logging Attachment contains the following: 1) rhel5.1-lpfc-verbose-1.TXT - Sysrq dumps of the host(with debug kernel) with no IO run. The process state was collected twice 10 minutes apart after the freeze. The lpfc log verbose was set to 0x40 and nmi_watchdog enabled. 2) messages-new - The /var/log/messages file after rebooting the host following the freeze.
Console log and /var/log/messages files are missing some crucial information. I am not seeing any SCSI error messages on console. Initial driver loading messages are missing from /var/log/message file. These messages are very important to debug this issue. Previous log files did have all those information, look like console log level get changed. Can you please re- run test with 1. console log level set to 8 echo 8 > /proc/sys/kenrel/printk 2. lpfc driver verbose set to 0x40 3. NO I/O 4. clean up old /var/log/messages .
Created attachment 251291 [details] New dumps as requested As requested, I have attached the new logs as follows: 1) rhel5.1-lpfc-verbose-2.TXT - Sysrq dumps (10 minutes apart)of the debug kernel after the freeze. This is for the non IO scenario with lpfc log verbose = 0x40, console log level = 8 and nmi_watchdog enabled. 2) messages-debug-2 - /var/log/messages for the above scenario.
As Mukesh commented, all paths for root filesystem down although it isn't expected to happen in the test scenario. As a result, one multipathd thread (PID=2812 in RHEL5.1-debug-1.TXT) is stalling on exec() system call, waiting for inode write out for updated access time. The stalling thread should be the path checker (checkerloop()), which is trying to execute the priority callout, mpath_prio_netapp. So onlined paths aren't activated any more, since only the path checker activates onlined paths in a system. Not using priority callouts, or specifying "noatime" option of mount (if the callout related files are on page cache) would work around the stall. # mount -o remount,nodiratime,noatime /
We are loosing all path to root device. Its look like command on active path is timing out and mid layer is aborting those command. Emulex driver returns aborted command with DID_ERROR. Console log indicates Device mapper make a path down when it receive a DID_ERROR. Kiyoshi what is DM behavior when it gets a command with DID_ERROR ? Martin i would like you to run test one more time with changing lpfc driver log level to 0x43.
Re: Comment#23 dm-multipath marks the path down, and doesn't use it until multipathd activates it again. (But multipathd gets stalled when the all path down happened. So no paths are activated any more.)
A command which fails with DID_ERROR should be retried. More observation - Command on active path get aborted. - Commands on same path returns UNIT ATTENTION attention and ASC ASQ as 29 00. It means bus reset is happening. A command can be aborted in following case 1. command is timing out. 2. bus reset issued. 3. Lun reset issued . Since there is no I/O going on system its less likely command is timing out. is any application issuing bus reset ? Out put from rhel5.1-lpfc-verbose-2.TXT ( First word is line number ) Indicates path 3-0-1-0 is seeing command aborted first and then next command fails with Unit attention and ASC ASCQ as 29 00 which is bus reset / power on reset. 68920 lpfc 0000:02:00.0: 0:0729 FCP cmd x28 failed <1/0> status: x3 result: x16 Data: x76 x966 68921 lpfc 0000:02:00.0: 0:0710 Iodone <1/0> cmd e7d3f3c0, error x70000 SNS x0 x0 Data: x0 x0 68924 lpfc 0000:02:00.0: 0:0749 SCSI Layer I/O Abort Request Status x2002 ID 1 LUN 0 snum 0xee97 3-0-1-0 getting abort 68925 lpfc 0000:02:00.0: 0:0729 FCP cmd x0 failed <1/0> status: x1 result: x0 Data: x6b x98a 68926 lpfc 0000:02:00.0: 0:0730 FCP command x0 failed: x2 SNS xf0000600 x29000000 Data: x2 x0 x16 x0 x0 68933 lpfc 0000:02:00.0: 0:0710 Iodone <1/0> cmd e7d3f3c0, error x2 SNS x600f0 x29 Data: x0 x0 BUS RESET ON lun 3-0-1-0
Created attachment 252931 [details] New dumps Attachment contains the following: 1) rhel5.1-lpfc-0x43.TXT - Syrq dumps (10 minutes apart) of the debug kernel after the freeze. This if for the non IO scenario with lpfc log verbose = 0x43, console log level = 8 & nmi_watchdog enabled. 2) messages-debug-3 - /var/log/messages for the above scenario.
Re: Comment#25 > A command which fails with DID_ERROR should be retried. Current kernel doesn't propagate the error code to device-mapper layer. https://bugzilla.redhat.com/show_bug.cgi?id=168536 So dm-multipath doesn't retry any errors using the same path. Is the DID_ERROR retried by SCSI mid layer if dm-multipath doesn't set FAILFAST? (related bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=304521) > More observation > - Command on active path get aborted. > - Commands on same path returns UNIT ATTENTION attention and ASC ASQ as > 29 00. It means bus reset is happening. > A command can be aborted in following case > 1. command is timing out. > 2. bus reset issued. > 3. Lun reset issued . > > Since there is no I/O going on system its less likely command is timing > out. multipathd is submitting I/Os periodically to check paths. (In this test case, multipathd uses direct I/O for it.)
I have a few queries on the above: 1) Why is the freeze seen only on SANbooted root device dm-multipathed hosts during fault injections - and that too for a non IO scenario? Apparently the same non SANbooted host successfully survives these fault injections even for intensive IO longevity runs on the data luns (of 72 hours & more). 2) Curiously, the SANbooted host without data luns mapped (i.e. only with the root lun) survives these fault injections i.e. the freeze is seen only on a SANbooted host with additional data luns mapped to it during fault injections. Why? 3) The fault injection scripts are automated and perform offlining/onlining of FCP paths every 10 minutes. If the non availability of paths to the root lun is the cause of this issue, why does the freeze always set in only after 2-3 hours of fault iterations (and not in the beginning)? 4) As described before, the FC traces (I can share them if requested) reveal that the host initiators remain idle after the successful reponse to the REPORT LUNs commands during the freeze. Why don't the initiators go ahead with subsequent SCSI queries like READ CAPACITY etc., for the above scenario (which is what occurs in a normal non SANbooted environment)? Apparently they remain idle despite the target properly responding to the corresponding initiator queries triggering the freeze.
Here is what i found out - During testing we are loosing all path to lun occasionally. /var/log/messages >> Nov 9 17:33:37 rhel5-1-rc1 multipathd: mpath2: remaining active paths: 0 - system hang when it loose all path to root device - Active luns are failing due to two reason Command is aborted by scsi mid layer because its timing out. We are seeing Unit attention with 29 00 which indicates target is restarting. Target may be recovering from some error or receive a bus reset. CONSOLE LOG >> 44445 lpfc 0000:02:00.0: 0:0729 FCP cmd x2a failed <1/0> status: x3 result: x16 Data: x208 x8eb^M 44446 lpfc 0000:02:00.0: 0:0710 Iodone <1/0> cmd f0b3dce0, error x70000 SNS x0 x0 Data: x0 x0^M 44447 lpfc 0000:02:00.0: 0:0749 SCSI Layer I/O Abort Request Status ID 1 LUN 0 snum 0x81a7^M - Command is failing with sense key 2 ( NOT READY ) and ASC ASCQ 40 10 . CONSOLE LOG >> 41782 lpfc 0000:02:00.1: 1:0729 FCP cmd x28 failed <1/2> status: x1 result: x1000 Data: x8a x957^M 41783 lpfc 0000:02:00.1: 1:0730 FCP command x28 failed: x2 SNS xf0000200 x4010000 Data: xa x1000 x16 x0 x0^M 41785 lpfc 0000:02:00.1: 1:0710 Iodone <1/2> cmd f09b472c, error x2 SNS x200f0 x104 Data: x0 x1000^M Second reason make me think we are seeing 29 00 because target is going through some error recovery. - Even system seems to hung, EMULEX driver is processing RSCN for failed target when it come back. We do see REPORT lun is completing fine but no further activity. The midlayer scans, sending the Report Lun. It's implying that the report luns showed nothing different from what the midlayer already sees as present and implies the scsi status of the devices are still good too - thus it doesn't send any further i/o as it has already done that. CONSOLE LOG>>> 45066 lpfc 0000:02:00.0: 0:0212 DSM out state 6 on NPort x10e00 Data: x7 From Emulex driver point of view we are handling every thing correctly. I don't see any DM activity after that.
To answer Martin questions Q 1 and 2 MUKESH>> As Kiyoshi suggested in his comment 22 DM can cause system hang when there is no path to root device. I think he will explain better why this happens only with root device. Q 3 3) The fault injection scripts are automated and perform offlining/onlining of FCP paths every 10 minutes. If the non availability of paths to the root lun is the cause of this issue, why does the freeze always set in only after 2-3 hours of fault iterations (and not in the beginning)? MUKESH>> It takes that long to fail active path to root device. It may be because target is going into some kind of error recovery mode after that long. As i mentioned in my previous comment ( comment 29 ), active path fail because - its timing out and aborted by mid layer - Target is reporting NOT READY with ASC ASQ 40 10 . I am curious to know why target is reporting following errors and what that mean. - Command is failing with sense key 2 ( NOT READY ) and ASC ASCQ 40 10 . CONSOLE LOG >> 41782 lpfc 0000:02:00.1: 1:0729 FCP cmd x28 failed <1/2> status: x1 result: x1000 Data: x8a x957^M 41783 lpfc 0000:02:00.1: 1:0730 FCP command x28 failed: x2 SNS xf0000200 x4010000 Data: xa x1000 x16 x0 x0^M 41785 lpfc 0000:02:00.1: 1:0710 Iodone <1/2> cmd f09b472c, error x2 SNS x200f0 x104 Data: x0 x1000^M Q 4 As described before, the FC traces (I can share them if requested) reveal that the host initiators remain idle after the successful reponse to the REPORT LUNs commands during the freeze. Why don't the initiators go ahead with subsequent SCSI queries like READ CAPACITY etc., for the above scenario (which is what occurs in a normal non SANbooted environment)? Apparently they remain idle despite the target properly responding to the corresponding initiator queries triggering the freeze. MUKESH>> Please see my comment (COMMENT #29 last bullet item ). DM send following 3 commands after report luns on each lun after Emulex driver discover remote ports. 28h READ c0h VENDOR SPECIFIC 12h INQUIRY. Since DM seems to be in hung state we don't see any command after report lun.
Re: Comment#28 > 1) Why is the freeze seen only on SANbooted root device > dm-multipathed hosts during fault injections - and that too for a > non IO scenario? Apparently the same non SANbooted host successfully > survives these fault injections even for intensive IO longevity runs > on the data luns (of 72 hours & more). I assume the "non SANbooted" means dm-multipath is not used for root device. If it's incorrect, please let me know. The mpath_prio_netapp which is used by multipathd is on root device. So if root device is on dm-multipath and all paths down, multipathd is stalled when trying to get the mpath_prio_netapp. However if root device isn't on dm-multipath, multipathd can get it and continue to work. So even if all paths of the data luns down temporarily, those paths are activated by multipathd when they are onlined again. I/O on the data luns is irrelevant to the freeze. > 2) Curiously, the SANbooted host without data luns mapped (i.e. only > with the root lun) survives these fault injections i.e. the freeze > is seen only on a SANbooted host with additional data luns mapped to > it during fault injections. Why? I'm not sure, but I guess no-path situation (error return for active paths from the device driver) doesn't happen in that case. (But if it is true, why does the device driver detect errors only when maps for data luns exist?) Anyway, if you attach the console log and /var/log/messages of that test, I'll check. > 3) The fault injection scripts are automated and perform > offlining/onlining of FCP paths every 10 minutes. If the non > availability of paths to the root lun is the cause of this issue, > why does the freeze always set in only after 2-3 hours of fault > iterations (and not in the beginning)? It's a timing issue. The freeze doesn't always happen when all paths for root device down. The behavior of multipathd is below: for (all paths) { 1. check path 2. execute the priority callout, if the path is up (doesn't execute it if the path is down) } So when all paths down before the path checking and multipathd detects the path down, the freeze doesn't happen. But when all paths down after the path checking before executing priority callout, the freeze happens. I'm not sure why it always happens on almost same time-window.
By non SANbooted, I did mean the root device not mounted on a dm-multipath partition. Actually you are right about there being no paths available to the root lun during the fault injections. But this is only for a small window, and the target does respond to basic SCSI commands like INQUIRY, REPORT LUNs etc. during this period. So its never an issue with non SANbooted scenarios. But apparently, that's not so for root device dm-multipath SANboot scenarios as per your explanation. I saw your proposal regarding building all priority callouts into the multipathd as library functions like path checkers. That does make sense and would hopefully resolve this issue. But in the current state, I don't think we can support the root device dm-multipath feature on RHEL 5.1.
Re: Comment#32 I agree with you. Unavailability of all-paths must be avoided on RHEL5.1's root multipath even if it's temporary and very small window, if the storage uses priority callout. That is very hard limitation and the information should be available to customers. Ben, or somebody from Red Hat, please provide such information via knowledge base article or something. The noatime mount in Comment#22 doesn't work if the callout isn't on page cache. Possible better workaround might be copying all callouts to ramdisk and modify multipath.conf to use them.
Since I don't want to drastically change how the priority callouts work in an update release, I'm not going to pull the libprio work into RHEL 5. Instead, I'm going to add the ramdisk code back into multipath. I was pulled out because it didn't work well with the pthread code in RHEL4. However, pthreads work fine with it in RHEL 5.
I've committed the fix for this. The way it works, there should be no need to change the configuration. multipathd now creates a ramfs cache for all of the getuid and prio callouts that it uses. Once multipathd starts up, it doesn't matter if you lose access to the callout binaries, because multipathd has its own copies. There is only one minor restriction. Multipathd only adds callout programs to it's cache on startup, not on reconfigure. This means, if you edit /etc/multipath.conf, and add a "getuid_callout" or "prio_callout" line that needs a callout program which was not previously needed by any configuration in either /etc/multipath.conf or the default configs, you will need to restart multipathd for the binary to be loaded into multipathd's cache. Simply running # multipath -k"reconfigure" Will not work. In the very rare case where a customer needs a prio_callout not supplied by the device-mapper-multipath package, or a specialized getuid_callout, and they have already started up multipathd before they editted the /etc/multipath.conf to include this information, the customer simply needs to run # service multipathd restart after editting /etc/multipath.conf, and everything should be fine.
added to RHEL5.2 release notes under "Resolved Issues": <quote> Root devices on multipathed hosts no longer freeze during fibre-channel path faults. multipathd now creates a ramfs cache for all getuid and prio callouts used. multipathd uses these cached callouts, making them persistent accross possible path faults. Note that these callouts are only cached during startup, and not during reconfiguration. If you add a callout to /etc/multipath.conf after startup, this callout will not be cached even if you run multipath -k"reconfigure". To ensure that callout additions to /etc/multipath.conf are cached, restart multipathd using service restart multipathd. </quote> please advise if any revisions are in order. thanks!
*** Bug 431119 has been marked as a duplicate of this bug. ***
Partner NetApp has tested the 5.1 erratum package and issue is still not resolved. See bug 428338 Comment #7 and Comment #8. https://bugzilla.redhat.com/show_bug.cgi?id=428338#c7
See if using the priority callout /sbin/mpath_prio_netapp.static fixes the problem. This is just a statically compiled version of the regular binary.
I did as you suggested and restarted the multipathd daemon. Then ran the test script for simulating the faults (not the actual FC path faults yet). And the results do look promising. But with the new setting, the host does not boot after recreating the initrd perhaps because the current RHEL 5.1 mkinitrd includes the statically linked prio binaries and renames them without the .static extension.
I'm glad to have this finally nailed down. For the actual fix, I was planning on simply making the non-static callouts symbolic links to the static ones. This seems like the the easiest way to solve the problem, without forcing people to mess with their configuration. Could you just make /sbin/mpath_prio_netapp a symlink to /sbin/mpath_prio_netapp.static and verify that everything works for you. This fixes everything for me.
Making /sbin/mpath_prio_netapp a symlink to /sbin/mpath_prio_netapp.static solves the boot issue. Will update the results of the FC path faults later.
All the non-static callouts now are just symlinks to the static ones.
OK - it looks like the follow-up fix to this is now in bug 431947.
Actually, bz 431947 is now the stream version of this issue. This bug is being used for the entire solution in both the errata and rpm spec file. Sorry for the confusion.
added to RHEl5.2 release notes under "Resolved Issues": <quote> The priority callouts of dm-multipath are now statically compiled. This fixes a problem that occurs when running dm-multipath on devices containing the root filesystem, which caused such devices to freeze during fibre-channel path faults. </quote> please advise if any further revisions are required. thanks!
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot1--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
Hi, the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at which point no further additions or revisions will be entertained. a mockup of the RHEL5.2 release notes can be viewed at the following link: http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html please use the aforementioned link to verify if your bugzilla is already in the release notes (if it needs to be). each item in the release notes contains a link to its original bug; as such, you can search through the release notes by bug number. Cheers, Don
This fix is broken for because it causes a regression reported in bz 439030. For some reason that I don't quite understand. If I clone the multipathd process and it gets its own stack, selinux in either permissive or enforcing modes, causes multipathed to crash when device-mapper tries to create the multipath device nodes.
*** Bug 439030 has been marked as a duplicate of this bug. ***
Fixed. I use fork() and unshare(CLONE_NEWNS) to get my own namespace, which avoids the clone() call and means I don't have to create my own stack space, which was causing the problem with selinux.
Note to NetApp: this regression was originally reported in bug 438150, so assuming we'll be using this bug (bug 355961) in RHEL 5.2, then bug 438150 may be used for 5.1.z.
Ben, I can test on IA64
Confirmed the following 2 problems in device-mapper-multipath-0.4.7-16.el5 are fixed in device-mapper-multipath-0.4.7-17.el5. o multipathd stalls on IA64 + SELinux (Reported in bug 439030) o multipathd gets segfault on x86_64 + SELinux (Reported in bug 438150)
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot3--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
hi guys, does the release note for this bug (quoted in Comment# 54) still stand? please advise before April 15 if any revisions are required. thanks!
Don - probably not. Another fix went in as of Comment #61, so I'd assume this isn't up to date. The release notes would need be updated to include this - not sure if you can do this or if Ben needs to supply it to you.
Setting to VERIFIED based on NEC's gracious testing, which seems to include the most recent item. Correct me if I'm wrong here.
ach. in that case, can somebody post the necessary edits to the RHEl5.2 release note for this bug? at present, it still appears as quoted in Comment#54. note that the deadline for the RHEl5.2 release notes is on April 15, at which point no further revisions will be entertained.
Ben's fix in Comment#61 doesn't affect the release note, so no change is required. But I think we could make it better like: <quote> The priority callouts of dm-multipath are now statically compiled and copied onto the memory of the monitoring daemon, multipathd. So multipathd doesn't require access the root filesystem to execute the priority callouts. This fixes a multipathd stall problem that occurs when running dm-multipath on devices containing the root filesystem and all paths of the devices fail, which caused such devices to keep unavailable even if the failed paths are restored. </quote> It's just my comment. If it's not comfortable for Red Hat, ignoring the comment is no problem for me.
thanks Kiyoshi. revising as follows: <quote> The priority callouts of dm-multipath are now statically compiled and copied onto the memory of multipathd. As such, multipathd no longer requires access to the root filesystem in order to execute priority callouts. This fixes a problem that occurs when running dm-multipath on devices containing the root file system, which caused such devices to freeze during fibre-channel path faults. </quote> please advise before April 15 if any further revisions are required. thanks!
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0337.html