Bug 220668
Summary: | [IT 109546] RHEL 4 U4 + EMC PowerPath 4.5.1 doesn't immediately fail over on cable pull | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Issue Tracker <tao> | ||||||
Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> | ||||||
Status: | CLOSED NOTABUG | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.4 | CC: | coughlan, dmilburn, kearnan_keith, mmahut, tao | ||||||
Target Milestone: | --- | Keywords: | Regression | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2007-02-22 19:36:15 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Issue Tracker
2006-12-22 21:43:01 UTC
In some environments, EMC PowerPath 4.5.1 on RHEL 4 U4 will not immediately fail over on a cable pull. Specifically, I/O to mounted filesystems will proceed uninterrupted, but reads from raw devices will hang. This failure has not been observed with RHEL 4 U2, using the same PowerPath software, so it appears to be a regression. dm-multipath fails over correctly, so this could be a latent PowerPath bug exposed by changes in the kernel, or possibly even a bug in an application failing to properly handle a non-fatal error returned by a read from a raw device. The exact cause has not yet been diagnosed, but I am opening this bug for public discussion between the various interested parties. Customer can reproduce the problem on RHEL 4 U4, EMC cannot with apparently identical configuration. Setting scsi_logging_level=0xffff results in too much noise and not enough useful data. The definitions for this parameter are in drivers/scsi/scsi_logging.h: #define SCSI_LOG_ERROR_SHIFT 0 #define SCSI_LOG_TIMEOUT_SHIFT 3 #define SCSI_LOG_SCAN_SHIFT 6 #define SCSI_LOG_MLQUEUE_SHIFT 9 #define SCSI_LOG_MLCOMPLETE_SHIFT 12 #define SCSI_LOG_LLQUEUE_SHIFT 15 #define SCSI_LOG_LLCOMPLETE_SHIFT 18 #define SCSI_LOG_HLQUEUE_SHIFT 21 #define SCSI_LOG_HLCOMPLETE_SHIFT 24 #define SCSI_LOG_IOCTL_SHIFT 27 Each of the fields are 3 bits wide, so they go 0 (off) to 7 (everything) For example, scsi_logging_level=0x63 would set scan logging to 1, timeout logging to 4, and error logging to 3. To get a better idea of where we might need to instrument, I'd like to see sysrq-t data taken during the I/O test, once running normally, and once immediately after a cable pull. Hello, after further investigation, it was determined the most helpful information we need before providing debug code is: The sysrq-m data taken during the I/O test, once running normally, and once immediately after a cable pull. We would like this from both the customer site (the reproduction environment) and from the lab (non-production environment). To do this, enable sysrq: echo 1 > /proc/sys/kernel/sysrq Then immediately before the cable pull, issue: echo m > /proc/sysrq-trigger Then wait one minute and again, issue: echo m > /proc/sysrq-trigger Then post the /var/log/messages file for the system here. Thanks, J Created attachment 144409 [details] Messages file from EMC lab server for comment #3 This is the messages file in response to comment #3. This is from EMC's lab server. Please note that everything performs as expected on this server. Attaching outputs (messages.gz) from above testing. Start time: 14:43:33 End time: 14:44:08 Error during memory dump: Beginning test 1 '2k 20% Seq 40% Write': Ignore: 10 sec Duration: 4 hour I/Os per Second: 500 12% pattern 1 '2k Seq Read' 8% pattern 2 '2k Seq Write' 48% pattern 3 '2k Random Read' 32% pattern 4 '2k Random Write' Test 1 starting, file '/dev/raw/raw1', device 1 copy 1, at Thu Dec 28 2006 14:30:22 Test 1 starting, file '/dev/raw/raw2', device 2 copy 1, at Thu Dec 28 2006 14:30:22 Test 1 starting, file '/dev/raw/raw3', device 3 copy 1, at Thu Dec 28 2006 14:30:22 Test 1 starting, file '/dev/raw/raw4', device 4 copy 1, at Thu Dec 28 2006 14:30:22 Test 1 starting, file '/filesystem1/3gfile', device 5 copy 1, at Thu Dec 28 2006 14:30:22 Test 1 starting, file '/filesystem2/3gfile', device 6 copy 1, at Thu Dec 28 2006 14:30:22 /filesystem1/3gfile: Input/output error ./iorate: ERROR: Read of 2k bytes to file 5 '/filesystem1/3gfile' returned -1 ./iorate: ERROR: BAILING out of test due to errors /dev/raw/raw2: Input/output error ./iorate: ERROR: Read of 2k bytes to file 2 '/dev/raw/raw2' returned -1 ./iorate: ERROR: BAILING out of test due to errors ./iorate: ERROR: Test 1, file '/dev/raw/raw2', device 2 copy 1, died with status 60672 ./iorate: ERROR: Test 1, file '/filesystem1/3gfile', device 5 copy 1, died with status 60672 Created attachment 144489 [details]
Messages from customer site.
As discussed on call today, I'm not sure that it's fair to call the result of this test a "failure". From a POSIX perspective, EIO is a completely valid result of a read() call that an application is supposed to handle however it sees fit. Any sane application would attempt to retry the operation at least once before bailing out, particularly when operating on raw devices. Along this line, I'd like to see what happens if we modify iorate to retry and tell us how many failures it experiences. If failover occurs successfully after a single raw IO operation fails, then I think PowerPath is doing the right thing, and something lower down is to blame for the I/O errors that the customer didn't see in U2. If repeated dependent failures occur, then PowerPath isn't failing over properly. To this end, I'd like a pointer to the version of iorate that the customer is using for this test, so I can create a version that is identical except for the retry patch. This event sent from IssueTracker by csnook issue 109546 In the interest of determining the difference between the reproducing and non-reproducing cases, I've written a general-purpose kprobe to instrument any arbitrary kernel function that appears in /proc/kallsyms, which includes functions in the PowerPath modules. The present version can only instrument one function at a time, so you'll want to load it multiple times under different names (-o option to modprobe) to instrument multiple functions. I'll be improving it soon to remove this limitation. I'm posting it at http://people.redhat.com/csnook/callprobe/ and will put updated versions there as well. When using the above module, be sure to specify the target function in the code before compilation. Replace: char *target="invalid dummy name"; With the target function, such as: char *target="scsi_cmd_ioctl"; J Below are amended instructions on how to use this system tap module mentioned in this BZ. CORRECTION: You only need to compile the module once and do not need to alter the "target" variable at compile time as it can be overridden at module load time. The following instructions can be used this module. ************************************ Using callprobe.ko ************************************ - After installing the kernel-devel packages for the kernel you are running, compile the vanilla the callprobe.c file using the following command: make -C /lib/modules/`uname -r`/build M=`pwd` modules - Make a module updates directory in you module tree for the kernel: mkdir /lib/modules/`uname -r`/updates - Copy the module into the module tree for the kernel: cp callprobe.ko /lib/modules/`uname -r`/updates/ - Run depmod. depmod - For each function you are interested in monitoring, use the following modprobe command: modprobe -v callprobe -o callprobe_<uniqueID> target=<function> Replace <uniqueID> in the command with a unique identifier (such as the function name or a number). Replace <function> with the function you wish to instrument. This module will work with PP functions and will report to var/log/messages when the function is accessed. If the failing case is not hitting a function one expects it to hit, then this module will tell us that. Here are the functions we are interested in: lpfc_can_disctmo lpfc_set_failmask lpfc_disc_state_machine lpfc_target_block lpfc_els_abort lpfc_target_remove lpfc_target_unblock lpfc_set_disctmo lpfc_findnode_wwpn So issue the following commands: modprobe -v callprobe -o callprobe_lpfc_can_disctmo target=lpfc_can_disctmo modprobe -v callprobe -o callprobe_lpfc_set_failmask target=lpfc_set_failmask modprobe -v callprobe -o callprobe_lpfc_disc_state_machine target=lpfc_disc_state_machine modprobe -v callprobe -o callprobe_lpfc_target_block target=lpfc_target_block modprobe -v callprobe -o callprobe_lpfc_els_abort target=lpfc_els_abort modprobe -v callprobe -o callprobe_lpfc_target_remove target=lpfc_target_remove modprobe -v callprobe -o callprobe_lpfc_target_unblock target=lpfc_target_unblock modprobe -v callprobe -o callprobe_lpfc_set_disctmo target=lpfc_set_disctmo modprobe -v callprobe -o callprobe_lpfc_findnode_wwpn target=lpfc_findnode_wwpn This problem has been identified as a PowerPath bug. The root cause of the problem is mishandling of multiple concurrent failovers under certain rare scenarios involving heavy I/O. The problem was introduced after certain performance improvement changes in PowerPath, that inadvertently affected these scenarios. The resolution is changing the logic and concurrency scheme within involved code paths. The performance improvements will still work as before. A fix is current under test. This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. Per comment 7, this is a PowerPath bug. No change in needed in RHEL. Closing. |