Bug 220668

Summary: [IT 109546] RHEL 4 U4 + EMC PowerPath 4.5.1 doesn't immediately fail over on cable pull
Product: Red Hat Enterprise Linux 4 Reporter: Issue Tracker <tao>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 4.4CC: coughlan, dmilburn, kearnan_keith, mmahut, tao
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-02-22 19:36:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Messages file from EMC lab server for comment #3
none
Messages from customer site. none

Description Issue Tracker 2006-12-22 21:43:01 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Chris Snook 2006-12-22 21:52:39 UTC
In some environments, EMC PowerPath 4.5.1 on RHEL 4 U4 will not immediately fail
over on a cable pull.  Specifically, I/O to mounted filesystems will proceed
uninterrupted, but reads from raw devices will hang.  This failure has not been
observed with RHEL 4 U2, using the same PowerPath software, so it appears to be
a regression.  dm-multipath fails over correctly, so this could be a latent
PowerPath bug exposed by changes in the kernel, or possibly even a bug in an
application failing to properly handle a non-fatal error returned by a read from
a raw device.  The exact cause has not yet been diagnosed, but I am opening this
bug for public discussion between the various interested parties.

Comment 2 Chris Snook 2006-12-22 22:41:23 UTC
Customer can reproduce the problem on RHEL 4 U4, EMC cannot with apparently
identical configuration.  Setting scsi_logging_level=0xffff results in too much
noise and not enough useful data.  The definitions for this parameter are in
drivers/scsi/scsi_logging.h:

#define SCSI_LOG_ERROR_SHIFT              0
#define SCSI_LOG_TIMEOUT_SHIFT            3
#define SCSI_LOG_SCAN_SHIFT               6
#define SCSI_LOG_MLQUEUE_SHIFT            9
#define SCSI_LOG_MLCOMPLETE_SHIFT         12
#define SCSI_LOG_LLQUEUE_SHIFT            15
#define SCSI_LOG_LLCOMPLETE_SHIFT         18
#define SCSI_LOG_HLQUEUE_SHIFT            21
#define SCSI_LOG_HLCOMPLETE_SHIFT         24
#define SCSI_LOG_IOCTL_SHIFT              27

Each of the fields are 3 bits wide, so they go 0 (off) to 7 (everything)

For example, scsi_logging_level=0x63 would set scan logging to 1, timeout
logging to 4, and error logging to 3.

To get a better idea of where we might need to instrument, I'd like to see
sysrq-t data taken during the I/O test, once running normally, and once
immediately after a cable pull.

Comment 3 Johnray Fuller 2006-12-26 19:20:59 UTC
Hello,

after further investigation, it was determined the most helpful information we
need before providing debug code is:

The sysrq-m data taken during the I/O test, once running normally, and once
immediately after a cable pull.

We would like this from both the customer site (the reproduction environment)
and from the lab (non-production environment).

To do this, enable sysrq:

echo 1 > /proc/sys/kernel/sysrq

Then immediately before the cable pull, issue:

echo m > /proc/sysrq-trigger

Then wait one minute and again, issue:

echo m > /proc/sysrq-trigger

Then post the /var/log/messages file for the system here.

Thanks,
J



Comment 4 Keith Kearnan 2006-12-27 17:35:25 UTC
Created attachment 144409 [details]
Messages file from EMC lab server for comment #3

This is the messages file in response to comment #3.  This is from EMC's lab
server.  Please note that everything performs as expected on this server.

Comment 5 Jerry Levy 2006-12-28 20:25:40 UTC
Attaching outputs (messages.gz) from above testing. 

Start time:     14:43:33 
End time:       14:44:08 

Error during memory dump: 
        Beginning test 1 '2k  20% Seq  40% Write': 
                     Ignore: 10 sec 
                     Duration: 4 hour 
                     I/Os per Second: 500 
                 12% pattern 1 '2k Seq Read' 
                  8% pattern 2 '2k Seq Write' 
                 48% pattern 3 '2k Random Read' 
                 32% pattern 4 '2k Random Write' 
Test 1 starting, file '/dev/raw/raw1', device 1 copy 1, at Thu Dec 28 2006 
14:30:22 
Test 1 starting, file '/dev/raw/raw2', device 2 copy 1, at Thu Dec 28 2006 
14:30:22 
Test 1 starting, file '/dev/raw/raw3', device 3 copy 1, at Thu Dec 28 2006 
14:30:22 
Test 1 starting, file '/dev/raw/raw4', device 4 copy 1, at Thu Dec 28 2006 
14:30:22 
Test 1 starting, file '/filesystem1/3gfile', device 5 copy 1, at Thu Dec 28 
2006 14:30:22 
Test 1 starting, file '/filesystem2/3gfile', device 6 copy 1, at Thu Dec 28 
2006 14:30:22 
/filesystem1/3gfile: Input/output error 
./iorate: ERROR: Read of 2k bytes to file 5 '/filesystem1/3gfile' returned -1 
./iorate: ERROR: BAILING out of test due to errors 
/dev/raw/raw2: Input/output error 
./iorate: ERROR: Read of 2k bytes to file 2 '/dev/raw/raw2' returned -1 
./iorate: ERROR: BAILING out of test due to errors 
./iorate: ERROR: Test 1, file '/dev/raw/raw2', device 2 copy 1, died with 
status 60672 
./iorate: ERROR: Test 1, file '/filesystem1/3gfile', device 5 copy 1, died with 
status 60672 



Comment 6 Jerry Levy 2006-12-28 20:26:22 UTC
Created attachment 144489 [details]
Messages from customer site.

Comment 9 Issue Tracker 2006-12-30 08:17:58 UTC
As discussed on call today, I'm not sure that it's fair to call the
result of this test a "failure".  From a POSIX perspective, EIO is a
completely valid result of a read() call that an application is supposed to
handle however it sees fit.  Any sane application would attempt to retry
the operation at least once before bailing out, particularly when operating
on raw devices.  Along this line, I'd like to see what happens if we
modify iorate to retry and tell us how many failures it experiences.  If
failover occurs successfully after a single raw IO operation fails, then I
think PowerPath is doing the right thing, and something lower down is to
blame for the I/O errors that the customer didn't see in U2.  If repeated
dependent failures occur, then PowerPath isn't failing over properly.  To
this end, I'd like a pointer to the version of iorate that the customer is
using for this test, so I can create a version that is identical except for
the retry patch.


This event sent from IssueTracker by csnook 
 issue 109546

Comment 10 Chris Snook 2006-12-30 10:19:48 UTC
In the interest of determining the difference between the reproducing and
non-reproducing cases, I've written a general-purpose kprobe to instrument any
arbitrary kernel function that appears in /proc/kallsyms, which includes
functions in the PowerPath modules.  The present version can only instrument one
function at a time, so you'll want to load it multiple times under different
names (-o option to modprobe) to instrument multiple functions.  I'll be
improving it soon to remove this limitation.  I'm posting it at
http://people.redhat.com/csnook/callprobe/ and will put updated versions there
as well.

Comment 11 Johnray Fuller 2007-01-02 17:49:28 UTC
When using the above module, be sure to specify the target function in the code
before compilation.

Replace:

char *target="invalid dummy name";

With the target function, such as:

char *target="scsi_cmd_ioctl"; 

J

Comment 12 Johnray Fuller 2007-01-02 22:23:47 UTC
Below are amended instructions on how to use this system tap module mentioned in
this BZ.

CORRECTION: You only need to compile the module once and do not need to alter
the "target" variable at compile time as it can be overridden at module load time.

The following instructions can be used this module.

************************************
Using callprobe.ko
************************************

- After installing the kernel-devel packages for the kernel you are running,
compile the vanilla the callprobe.c file using the following command:

make -C /lib/modules/`uname -r`/build M=`pwd` modules

- Make a module updates directory in you module tree for the kernel:

mkdir /lib/modules/`uname -r`/updates

- Copy the module into the module tree for the kernel:

cp callprobe.ko /lib/modules/`uname -r`/updates/

- Run depmod.

depmod

- For each function you are interested in monitoring, use the following modprobe
command:

modprobe -v callprobe -o callprobe_<uniqueID> target=<function>

Replace <uniqueID> in the command with a unique identifier (such as the function
name or a number).

Replace <function> with the function you wish to instrument.

This module will work with PP functions and will report to var/log/messages when
the function is accessed. If the failing case is not hitting a function one
expects it to hit, then this module will tell us that.

Comment 14 Johnray Fuller 2007-01-03 21:42:53 UTC
Here are the functions we are interested in:

lpfc_can_disctmo
lpfc_set_failmask
lpfc_disc_state_machine
lpfc_target_block
lpfc_els_abort
lpfc_target_remove
lpfc_target_unblock
lpfc_set_disctmo
lpfc_findnode_wwpn

So issue the following commands:

modprobe -v callprobe -o callprobe_lpfc_can_disctmo target=lpfc_can_disctmo
modprobe -v callprobe -o callprobe_lpfc_set_failmask target=lpfc_set_failmask
modprobe -v callprobe -o callprobe_lpfc_disc_state_machine
target=lpfc_disc_state_machine
modprobe -v callprobe -o callprobe_lpfc_target_block target=lpfc_target_block
modprobe -v callprobe -o callprobe_lpfc_els_abort target=lpfc_els_abort
modprobe -v callprobe -o callprobe_lpfc_target_remove target=lpfc_target_remove
modprobe -v callprobe -o callprobe_lpfc_target_unblock target=lpfc_target_unblock
modprobe -v callprobe -o callprobe_lpfc_set_disctmo target=lpfc_set_disctmo
modprobe -v callprobe -o callprobe_lpfc_findnode_wwpn target=lpfc_findnode_wwpn

Comment 15 Keith Kearnan 2007-01-24 19:29:57 UTC
This problem has been identified as a PowerPath bug.  The root cause of the 
problem is mishandling of multiple concurrent failovers under certain rare 
scenarios involving heavy I/O.  The problem was introduced after certain 
performance improvement changes in PowerPath, that inadvertently affected these 
scenarios.  The resolution is changing the logic and concurrency scheme within 
involved code paths.  The performance improvements will still work as before.  
A fix is current under test.

Comment 16 RHEL Program Management 2007-02-20 20:44:50 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 17 Tom Coughlan 2007-02-22 19:36:15 UTC
Per comment 7, this is a PowerPath bug. No change in needed in RHEL. Closing.