Bug 821580 - [device-mapper] System hang/freeze when multipath over iSCSI got 1 iface down.
[device-mapper] System hang/freeze when multipath over iSCSI got 1 iface down.
Status: CLOSED DUPLICATE of bug 800555
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.3
All Linux
unspecified Severity medium
: rc
: ---
Assigned To: Mike Snitzer
Storage QE
:
Depends On:
Blocks: 840683
  Show dependency treegraph
 
Reported: 2012-05-14 22:12 EDT by Gris Ge
Modified: 2012-09-21 15:36 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-09-21 15:36:01 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
console log when trigger this bug (4.06 MB, application/octet-stream)
2012-08-08 04:11 EDT, Gris Ge
no flags Details
block and scsi error throttling patch (4.00 KB, patch)
2012-09-04 09:33 EDT, Mike Snitzer
no flags Details | Diff

  None (edit)
Description Gris Ge 2012-05-14 22:12:56 EDT
Description of problem:

When connecting 50+ LUNs via 4 target each, 1 interface down will cause system hang/freeze.

console flooded with these errors:
====
end_request: I/O error, dev dm-9, sector 1445847
====

Bug #800555 apply the rate limit to SCSI layer printk, need check whether device-mapper need this rate limit.

As each mpath has 50+ partitions, udev might the one who running I/O.


Version-Release number of selected component (if applicable):
kernel -268

How reproducible:
100%

Steps to Reproduce:
1. Use attached tool (tgtd.sh) to create 50 LUN via 4 iSCSI target each.
2. Use these commands to login iscsi target:
====
iscsiadm -m discovery -t st -p localhost
iscsiadm -m node -l
====
3. Use these commands to enable multipath:
====
mpathconf --enable
service multipathd start
====
4. Use these commands to create 52 partitions on each mpath. (no sure whether this step is necessary)
===========
fdisk /dev/mapper/mpathe << EOF
n
e
1


w
EOF

for X in `seq 5 54`;do
fdisk /dev/mapper/mpathe << EOF
n
l

+10M
t
$X
8e
w
EOF
done

for X in `multipath -l |grep mpath \
| perl -ne 'print "$1 " if /(mpath[a-z]+)/'`;
do
    sfdisk /dev/mapper/mpathe -d -f \
      | sfdisk /dev/mapper/$X;
done
===========

5. Use these commands to create mpath partitions (kpartx rule is different from udev rull, so we use udev way):
======
mulitpath -F
multipath -r
======

6. Logout iscsi session:
======
iscsiadm -m node -u
======

Actual results:
console flooded with "end_request: I/O error, dev dm-9, sector 1445847"
OS freeze.

Expected results:
OS no freeze.

Additional info:

This bug just request limit error message printed by kernel, request exception.
Comment 4 Mike Snitzer 2012-08-07 16:45:59 EDT
So you're creating 50 mpath devices, each with 52 partitions, with tgt target and iscsi client on the same machine.

Once multipath devices (and partitions are active) you're tearing down all the iscsi sessions.

This causes _all_ paths to the multipath devices to fail simultaneously.

Odd test.  Unlikely we'll do anything to throttle the kernel's error messages.  The OS freezing needs to be understood though.

Do you happen to have console access and do you have any understanding what went wrong?  (do you have a console trace that shows some stack trace and/or crash?).

Just needs reproducing, preferably against RHEL6.3.. really doubtful all the partition creation has anything to do with this issue.
Comment 5 Gris Ge 2012-08-08 01:43:05 EDT
Mike,

It's might be the console who slow OS down when kernel error message flood  in it.

It seems there is a error message rate limit patch applied to scsi layer which  
might fix this issue.

I will try to reproduce on RHLE 6.3 GA again and keep you posted.
Comment 6 Gris Ge 2012-08-08 04:11:25 EDT
Created attachment 602962 [details]
console log when trigger this bug

Mike,

I reproduced this problem on RHEL 6.3 GA.

The console was flooded by the I/O error on dm-XX (multipath devices) which freeze OS. I would like to rate limit apply to these error messages.

I have attached the console log.
Comment 7 Mike Snitzer 2012-08-08 09:24:51 EDT
(In reply to comment #6)
> Created attachment 602962 [details]
> console log when trigger this bug
> 
> Mike,
> 
> I reproduced this problem on RHEL 6.3 GA.
> 
> The console was flooded by the I/O error on dm-XX (multipath devices) which
> freeze OS. I would like to rate limit apply to these error messages.
> 
> I have attached the console log.

Seems there is something pathological about all iscsi sessions being dropped simultaneously. multipathd is attempting to reload all the multipath tables -- but that is failing because all the iscsi devices nolonger exist (hence: "multipath: error getting device" for each path).

It'd be useful to get the /var/log/messages from the same test cycle; this should give us more information about what multipathd is doing.

I'm not sure what the right response would be to this situation; but if a device no longer exists there clearly isn't any point trying to push down a multipath table that references the missing device(s).

Cc'ing Ben to get his insight.
Comment 8 Ben Marzinski 2012-08-08 17:26:59 EDT
The issue is that multipathd gets those remove uevents one at a time. So, when it gets the request to remove the first path, it doesn't know that the other have been removed. I suppose it would be possible to revalidate all of a multipath device's paths whenever one of them is removed.  I'm not sure that this would be the best idea for all cases.  Those uevents can pile up, and multipathd needs to deal with them quickly.  Also, this wouldn't change the amount of IO error messages.
Comment 9 Mike Snitzer 2012-09-04 09:27:32 EDT
Upstream has started to accept an error throttling patch for block and SCSI (block chunk was accepted, SCSI hasn't been yet):
http://www.open-fcoe.org/patchwork/patch/2655/

But looking at the log from comment#6 it seems the block patch would help the most.

Though we might look to rate limit these DM messages too:
device-mapper: table: 253:8: multipath: error getting device                                           
device-mapper: ioctl: error adding target to table
Comment 10 Mike Snitzer 2012-09-04 09:33:42 EDT
Created attachment 609686 [details]
block and scsi error throttling patch

Proposed patch from http://www.open-fcoe.org/patchwork/patch/2655/
Comment 12 Mike Snitzer 2012-09-21 15:36:01 EDT

*** This bug has been marked as a duplicate of bug 800555 ***

Note You need to log in before you can comment on or make changes to this bug.