Bug 996900 - Kernel panic when running traffic over raid10 md device created over 4 dm devices
Kernel panic when running traffic over raid10 md device created over 4 dm dev...
Status: CLOSED DUPLICATE of bug 982360
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.4
x86_64 Linux
unspecified Severity high
: rc
: ---
Assigned To: Jes Sorensen
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-14 05:09 EDT by Roi Dayan
Modified: 2013-10-08 08:20 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-10-08 08:20:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
kernel panic (5.59 KB, text/plain)
2013-08-14 05:09 EDT, Roi Dayan
no flags Details
multipath list output (4.33 KB, text/plain)
2013-08-14 05:11 EDT, Roi Dayan
no flags Details
scsi id script (59 bytes, application/x-shellscript)
2013-08-22 06:43 EDT, Roi Dayan
no flags Details
multipath config file (347 bytes, text/plain)
2013-08-22 06:46 EDT, Roi Dayan
no flags Details
tgt configuration script (1.03 KB, application/x-shellscript)
2013-08-22 06:50 EDT, Roi Dayan
no flags Details

  None (edit)
Description Roi Dayan 2013-08-14 05:09:20 EDT
Created attachment 786465 [details]
kernel panic

Description of problem:

Kernel panic of NULL pointer dereference when running traffic over raid10 md device created over 4 dm devices that has 4 paths each.


Versions:

Linux vsa9 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

kernel-2.6.32-358.el6.x86_64
device-mapper-multipath-0.4.9-64.el6.x86_64


How reproducible:


Steps to Reproduce:
1. discover 4 LUNs over 4 paths.

2. rescan multipath.
   see attached multipath_output.txt file.

3. create raid10 array with mdadm.
   mdadm --create --verbose /dev/md0 --level=raid10 --raid-devices=4 /dev/dm*

4. run dd with flag direct.
   dd if=/dev/zero of=/dev/md0 bs=128k count=100 oflag=direct

reproduced with iscsi, srp and iser.



Actual results:
kernel panic. see attached panic.txt file.


Expected results:
dd writing success of writing some blocks.
Comment 1 Roi Dayan 2013-08-14 05:11:05 EDT
Created attachment 786466 [details]
multipath list output
Comment 3 Ben Marzinski 2013-08-21 13:33:29 EDT
I can't recreate this with FC attached storage.  I'm going to try with iscsi. I don't currently have any hardware that will let me try srp or iser.  Have you tried taking multipath out of the picture and verifying that you can run this with md raid10 directly on top of the scsi devices?
Comment 4 Ben Marzinski 2013-08-21 15:53:45 EDT
I'm also not able to recreate this using the in-kernel open-iscsi initiator, and the linux target framework software target.
Comment 5 Roi Dayan 2013-08-22 06:43:08 EDT
Created attachment 789132 [details]
scsi id script
Comment 6 Roi Dayan 2013-08-22 06:46:17 EDT
Created attachment 789137 [details]
multipath config file
Comment 7 Roi Dayan 2013-08-22 06:50:09 EDT
Created attachment 789139 [details]
tgt configuration script
Comment 8 Roi Dayan 2013-08-22 06:52:43 EDT
(In reply to Ben Marzinski from comment #4)
> I'm also not able to recreate this using the in-kernel open-iscsi initiator,
> and the linux target framework software target.

Verified again. The crash reproduce on raid over multipath but does not reproduce on raid over the raw devices.

Attached script used to create the targets and luns on the target side.
Attached multipath.conf with script used to create the DMs over 4 paths each on the initiator side.

the command used for the traffic is:
# dd if=/dev/md0 of=/dev/null iflag=direct count=1 bs=128K

discovery was done as follows:
# iscsiadm -m discovery --op=new --op=delete --type sendtargets --portal 172.30.4.71:3261 -l
# iscsiadm -m discovery --op=new --op=delete --type sendtargets --portal 172.30.4.71:3262 -l
Comment 9 Chandra Seetharaman 2013-09-20 12:37:08 EDT
Roi, Ben,

We have seen a similar NULL pointer dereference in __bio_add_page() when using multipath and raid10.

We have been able to create a consistence reproduction scenario with dm devices (no dm-multipath involved) and raid10. 

Following script reproduces the very same problem quickly on our system. Note that the script assumes no other loop devices or dm devices are present (please change script accordingly if you have any loop devices or dm devices in your system.

-----------
#!/bin/bash

T_DIR="/tmp/test_raid"

[[ ! -d ${T_DIR} ]] && mkdir ${T_DIR}
echo "using working directory: ${T_DIR}"

# Create the 8 files.
for i in `seq 1 8`; do
    [[ ! -f ${T_DIR}/f_$i} ]] && dd if=/dev/zero of=${T_DIR}/f_${i} count=1 bs=1048576

    if [[ ! -e /dev/loop${i} ]]; then
       echo "Make loop device: /dev/loop${i}"
       mknod -m660 /dev/loop${i} b 7 ${i}
       chown root.disk /dev/loop${i}
       chmod 666 /dev/loop${i}
    fi
    losetup /dev/loop${i} ${T_DIR}/f_${i}

    echo "0 `blockdev --getsize /dev/loop${i}` linear /dev/loop${i} 0" | dmsetup create dm-01${i}
done

[[ ! -e  /proc/mdstat ]] && modprobe md

mdadm -v --create --level=raid10 --raid-devices=8 /dev/md0 /dev/dm-*

sleep 5

mkfs -t ext2 /dev/md0
[[ ! -d /mnt/raid10 ]] && mkdir /mnt/raid10
mount /dev/md0 /mnt/raid10

cp /boot/config* /mnt/raid10/.
-----------
It crashes immediately after the copy occurs, the very same way it happened with multipath and raid10.

After researching upstream we found the following patch that fixed the crash.

   md/raid10: fix problem with on-stack allocation of r10bio structure.
   https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?  id=e0ee778528bbaad28a5c69d2e219269a3a096607

Roi, You can try and see if this patch fixes your panic.

Note You need to log in before you can comment on or make changes to this bug.