Bug 996900

Summary: Kernel panic when running traffic over raid10 md device created over 4 dm devices
Product: Red Hat Enterprise Linux 6 Reporter: Roi Dayan <roid>
Component: kernelAssignee: Jes Sorensen <Jes.Sorensen>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.4CC: acathrow, agk, bmarzins, dwysocha, heinzm, msnitzer, prajnoha, prockai, sekharan, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-08 12:20:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kernel panic
none
multipath list output
none
scsi id script
none
multipath config file
none
tgt configuration script none

Description Roi Dayan 2013-08-14 09:09:20 UTC
Created attachment 786465 [details]
kernel panic

Description of problem:

Kernel panic of NULL pointer dereference when running traffic over raid10 md device created over 4 dm devices that has 4 paths each.


Versions:

Linux vsa9 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

kernel-2.6.32-358.el6.x86_64
device-mapper-multipath-0.4.9-64.el6.x86_64


How reproducible:


Steps to Reproduce:
1. discover 4 LUNs over 4 paths.

2. rescan multipath.
   see attached multipath_output.txt file.

3. create raid10 array with mdadm.
   mdadm --create --verbose /dev/md0 --level=raid10 --raid-devices=4 /dev/dm*

4. run dd with flag direct.
   dd if=/dev/zero of=/dev/md0 bs=128k count=100 oflag=direct

reproduced with iscsi, srp and iser.



Actual results:
kernel panic. see attached panic.txt file.


Expected results:
dd writing success of writing some blocks.

Comment 1 Roi Dayan 2013-08-14 09:11:05 UTC
Created attachment 786466 [details]
multipath list output

Comment 3 Ben Marzinski 2013-08-21 17:33:29 UTC
I can't recreate this with FC attached storage.  I'm going to try with iscsi. I don't currently have any hardware that will let me try srp or iser.  Have you tried taking multipath out of the picture and verifying that you can run this with md raid10 directly on top of the scsi devices?

Comment 4 Ben Marzinski 2013-08-21 19:53:45 UTC
I'm also not able to recreate this using the in-kernel open-iscsi initiator, and the linux target framework software target.

Comment 5 Roi Dayan 2013-08-22 10:43:08 UTC
Created attachment 789132 [details]
scsi id script

Comment 6 Roi Dayan 2013-08-22 10:46:17 UTC
Created attachment 789137 [details]
multipath config file

Comment 7 Roi Dayan 2013-08-22 10:50:09 UTC
Created attachment 789139 [details]
tgt configuration script

Comment 8 Roi Dayan 2013-08-22 10:52:43 UTC
(In reply to Ben Marzinski from comment #4)
> I'm also not able to recreate this using the in-kernel open-iscsi initiator,
> and the linux target framework software target.

Verified again. The crash reproduce on raid over multipath but does not reproduce on raid over the raw devices.

Attached script used to create the targets and luns on the target side.
Attached multipath.conf with script used to create the DMs over 4 paths each on the initiator side.

the command used for the traffic is:
# dd if=/dev/md0 of=/dev/null iflag=direct count=1 bs=128K

discovery was done as follows:
# iscsiadm -m discovery --op=new --op=delete --type sendtargets --portal 172.30.4.71:3261 -l
# iscsiadm -m discovery --op=new --op=delete --type sendtargets --portal 172.30.4.71:3262 -l

Comment 9 Chandra Seetharaman 2013-09-20 16:37:08 UTC
Roi, Ben,

We have seen a similar NULL pointer dereference in __bio_add_page() when using multipath and raid10.

We have been able to create a consistence reproduction scenario with dm devices (no dm-multipath involved) and raid10. 

Following script reproduces the very same problem quickly on our system. Note that the script assumes no other loop devices or dm devices are present (please change script accordingly if you have any loop devices or dm devices in your system.

-----------
#!/bin/bash

T_DIR="/tmp/test_raid"

[[ ! -d ${T_DIR} ]] && mkdir ${T_DIR}
echo "using working directory: ${T_DIR}"

# Create the 8 files.
for i in `seq 1 8`; do
    [[ ! -f ${T_DIR}/f_$i} ]] && dd if=/dev/zero of=${T_DIR}/f_${i} count=1 bs=1048576

    if [[ ! -e /dev/loop${i} ]]; then
       echo "Make loop device: /dev/loop${i}"
       mknod -m660 /dev/loop${i} b 7 ${i}
       chown root.disk /dev/loop${i}
       chmod 666 /dev/loop${i}
    fi
    losetup /dev/loop${i} ${T_DIR}/f_${i}

    echo "0 `blockdev --getsize /dev/loop${i}` linear /dev/loop${i} 0" | dmsetup create dm-01${i}
done

[[ ! -e  /proc/mdstat ]] && modprobe md

mdadm -v --create --level=raid10 --raid-devices=8 /dev/md0 /dev/dm-*

sleep 5

mkfs -t ext2 /dev/md0
[[ ! -d /mnt/raid10 ]] && mkdir /mnt/raid10
mount /dev/md0 /mnt/raid10

cp /boot/config* /mnt/raid10/.
-----------
It crashes immediately after the copy occurs, the very same way it happened with multipath and raid10.

After researching upstream we found the following patch that fixed the crash.

   md/raid10: fix problem with on-stack allocation of r10bio structure.
   https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?  id=e0ee778528bbaad28a5c69d2e219269a3a096607

Roi, You can try and see if this patch fixes your panic.