996900 – Kernel panic when running traffic over raid10 md device created over 4 dm devices

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 996900 - Kernel panic when running traffic over raid10 md device created over 4 dm devices

Summary: Kernel panic when running traffic over raid10 md device created over 4 dm dev...

Keywords:
Status:	CLOSED DUPLICATE of bug 982360
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jes Sorensen
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-08-14 09:09 UTC by Roi Dayan
Modified:	2013-10-08 12:20 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-10-08 12:20:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kernel panic (5.59 KB, text/plain) 2013-08-14 09:09 UTC, Roi Dayan	no flags	Details
multipath list output (4.33 KB, text/plain) 2013-08-14 09:11 UTC, Roi Dayan	no flags	Details
scsi id script (59 bytes, application/x-shellscript) 2013-08-22 10:43 UTC, Roi Dayan	no flags	Details
multipath config file (347 bytes, text/plain) 2013-08-22 10:46 UTC, Roi Dayan	no flags	Details
tgt configuration script (1.03 KB, application/x-shellscript) 2013-08-22 10:50 UTC, Roi Dayan	no flags	Details
View All

Description Roi Dayan 2013-08-14 09:09:20 UTC

Created attachment 786465 [details]
kernel panic

Description of problem:

Kernel panic of NULL pointer dereference when running traffic over raid10 md device created over 4 dm devices that has 4 paths each.


Versions:

Linux vsa9 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

kernel-2.6.32-358.el6.x86_64
device-mapper-multipath-0.4.9-64.el6.x86_64


How reproducible:


Steps to Reproduce:
1. discover 4 LUNs over 4 paths.

2. rescan multipath.
   see attached multipath_output.txt file.

3. create raid10 array with mdadm.
   mdadm --create --verbose /dev/md0 --level=raid10 --raid-devices=4 /dev/dm*

4. run dd with flag direct.
   dd if=/dev/zero of=/dev/md0 bs=128k count=100 oflag=direct

reproduced with iscsi, srp and iser.



Actual results:
kernel panic. see attached panic.txt file.


Expected results:
dd writing success of writing some blocks.

Comment 1 Roi Dayan 2013-08-14 09:11:05 UTC

Created attachment 786466 [details]
multipath list output

Comment 3 Ben Marzinski 2013-08-21 17:33:29 UTC

I can't recreate this with FC attached storage.  I'm going to try with iscsi. I don't currently have any hardware that will let me try srp or iser.  Have you tried taking multipath out of the picture and verifying that you can run this with md raid10 directly on top of the scsi devices?

Comment 4 Ben Marzinski 2013-08-21 19:53:45 UTC

I'm also not able to recreate this using the in-kernel open-iscsi initiator, and the linux target framework software target.

Comment 5 Roi Dayan 2013-08-22 10:43:08 UTC

Created attachment 789132 [details]
scsi id script

Comment 6 Roi Dayan 2013-08-22 10:46:17 UTC

Created attachment 789137 [details]
multipath config file

Comment 7 Roi Dayan 2013-08-22 10:50:09 UTC

Created attachment 789139 [details]
tgt configuration script

Comment 8 Roi Dayan 2013-08-22 10:52:43 UTC

(In reply to Ben Marzinski from comment #4)
> I'm also not able to recreate this using the in-kernel open-iscsi initiator,
> and the linux target framework software target.

Verified again. The crash reproduce on raid over multipath but does not reproduce on raid over the raw devices.

Attached script used to create the targets and luns on the target side.
Attached multipath.conf with script used to create the DMs over 4 paths each on the initiator side.

the command used for the traffic is:
# dd if=/dev/md0 of=/dev/null iflag=direct count=1 bs=128K

discovery was done as follows:
# iscsiadm -m discovery --op=new --op=delete --type sendtargets --portal 172.30.4.71:3261 -l
# iscsiadm -m discovery --op=new --op=delete --type sendtargets --portal 172.30.4.71:3262 -l

Comment 9 Chandra Seetharaman 2013-09-20 16:37:08 UTC

Roi, Ben,

We have seen a similar NULL pointer dereference in __bio_add_page() when using multipath and raid10.

We have been able to create a consistence reproduction scenario with dm devices (no dm-multipath involved) and raid10. 

Following script reproduces the very same problem quickly on our system. Note that the script assumes no other loop devices or dm devices are present (please change script accordingly if you have any loop devices or dm devices in your system.

-----------
#!/bin/bash

T_DIR="/tmp/test_raid"

[[ ! -d ${T_DIR} ]] && mkdir ${T_DIR}
echo "using working directory: ${T_DIR}"

# Create the 8 files.
for i in `seq 1 8`; do
    [[ ! -f ${T_DIR}/f_$i} ]] && dd if=/dev/zero of=${T_DIR}/f_${i} count=1 bs=1048576

    if [[ ! -e /dev/loop${i} ]]; then
       echo "Make loop device: /dev/loop${i}"
       mknod -m660 /dev/loop${i} b 7 ${i}
       chown root.disk /dev/loop${i}
       chmod 666 /dev/loop${i}
    fi
    losetup /dev/loop${i} ${T_DIR}/f_${i}

    echo "0 `blockdev --getsize /dev/loop${i}` linear /dev/loop${i} 0" | dmsetup create dm-01${i}
done

[[ ! -e  /proc/mdstat ]] && modprobe md

mdadm -v --create --level=raid10 --raid-devices=8 /dev/md0 /dev/dm-*

sleep 5

mkfs -t ext2 /dev/md0
[[ ! -d /mnt/raid10 ]] && mkdir /mnt/raid10
mount /dev/md0 /mnt/raid10

cp /boot/config* /mnt/raid10/.
-----------
It crashes immediately after the copy occurs, the very same way it happened with multipath and raid10.

After researching upstream we found the following patch that fixed the crash.

   md/raid10: fix problem with on-stack allocation of r10bio structure.
   https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?  id=e0ee778528bbaad28a5c69d2e219269a3a096607

Roi, You can try and see if this patch fixes your panic.

Note You need to log in before you can comment on or make changes to this bug.