Bug 565494

Summary: "dmraid -ay" panics kernel
Product: Red Hat Enterprise Linux 5 Reporter: michal novacek <mnovacek>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: low    
Version: 5.5CC: agk, dwysocha, heinzm, mbroz, prockai, yugzhang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 567605 (view as bug list) Environment:
Last Closed: 2010-03-30 07:44:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 567605    
Attachments:
Description Flags
script crashing the kernel
none
different kernel panic messages I have catched during crashes
none
data for creation of dmraid device-mapper devices none

Description michal novacek 2010-02-15 13:35:14 UTC
Created attachment 394313 [details]
script crashing the kernel

Description of problem: running the crash.sh script causes kernel to panic. I was able to reproduce this on two different x86_64 machines.

Version-Release number of selected component (if applicable):
RHEL5.5-Server-20100215.nightly_nfs-x86_64 
kernel-2.6.18-187.el5
dmraid-1.0.0.rc13-60.el5

How reproducible: always

Steps to Reproduce:
1. get data.tar.bz2 (13M, link in attachements)
2. run crash.sh (attached as well)

Actual results: kernel panic
 
Expected results: kernal panic not happening

Additional info: happens on updated RHEL 5.4 as well. Doesn't seem to happen on RHEL6-Alpha-3.

Comment 1 michal novacek 2010-02-15 13:36:46 UTC
Created attachment 394314 [details]
different kernel panic messages I have catched during crashes

Comment 2 michal novacek 2010-02-15 13:39:08 UTC
Created attachment 394315 [details]
data for creation of dmraid device-mapper devices

Comment 3 Heinz Mauelshagen 2010-02-16 17:35:16 UTC
Michal,

I tested the metadata sample in question on a vanilla kernel without any OOPS. Access to the RAID set was alright.

I assume this may be a loop device issue, because I based the VG on a real disk.

Can you avoid using loop and see if you're able to reproduce on the el5 kernel ?

Comment 4 michal novacek 2010-02-17 16:05:38 UTC
I just randomly picked machine (hp-dl360-04.rhts.eng.brq.redhat.com)
created new physical partition and used it instead of the /dev/loop1 with the same result (follows):

rhel 5.4, kernel 2.6.18-185.el5

BUG: unable to handle kernel NULL pointer dereference at virtual address 0000000
 printing eip:
f88cbfc3
*pde = 72d47067
Oops: 0000 [#1]
SMP 
last sysfs file: /block/hda/removable
Modules linked in: dm_zero dm_snapshot autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi ac parport_pc lp parport scb2_flash mtdcore chipreg ide_cd floppy cdrom i2c_piix4 i2c_core pcspkr hpilo serio_raw tg3 dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    0060:[<f88cbfc3>]    Not tainted VLI
EFLAGS: 00010282   (2.6.18-185.el5 #1) 
EIP is at dm_io_client_destroy+0x3/0x1a [dm_mod]
eax: 00000000   ebx: 00000000   ecx: f7fff060   edx: c15bfe40
esi: f776c128   edi: edff2680   ebp: f776c000   esp: f5825bc0
ds: 007b   es: 007b   ss: 0068
Process dmraid (pid: 2719, ti=f5825000 task=f76ab000 task.ti=f5825000)
Stack: f776c0a4 f88f3a47 f776c0a4 00000001 f776c000 f88f3d02 f776c128 f88f6a3b 
       00000002 0439bfff f34464c0 00000013 f8b6e080 f34464c8 00000000 00000495 
       f88fb278 00000001 f776c128 f47f0a40 00008000 00000000 00000050 0439bfff 
Call Trace:
 [<f88f3a47>] stripe_recover_free+0x4f/0x5a [dm_raid45]
 [<f88f3d02>] sc_exit+0x14/0x69 [dm_raid45]
 [<f88f6a3b>] raid_ctr+0xc4b/0x11c0 [dm_raid45]
 [<c04169d7>] smp_send_reschedule+0x51/0x53
 [<f88c9209>] dm_table_add_target+0x14e/0x27d [dm_mod]
 [<f88cad29>] table_load+0xcd/0x186 [dm_mod]
 [<f88cb79f>] ctl_ioctl+0x1f3/0x238 [dm_mod]
 [<f88cac5c>] table_load+0x0/0x186 [dm_mod]
 [<c0485e60>] do_ioctl+0x47/0x5d
 [<c04863c9>] vfs_ioctl+0x47b/0x4d3
 [<c0486469>] sys_ioctl+0x48/0x5f
 [<c0404f17>] syscall_call+0x7/0xb
 =======================
Code: 24 24 89 e0 89 e3 e8 ee f9 ff ff 89 f9 89 ea 31 c0 ff 74 24 2c ff 74 24 2c 53 56 e8 57 fe ff ff 83 c4 20 5b 5e 5f 5d c3 53 89 c3 <8b> 00 e8 5d f6 b8 c7 8b 43 04 e8 04 e7 ba c7 89 d8 5b e9 94 5a 
EIP: [<f88cbfc3>] dm_io_client_destroy+0x3/0x1a [dm_mod] SS:ESP 0068:f5825bc0
 <0>Kernel panic - not syncing: Fatal exception

Comment 5 Heinz Mauelshagen 2010-02-22 12:19:43 UTC
Analysis shows an uncovered error code path in the dm-raid45 target to deal with a failing resource allocation during construction of the RAID mapping.

This is unlikely to show up in the field, because such metadata format ain't in use on an i386 system anyway. Question is, why it's being hit in the fist place because only default # of stripes is being tried allocating and the test system has enough RAM (2GB).

An error path fix would still let the mapping creation fail but the OOPS will go away.

Comment 7 Heinz Mauelshagen 2010-02-23 11:36:56 UTC
Fix sent to rhkernel-list with subject "[RHEL5.5 PATCH] dm: raid45 target: constructor error path oops fix" -> POST.

Comment 11 Jarod Wilson 2010-03-03 15:45:12 UTC
in kernel-2.6.18-191.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 18 errata-xmlrpc 2010-03-30 07:44:25 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html