Bug 555171 - dm-raid1: kernel panic when bio on recovery failed region is released
Summary: dm-raid1: kernel panic when bio on recovery failed region is released
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Takahiro Yasui
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 557934
TreeView+ depends on / blocked
 
Reported: 2010-01-13 21:07 UTC by Takahiro Yasui
Modified: 2014-07-25 05:08 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 557934 (view as bug list)
Environment:
Last Closed: 2010-03-30 07:28:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch for 2.6.18-182.el5 kernel (2.78 KB, text/plain)
2010-01-13 21:07 UTC, Takahiro Yasui
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description Takahiro Yasui 2010-01-13 21:07:37 UTC
Created attachment 383559 [details]
Patch for 2.6.18-182.el5 kernel

Description of problem:
  dm-raid1: kernel panic when bio on recovery failed region is released

Version-Release number of selected component (if applicable):
  2.6.18-182.el5

How reproducible:
  In the following steps, dmeventd is killed and suspend/resume are done
  repeatedly to reproduce this issue easily, however, this issue could happen
  without killing dmeventd.

Steps to Reproduce:
  1. create two way mirror
    # dmsetup ls
    vg00-lv00_mimage_1	(253, 2)
    vg00-lv00_mimage_0	(253, 1)
    vg00-lv00_mlog	(253, 0)
    vg00-lv00	(253, 3)

  2. disable one of the leg
    # echo offline > /sys/block/<dev>/device/state

  3. kill dmeventd
    # ps	-ef | grep dmeventd
    root      3378     1  0 11:49 ?        00:00:00 [dmeventd]
    # kill -9 3378

  4. Write I/O to region #0
    # dd if=/dev/zero of=/dev/vg00/lv00 bs=4096 count=1

  5. repeat suspend/resume
    # dmsetup suspend --noflush vg00-lv00
    # dmsetup resume vg00-lv00
    ...repeat suspend/resume...

  6. kernel panic happens

Actual results:
  Kernel panics with the following oops message.

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000020
...
 printing eip:
f8d8c207
*pde = cc3c7067
Oops: 0002 [#1]
SMP
...
CPU:    1
EIP:    0060:[<f8d8c207>]    Tainted: GF     VLI
EFLAGS: 00010046   (2.6.18-182.el5.dm #1)
EIP is at mirror_end_io+0x55/0x1fe [dm_mirror]
eax: 00000216   ebx: 00000000   ecx: 00000000   edx: 00000216
esi: f7b81230   edi: f7b8120c   ebp: f7b81200   esp: eb0c2d0c
ds: 007b   es: 007b   ss: 0068
Process dmsetup (pid: 13525, ti=eb0c2000 task=f5a6baa0 task.ti=eb0c2000)
...
Call Trace:
 [<f88c4564>] clone_endio+0x63/0xc4 [dm_mod]
 [<f8d8c1b2>] mirror_end_io+0x0/0x1fe [dm_mirror]
 [<f88c4501>] clone_endio+0x0/0xc4 [dm_mod]
 [<c047ab89>] bio_endio+0x50/0x55
 [<f8d8a4bd>] mirror_presuspend+0xe8/0xf4 [dm_mirror]
 [<c045d211>] __alloc_pages+0x69/0x2cf
 [<f88c59e4>] suspend_targets+0x2a/0x37 [dm_mod]
 [<f88c54cd>] dm_suspend+0x70/0x263 [dm_mod]
 [<c041f775>] default_wake_function+0x0/0xc
 [<f88c7e9a>] dev_suspend+0x50/0x152 [dm_mod]
 [<f88c879f>] ctl_ioctl+0x1f3/0x238 [dm_mod]
 [<f88c7e4a>] dev_suspend+0x0/0x152 [dm_mod]
 [<c0485edc>] do_ioctl+0x47/0x5d
 [<c0486445>] vfs_ioctl+0x47b/0x4d3
 [<c0466c75>] unmap_region+0xe1/0xf0
 [<c061e8fe>] do_page_fault+0x23a/0x52d
 [<c061e968>] do_page_fault+0x2a4/0x52d
 [<c04864e5>] sys_ioctl+0x48/0x5f
 [<c0404f17>] syscall_call+0x7/0xb

Expected results:
  No kernel panic happens.

Additional info:
  This issue is reported on dm-devel.
  https://www.redhat.com/archives/dm-devel/2010-January/msg00047.html

Comment 1 Takahiro Yasui 2010-01-13 22:44:54 UTC
kernel panic happened at 0xf8d8c207.

crash> dis mirror_end_io
...
0xf8d8c1ee <mirror_end_io+0x3c>:        call   0xf8d8a000 <__rh_lookup>
0xf8d8c1f3 <mirror_end_io+0x41>:        mov    %eax,%ebx
0xf8d8c1f5 <mirror_end_io+0x43>:        lock incl 0x1c(%ebp)
0xf8d8c1f9 <mirror_end_io+0x47>:        lea    0x30(%ebp),%esi
0xf8d8c1fc <mirror_end_io+0x4a>:        mov    %esi,%eax
0xf8d8c1fe <mirror_end_io+0x4c>:        call   0xc061d7e8 <_spin_lock_irqsave>
0xf8d8c203 <mirror_end_io+0x51>:        mov    %eax,0x8(%esp)
0xf8d8c207 <mirror_end_io+0x55>:        lock decl 0x20(%ebx)	*** PANIC ***
0xf8d8c20b <mirror_end_io+0x59>:        sete   %al
0xf8d8c20e <mirror_end_io+0x5c>:        test   %al,%al

This means that kernel panic happened at the following line.

static void rh_dec(struct region_hash *rh, region_t region)
{
        ...
        read_lock(&rh->hash_lock);
        reg = __rh_lookup(rh, region);
        read_unlock(&rh->hash_lock);

        spin_lock_irqsave(&rh->region_lock, flags);
        if (atomic_dec_and_test(&reg->pending)) {	*** PANIC ***

By printk debug, I confirmed that __rh_lookup() returned NULL.

Comment 3 RHEL Program Management 2010-01-14 17:21:57 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Mikuláš Patočka 2010-01-14 17:41:59 UTC
I think the reason is this:

When recovery fails, we mark the region as RH_NOSYNC and add it to failed_recovered_regions list (dm-raid1.c:rh_recovery_end)

RH_NOSYNC allows further writes to be processed and they increment region->pending count (see do_writes ... case RH_NOSYNC: this_list = &nosync; ... rh_inc_pending(&ms->rh, &nosync);)

Regions on failed_recovered_regions list are unconditionally freed regardless of possible pending count. See dm-raid1.c:rh_update_states: list_splice(&rh->failed_recovered_regions, &failed_recovered); ... list_for_each_entry_safe (reg, next, &failed_recovered, list) { complete_resync_work(reg, 0); mempool_free(reg, rh->region_pool); } --- here the region is freed without checking if there are pending I/Os on it.

Note that rh_update_states also frees "clean" and "recovered" regions unconditionally, but there should be no i/os on them. On "clean", you can't have i/o by definition ("clean" are regions without i/o, i/o turns "clean" region into "dirty"). On "recovered" you can't have i/o because it is in RH_RECOVERING state and do_writes doesn't pass i/os to them.

Comment 5 Takahiro Yasui 2010-01-14 17:52:13 UTC
(In reply to comment #4)
> I think the reason is this:

Yes, it is the same as the reason I described in the patch header.

---
When recovery process of a region failed, dm_rh_recovery_end() function
changes the state of the region from RM_RH_RECOVERING to DM_RH_NOSYNC.
When recovery_complete() is executed between dm_rh_update_states() and
dm_writes() in do_mirror(), bios are processed with the region state,
DM_RH_NOSYNC. However, the region data is freed without checking its
pending count when dm_rh_update_states() is called next time.

When bios are finished by mirror_end_io(), __rh_lookup() in dm_rh_dec()
returns NULL even though a valid return value are expected.
---

Comment 6 Takahiro Yasui 2010-01-14 17:56:21 UTC
(In reply to comment #5)
Sorry, there are some typos. In RHEL5, function names are different.

<correction>
  dm_writes() -> do_writes()
  dm_rh_update_states() -> rh_update_states()
  dm_rh_dec() -> rh_dec()

Comment 10 Takahiro Yasui 2010-02-26 00:47:27 UTC
I verified this fix on 2.6.18-187.el5.

Comment 14 errata-xmlrpc 2010-03-30 07:28:53 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html


Note You need to log in before you can comment on or make changes to this bug.