Created attachment 383559 [details] Patch for 2.6.18-182.el5 kernel Description of problem: dm-raid1: kernel panic when bio on recovery failed region is released Version-Release number of selected component (if applicable): 2.6.18-182.el5 How reproducible: In the following steps, dmeventd is killed and suspend/resume are done repeatedly to reproduce this issue easily, however, this issue could happen without killing dmeventd. Steps to Reproduce: 1. create two way mirror # dmsetup ls vg00-lv00_mimage_1 (253, 2) vg00-lv00_mimage_0 (253, 1) vg00-lv00_mlog (253, 0) vg00-lv00 (253, 3) 2. disable one of the leg # echo offline > /sys/block/<dev>/device/state 3. kill dmeventd # ps -ef | grep dmeventd root 3378 1 0 11:49 ? 00:00:00 [dmeventd] # kill -9 3378 4. Write I/O to region #0 # dd if=/dev/zero of=/dev/vg00/lv00 bs=4096 count=1 5. repeat suspend/resume # dmsetup suspend --noflush vg00-lv00 # dmsetup resume vg00-lv00 ...repeat suspend/resume... 6. kernel panic happens Actual results: Kernel panics with the following oops message. BUG: unable to handle kernel NULL pointer dereference at virtual address 00000020 ... printing eip: f8d8c207 *pde = cc3c7067 Oops: 0002 [#1] SMP ... CPU: 1 EIP: 0060:[<f8d8c207>] Tainted: GF VLI EFLAGS: 00010046 (2.6.18-182.el5.dm #1) EIP is at mirror_end_io+0x55/0x1fe [dm_mirror] eax: 00000216 ebx: 00000000 ecx: 00000000 edx: 00000216 esi: f7b81230 edi: f7b8120c ebp: f7b81200 esp: eb0c2d0c ds: 007b es: 007b ss: 0068 Process dmsetup (pid: 13525, ti=eb0c2000 task=f5a6baa0 task.ti=eb0c2000) ... Call Trace: [<f88c4564>] clone_endio+0x63/0xc4 [dm_mod] [<f8d8c1b2>] mirror_end_io+0x0/0x1fe [dm_mirror] [<f88c4501>] clone_endio+0x0/0xc4 [dm_mod] [<c047ab89>] bio_endio+0x50/0x55 [<f8d8a4bd>] mirror_presuspend+0xe8/0xf4 [dm_mirror] [<c045d211>] __alloc_pages+0x69/0x2cf [<f88c59e4>] suspend_targets+0x2a/0x37 [dm_mod] [<f88c54cd>] dm_suspend+0x70/0x263 [dm_mod] [<c041f775>] default_wake_function+0x0/0xc [<f88c7e9a>] dev_suspend+0x50/0x152 [dm_mod] [<f88c879f>] ctl_ioctl+0x1f3/0x238 [dm_mod] [<f88c7e4a>] dev_suspend+0x0/0x152 [dm_mod] [<c0485edc>] do_ioctl+0x47/0x5d [<c0486445>] vfs_ioctl+0x47b/0x4d3 [<c0466c75>] unmap_region+0xe1/0xf0 [<c061e8fe>] do_page_fault+0x23a/0x52d [<c061e968>] do_page_fault+0x2a4/0x52d [<c04864e5>] sys_ioctl+0x48/0x5f [<c0404f17>] syscall_call+0x7/0xb Expected results: No kernel panic happens. Additional info: This issue is reported on dm-devel. https://www.redhat.com/archives/dm-devel/2010-January/msg00047.html
kernel panic happened at 0xf8d8c207. crash> dis mirror_end_io ... 0xf8d8c1ee <mirror_end_io+0x3c>: call 0xf8d8a000 <__rh_lookup> 0xf8d8c1f3 <mirror_end_io+0x41>: mov %eax,%ebx 0xf8d8c1f5 <mirror_end_io+0x43>: lock incl 0x1c(%ebp) 0xf8d8c1f9 <mirror_end_io+0x47>: lea 0x30(%ebp),%esi 0xf8d8c1fc <mirror_end_io+0x4a>: mov %esi,%eax 0xf8d8c1fe <mirror_end_io+0x4c>: call 0xc061d7e8 <_spin_lock_irqsave> 0xf8d8c203 <mirror_end_io+0x51>: mov %eax,0x8(%esp) 0xf8d8c207 <mirror_end_io+0x55>: lock decl 0x20(%ebx) *** PANIC *** 0xf8d8c20b <mirror_end_io+0x59>: sete %al 0xf8d8c20e <mirror_end_io+0x5c>: test %al,%al This means that kernel panic happened at the following line. static void rh_dec(struct region_hash *rh, region_t region) { ... read_lock(&rh->hash_lock); reg = __rh_lookup(rh, region); read_unlock(&rh->hash_lock); spin_lock_irqsave(&rh->region_lock, flags); if (atomic_dec_and_test(®->pending)) { *** PANIC *** By printk debug, I confirmed that __rh_lookup() returned NULL.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
I think the reason is this: When recovery fails, we mark the region as RH_NOSYNC and add it to failed_recovered_regions list (dm-raid1.c:rh_recovery_end) RH_NOSYNC allows further writes to be processed and they increment region->pending count (see do_writes ... case RH_NOSYNC: this_list = &nosync; ... rh_inc_pending(&ms->rh, &nosync);) Regions on failed_recovered_regions list are unconditionally freed regardless of possible pending count. See dm-raid1.c:rh_update_states: list_splice(&rh->failed_recovered_regions, &failed_recovered); ... list_for_each_entry_safe (reg, next, &failed_recovered, list) { complete_resync_work(reg, 0); mempool_free(reg, rh->region_pool); } --- here the region is freed without checking if there are pending I/Os on it. Note that rh_update_states also frees "clean" and "recovered" regions unconditionally, but there should be no i/os on them. On "clean", you can't have i/o by definition ("clean" are regions without i/o, i/o turns "clean" region into "dirty"). On "recovered" you can't have i/o because it is in RH_RECOVERING state and do_writes doesn't pass i/os to them.
(In reply to comment #4) > I think the reason is this: Yes, it is the same as the reason I described in the patch header. --- When recovery process of a region failed, dm_rh_recovery_end() function changes the state of the region from RM_RH_RECOVERING to DM_RH_NOSYNC. When recovery_complete() is executed between dm_rh_update_states() and dm_writes() in do_mirror(), bios are processed with the region state, DM_RH_NOSYNC. However, the region data is freed without checking its pending count when dm_rh_update_states() is called next time. When bios are finished by mirror_end_io(), __rh_lookup() in dm_rh_dec() returns NULL even though a valid return value are expected. ---
(In reply to comment #5) Sorry, there are some typos. In RHEL5, function names are different. <correction> dm_writes() -> do_writes() dm_rh_update_states() -> rh_update_states() dm_rh_dec() -> rh_dec()
I verified this fix on 2.6.18-187.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html