Bug 465825 - panic in kcopyd during snapshot I/O
Summary: panic in kcopyd during snapshot I/O
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Mikuláš Patočka
QA Contact: Martin Jenner
URL:
Whiteboard:
: 472094 (view as bug list)
Depends On:
Blocks: 476461
TreeView+ depends on / blocked
 
Reported: 2008-10-06 15:48 UTC by Corey Marthaler
Modified: 2009-01-20 20:17 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 20:17:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
A patch that could fix this (3.46 KB, patch)
2008-10-14 17:29 UTC, Mikuláš Patočka
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Corey Marthaler 2008-10-06 15:48:19 UTC
Description of problem:
I was running the 'block_io_to_snaps' testcase when I saw this issue. 

This appears to be the same bz as listed in:
http://bugzilla.kernel.org/show_bug.cgi?id=11636

Test output:
 Making snapshot block_snap128 of origin volume
 Running block level I/O to the origin and verifying on snapshot 128
 b_iogen starting up with the following:

 Iterations:      500
 Seed:            25496
 Offset-mode:     random
 Single Pass:     off
 Overlap Flag:    on
 Mintrans:        512000
 Maxtrans:        5120000
 Syscalls:        write  writev
 Flags:          direct

 Test Devices:

 Path                                                      Size
                                                         (bytes)
 ---------------------------------------------------------------
 /dev/snapper/origin                                        4294967296
      Snap Devices:
              /dev/snapper/block_snap128
 Didn't receive heartbeat for 120 seconds
 block level IO failed with snapshot 128
<fail name="" pid="14906" time="Fri Oct  3 23:40:42 2008" type="cmd" duration="3485

There appears to be i/o errors, though that shouldn't cause a panic.

Buffer I/O error on device dm-3, logical block 7156                           
lost page write due to I/O error on dm-3                                      
Buffer I/O error on device dm-3, logical block 7157                           
lost page write due to I/O error on dm-3                                      
Buffer I/O error on device dm-3, logical block 7158                           
lost page write due to I/O error on dm-3                                      
Buffer I/O error on device dm-3, logical block 7159                           
lost page write due to I/O error on dm-3                                      
Buffer I/O error on device dm-3, logical block 7160                           
lost page write due to I/O error on dm-3                                      
Buffer I/O error on device dm-3, logical block 7161                           
lost page write due to I/O error on dm-3                                      
EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended         
EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended         
EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended         
EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended         
EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended         
Unable to handle kernel paging request at 0000000000200200 RIP:               
 [<ffffffff8014d090>] list_del+0x8/0x71                                       
PGD 204a10067 PUD 204766067 PMD 0                                             
Oops: 0000 [1] SMP                                                            
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq                         
CPU 1                                                                         
Modules linked in: gfs(U) dlm configfs autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrd
Pid: 25313, comm: kcopyd Tainted: G      2.6.18-116.el5 #1                                 
RIP: 0010:[<ffffffff8014d090>]  [<ffffffff8014d090>] list_del+0x8/0x71                     
RSP: 0018:ffff8101e81ddd20  EFLAGS: 00010246                                               
RAX: 0000000000200200 RBX: ffff810204bf3978 RCX: 0000000000000001                          
RDX: ffff8101fe30bad0 RSI: ffff8101ff7052f0 RDI: ffff810204bf3978                          
RBP: 00000000000081bf R08: 00000000000081be R09: ffff8101ff7052b0                          
R10: 00000000000081bf R11: ffff8101ff705290 R12: ffff810204bf3978                          
R13: ffff81021cd3aa00 R14: ffff8101f2f73d68 R15: 0000000000000000                          
FS:  0000000000000000(0000) GS:ffff8101fff107c0(0000) knlGS:0000000000000000               
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b                                          
CR2: 0000000000200200 CR3: 000000020492c000 CR4: 00000000000006e0                          
Process kcopyd (pid: 25313, threadinfo ffff8101e81dc000, task ffff81021aeee080)
Stack:  ffff8101ff705290 ffffffff881118e8 0000000000000040 0000000000000000
 ffff81021b3caf40 0000000000000007 ffffffff881119dc ffff8101f2f73d68
 0000000000000000 ffffffff88112af9 ffff810202e21768 ffff810215fe2980
Call Trace:
 [<ffffffff881118e8>] :dm_snapshot:pending_complete+0x114/0x1d1
 [<ffffffff881119dc>] :dm_snapshot:commit_callback+0x0/0x5
 [<ffffffff88112af9>] :dm_snapshot:persistent_commit+0xc1/0xdc
 [<ffffffff881119a5>] :dm_snapshot:copy_callback+0x0/0x37
 [<ffffffff880d8a52>] :dm_mod:run_complete_job+0x51/0x80
 [<ffffffff880d8764>] :dm_mod:process_jobs+0x2a/0xed
 [<ffffffff880d8a01>] :dm_mod:run_complete_job+0x0/0x80
 [<ffffffff880d8827>] :dm_mod:do_work+0x0/0x47
 [<ffffffff880d8841>] :dm_mod:do_work+0x1a/0x47
 [<ffffffff8004d264>] run_workqueue+0x94/0xe4
 [<ffffffff80049b1d>] worker_thread+0x0/0x122
 [<ffffffff8009e7ca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80049c0d>] worker_thread+0xf0/0x122
 [<ffffffff8008b274>] default_wake_function+0x0/0xe
 [<ffffffff8009e7ca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8009e7ca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800324f0>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009e7ca>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8002f1dd>] generic_delete_inode+0x0/0x143
 [<ffffffff800323f2>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 48 8b 10 48 39 fa 74 1b 48 89 fe 31 c0 48 c7 c7 2a b6 2a 80
RIP  [<ffffffff8014d090>] list_del+0x8/0x71
 RSP <ffff8101e81ddd20>
CR2: 0000000000200200
 <0>Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):
2.6.18-116.el5

lvm2-2.02.40-3.el5    BUILT: Thu Sep 25 14:59:07 CDT 2008
lvm2-cluster-2.02.40-3.el5    BUILT: Thu Sep 25 15:00:54 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008

Comment 2 Mikuláš Patočka 2008-10-14 17:29:18 UTC
Created attachment 320331 [details]
A patch that could fix this

A patch that fixes memory corruption in snapshots. It may fix this but, although it is not possible to verify it because the bug is not reproducible.

Comment 3 Mikuláš Patočka 2008-10-14 17:46:21 UTC
When reading the source code, I found a possible race condition that could result in this crash. It is not possible to verify it because this crash is unreproducible.

The patch and explanation of the race is in the above attachment.

The bug is much more serious than it looks --- it may result in random crashing if the user is writing simultaneously to both origin volume and snapshot volume. The crashing may happen even under normal operation, without any disk errors. (crashing under normal non-error conditions is reported to happen in the link in kernel.org bugzilla above, we are waiting for feedback).

Because this is random crashing condition, I am suggesting to mark it as an exception and get the fix into 5.3. This bug was present since snapshots were introduced, it is in both RHEL-4 and RHEL-5.

Comment 4 Corey Marthaler 2008-10-14 18:27:41 UTC
exception raised, removing needinfo.

Comment 5 Don Zickus 2008-10-29 16:18:38 UTC
in kernel-2.6.18-121.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 7 Corey Marthaler 2008-11-18 19:43:07 UTC
*** Bug 472094 has been marked as a duplicate of this bug. ***

Comment 8 Corey Marthaler 2008-11-19 16:33:54 UTC
Fix verified in 2.6.18-123.el5.

Comment 10 errata-xmlrpc 2009-01-20 20:17:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html


Note You need to log in before you can comment on or make changes to this bug.