Description of problem: I was running the 'block_io_to_snaps' testcase when I saw this issue. This appears to be the same bz as listed in: http://bugzilla.kernel.org/show_bug.cgi?id=11636 Test output: Making snapshot block_snap128 of origin volume Running block level I/O to the origin and verifying on snapshot 128 b_iogen starting up with the following: Iterations: 500 Seed: 25496 Offset-mode: random Single Pass: off Overlap Flag: on Mintrans: 512000 Maxtrans: 5120000 Syscalls: write writev Flags: direct Test Devices: Path Size (bytes) --------------------------------------------------------------- /dev/snapper/origin 4294967296 Snap Devices: /dev/snapper/block_snap128 Didn't receive heartbeat for 120 seconds block level IO failed with snapshot 128 <fail name="" pid="14906" time="Fri Oct 3 23:40:42 2008" type="cmd" duration="3485 There appears to be i/o errors, though that shouldn't cause a panic. Buffer I/O error on device dm-3, logical block 7156 lost page write due to I/O error on dm-3 Buffer I/O error on device dm-3, logical block 7157 lost page write due to I/O error on dm-3 Buffer I/O error on device dm-3, logical block 7158 lost page write due to I/O error on dm-3 Buffer I/O error on device dm-3, logical block 7159 lost page write due to I/O error on dm-3 Buffer I/O error on device dm-3, logical block 7160 lost page write due to I/O error on dm-3 Buffer I/O error on device dm-3, logical block 7161 lost page write due to I/O error on dm-3 EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended Unable to handle kernel paging request at 0000000000200200 RIP: [<ffffffff8014d090>] list_del+0x8/0x71 PGD 204a10067 PUD 204766067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq CPU 1 Modules linked in: gfs(U) dlm configfs autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrd Pid: 25313, comm: kcopyd Tainted: G 2.6.18-116.el5 #1 RIP: 0010:[<ffffffff8014d090>] [<ffffffff8014d090>] list_del+0x8/0x71 RSP: 0018:ffff8101e81ddd20 EFLAGS: 00010246 RAX: 0000000000200200 RBX: ffff810204bf3978 RCX: 0000000000000001 RDX: ffff8101fe30bad0 RSI: ffff8101ff7052f0 RDI: ffff810204bf3978 RBP: 00000000000081bf R08: 00000000000081be R09: ffff8101ff7052b0 R10: 00000000000081bf R11: ffff8101ff705290 R12: ffff810204bf3978 R13: ffff81021cd3aa00 R14: ffff8101f2f73d68 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8101fff107c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000200200 CR3: 000000020492c000 CR4: 00000000000006e0 Process kcopyd (pid: 25313, threadinfo ffff8101e81dc000, task ffff81021aeee080) Stack: ffff8101ff705290 ffffffff881118e8 0000000000000040 0000000000000000 ffff81021b3caf40 0000000000000007 ffffffff881119dc ffff8101f2f73d68 0000000000000000 ffffffff88112af9 ffff810202e21768 ffff810215fe2980 Call Trace: [<ffffffff881118e8>] :dm_snapshot:pending_complete+0x114/0x1d1 [<ffffffff881119dc>] :dm_snapshot:commit_callback+0x0/0x5 [<ffffffff88112af9>] :dm_snapshot:persistent_commit+0xc1/0xdc [<ffffffff881119a5>] :dm_snapshot:copy_callback+0x0/0x37 [<ffffffff880d8a52>] :dm_mod:run_complete_job+0x51/0x80 [<ffffffff880d8764>] :dm_mod:process_jobs+0x2a/0xed [<ffffffff880d8a01>] :dm_mod:run_complete_job+0x0/0x80 [<ffffffff880d8827>] :dm_mod:do_work+0x0/0x47 [<ffffffff880d8841>] :dm_mod:do_work+0x1a/0x47 [<ffffffff8004d264>] run_workqueue+0x94/0xe4 [<ffffffff80049b1d>] worker_thread+0x0/0x122 [<ffffffff8009e7ca>] keventd_create_kthread+0x0/0xc4 [<ffffffff80049c0d>] worker_thread+0xf0/0x122 [<ffffffff8008b274>] default_wake_function+0x0/0xe [<ffffffff8009e7ca>] keventd_create_kthread+0x0/0xc4 [<ffffffff8009e7ca>] keventd_create_kthread+0x0/0xc4 [<ffffffff800324f0>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009e7ca>] keventd_create_kthread+0x0/0xc4 [<ffffffff8002f1dd>] generic_delete_inode+0x0/0x143 [<ffffffff800323f2>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 48 8b 10 48 39 fa 74 1b 48 89 fe 31 c0 48 c7 c7 2a b6 2a 80 RIP [<ffffffff8014d090>] list_del+0x8/0x71 RSP <ffff8101e81ddd20> CR2: 0000000000200200 <0>Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): 2.6.18-116.el5 lvm2-2.02.40-3.el5 BUILT: Thu Sep 25 14:59:07 CDT 2008 lvm2-cluster-2.02.40-3.el5 BUILT: Thu Sep 25 15:00:54 CDT 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008
Created attachment 320331 [details] A patch that could fix this A patch that fixes memory corruption in snapshots. It may fix this but, although it is not possible to verify it because the bug is not reproducible.
When reading the source code, I found a possible race condition that could result in this crash. It is not possible to verify it because this crash is unreproducible. The patch and explanation of the race is in the above attachment. The bug is much more serious than it looks --- it may result in random crashing if the user is writing simultaneously to both origin volume and snapshot volume. The crashing may happen even under normal operation, without any disk errors. (crashing under normal non-error conditions is reported to happen in the link in kernel.org bugzilla above, we are waiting for feedback). Because this is random crashing condition, I am suggesting to mark it as an exception and get the fix into 5.3. This bug was present since snapshots were introduced, it is in both RHEL-4 and RHEL-5.
exception raised, removing needinfo.
in kernel-2.6.18-121.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
*** Bug 472094 has been marked as a duplicate of this bug. ***
Fix verified in 2.6.18-123.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html