Bug 180285

Summary: panic during lvremove of snapshot of a mirror volume
Product: Red Hat Enterprise Linux 4 Reporter: Nate Straz <nstraz>
Component: lvm2Assignee: Alasdair Kergon <agk>
Status: CLOSED CURRENTRELEASE QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: agk, dwysocha, jbrassow, mbroz, rkenna
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel 2.6.9-30.1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-01-10 13:46:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test case none

Description Nate Straz 2006-02-06 21:41:46 UTC
Description of problem:

I was removing a snapshot of a mirror volume that I was using while testing
system-config-lvm, when the system panicked.  I was using a local GFS file
system, no cluster infrastructure was loaded.

Console messages:

GFS: fsid=dm-5.0: fatal: I/O error
GFS: fsid=dm-5.0:   block = 2400436224
GFS: fsid=dm-5.0:   function = gfs_dreread
GFS: fsid=dm-5.0:   file =
/usr/src/build/685212-i686/BUILD/gfs-kernel-2.6.9-47/up/src/gfs/dio.c, line = 576
GFS: fsid=dm-5.0:   time = 1139260172
GFS: fsid=dm-5.0: about to withdraw from the cluster
GFS: fsid=dm-5.0: waiting for outstanding I/O       
GFS: fsid=dm-5.0: telling LM to withdraw     
GFS: fsid=dm-5.0: withdrawn             
GFS: fsid=dm-5.0: head_off = 28090872822202384, head_wrap = 1
GFS: fsid=dm-5.0: dump_off = 6540224, dump_wrap = 0          
GFS: fsid=dm-5.0: assertion "FALSE" failed         
GFS: fsid=dm-5.0:   function = check_seg_usage
GFS: fsid=dm-5.0:   file =
/usr/src/build/685212-i686/BUILD/gfs-kernel-2.6.9-47/up/src/gfs/log.c, line = 590
GFS: fsid=dm-5.0:   time = 1139260172
------------[ cut here ]------------ 
kernel BUG at
/usr/src/build/685212-i686/BUILD/gfs-kernel-2.6.9-47/up/src/gfs/util.c:211!
invalid operand: 0000 [#1]
Modules linked in: lock_nolock(U) gfs(U) lock_harness(U) vfat fat parport_pc lp
parport autofs4 i2c_dev i2c_core md5 ipv6 sunrpc button battery ac uhci_hcd
e7xxx_edac edac_mc hw_random shpchp e1000 floppy dm_snapshot dm_zero dm_mirror
ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0                                                   
EIP:    0060:[<f8f4daea>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-29.EL)            
EIP is at gfs_assert_i+0x38/0x69 [gfs]
eax: 000000ed   ebx: f8eb0000   ecx: f8f55b36   edx: f3763bf0
esi: f8f4e6ce   edi: f8f51d78   ebp: 0000024e   esp: f3763bec
ds: 007b   es: 007b   ss: 0068                               
Process lvremove (pid: 2686, threadinfo=f3763000 task=c3333770)
Stack: f8f55b36 f8ec4890 f8f51d78 f8ec4890 f8f4e6ce f8ec4890 f8f51af2 0000024e 
       f8ec4890 43e7bb0c 0063cc7f 0063cbb0 00000000 f8eb0000 f8f31b24 f8f51af2 
       0000024e 00000000 00000000 0063cbc0 00000000 00000000 0063cc80 00000001 
Call Trace:                                                                    
 [<f8f31b24>] check_seg_usage+0x197/0x19f [gfs]
 [<f8f31ca6>] sync_trans+0x143/0x1b1 [gfs]     
 [<f8f31f4d>] disk_commit+0xec/0x264 [gfs]
 [<f8f32166>] log_refund+0x61/0x256 [gfs] 
 [<f8f32673>] log_flush_internal+0xeb/0x19b [gfs]
 [<f8f19438>] gfs_sync_meta+0x8/0x3c [gfs]       
 [<f8f27b67>] trans_go_drop_th+0x1b/0x2b [gfs]
 [<f8f22357>] rq_demote+0x15c/0x1da [gfs]     
 [<f8f22514>] run_queue+0x5a/0xc1 [gfs]  
 [<f8f2475d>] gfs_glock_dq+0x259/0x2b6 [gfs]
 [<f8f24d4b>] gfs_glock_dq_uninit+0x8/0x10 [gfs]
 [<f8f4b916>] gfs_lock_fs_check_clean+0xda/0xff [gfs]
 [<c0152754>] pagevec_lookup_tag+0x1b/0x23           
 [<c01914ed>] mpage_writepages+0x2b8/0x30f
 [<c01711fb>] blkdev_writepage+0x0/0xc    
 [<c01e4986>] radix_tree_gang_lookup_tag+0x3d/0x53
 [<f8f4b98e>] gfs_freeze_fs+0x53/0x83 [gfs]       
 [<f8f41192>] gfs_write_super_lockfs+0x14/0x71 [gfs]
 [<c016a42a>] freeze_bdev+0xf0/0xfd                 
 [<c016a42a>] freeze_bdev+0xf0/0xfd
 [<f8861133>] lock_fs+0x38/0x5a [dm_mod]
 [<f8861242>] dm_suspend+0xc1/0x209 [dm_mod]
 [<c011d0ce>] default_wake_function+0x0/0xc 
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<f886100f>] dm_get_mdptr+0x68/0x87 [dm_mod]
 [<f8863cd0>] do_suspend+0x128/0x14b [dm_mod]
 [<f8864de5>] ctl_ioctl+0xd9/0x13c [dm_mod]  
 [<f8863e80>] dev_suspend+0x0/0x10 [dm_mod]
 [<c017cd15>] sys_ioctl+0x297/0x336        
 [<c03114ab>] syscall_call+0x7/0xb 
Code: c3 83 b8 30 02 00 00 00 74 2c ff 35 d0 b7 40 c0 8d 80 90 48 01 00 50 55 ff
74 24 20 50 51 50 52 50 68 36 5b f5 f8 e8 93 3e 1d c7 <0f> 0b d3 00 b9 5b f5 f8
83 c4 28 e8 52 8a 1b c7 ff 35 d0 b7 40 
 <0>Fatal exception: panic in 5 seconds 
Kernel panic - not syncing: Fatal exception

Version-Release number of selected component (if applicable):
lvm2-2.02.01-1.3.RHEL4
GFS-6.1.4-0

How reproducible:
unknown
  
Actual results:
After rebooting the snapshot volume was gone.

Expected results:
Don't panic.

Additional info:
I mounted the file system after I took the snapshot.  I wrote a 100MB file
with dd from /dev/zero, then did the lvremove (from system-config-lvm).

Comment 1 Jonathan Earl Brassow 2006-02-08 19:25:46 UTC
Does this still happen with the latest rpms?
What kind of mirroring are you using?  (DM Mirroring support is not available
until RHEL4 U4.)

Do you have a script that can reproduce?


Comment 2 Jonathan Earl Brassow 2006-02-08 19:28:49 UTC
It'd be nice to know the kernel version in this case as well.  (Might as well
get the device mapper version from dmsetup too.)

Comment 3 Alasdair Kergon 2006-02-08 19:52:06 UTC
Crashed in a GFS assert so one for GFS people to look at first and determine the
detailed cause of the failure.  (Then it might have to be transferred back.)

And please retest with a kernel package > 2.6.9-30.1 (checked into kernel cvs
yesterday) as there've been some snapshot fixes recently.


Comment 4 Nate Straz 2006-02-08 20:38:17 UTC
I'm loading the RHEL4-U3-re0127 images now (that includes the 30 kernel).  I'll
try reproducing it with that.

I was using whatever mirroring that system-config-lvm sets up.  I'd wager that's
LVM2 mirroring.  Whether that implies DM mirroring, I don't know.

I haven't written a script to reproduce it, nor did I reproduce it on my first try.

I was running the 29 kernel with device-mapper-1.02.02-3.0.RHEL4.

Comment 5 Nate Straz 2006-02-08 20:48:11 UTC
There isn't a GFS build against the -30 kernel so I tried to reproduce this
with ext3.  I was able to panic the system.

Running:
kernel: 2.6.9-30.ELsmp
lvm2-2.02.01-1.3.RHEL4
device-mapper-1.02.02-3.0.RHEL4

------------[ cut here ]------------
Unable to handle kernel paging request at virtual address f8c51004
 printing eip:                                                    
f882dc47      
*pde = 00000000
kernel BUG at drivers/md/kcopyd.c:145!
invalid operand: 0000 [#1]            
SMP                       
Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core
sunrpc button battery ac uhci_hcd e7xxx_edac edac_mc hw_random shpchp e1000
floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
CPU:    1    
EIP:    0060:[<f8863c9f>]    Not tainted VLI
EFLAGS: 00010283   (2.6.9-30.ELsmp)         
EIP is at client_free_pages+0xb/0x32 [dm_mod]
eax: 00000100   ebx: f63969a0   ecx: f72c66c0   edx: 00000000
esi: f8bd1080   edi: 00000000   ebp: 00000000   esp: f5fdaf30
ds: 007b   es: 007b   ss: 0068                               
Process lvremove (pid: 6882, threadinfo=f5fda000 task=f7aaa030)
Stack: f63969a0 f886464f f7e26980 f882c7ca f8bd1080 f7e26780 f885fa9d f7e26780 
       cfb93120 f5fda000 08ff4ac8 f8861442 f88f7000 f886afe0 f8861b66 00000004 
       00000000 f8862e40 f88f7000 f8861ac0 f88f7000 f886b080 08ff4ac8 c134fd04 
Call Trace:                                                                    
 [<f886464f>] kcopyd_client_destroy+0x12/0x26 [dm_mod]
 [<f882c7ca>] snapshot_dtr+0x4f/0x58 [dm_snapshot]    
 [<f885fa9d>] table_destroy+0x39/0x83 [dm_mod]    
 [<f8861442>] __hash_remove+0x63/0x78 [dm_mod]
 [<f8861b66>] dev_remove+0xa6/0xc3 [dm_mod]   
 [<f8862e40>] ctl_ioctl+0xd9/0x13c [dm_mod]
 [<f8861ac0>] dev_remove+0x0/0xc3 [dm_mod] 
 [<c016a316>] sys_ioctl+0x227/0x269       
 [<c02d2703>] syscall_call+0x7/0xb 
 [<c02d007b>] schedule+0x283/0x8d3
Code: ff ff eb 17 89 38 45 89 c7 39 dd eb da 89 f0 89 fa e8 7d ff ff ff 01 5e 14
31 c0 5b 5e 5f 5d c3 53 89 c3 8b 40 14 39 43 18 74 08 <0f> 0b 91 00 c9 52 86 f8
8b 43 10 e8 8c ff ff ff c7 43 10 00 00 
 <1>Oops: 0000 [#2]                     
Fatal exception: panic in 5 seconds
SMP                                
Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core
sunrpc button battery ac uhci_hcd e7xxx_edac edac_mc hw_random shpchp e1000
floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
CPU:    2    
EIP:    0060:[<f882dc47>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-30.ELsmp)         
EIP is at persistent_commit+0x90/0xd5 [dm_snapshot]
eax: 00000000   ebx: 00000000   ecx: f8c51000   edx: 00000000
esi: 00000000   edi: f72c4bc0   ebp: 00000000   esp: f69dcf08
ds: 007b   es: 007b   ss: 0068                               
Process kcopyd (pid: 6018, threadinfo=f69dc000 task=f71ea630)
Stack: 00004732 00000000 0000183e 00000000 e7c84654 f7e26980 e7c84654 f882ca6a 
       f882ca9a e7c84654 c93ae144 00000000 f8863dc8 00000000 c93ae144 f886b178 
       00000246 f8863d89 f8863f64 00000001 f886c8c0 f886c8c4 00000206 d85dc000 
Call Trace:                                                                    
 [<f882ca6a>] copy_callback+0x0/0x34 [dm_snapshot]
 [<f882ca9a>] copy_callback+0x30/0x34 [dm_snapshot]
 [<f8863dc8>] run_complete_job+0x3f/0x47 [dm_mod]  
 [<f8863d89>] run_complete_job+0x0/0x47 [dm_mod] 
 [<f8863f64>] process_jobs+0x16/0xcb [dm_mod]   
 [<f8864028>] do_work+0xf/0x2d [dm_mod]      
 [<c01309f3>] worker_thread+0x168/0x1d5
 [<f8864019>] do_work+0x0/0x2d [dm_mod]
 [<c011e7d7>] default_wake_function+0x0/0xc
 [<c011e7d7>] default_wake_function+0x0/0xc
 [<c013088b>] worker_thread+0x0/0x1d5      
 [<c0133f7d>] kthread+0x73/0x9b      
 [<c0133f0a>] kthread+0x0/0x9b 
 [<c01041f5>] kernel_thread_helper+0x5/0xb
Code: b9 01 00 00 00 89 f8 e8 73 fb ff ff 85 c0 89 c6 74 07 c7 47 08 00 00 00 00
31 db 3b 5f 28 73 1c 31 ed 8b 4f 2c 31 d2 01 e9 85 f6 <8b> 41 04 0f 94 c2 43 ff
11 83 c5 08 3b 5f 28 72 e6 c7 47 28 00 
 <0>Fatal exception: panic in 5 seconds 
Kernel panic - not syncing: Fatal exception


Comment 6 Jonathan Earl Brassow 2006-02-08 20:59:14 UTC
I believe you have to use kernel >= 30.1, it contains a critical snapshot fix.

Comment 7 Alasdair Kergon 2006-02-08 22:10:21 UTC
Those panics are quite different and should be fixed in yesterday's kernel.

Comment 8 Nate Straz 2006-02-08 22:15:19 UTC
Created attachment 124416 [details]
Test case

This test case seems to reproduce the second set of panics easily.  The sleep 5
seems to be critical to get it to panic.  Please try it with -30.1 since I
don't
see any builds of it on porkchop.

Comment 9 Nate Straz 2006-02-09 20:01:35 UTC
I tried the attached test case on -30.1 and instead of panicking, the lvremove
command hung.  Crash shows this stack trace:

crash> bt 3656
PID: 3656   TASK: f7aba5b0  CPU: 2   COMMAND: "lvremove"
 #0 [e0ec3e88] schedule at c02d0569
 #1 [e0ec3eec] kcopyd_client_destroy at f886471d
 #2 [e0ec3f3c] mirror_dtr at f8835626
 #3 [e0ec3f48] table_destroy at f885fa9b
 #4 [e0ec3f5c] __hash_remove at f886143d
 #5 [e0ec3f68] dev_remove at f8861b61
 #6 [e0ec3f74] ctl_ioctl at f8862e3c
 #7 [e0ec3f94] sys_ioctl at c016a243
 #8 [e0ec3fc0] system_call at c02d2630
    EAX: 00000036  EBX: 00000008  ECX: c134fd04  EDX: 092bb3f0 
    DS:  007b      ESI: 092bb3f0  ES:  007b      EDI: 002fa440 
    SS:  007b      ESP: bfe3ece4  EBP: bfe3feb8 
    CS:  0073      EIP: 001a27a2  ERR: 00000036  EFLAGS: 00000282 


Comment 10 Alasdair Kergon 2006-02-09 20:13:03 UTC
Can you get traces for *all* processes on the system?

That's a mirror being removed by lvremove.

What exactly is the PV, LV, filesystem configuration?

And see bug 179786 for other diagnostic commands (dmsetup) to run.

Comment 11 Alasdair Kergon 2006-02-09 20:18:48 UTC
Ah - just seen the attachment answering the configuration question.

I've never tried a snapshot of an lvm2 mirror so the code shouldn't really have
let you run it...

But the important thing is that you cannot reproduce the GFS panic any more -
then we'll switch this back to an lvm2 bug and assume the panic was fixed by the
kcopyd patch.


#!/bin/sh
#
# Script to reproduce Red Hat Bugzilla 180285
#

VG="errata"
LVSIZE="25G"
SNAPSIZE="10G"

lvcreate -m 1 -n mirror -L $LVSIZE $VG
mkfs -t ext3 /dev/mapper/$VG-mirror
mkdir -p /mnt/mirror
mount /dev/mapper/$VG-mirror /mnt/mirror
lvcreate -s -L $SNAPSIZE -n mirror-snap $VG/mirror
dd if=/dev/zero of=/mnt/mirror/zerofile bs=1024k count=100 &
sleep 5
lvremove -f $VG/mirror-snap
umount /mnt/mirror
lvremove -f $VG/mirror

Comment 12 David Lawrence 2006-04-12 18:39:06 UTC
The status NEEDINFO_QA has been deprecated in favor of NEEDINFO. Moving bugs to
NEEDINFO.

Comment 13 Jonathan Earl Brassow 2006-09-08 19:51:28 UTC
switching to lvm2 bug as mentioned in comment #11


Comment 14 Nate Straz 2006-09-08 21:29:45 UTC
What information are you looking for?

Comment 16 Jonathan Earl Brassow 2006-11-29 19:15:21 UTC
To close this bug, simply verify that you can't reproduce it anymore (given that
snapshots of mirrors are disallowed).

You could then open a RFE bug to add the ability to take snapshots of mirrors if
you like, but this bug would close.


Comment 17 Milan Broz 2007-01-10 10:59:54 UTC
Current RHEL4 lvm2 tools should not allow snapshots of mirrors, 
lvcreate -s will fail with "Snapshots and mirrors may not yet be mixed."

(Support for snapshot of mirrors is RFE bug 220502).