Bug 153925

Summary:	Kernel panic when attempting to backup snapshot volume
Product:	Red Hat Enterprise Linux 4	Reporter:	Stephen N. Stremmel <stremmel>
Component:	kernel	Assignee:	LVM and device-mapper development team <lvm-team>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.0	CC:	agk, davej, dwysocha, ksorensen, mbroz, rh-admins
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-05-03 09:13:09 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stephen N. Stremmel 2005-04-05 20:45:20 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2

Description of problem:
When using the snapshot feature, the kernel panics when any backup of the mounted snapshot is attempted.













Version-Release number of selected component (if applicable):
lvm2-2.00.31-1.0.RHEL4 & kernel-smp-2.6.9-5.0.3.EL

How reproducible:
Always

Steps to Reproduce:
1. [root@flamingo ~]# lvcreate --size 500m --snapshot --name bksnap /dev/VolGroup00/LogVol02
    Rounding up size to full physical extent 512.00 MB
    Logical volume "bksnap" created

(/var/log/messages entry:
Apr  5 11:41:51 flamingo kernel: kjournald starting.  Commit interval 5 seconds
Apr  5 11:41:51 flamingo kernel: EXT3 FS on dm-5, internal journal
Apr  5 11:41:51 flamingo kernel: EXT3-fs: mounted filesystem with ordered data mode.
Apr  5 11:41:51 flamingo kernel: SELinux: initialized (dev dm-5, type ext3), uses xattr
)

2. [root@flamingo ~]# lvscan
  ACTIVE            '/dev/VolGroup00/LogVol00' [14.00 GB] inherit
  ACTIVE   Original '/dev/VolGroup00/LogVol02' [9.75 GB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol01' [1.94 GB] inherit
  ACTIVE   Snapshot '/dev/VolGroup00/bksnap' [512.00 MB] inherit

3. [root@flamingo ~]# mount /dev/VolGroup00/bksnap /mnt

4. [root@flamingo ~]# mount
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/sda1 on /boot type ext3 (rw)
none on /dev/shm type tmpfs (rw)
/dev/mapper/VolGroup00-LogVol02 on /var type ext3 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/mapper/VolGroup00-bksnap on /mnt type ext3 (rw)

5. [root@flamingo ~]# rsync -av --stats /mnt /rsync-depot/test
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c0146691
*pde = 203cc001
Oops: 0000 [#1]
SMP
Modules linked in: nfs lockd md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables button battery ac uhci_hcd e1000 e100 mii dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod 3w_xxxx sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c0146691>]    Not tainted VLI
EFLAGS: 00010282   (2.6.9-5.0.3.ELsmp)
EIP is at page_address+0x6/0x6e
eax: 00000000   ebx: 00000000   ecx: dfdf0700   edx: dfdf0680
esi: dfdf0700   edi: 00000000   ebp: 00000000   esp: c03b8ef8
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c03b8000 task=c0312a60)
Stack: f7fb3c80 dfdf0700 00000000 00000000 c0146315 dfdf0680 dfdf0680 f7c09b00
       c0146411 00000000 c0146406 dfdf0680 00002000 c0146427 c015a43e 00002000
       dfdf0680 00000000 c03b8f68 c0219298 f7d6d5ec 00000000 00000000 00001000
Call Trace:
 [<c0146315>] copy_to_high_bio_irq+0x2b/0x4c
 [<c0146411>] bounce_end_io_read+0x0/0x1b
 [<c0146406>] __bounce_end_io_read+0x19/0x24
 [<c0146427>] bounce_end_io_read+0x16/0x1b
 [<c015a43e>] bio_endio+0x50/0x55
 [<c0219298>] __end_that_request_first+0xea/0x1ab
 [<f8843620>] scsi_end_request+0x1b/0xa0 [scsi_mod]
 [<f88439e3>] scsi_io_completion+0x20b/0x417 [scsi_mod]
 [<f883fad6>] scsi_finish_command+0xad/0xb1 [scsi_mod]
 [<f883f9fb>] scsi_softirq+0xb6/0xbe [scsi_mod]
 [<c0124b2c>] __do_softirq+0x4c/0xb1
 [<c0107f39>] do_softirq+0x4f/0x56
 =======================
 [<c010784f>] do_IRQ+0x125/0x130
 [<c02c6a68>] common_interrupt+0x18/0x20
 [<c0104018>] default_idle+0x0/0x2c
 [<c0104041>] default_idle+0x29/0x2c
 [<c010409d>] cpu_idle+0x26/0x3b
 [<c0382784>] start_kernel+0x194/0x198
Code: 08 0f 0b da 01 55 70 2d c0 89 d8 5b e9 d0 fd ff ff 5b c3 69 c0 01 00 37 9e c1 e8 19 c1 e0 07 05 00 65 42 c0 c3 55 57 56 53 89 c3 <8b> 00 f6 c4 01 75 19 2b 1d 10 c5 42 c0 c1 fb 05 c1 e3 0c 8d 83
 <0>Kernel panic - not syncing: Fatal exception in interrupt
  

Actual Results:  Kernel panic

Additional info:

This is a consistant problem. Everything works well until you try to access the snapshot to back it up. I tried using 'tar -cvf /rsync-depot/test.tar /mnt (/rsync-depot is an NFS mount), as well as 'tar -cvf /tmp/test.tar /mnt (a local FS). I booted into the installation kernel 2.6.9-5 and tried it with the same results. I tried adding the '--permission r' to the 'lvcreate' command line, and using the '-r' flag when I mounted the snapshot volume. Same results.

We are trying to use the snapshot feature for our node farm backups. I saw references to this bug regarding the 2.4.X kernel, but it was supposed to be resolved in RHEL4, or so it seemed.

Comment 2 OSU Physics Department Linux Admins 2006-02-03 16:59:50 UTC

I can confirm this issue on the latest RHELv4 AS kernel:

Linux version 2.6.9-22.0.2.ELsmp (bhcompile.redhat.com) (gcc
version 3.4.5 20051201 (Red Hat 3.4.5-2)) #1 SMP Thu Jan 5 17:13:01 EST 2006

Using cpio to backup LVM2 snapshots causes an immediate kernel panic.  It is
always reproducable.  I have a complete netdump of a crash for analysis if it's
needed.

Comment 3 Doug Ledford 2008-10-02 13:42:41 UTC

This really looks like an lvm issue, not a SCSI issue.   Reassigning.

Comment 4 Kristian Sørensen 2009-10-03 11:08:50 UTC

Any chance that this will be fixed soon? I have a server that crashes once a week because of this issue, or is just this another case of "The money you pay for your RedHat subscriptions does not imply that anyone at RedHat will lift a finger in order to fix any issue."? Seriously, this bug has been open for 4 and a half years now.

Comment 5 Milan Broz 2009-10-03 13:36:29 UTC

If you have such serious problem, please fill ticket in Red Hat support http://www.redhat.com/support and escalate the problem through official support channel.

If you can crash kernel from RHEL 4.8 update, please post kernel panic bactrace here (from recent kernel, there were too many fixes so old post is no longer usable) but I think these problems were already fixed in updates (e.g. some problem with bouncing pages were fixed in 2007 in http://rhn.redhat.com/errata/RHBA-2007-0791.html - bug 156385, but for some reson the bug is private).

Comment 6 Kristian Sørensen 2009-10-04 11:11:52 UTC

The crashes happens with SMP-kernel 2.6.9-78.0.22. Switching to UP kernel does not seem to help. I've just changed to the latest SMP-kernel in the hope that this will stop the crashes. 
I can't provide you with a full stack strace for now, as I do not have a serial console on the server in question (at least not yet.)
The strange thing about this issue is that it seems to appear more frequent. At first it seemed to appear once every 2 or 3 months, but now it seems to be roughly once a week. The only explanation for this behaviour could that the snapshot device contains more files now than it did when the crashes were less frequent.

Comment 7 Kristian Sørensen 2009-10-26 14:21:22 UTC

I can't get a capture of the crash on 2.6.9-89.0.11 since there seems to be a bug in the e1000 driver which causes a different kernel panic up to twice a day, so that kernel is not an option for my production server. Wasn't RHEL supposed to be at least somewhat stable?
I now successfully managed to get a serial console connection up and running, so I should be able to provide you with a crash dump relating to this issue within a week or so.

Comment 8 Kristian Sørensen 2009-10-28 06:45:54 UTC

OK. Here comes the backtrace:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
00000000
*pde = 2f2be001
Oops: 0000 [#1]
SMP
Modules linked in: md5 ipv6 w83627hf eeprom i2c_sensor i2c_isa i2c_i801 i2c_dev i2c_core nfs lockd nfs_acl sunrpc cpufreq_powersave button battery ac uhci_hcd hw_random e100 mii e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd raid1 dm_mod ata_piix libata sd_mod scsi_mod
CPU:    1
EIP:    0060:[<00000000>]    Not tainted VLI
EFLAGS: 00010082   (2.6.9-78.0.22.ELsmp)
EIP is at 0x0
eax: 00000001   ebx: db482f0c   ecx: c3136de0   edx: 00000000
esi: d7e3dee4   edi: d1a6ca80   ebp: c0120572   esp: d7e3def0
ds: 007b   es: 007b   ss: 0068
Process gzip (pid: 18200, threadinfo=d7e3d000 task=f4542bb0)
Stack: db482f0c 00000001 c011e845 00000000 00000000 d1a6ca88 00000001 00000001
       d1a6ca80 00000001 d7e3df3c c011e8ea 00000001 00000000 00000202 00000001
       d1a6ca80 d7e3df80 080a25c0 00001000 c016757a 00000000 00000000 ecf2a000
Call Trace:
 [<c011e845>] __wake_up_common+0x36/0x51
 [<c011e8ea>] __wake_up_sync+0x3b/0x56
 [<c016757a>] pipe_readv+0x200/0x29e
 [<c0167634>] pipe_read+0x1c/0x20
 [<c015c942>] vfs_read+0xb6/0xe2
 [<c015cb57>] sys_read+0x3c/0x62
 [<c02e0a2f>] syscall_call+0x7/0xb
 [<c02e007b>] __lock_text_end+0x820/0x1071
Code:  Bad EIP value.
 <0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception

Comment 9 Kristian Sørensen 2009-11-13 06:53:01 UTC

And today I had another crash:

Red Hat Enterprise Linux ES release 4 (Nahant Update 5)
Kernel 2.6.9-78.0.22.ELsmp on an i686

indus.nordija.com login: Unable to handle kernel paging request at virtual address fffff010
 printing eip:
c014a018
*pde = 00200074
Oops: 0000 [#1]
SMP
Modules linked in: md5 ipv6 w83627hf eeprom i2c_sensor i2c_isa i2c_i801 i2c_dev i2c_core nfs lockd nfs_acl sunrpc cpufreq_powersave button battery ac uhci_hcd hw_random e100 mii e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd raid1 dm_mod ata_piix libata sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c014a018>]    Not tainted VLI
EFLAGS: 00010286   (2.6.9-78.0.22.ELsmp)
EIP is at lru_add_drain+0xd/0x77
eax: e1f08080   ebx: fffff000   ecx: eb65d3e4   edx: c03d5b80
esi: e1f08080   edi: da688b74   ebp: b7f12000   esp: f2dedf68
ds: 007b   es: 007b   ss: 0068
Process tar (pid: 31926, threadinfo=f2ded000 task=ebb58eb0)
Stack: c03d3260 c0152147 eb65d3e4 e1f08080 00000000 e1f080c4 da688b74 b7f12000
       e1f08080 c015243a b7f12000 b7f13000 b7f13000 b7f13000 eb65d3e4 e1f08080
       e1f080b0 00000000 f2ded000 c01524aa b7f12000 09562220 c02e0a2f b7f12000
Call Trace:
 [<c0152147>] unmap_region+0x24/0xef
 [<c015243a>] do_munmap+0xf8/0x116
 [<c01524aa>] sys_munmap+0x52/0x6a
 [<c02e0a2f>] syscall_call+0x7/0xb
 [<c02e007b>] __lock_text_end+0x820/0x1071
Code: 53 0c f0 ff 42 04 8b 01 89 5c 81 08 40 83 f8 0e 89 01 75 08 5b 89 c8 e9 9c 03 00 00 5b c3 53 bb 00 f0 ff ff ba 80 5b 3d c0 21 e3 <8b> 43 10 03 14 85 20 f1 3d c0 83 3a 00 74 07 89 d0 e8 c3 02 00
 <0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception

This always happens at night when the snapshot LV is mounted

Comment 10 Milan Broz 2011-05-03 09:13:09 UTC

Hm, seems this bug reporten in 2005 got lost in queue for long time, sorry for that.

The comment #9 is probably unrelated crash.

Anyway, there were several DM snapshot fixes in RHEL4 kernel, I think it should be fixed now.

If you still see the problem, please better report new bug or support ticket, thanks.