Description of problem:
I was running my "hell" GFS2 tests on the latest GFS2 version and got
a kernel panic with "Kernel BUG at mm/filemap.c:553" on "hell6".
This problem was not caused by a recent code change. The problem
exists in kernels 57 through 68, but the -56 kernel works properly.
Version-Release number of selected component (if applicable):
RHEL 5.2 beta
Steps to Reproduce:
The "Hell6" test is as follows:
service cman start
service clvmd start
mkfs.gfs2 -O -t bobs_roth:test_gfs -p lock_dlm -j 3 /dev/roth_vg/roth_lv
mount -tgfs2 /dev/roth_vg/roth_lv /mnt/gfs2
cp -a * /mnt/gfs2/
The test goes on to rm the copied files after the file system is full
on a working system. However, it is the cp that causes the kernel to
panic on broken versions.
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at mm/filemap.c:553
invalid opcode: 0000  SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
Modules linked in: lock_dlm gfs2 dlm configfs autofs4 hidp rfcomm l2cap
bluetooth sunrpc ipv6 dm_multipath video sbs backlight i2c_ec button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport joydev ide_cd shpchp i2c_i801
sg i2c_core cdrom tg3 serio_raw pcspkr dm_snapshot dm_zero dm_mirror dm_mod
qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd
Pid: 2946, comm: cp Not tainted 2.6.18-57.el5 #1
RIP: 0010:[<ffffffff800179a4>] [<ffffffff800179a4>] unlock_page+0xf/0x2f
RSP: 0018:ffff8100525d9bd8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff810000d854d8 RCX: ffff81001f8949e4
RDX: ffff81001e935500 RSI: 00000000ffffffe4 RDI: ffff810000d854d8
RBP: ffff810000d854d8 R08: 00000000ffffffe4 R09: 0000000000020000
R10: ffff8100015e98f8 R11: 00000000fffffffa R12: 000000000035b000
R13: 0000000000001000 R14: 00007fffec4aa000 R15: 0000000000000000
FS: 00002aaaaaabaf20(0000) GS:ffff8100026e57c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000e29f1f8 CR3: 0000000056abe000 CR4: 00000000000006e0
Process cp (pid: 2946, threadinfo ffff8100525d8000, task ffff8100644d1860)
Stack: ffffffffffffffe4 ffffffff8000f9eb 0000000000001000 ffff8100525d9f50
000000000035b000 0000000000000001 ffff8100525d9dd8 0000000000000001
0000100000000000 ffff81007be5c580 ffff81001e935610 ffffffff884c7fa0
Code: 0f 0b 68 9f d2 28 80 c2 29 02 48 89 df e8 0e 2a 00 00 48 89
No kernel panic. When the device gets full, it should just start
spewing out messages like these:
cp: writing `/mnt/gfs2/Metal/Rhapsody/SF6GZ0~2/10 - Guardiani del destino.mp3':
No space left on device
The problem was introduced between -56 and -57, so between 13 Nov through
21 Nov 2007. I'll compare the sources to see if I can find the problem.
Created attachment 292044 [details]
Diff between -56 and -57 kernel
This is a diff between the -56 version that works and the -57 version
that fails hell6. It's 1152 lines long, so lots of changes.
The problem happens with data=writeback as well as the default ordered
Created attachment 292094 [details]
Patch to fix the problem
Solved. Function gfs2_write_lock_start was unlocking the page
prematurely. When the code figured out there was no space left on
the device, it returned the -ENOSPC return code. However, vfs will
try to unlock the page if the return code is not AOP_TRUNCATED_PAGE.
The problem was, we had already unlocked it.
The solution--this patch--is to unlock the page only in cases where
it determines that it's going to return AOP_TRUNCATED_PAGE.
BTW, this problem no longer exists upstream because the upstream code
has advanced beyond the need for returning AOP_TRUNCATED_PAGE.
Reassigning to myself.
Also, I need some flags set please. It's important to get this into
RHEL5.2. We don't want rudimentary errors like out-of-space to cause
a kernel panic.
The patch in comment #3 is wrong. We can't remove the unlock before the glock as
thats the whole point of gfs2_write_lock_start. Instead we'll have to fix it by
checking the error path and getting the page lock again if and only if we are
going to return an error.
Created attachment 292152 [details]
Does this look better? This one keeps the unlock_page in place,
but relocks the page on error.
Created attachment 292384 [details]
Same patch, but can be applied before 253990 patch
This is the same patch, but the line numbers are different. The previous
version was meant to be applied over top of the 253990 (performance) fix.
This one goes directly on top of the preceding i_alloc fix that was
previously posted to rhkernel-list. I'm planning to post this one.
The fix was posted to rhkernel-list, so I'm changing status to POST
and rerouting it to Don Zickus.
You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.