Bug 208514

Summary:

singlenode gfs2 blows up when running fsx

Product:

Red Hat Enterprise Linux 5

Reporter:

Dave Jones <davej>

Component:

kernel

Assignee:

Ben Marzinski <bmarzins>

Status:

CLOSED DUPLICATE

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

5.0

CC:

cluster-maint, nobody+wcheng, pfrields, wtogami

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-06-29 17:26:44 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

231239

Bug Blocks:

204760

Attachments:

Description	Flags
simple recreater	none
patch to flush log when there is no space in any resource groups	none

Description Dave Jones 2006-09-28 23:27:19 UTC

dd if=/dev/zero of=data.img
mkfs.gfs2 -t mycluster:mygfs2 -p lock_nolock -j2 data.img
mount -t gfs2 data.img /mnt/test/ -o loop
cd /mnt/test
fsx foo

Unable to handle kernel NULL pointer dereference at 00000000000000b8 RIP: 
 [<ffffffff8872a2d6>] :gfs2:gfs2_glock_dq+0x16/0xac
PGD 4dca1067 PUD 4a681067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /block/loop7/dev
CPU 1 
Modules linked in: lock_nolock gfs2 loop ipt_ULOG x_tables af_key irda crc_ccitt
atm decnet pppoe pppox ppp_generic slhc ipx p8023 appletalk i915 drm ipv6 hidp
l2cap nfs lockd fscache nfs_acl sunrpc cpufreq_ondemand video sbs i2c_ec button
battery asus_acpi ac parport_pc lp parport intel_rng snd_hda_intel snd_hda_codec
sr_mod cdrom snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
sg snd_pcm_oss snd_mixer_oss snd_pcm ohci1394 ieee1394 pcspkr i2c_i801 i2c_core
hci_usb snd_timer e1000 bluetooth snd soundcore snd_page_alloc shpchp serio_raw
dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd
ehci_hcd ohci_hcd uhci_hcd
Pid: 9014, comm: fsx Not tainted 2.6.18-1.2689.fc6 #1
RIP: 0010:[<ffffffff8872a2d6>]  [<ffffffff8872a2d6>] :gfs2:gfs2_glock_dq+0x16/0xac
RSP: 0000:ffff810049b27af8  EFLAGS: 00010246
RAX: 00000000fffffffb RBX: 0000000000000000 RCX: 0000000000000035
RDX: ffff8100494e6000 RSI: ffff810049b27b68 RDI: ffff810049b27b68
RBP: ffff810049b27b18 R08: ffffffff887338a8 R09: ffff810049a96100
R10: ffff81000296e140 R11: 0000000000000246 R12: ffff810049b27b68
R13: 0000000000000001 R14: ffff810074660d08 R15: ffff810074660d08
FS:  0000000000000000(0000) GS:ffff81007e5a1cc0(0063) knlGS:00000000f7e936c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000000000b8 CR3: 000000004a837000 CR4: 00000000000006e0
Process fsx (pid: 9014, threadinfo ffff810049b26000, task ffff810049a96100)
Stack:  0000000000000001 ffff810049b27b68 0000000000000001 ffff810074660d08
 ffff810049b27b48 ffffffff8872a397 ffff810049a96878 ffff810049b27b68
 0000000000000000 ffff81000296e140 ffff810049b27c28 ffffffff88733acd
Call Trace:
 [<ffffffff8872a397>] :gfs2:gfs2_glock_dq_m+0x2b/0x39
 [<ffffffff88733acd>] :gfs2:gfs2_readpage+0xee/0x118
 [<ffffffff8021404e>] filemap_nopage+0x264/0x357
 [<ffffffff887398bb>] :gfs2:gfs2_sharewrite_nopage+0xfc/0x317
 [<ffffffff80208d35>] __handle_mm_fault+0x493/0xc98
 [<ffffffff80269fcc>] do_page_fault+0x487/0x842
 [<ffffffff80261591>] error_exit+0x0/0x96
DWARF2 unwinder stuck at error_exit+0x0/0x96
Leftover inexact backtrace:


Code: 4c 8b b3 b8 00 00 00 74 0a 31 f6 48 89 df e8 da f9 ff ff 4c 
RIP  [<ffffffff8872a2d6>] :gfs2:gfs2_glock_dq+0x16/0xac
 RSP <ffff810049b27af8>
CR2: 00000000000000b8
 
=====================================
[ BUG: lock held at task exit time! ]
-------------------------------------
fsx/9014 is exiting with locks still held!
1 lock held by fsx/9014:
 #0:  (&mm->mmap_sem){----}, at: [<ffffffff80269f16>] do_page_fault+0x3d1/0x842

stack backtrace:

Call Trace:
 [<ffffffff8026ebcd>] show_trace+0xae/0x336
 [<ffffffff8026ee6a>] dump_stack+0x15/0x17
 [<ffffffff802a743e>] debug_check_no_locks_held+0x87/0x8b
 [<ffffffff80216575>] do_exit+0x8c2/0x911
 [<ffffffff8026a2ff>] do_page_fault+0x7ba/0x842
 [<ffffffff80261591>] error_exit+0x0/0x96
DWARF2 unwinder stuck at error_exit+0x0/0x96
Leftover inexact backtrace:
 [<ffffffff887338a8>] :gfs2:gfs2_get_block+0x0/0x77
 [<ffffffff8872a2d6>] :gfs2:gfs2_glock_dq+0x16/0xac
 [<ffffffff802a8238>] mark_held_locks+0x53/0x79
 [<ffffffff8872a397>] :gfs2:gfs2_glock_dq_m+0x2b/0x39
 [<ffffffff88733acd>] :gfs2:gfs2_readpage+0xee/0x118
 [<ffffffff80267e63>] _read_unlock_irq+0x2b/0x31
 [<ffffffff802a8238>] mark_held_locks+0x53/0x79
 [<ffffffff80267e63>] _read_unlock_irq+0x2b/0x31
 [<ffffffff802a8410>] trace_hardirqs_on+0x119/0x13d
 [<ffffffff80267e63>] _read_unlock_irq+0x2b/0x31
 [<ffffffff8021404e>] filemap_nopage+0x264/0x357
 [<ffffffff887398bb>] :gfs2:gfs2_sharewrite_nopage+0xfc/0x317
 [<ffffffff8873982b>] :gfs2:gfs2_sharewrite_nopage+0x6c/0x317
 [<ffffffff80208d35>] __handle_mm_fault+0x493/0xc98
 [<ffffffff80269f16>] do_page_fault+0x3d1/0x842
 [<ffffffff80269fcc>] do_page_fault+0x487/0x842
 [<ffffffff802327f8>] __up_write+0xf0/0xff
 [<ffffffff80267612>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80261591>] error_exit+0x0/0x96

Comment 1 Dave Jones 2006-09-28 23:39:30 UTC

I rebooted into the 2699 kernel which has a slightly newer gfs2..

it got a bit further..
fsx foo
truncating to largest ever: 0x13e76
truncating to largest ever: 0x2e52c
truncating to largest ever: 0x3c2c2
truncating to largest ever: 0x3f15f
truncating to largest ever: 0x3fcb9
truncating to largest ever: 0x3fe96
truncating to largest ever: 0x3ff9d
Bus error (core dumped)

dmesg has..

SELinux: initialized (dev loop0, type gfs2), uses xattr
attempt to access beyond end of device
loop0: rw=1, want=4090720, limit=409600
Buffer I/O error on device loop0, logical block 51133
lost page write due to I/O error on loop0
attempt to access beyond end of device
loop0: rw=1, want=3675240, limit=409600
Buffer I/O error on device loop0, logical block 51044
lost page write due to I/O error on loop0
attempt to access beyond end of device
loop0: rw=1, want=815376, limit=409600
Buffer I/O error on device loop0, logical block 50960
lost page write due to I/O error on loop0


I unmounted, and started over with the dd/mkfs/mount/fsx.  this time fsx spewed
pages and pages of errors, with another..
attempt to access beyond end of device
loop1: rw=1, want=814672, limit=409600
Buffer I/O error on device loop1, logical block 50916
lost page write due to I/O error on loop1

in dmesg.   I'll attach the fsx log if it's requested, but right now, I've not
seen it last more than a few seconds, so this should be easily reproducable.

Comment 2 Dave Jones 2006-09-28 23:40:50 UTC

The fsx failure ended with ..

14718(126 mod 256): READ        0xd2d4 thru 0x15384     (0x80b1 bytes)
14719(127 mod 256): MAPREAD     0x5773 thru 0xe481      (0x8d0f bytes)
14720(128 mod 256): WRITE       0x2d330 thru 0x39df8    (0xcac9 bytes) HOLE
save_buffer write: No space left on device

df showed ..
/dev/loop1             210M    70M   141M  34% /mnt/test

An error in space accounting with sparse files maybe ?

Comment 3 Steve Whitehouse 2006-09-29 08:11:12 UTC

See also bz #205307 which I think I'll close as a dup of this one. I suspect
though that the original problem described in 205307 is now fixed but there is
obviously still a problem here and this bug has the most uptodate information in it.

Comment 4 Steve Whitehouse 2006-09-29 08:20:35 UTC

*** Bug 205307 has been marked as a duplicate of this bug. ***

Comment 5 Steve Whitehouse 2007-02-02 16:26:50 UTC

Ben, I believe that we fixed the bug reported here some time ago, but if you
could confirm that 100% that would be good. Also Dave has suggested there might
be a problem relating to accounting for blocks, so if you could run a few tests
creating and removing files and check that we get back all the space that was
allocated when things are deleted, then we can either mark this as closed or fix
the problem as appropriate.

Comment 6 Ben Marzinski 2007-02-07 15:42:59 UTC

This appears to still be a problem. At least, using the code from
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes.git, I still
see the error from Comment #2, and it does not seem to be a problem in the fsx
test. I'm looking into what exactly is going on.

Comment 7 Ben Marzinski 2007-02-08 22:09:03 UTC

I'm still not sure what's going on in the code, but the I've got a simple
reproducer. If you just keep aternating between truncating a file down to 0
bytes, and writing to it, GFS2 eventually reports no space left on the device.
However, the file's size is 0. Running df seems to clear things up.

Comment 8 Ben Marzinski 2007-02-08 22:14:46 UTC

Created attachment 147705 [details]
simple recreater

This program just repeatedly truncates and then writes to a file. I should run
forever, and does on ext3. On gfs2, the write eventually fails with
errno=ENOSPC

Comment 9 Ben Marzinski 2007-03-26 17:11:15 UTC

Created attachment 150919 [details]
patch to flush log when there is no space in any resource groups

When you deallocate blocks in gf2, you are not able to reuse the space until
the associated resource groups are flushed to the ondisk log. This occasionally
caused gfs2 to act like there is no available space, when there actually is.
This patch causes gfs2 to flush the incore log if it is unable to find any
resource groups with available space.

Comment 10 Ben Marzinski 2007-03-29 19:07:34 UTC

Patch applied

Comment 11 Kiersten (Kerri) Anderson 2007-04-26 17:59:41 UTC

Did this get posted to rhkernel-list?

Comment 12 Ben Marzinski 2007-06-29 17:26:44 UTC

This bug was posted to rhkernel-list was part of bz #239777

*** This bug has been marked as a duplicate of 239777 ***