Bug 208514
| Summary: | singlenode gfs2 blows up when running fsx | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Dave Jones <davej> | ||||||
| Component: | kernel | Assignee: | Ben Marzinski <bmarzins> | ||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Brian Brock <bbrock> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 5.0 | CC: | cluster-maint, nobody+wcheng, pfrields, wtogami | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2007-06-29 17:26:44 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 231239 | ||||||||
| Bug Blocks: | 204760 | ||||||||
| Attachments: |
|
||||||||
I rebooted into the 2699 kernel which has a slightly newer gfs2.. it got a bit further.. fsx foo truncating to largest ever: 0x13e76 truncating to largest ever: 0x2e52c truncating to largest ever: 0x3c2c2 truncating to largest ever: 0x3f15f truncating to largest ever: 0x3fcb9 truncating to largest ever: 0x3fe96 truncating to largest ever: 0x3ff9d Bus error (core dumped) dmesg has.. SELinux: initialized (dev loop0, type gfs2), uses xattr attempt to access beyond end of device loop0: rw=1, want=4090720, limit=409600 Buffer I/O error on device loop0, logical block 51133 lost page write due to I/O error on loop0 attempt to access beyond end of device loop0: rw=1, want=3675240, limit=409600 Buffer I/O error on device loop0, logical block 51044 lost page write due to I/O error on loop0 attempt to access beyond end of device loop0: rw=1, want=815376, limit=409600 Buffer I/O error on device loop0, logical block 50960 lost page write due to I/O error on loop0 I unmounted, and started over with the dd/mkfs/mount/fsx. this time fsx spewed pages and pages of errors, with another.. attempt to access beyond end of device loop1: rw=1, want=814672, limit=409600 Buffer I/O error on device loop1, logical block 50916 lost page write due to I/O error on loop1 in dmesg. I'll attach the fsx log if it's requested, but right now, I've not seen it last more than a few seconds, so this should be easily reproducable. The fsx failure ended with .. 14718(126 mod 256): READ 0xd2d4 thru 0x15384 (0x80b1 bytes) 14719(127 mod 256): MAPREAD 0x5773 thru 0xe481 (0x8d0f bytes) 14720(128 mod 256): WRITE 0x2d330 thru 0x39df8 (0xcac9 bytes) HOLE save_buffer write: No space left on device df showed .. /dev/loop1 210M 70M 141M 34% /mnt/test An error in space accounting with sparse files maybe ? See also bz #205307 which I think I'll close as a dup of this one. I suspect though that the original problem described in 205307 is now fixed but there is obviously still a problem here and this bug has the most uptodate information in it. *** Bug 205307 has been marked as a duplicate of this bug. *** Ben, I believe that we fixed the bug reported here some time ago, but if you could confirm that 100% that would be good. Also Dave has suggested there might be a problem relating to accounting for blocks, so if you could run a few tests creating and removing files and check that we get back all the space that was allocated when things are deleted, then we can either mark this as closed or fix the problem as appropriate. This appears to still be a problem. At least, using the code from git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes.git, I still see the error from Comment #2, and it does not seem to be a problem in the fsx test. I'm looking into what exactly is going on. I'm still not sure what's going on in the code, but the I've got a simple reproducer. If you just keep aternating between truncating a file down to 0 bytes, and writing to it, GFS2 eventually reports no space left on the device. However, the file's size is 0. Running df seems to clear things up. Created attachment 147705 [details]
simple recreater
This program just repeatedly truncates and then writes to a file. I should run
forever, and does on ext3. On gfs2, the write eventually fails with
errno=ENOSPC
Created attachment 150919 [details]
patch to flush log when there is no space in any resource groups
When you deallocate blocks in gf2, you are not able to reuse the space until
the associated resource groups are flushed to the ondisk log. This occasionally
caused gfs2 to act like there is no available space, when there actually is.
This patch causes gfs2 to flush the incore log if it is unable to find any
resource groups with available space.
Patch applied Did this get posted to rhkernel-list? This bug was posted to rhkernel-list was part of bz #239777 *** This bug has been marked as a duplicate of 239777 *** |
dd if=/dev/zero of=data.img mkfs.gfs2 -t mycluster:mygfs2 -p lock_nolock -j2 data.img mount -t gfs2 data.img /mnt/test/ -o loop cd /mnt/test fsx foo Unable to handle kernel NULL pointer dereference at 00000000000000b8 RIP: [<ffffffff8872a2d6>] :gfs2:gfs2_glock_dq+0x16/0xac PGD 4dca1067 PUD 4a681067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /block/loop7/dev CPU 1 Modules linked in: lock_nolock gfs2 loop ipt_ULOG x_tables af_key irda crc_ccitt atm decnet pppoe pppox ppp_generic slhc ipx p8023 appletalk i915 drm ipv6 hidp l2cap nfs lockd fscache nfs_acl sunrpc cpufreq_ondemand video sbs i2c_ec button battery asus_acpi ac parport_pc lp parport intel_rng snd_hda_intel snd_hda_codec sr_mod cdrom snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device sg snd_pcm_oss snd_mixer_oss snd_pcm ohci1394 ieee1394 pcspkr i2c_i801 i2c_core hci_usb snd_timer e1000 bluetooth snd soundcore snd_page_alloc shpchp serio_raw dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 9014, comm: fsx Not tainted 2.6.18-1.2689.fc6 #1 RIP: 0010:[<ffffffff8872a2d6>] [<ffffffff8872a2d6>] :gfs2:gfs2_glock_dq+0x16/0xac RSP: 0000:ffff810049b27af8 EFLAGS: 00010246 RAX: 00000000fffffffb RBX: 0000000000000000 RCX: 0000000000000035 RDX: ffff8100494e6000 RSI: ffff810049b27b68 RDI: ffff810049b27b68 RBP: ffff810049b27b18 R08: ffffffff887338a8 R09: ffff810049a96100 R10: ffff81000296e140 R11: 0000000000000246 R12: ffff810049b27b68 R13: 0000000000000001 R14: ffff810074660d08 R15: ffff810074660d08 FS: 0000000000000000(0000) GS:ffff81007e5a1cc0(0063) knlGS:00000000f7e936c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00000000000000b8 CR3: 000000004a837000 CR4: 00000000000006e0 Process fsx (pid: 9014, threadinfo ffff810049b26000, task ffff810049a96100) Stack: 0000000000000001 ffff810049b27b68 0000000000000001 ffff810074660d08 ffff810049b27b48 ffffffff8872a397 ffff810049a96878 ffff810049b27b68 0000000000000000 ffff81000296e140 ffff810049b27c28 ffffffff88733acd Call Trace: [<ffffffff8872a397>] :gfs2:gfs2_glock_dq_m+0x2b/0x39 [<ffffffff88733acd>] :gfs2:gfs2_readpage+0xee/0x118 [<ffffffff8021404e>] filemap_nopage+0x264/0x357 [<ffffffff887398bb>] :gfs2:gfs2_sharewrite_nopage+0xfc/0x317 [<ffffffff80208d35>] __handle_mm_fault+0x493/0xc98 [<ffffffff80269fcc>] do_page_fault+0x487/0x842 [<ffffffff80261591>] error_exit+0x0/0x96 DWARF2 unwinder stuck at error_exit+0x0/0x96 Leftover inexact backtrace: Code: 4c 8b b3 b8 00 00 00 74 0a 31 f6 48 89 df e8 da f9 ff ff 4c RIP [<ffffffff8872a2d6>] :gfs2:gfs2_glock_dq+0x16/0xac RSP <ffff810049b27af8> CR2: 00000000000000b8 ===================================== [ BUG: lock held at task exit time! ] ------------------------------------- fsx/9014 is exiting with locks still held! 1 lock held by fsx/9014: #0: (&mm->mmap_sem){----}, at: [<ffffffff80269f16>] do_page_fault+0x3d1/0x842 stack backtrace: Call Trace: [<ffffffff8026ebcd>] show_trace+0xae/0x336 [<ffffffff8026ee6a>] dump_stack+0x15/0x17 [<ffffffff802a743e>] debug_check_no_locks_held+0x87/0x8b [<ffffffff80216575>] do_exit+0x8c2/0x911 [<ffffffff8026a2ff>] do_page_fault+0x7ba/0x842 [<ffffffff80261591>] error_exit+0x0/0x96 DWARF2 unwinder stuck at error_exit+0x0/0x96 Leftover inexact backtrace: [<ffffffff887338a8>] :gfs2:gfs2_get_block+0x0/0x77 [<ffffffff8872a2d6>] :gfs2:gfs2_glock_dq+0x16/0xac [<ffffffff802a8238>] mark_held_locks+0x53/0x79 [<ffffffff8872a397>] :gfs2:gfs2_glock_dq_m+0x2b/0x39 [<ffffffff88733acd>] :gfs2:gfs2_readpage+0xee/0x118 [<ffffffff80267e63>] _read_unlock_irq+0x2b/0x31 [<ffffffff802a8238>] mark_held_locks+0x53/0x79 [<ffffffff80267e63>] _read_unlock_irq+0x2b/0x31 [<ffffffff802a8410>] trace_hardirqs_on+0x119/0x13d [<ffffffff80267e63>] _read_unlock_irq+0x2b/0x31 [<ffffffff8021404e>] filemap_nopage+0x264/0x357 [<ffffffff887398bb>] :gfs2:gfs2_sharewrite_nopage+0xfc/0x317 [<ffffffff8873982b>] :gfs2:gfs2_sharewrite_nopage+0x6c/0x317 [<ffffffff80208d35>] __handle_mm_fault+0x493/0xc98 [<ffffffff80269f16>] do_page_fault+0x3d1/0x842 [<ffffffff80269fcc>] do_page_fault+0x487/0x842 [<ffffffff802327f8>] __up_write+0xf0/0xff [<ffffffff80267612>] trace_hardirqs_on_thunk+0x35/0x37 [<ffffffff80261591>] error_exit+0x0/0x96