146672 – Node panic doing a writev()

Bug 146672 - Node panic doing a writev()

Summary: Node panic doing a writev()

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ben Marzinski
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	144795
TreeView+	depends on / blocked

Reported:	2005-01-31 17:37 UTC by Dean Jansa
Modified:	2013-09-23 15:31 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-09-23 15:31:47 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dean Jansa 2005-01-31 17:37:22 UTC

GFS <CVS> (built Jan 28 2005 13:48:31)

Node panic doing a writev().

5 node cluster, one node running:

  iogen -f sync -i 30s -m random -s read,write,readv,writev -t 1b -T
400000b 400000b:rwransynclarge | doio -av

in a filesystem mounted on all nodes.  The above IO was running in a
dir seperate from other nodes.  

Node panics with:

Jan 28 21:00:52 tank-03 kernel: GFS: fsid=tank-cluster:vedder.2:
head_off = 1709
34368, head_wrap = 79
Jan 28 21:00:52 tank-03 kernel: GFS: fsid=tank-cluster:vedder.2:
dump_off = 1709
33520, dump_wrap = 78
Jan 28 21:00:52 tank-03 kernel:  [<f8b44ea2>] gfs_assert_i+0x48/0x69 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b2dc0d>]
check_seg_usage+0x197/0x19f [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b2dd90>] sync_trans+0x143/0x1b1 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b307e3>]
quota_trans_size+0x20/0x36 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b2df1d>] disk_commit+0xe8/0x260 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b2e136>] log_refund+0x61/0x181 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b2e470>]
log_flush_internal+0xc2/0x172 [gfs
]
Jan 28 21:00:52 tank-03 kernel:  [<f8b2e8a9>]
gfs_log_commit+0x1ae/0x1d1 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b43e0d>] gfs_trans_end+0xa0/0xce
[gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b375b3>]
do_do_write_buf+0x309/0x3b4 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b3776e>] do_write_buf+0x110/0x152
[gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b36978>] walk_vm+0xd3/0xf7 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<c0119851>] do_page_fault+0x1ae/0x5b6
Jan 28 21:00:52 tank-03 kernel:  [<f8b3784d>] gfs_write+0x9d/0xb6 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b3765e>] do_write_buf+0x0/0x152 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<f8b377b0>] gfs_write+0x0/0xb6 [gfs]
Jan 28 21:00:52 tank-03 kernel:  [<c0155a80>] do_readv_writev+0x1c5/0x21d
Jan 28 21:00:52 tank-03 kernel:  [<c011c73c>] load_balance+0x27/0x134
Jan 28 21:00:52 tank-03 kernel:  [<c011c832>] load_balance+0x11d/0x134
Jan 28 21:00:52 tank-03 kernel:  [<c0155b56>] vfs_writev+0x3e/0x43
Jan 28 21:00:52 tank-03 kernel:  [<c0155bf9>] sys_writev+0x3c/0x62
Jan 28 21:00:52 tank-03 kernel:  [<c02c62a3>] syscall_call+0x7/0xb



Version-Release number of selected component (if applicable):


How reproducible:

Have not tried yet.

Steps to Reproduce:
1. See above.

Comment 1 Dean Jansa 2005-01-31 22:59:50 UTC

FWIW --

Before the IO was run data journaling was turned on via:

gfs_tool setflag inherit_jdata /mnt/vedder

Comment 2 Dean Jansa 2005-02-01 22:24:18 UTC

Able to hit again with the load below, started on each node, and
tank-01 trips over the assert withing moments:

iogen -f sync -m random -s write,writev -t 100000b -T 400000b
400000b:rwransynclarge0 | doio -av                                
iogen -f sync -m random -s write,writev -t 100000b -T 400000b
400000b:rwransynclarge1 | doio -av
iogen -f sync -m random -s write,writev -t 100000b -T 400000b
400000b:rwransynclarge2 | doio -av
iogen -f sync -m random -s write,writev -t 100000b -T 400000b
400000b:rwransynclarge3 | doio -av
iogen -f sync -m random -s write,writev -t 100000b -T 400000b
400000b:rwransynclarge4 | doio -av
iogen -f sync -m random -s write,writev -t 100000b -T 400000b
400000b:rwransynclarge5 | doio -av
iogen -f sync -m random -s write,writev -t 100000b -T 400000b
400000b:rwransynclarge6 | doio -av
iogen -f sync -m random -s write,writev -t 100000b -T 400000b
400000b:rwransynclarge7 | doio -av
iogen -f sync -m random -s write,writev -t 100000b -T 400000b
400000b:rwransynclarge8 | doio -av
iogen -f sync -m random -s write,writev -t 100000b -T 400000b
400000b:rwransynclarge9 | doio -av

Comment 3 Ben Marzinski 2005-02-01 22:35:50 UTC

I have been able to reproduce this fairly reliably by doing the following:

- make a gfs file system with the smallest journal size possible
- mount it on one node
- turn on data journaling
- run the revolver load set to HEAVY

Now, I'm going to putz with tuning things and see if I can make it go
away... or at least be unreproducible.  I think that in the long term, 
Ken is working on stuff that will avoid this problem.  In the short
term, making it hard to hit might be the best option.  Apparently,
the ability to do this has always been in the journaling code, and
fixing it may take some work.

Comment 4 Ben Marzinski 2005-02-04 20:47:02 UTC

It seems that when you hit this bug with head_off and dump_off being
equal, it ends up much worse.  When I tried to remount the filesystem
after panicing on the assert with head_off and dump_off both equalling
13558176, I hit the following assert

GFS: Trying to join cluster "lock_dlm", "cypher:gfs1"
GFS: fsid=cypher:gfs1.0: Joined cluster. Now mounting FS...
GFS: fsid=cypher:gfs1.0: jid=0: Trying to acquire journal lock...
GFS: fsid=cypher:gfs1.0: jid=0: Looking at journal...
GFS: fsid=cypher:gfs1.0: jid=0: Acquiring the transaction lock...
GFS: fsid=cypher:gfs1.0: jid=0: Replaying journal...
GFS: fsid=cypher:gfs1.0: jid=0: Replayed 1368 of 1390 blocks
GFS: fsid=cypher:gfs1.0: jid=0: replays = 1368, skips = 0, sames = 22
GFS: fsid=cypher:gfs1.0: jid=0: Journal replayed in 1s
GFS: fsid=cypher:gfs1.0: jid=0: Done
GFS: fsid=cypher:gfs1.0: jid=1: Trying to acquire journal lock...
GFS: fsid=cypher:gfs1.0: jid=1: Looking at journal...
GFS: fsid=cypher:gfs1.0: jid=1: Done
GFS: fsid=cypher:gfs1.0: jid=2: Trying to acquire journal lock...
GFS: fsid=cypher:gfs1.0: jid=2: Looking at journal...
GFS: fsid=cypher:gfs1.0: jid=2: Done
GFS: fsid=cypher:gfs1.0: Scanning for log elements...
 [<e02cee56>] gfs_assert_i+0x48/0x69 [gfs]
 [<e02ba464>] unlinked_after_scan+0x0/0x4b [gfs]
 [<e02ba492>] unlinked_after_scan+0x2e/0x4b [gfs]
 [<e02c9c3d>] gfs_recover_dump+0x104/0x156 [gfs]
 [<e02cd447>] gfs_make_fs_rw+0xc3/0x11b [gfs]
 [<e02c2e33>] fill_super+0x9c7/0xf4f [gfs]
 [<c013eaea>] pagevec_lookup+0x17/0x1d
 [<c01565d6>] set_blocksize+0x77/0x7c
 [<e02c34ea>] gfs_get_sb+0x114/0x159 [gfs]
 [<c0167c74>] alloc_vfsmnt+0x78/0x9f
 [<c0156263>] do_kern_mount+0x8a/0x13d
 [<c0168c2b>] do_new_mount+0x61/0x90
 [<c0169278>] do_mount+0x178/0x190
 [<c0139ca0>] __alloc_pages+0xb4/0x298
 [<c01695cf>] sys_mount+0x91/0x108
 [<c029d6c7>] syscall_call+0x7/0xb
Kernel panic - not syncing: GFS: fsid=cypher:gfs1.0: assertion
"test_bit(SDF_FOUND_UL_DUMP, &sdp->sd_flags)" failed
GFS: fsid=cypher:gfs1.0:   function = unlinked_after_scan
GFS: fsid=cypher:gfs1.0:   file =
/home/devel/cluster/gfs-kernel/src/gfs/lops.c, line = 1168
GFS: fsid=cypher:gfs1.0:   time = 1107361448

This looks to me like even though you assert to make sure that you
don't overwrite the last dump entry.  That entry appears to get lost.
When I try to remount, the entry that used to be the last dump entry
has a different sequence number than it did when I asserted.

Comment 5 Ben Marzinski 2005-02-04 21:35:10 UTC

It seems that the sequence number of the last dump entry hasn't
changed after I reboot the machine.  However once I try to remount,
and replay the journal, the sequence number is changed. So it seems
like when head_off and dump_off are the same, the dump entry gets
messed up on journal reply... or something.

Comment 7 Ben Marzinski 2005-03-03 18:11:55 UTC

O.k. It seems that this problem is largely caused by semaphore unfairness.
gfs_log_dump() tries to lock the semaphore, but another process currently has
it locked.  Then the other process unlocks and relocks the semaphore almost 400
times, writing stuff to the log. The log gets full before gfs_log_dump() ever
gets the lock.  If I change the sd_log_lock semaphore to a rw_semaphore, which is
totally fair, and just make everyone acquire write locks, the problem goes away.

So that's what I did.

Comment 8 Dean Jansa 2005-03-08 16:06:57 UTC

I hit this again with GFS-kernel-smp-2.6.9-25.1 rpm.

You can grab a test from sistina-test/gfs/src/bz147682

./bz147682 -u <user> -m <mountpoint> -n <node1,node2,...,nodeN> [-w
remote_pan_run_width] [-l <STS_LROOT>] [-r <STS_RROOT>]

Comment 9 Dean Jansa 2005-03-08 16:08:32 UTC

I should note, you have to have data journaling on in the fs before
you run the test case in comment #8.

Comment 10 Dean Jansa 2005-03-08 16:13:22 UTC

Oops, sorry ben, ignore comment #8, wrong bug number...

This is the case I was looking for:

On the node, cd into the fs, with data journaling turned on and run:

for n in 0 1 2 3 4 5 6 7 8 9
        do
            iogen -i 300s -f sync -m random -s write,writev -t 400000b
-T 400000b 400000b:rwransynclarge${n} | doio -av &
        done

Comment 11 Ben Marzinski 2005-03-21 21:03:53 UTC

O.k. I can see this too, but I can make it go away simply by lowering the gfs
parameter max_atomic_write with gfs_tool settune. Lowering this parameter from
4Mb to 1Mb made me unable to reproduce the problem. If someone sees this error
in the field, it will simply cause one machine to panic, and we can tell them to
please lower the value of this parameter until they stop seeing the problem.

Since Ken is going to rewrite the log code anyway, this seems to me like a
perfectly acceptable way to deal with the problem.

This bug should probably get resolved WONTFIX, or DEFFERED or something useful,
but I'm not quite sure what.

Comment 13 Lon Hohberger 2013-09-23 15:31:47 UTC

The Red Hat Cluster Suite product is past end-of-life; closing.

Note You need to log in before you can comment on or make changes to this bug.