Bug 146672
Summary: | Node panic doing a writev() | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Dean Jansa <djansa> |
Component: | gfs | Assignee: | Ben Marzinski <bmarzins> |
Status: | CLOSED DEFERRED | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | bmarzins, bstevens, kpreslan |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-09-23 15:31:47 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 144795 |
Description
Dean Jansa
2005-01-31 17:37:22 UTC
FWIW -- Before the IO was run data journaling was turned on via: gfs_tool setflag inherit_jdata /mnt/vedder Able to hit again with the load below, started on each node, and tank-01 trips over the assert withing moments: iogen -f sync -m random -s write,writev -t 100000b -T 400000b 400000b:rwransynclarge0 | doio -av iogen -f sync -m random -s write,writev -t 100000b -T 400000b 400000b:rwransynclarge1 | doio -av iogen -f sync -m random -s write,writev -t 100000b -T 400000b 400000b:rwransynclarge2 | doio -av iogen -f sync -m random -s write,writev -t 100000b -T 400000b 400000b:rwransynclarge3 | doio -av iogen -f sync -m random -s write,writev -t 100000b -T 400000b 400000b:rwransynclarge4 | doio -av iogen -f sync -m random -s write,writev -t 100000b -T 400000b 400000b:rwransynclarge5 | doio -av iogen -f sync -m random -s write,writev -t 100000b -T 400000b 400000b:rwransynclarge6 | doio -av iogen -f sync -m random -s write,writev -t 100000b -T 400000b 400000b:rwransynclarge7 | doio -av iogen -f sync -m random -s write,writev -t 100000b -T 400000b 400000b:rwransynclarge8 | doio -av iogen -f sync -m random -s write,writev -t 100000b -T 400000b 400000b:rwransynclarge9 | doio -av I have been able to reproduce this fairly reliably by doing the following: - make a gfs file system with the smallest journal size possible - mount it on one node - turn on data journaling - run the revolver load set to HEAVY Now, I'm going to putz with tuning things and see if I can make it go away... or at least be unreproducible. I think that in the long term, Ken is working on stuff that will avoid this problem. In the short term, making it hard to hit might be the best option. Apparently, the ability to do this has always been in the journaling code, and fixing it may take some work. It seems that when you hit this bug with head_off and dump_off being equal, it ends up much worse. When I tried to remount the filesystem after panicing on the assert with head_off and dump_off both equalling 13558176, I hit the following assert GFS: Trying to join cluster "lock_dlm", "cypher:gfs1" GFS: fsid=cypher:gfs1.0: Joined cluster. Now mounting FS... GFS: fsid=cypher:gfs1.0: jid=0: Trying to acquire journal lock... GFS: fsid=cypher:gfs1.0: jid=0: Looking at journal... GFS: fsid=cypher:gfs1.0: jid=0: Acquiring the transaction lock... GFS: fsid=cypher:gfs1.0: jid=0: Replaying journal... GFS: fsid=cypher:gfs1.0: jid=0: Replayed 1368 of 1390 blocks GFS: fsid=cypher:gfs1.0: jid=0: replays = 1368, skips = 0, sames = 22 GFS: fsid=cypher:gfs1.0: jid=0: Journal replayed in 1s GFS: fsid=cypher:gfs1.0: jid=0: Done GFS: fsid=cypher:gfs1.0: jid=1: Trying to acquire journal lock... GFS: fsid=cypher:gfs1.0: jid=1: Looking at journal... GFS: fsid=cypher:gfs1.0: jid=1: Done GFS: fsid=cypher:gfs1.0: jid=2: Trying to acquire journal lock... GFS: fsid=cypher:gfs1.0: jid=2: Looking at journal... GFS: fsid=cypher:gfs1.0: jid=2: Done GFS: fsid=cypher:gfs1.0: Scanning for log elements... [<e02cee56>] gfs_assert_i+0x48/0x69 [gfs] [<e02ba464>] unlinked_after_scan+0x0/0x4b [gfs] [<e02ba492>] unlinked_after_scan+0x2e/0x4b [gfs] [<e02c9c3d>] gfs_recover_dump+0x104/0x156 [gfs] [<e02cd447>] gfs_make_fs_rw+0xc3/0x11b [gfs] [<e02c2e33>] fill_super+0x9c7/0xf4f [gfs] [<c013eaea>] pagevec_lookup+0x17/0x1d [<c01565d6>] set_blocksize+0x77/0x7c [<e02c34ea>] gfs_get_sb+0x114/0x159 [gfs] [<c0167c74>] alloc_vfsmnt+0x78/0x9f [<c0156263>] do_kern_mount+0x8a/0x13d [<c0168c2b>] do_new_mount+0x61/0x90 [<c0169278>] do_mount+0x178/0x190 [<c0139ca0>] __alloc_pages+0xb4/0x298 [<c01695cf>] sys_mount+0x91/0x108 [<c029d6c7>] syscall_call+0x7/0xb Kernel panic - not syncing: GFS: fsid=cypher:gfs1.0: assertion "test_bit(SDF_FOUND_UL_DUMP, &sdp->sd_flags)" failed GFS: fsid=cypher:gfs1.0: function = unlinked_after_scan GFS: fsid=cypher:gfs1.0: file = /home/devel/cluster/gfs-kernel/src/gfs/lops.c, line = 1168 GFS: fsid=cypher:gfs1.0: time = 1107361448 This looks to me like even though you assert to make sure that you don't overwrite the last dump entry. That entry appears to get lost. When I try to remount, the entry that used to be the last dump entry has a different sequence number than it did when I asserted. It seems that the sequence number of the last dump entry hasn't changed after I reboot the machine. However once I try to remount, and replay the journal, the sequence number is changed. So it seems like when head_off and dump_off are the same, the dump entry gets messed up on journal reply... or something. O.k. It seems that this problem is largely caused by semaphore unfairness. gfs_log_dump() tries to lock the semaphore, but another process currently has it locked. Then the other process unlocks and relocks the semaphore almost 400 times, writing stuff to the log. The log gets full before gfs_log_dump() ever gets the lock. If I change the sd_log_lock semaphore to a rw_semaphore, which is totally fair, and just make everyone acquire write locks, the problem goes away. So that's what I did. I hit this again with GFS-kernel-smp-2.6.9-25.1 rpm. You can grab a test from sistina-test/gfs/src/bz147682 ./bz147682 -u <user> -m <mountpoint> -n <node1,node2,...,nodeN> [-w remote_pan_run_width] [-l <STS_LROOT>] [-r <STS_RROOT>] I should note, you have to have data journaling on in the fs before you run the test case in comment #8. Oops, sorry ben, ignore comment #8, wrong bug number... This is the case I was looking for: On the node, cd into the fs, with data journaling turned on and run: for n in 0 1 2 3 4 5 6 7 8 9 do iogen -i 300s -f sync -m random -s write,writev -t 400000b -T 400000b 400000b:rwransynclarge${n} | doio -av & done O.k. I can see this too, but I can make it go away simply by lowering the gfs parameter max_atomic_write with gfs_tool settune. Lowering this parameter from 4Mb to 1Mb made me unable to reproduce the problem. If someone sees this error in the field, it will simply cause one machine to panic, and we can tell them to please lower the value of this parameter until they stop seeing the problem. Since Ken is going to rewrite the log code anyway, this seems to me like a perfectly acceptable way to deal with the problem. This bug should probably get resolved WONTFIX, or DEFFERED or something useful, but I'm not quite sure what. The Red Hat Cluster Suite product is past end-of-life; closing. |