Description of problem: There were reports of long "df" execution time with larger fs size: http://post-office.corp.redhat.com/archives/cluster-list/2006-July/msg00023.html With several sysrq-t attempts, the "df" thread backtrace was never caught and it started to look like the command itself was hogging the CPUs. We switched to sysrq-w and found the following: <|> [<02207d2b>] showacpu+0x27/0x33 <|> [<02116c67>] smp_call_function_interrupt+0x3a/0x77 <|> [<f8d2007b>] user_eo_set+0x1c/0x52 [gfs] <|> [<f8d461cc>] stat_gfs_async+0xdf/0x187 [gfs] <|> [<f8d4633b>] gfs_stat_gfs+0x27/0x4e [gfs] <|> [<f8d3e44a>] gfs_statfs+0x26/0xc7 [gfs] <|> [<02157c69>] vfs_statfs+0x41/0x59 <|> [<02157d5f>] vfs_statfs64+0xe/0x28 <|> [<02165fe5>] __user_walk+0x4a/0x51 <|> [<02157e32>] sys_statfs64+0x49/0x7f <|> [<0211e8b0>] __wake_up+0x29/0x3c <|> [<021f7dd1>] tty_ldisc_deref+0x50/0x5f <|> [<021f8b70>] tty_write+0x258/0x262 <|> [<021fd134>] write_chan+0x0/0x1c5 <|> [<02159af3>] vfs_write+0xda/0xe2 <|> [<02159b99>] sys_write+0x3c/0x62 Log this bugzilla. Whenever we have a chance, we should look into improving gfs_statfs implementation - it has been a known GFS performance issue. Version-Release number of selected component (if applicable): RHEL 4 U3 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Looks very similar to bz #198740
The performance bottleneck lies in the lock acquiring process of resource group list, implemented as "stat_gfs_async()" routine. It requires one lock per entry (of rg list). If fs size is large, the entry count could be huge and it will take a while for the walk-thru to complete. To fix this issue: 1) WorkAround There is a gfs_tool "statfs_slots" tunable that can be set. The current default is 64 but can be altered by (for example, to increase the slot count to 128): shell> gfs_tool settune /mnt/gfs_mount_point statfs_slots 128 This number can be increased to get better performance for larger fs sizes. It is used as a "batch" count for gfs_statfs to acquire rg locks. The default is obtaining 64 locks (asynchronously) at a time. Experiments done in comment #6 showed that by doubling this number (to 128) has successfully cut the "df" wall time in half. The tradeoff is that it will increase lock traffic - setting it too high could hurt performance instead. A short document about the general "settune" tunables can be found at: http://people.redhat.com/wcheng/Patches/GFS/gfs_tune.txt 2) Short term (code) fix Will do a quick experimental fix that automatically sets the statfs_slots based on the length of resource group list (but still keep it tunable). If proved to be helpful, we'll put it into RHEL 4.5. Plan to get the test RPMs done by end of this week. 3) Real fix There are several ways to address this issue. One thing comes to mind is hierarchal locking. Another thing is doing approximation that asks each node to save its local statistics around and do cluster-wide sync them from time to time. GFS2 has chosen the second approach. We could backport GFS2 implementation to GFS1 if there are strong requirement.
Richard, I would like to see their /etc/cluster/cluster.conf file.
Richard said the problem is solved. The following are taken from the forwarded mail - saves here for future technical referece: -- start quote -- What we will be doing in production, is we will have approximately 13 SUN email servers sending data to a central repository via scp. These will essentially be gzipped tarballs upwards of a couple of GB a piece. The idea behind using a shared GFS volume, is that these copies can be distributed across the nodes in the cluster (I believe we were thinking of starting with a 4-node cluster). Obviously, part of what I am currently testing, now that we have our locking performance issues resolved, is how well this scales, being that we will be writing to a shared filesystem. Finally, we did end up coming to a resolution for our performance issue. It seems that for some reason, our ethernet interfaces on these servers were auto-negotiating as half-duplex, and this was contributing to the delay. I have disabled auto-negotiation and set the interfaces to full duplex. This, coupled with the dlm drop_count parameter setting and the statfs_slots setting for gfs has eliminated our noticeable delay. I'm not sure why this simple solution was missed earlier. -- end quote I've changed the gfs_statfs() to use vmalloc(), instead of kmalloc(). This should allow admin to specify statfs_slots number without memory constraints. Further gfs_statfs performance improvements will be done later when we get some free development cycles.
I think the change in gfs for this (vmalloc/vfree) may be a problem. When I mount a 1.5 TB file system and do df I get the following. When I got back to kmalloc/kfree the problem goes away. Unable to handle kernel paging request at 0000000000133000 RIP: <ffffffffa02b08e6>{:gfs:gfs_stat_gfs+171} PML4 13e7ec067 PGD 7e555067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: gfs(U) lock_nolock(U) lock_dlm(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd ehci_hcd tg3 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_nv libata qla2400 qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod Pid: 6443, comm: df Tainted: GF 2.6.9-34.ELsmp RIP: 0010:[<ffffffffa02b08e6>] <ffffffffa02b08e6>{:gfs:gfs_stat_gfs+171} RSP: 0018:0000010136975c28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000001600 RCX: 0000000000001600 RDX: 000001000000e701 RSI: 0000010037e1d000 RDI: 0000000000133000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000246 R10: 0000000000000246 R11: 000001007b4876b0 R12: 0000000000000000 R13: 0000000000133000 R14: 000001013de55400 R15: ffffff000024a000 FS: 0000002a9557db00(0000) GS:ffffffff804d7b00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000133000 CR3: 0000000000101000 CR4: 00000000000006e0 Process df (pid: 6443, threadinfo 0000010136974000, task 000001013cb43030) Stack: 0000010136975cf8 00000101385e8d68 0000000100000040 0000010136975d48 0000010136975cc8 000501013f4c7001 000000030024ed15 0000000000000000 0000012e0000014a 0000010136975d08 Call Trace:<ffffffffa02a8aae>{:gfs:gfs_statfs+39} <ffffffff80175f26>{vfs_statfs+82} <ffffffff80175f55>{vfs_statfs_native+20} <ffffffff80175fef>{sys_statfs+70} <ffffffff80229b05>{write_chan+0} <ffffffff80177c89>{vfs_write+248} <ffffffff80177d48>{sys_write+69} <ffffffff801101c6>{system_call+126} Code: f3 aa c7 44 24 0c 00 00 00 00 8b 44 24 10 39 44 24 0c c7 44 RIP <ffffffffa02b08e6>{:gfs:gfs_stat_gfs+171} RSP <0000010136975c28> CR2: 0000000000133000 <0>Kernel panic - not syncing: Oops
The gfs_holder is accessed one entry by one entry so there is really no need for contiguous memory for this huge array. However, I suspect the memory within the gfs_holder structure itself needs to be contiguous. So This statement would be the culprit: gha = vmalloc(slots * sizeof(struct gfs_holder), GFP_KERNEL); Could I log into your machine (to get the exact line of the panic RIP) ? I don't have 64 bit machine handy to do this.
Sorry, I wasn't on the cc list so I didn't get the last message. This still happens every time I do df on my machine, let me know when you'd like to use it and I'll stop my other testing on it.
I thought I recreated dct's crash - but I'm not really sure. What happened was after install-reinstall mess with my new x86_64 node, I got lazy and hand-crafted the change. So I accidentally replaced kmalloc with vmalloc as the following: from gha = kmalloc(slots * sizeof(struct gfs_holder), GFP_KERNEL); to gha = vmalloc(slots * sizeof(struct gfs_holder), GFP_KERNEL); The faulty code was compiled and ran fine on i686 machine but crashed on x86_64. The code in CVS is correct though. dct will double check later this week. However, the interesting thing is why i686 let me get away with this but x86_64 consistently generates stack memory corruption.
Apparently memset in x86_64 expects contiguous memory - not sure whether this is a feature or a bug though. In any case, I'll do some (gfs) code adjustments to work around this problem.
Turn out to be my bug - I havn't included <linux/vmalloc.h>. Somehow the compiler let me get away with it and gfs module can be loaded. During run time, the return address of vmalloc() is set to "int" and the higer 4 bytes of address is chopped off. It ends up with a 32 bit address, instead of a full 64 bit address. This explains why i686 doesn't see this issue. I also had other changes that checked in the past have this problem. All of these need to get cleaned up. Dave Jones said building with -Werror-implicit would have caught this. The RHEL5 kernel has this in its default CFLAGS.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0142.html
making this bug block tracker bug (BZ#239985)