Bug 199984

Summary:	gfs_statfs performance issue
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Wendy Cheng <nobody+wcheng>
Component:	gfs	Assignee:	Wendy Cheng <nobody+wcheng>
Status:	CLOSED ERRATA	QA Contact:	GFS Bugs <gfs-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	cwyse, ddomingo, lenny, richardl, teigland
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2007-0142	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-05-10 21:14:01 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	207487, 235840, 239985

Description Wendy Cheng 2006-07-24 19:22:25 UTC

Description of problem:
There were reports of long "df" execution time with larger fs size:

http://post-office.corp.redhat.com/archives/cluster-list/2006-July/msg00023.html

With several sysrq-t attempts, the "df" thread backtrace was never caught and
it started to look like the command itself was hogging the CPUs. We switched 
to sysrq-w and found the following:

<|> [<02207d2b>] showacpu+0x27/0x33 
<|> [<02116c67>] smp_call_function_interrupt+0x3a/0x77
<|> [<f8d2007b>] user_eo_set+0x1c/0x52 [gfs] 
<|> [<f8d461cc>] stat_gfs_async+0xdf/0x187 [gfs]
<|> [<f8d4633b>] gfs_stat_gfs+0x27/0x4e [gfs]
<|> [<f8d3e44a>] gfs_statfs+0x26/0xc7 [gfs]
<|> [<02157c69>] vfs_statfs+0x41/0x59
<|> [<02157d5f>] vfs_statfs64+0xe/0x28
<|> [<02165fe5>] __user_walk+0x4a/0x51
<|> [<02157e32>] sys_statfs64+0x49/0x7f
<|> [<0211e8b0>] __wake_up+0x29/0x3c
<|> [<021f7dd1>] tty_ldisc_deref+0x50/0x5f
<|> [<021f8b70>] tty_write+0x258/0x262
<|> [<021fd134>] write_chan+0x0/0x1c5
<|> [<02159af3>] vfs_write+0xda/0xe2
<|> [<02159b99>] sys_write+0x3c/0x62

Log this bugzilla. Whenever we have a chance, we should look into improving
gfs_statfs implementation  - it has been a known GFS performance issue.

Version-Release number of selected component (if applicable):
RHEL 4 U3

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Lenny Maiorani 2006-07-24 19:56:15 UTC

Looks very similar to bz #198740

Comment 9 Wendy Cheng 2006-07-25 18:45:54 UTC

The performance bottleneck lies in the lock acquiring process of resource 
group list, implemented as "stat_gfs_async()" routine. It requires one lock 
per entry (of rg list). If fs size is large, the entry count could be huge 
and it will take a while for the walk-thru to complete. To fix this issue:
                                                                                
1) WorkAround
                                                                                
There is a gfs_tool "statfs_slots" tunable that can be set. The current 
default is 64 but can be altered by (for example, to increase the slot count 
to 128):
                                                                                
shell> gfs_tool settune /mnt/gfs_mount_point statfs_slots 128
                                                                                
This number can be increased to get better performance for larger fs sizes. 
It is used as a "batch" count for gfs_statfs to acquire rg locks. The default 
is obtaining 64 locks (asynchronously) at a time. Experiments done in comment 
#6 showed that by doubling this number (to 128) has successfully cut the "df" 
wall time in half. The tradeoff is that it will increase lock traffic - 
setting it too high could hurt performance instead. A short document about the
general "settune" tunables can be found at:
                                                                                
http://people.redhat.com/wcheng/Patches/GFS/gfs_tune.txt
                                                                                
2) Short term (code) fix
                                                                                
Will do a quick experimental fix that automatically sets the statfs_slots 
based on the length of resource group list (but still keep it tunable). If 
proved to be helpful, we'll put it into RHEL 4.5. Plan to get the test RPMs 
done by end of this week.
                                                                                
3) Real fix
                                                                                
There are several ways to address this issue. One thing comes to mind is 
hierarchal locking. Another thing is doing approximation that asks each
node to save its local statistics around and do cluster-wide sync them from
time to time. GFS2 has chosen the second approach. We could backport GFS2
implementation to GFS1 if there are strong requirement.

Comment 12 Wendy Cheng 2006-08-01 17:03:43 UTC

Richard, I would like to see their /etc/cluster/cluster.conf file.

Comment 24 Wendy Cheng 2006-08-17 16:17:14 UTC

Richard said the problem is solved. The following are taken from the
forwarded mail - saves here for future technical referece:

-- start quote --

    What we will be doing in production, is we will have approximately 
13 SUN email servers sending data to a central repository via scp.  
These will essentially be gzipped tarballs upwards of a couple of GB a 
piece.  The idea behind using a shared GFS volume, is that these copies 
can be distributed across the nodes in the cluster (I believe we were 
thinking of starting with a 4-node cluster).  Obviously, part of what I 
am currently testing, now that we have our locking performance issues 
resolved, is how well this scales, being that we will be writing to a 
shared filesystem.

    Finally, we did end up coming to a resolution for our performance 
issue.  It seems that for some reason, our ethernet interfaces on these 
servers were auto-negotiating as half-duplex, and this was contributing 
to the delay.  I have disabled auto-negotiation and set the interfaces 
to full duplex.  This, coupled with the dlm drop_count parameter setting 
and the statfs_slots setting for gfs has eliminated our noticeable 
delay.  I'm not sure why this simple solution was missed earlier.

-- end quote

I've changed the gfs_statfs() to use vmalloc(), instead of kmalloc(). 
This should allow admin to specify statfs_slots number without memory
constraints. Further gfs_statfs performance improvements will be done
later when we get some free development cycles.

Comment 25 David Teigland 2006-08-23 15:53:19 UTC

I think the change in gfs for this (vmalloc/vfree) may be a problem.
When I mount a 1.5 TB file system and do df I get the following.
When I got back to kmalloc/kfree the problem goes away.

Unable to handle kernel paging request at 0000000000133000 RIP:
<ffffffffa02b08e6>{:gfs:gfs_stat_gfs+171}
PML4 13e7ec067 PGD 7e555067 PMD 0
Oops: 0002 [1] SMP
CPU 0
Modules linked in: gfs(U) lock_nolock(U) lock_dlm(U) lock_harness(U)
dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core
sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd ehci_hcd tg3
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_nv libata qla2400
qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptfc mptscsi mptbase
sd_mod scsi_mod
Pid: 6443, comm: df Tainted: GF     2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa02b08e6>] <ffffffffa02b08e6>{:gfs:gfs_stat_gfs+171}
RSP: 0018:0000010136975c28  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000001600 RCX: 0000000000001600
RDX: 000001000000e701 RSI: 0000010037e1d000 RDI: 0000000000133000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000246
R10: 0000000000000246 R11: 000001007b4876b0 R12: 0000000000000000
R13: 0000000000133000 R14: 000001013de55400 R15: ffffff000024a000
FS:  0000002a9557db00(0000) GS:ffffffff804d7b00(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000133000 CR3: 0000000000101000 CR4: 00000000000006e0
Process df (pid: 6443, threadinfo 0000010136974000, task 000001013cb43030)
Stack: 0000010136975cf8 00000101385e8d68 0000000100000040 0000010136975d48
       0000010136975cc8 000501013f4c7001 000000030024ed15 0000000000000000
       0000012e0000014a 0000010136975d08
Call Trace:<ffffffffa02a8aae>{:gfs:gfs_statfs+39}
<ffffffff80175f26>{vfs_statfs+82}
       <ffffffff80175f55>{vfs_statfs_native+20}
<ffffffff80175fef>{sys_statfs+70}
       <ffffffff80229b05>{write_chan+0} <ffffffff80177c89>{vfs_write+248}
       <ffffffff80177d48>{sys_write+69}
<ffffffff801101c6>{system_call+126}


Code: f3 aa c7 44 24 0c 00 00 00 00 8b 44 24 10 39 44 24 0c c7 44
RIP <ffffffffa02b08e6>{:gfs:gfs_stat_gfs+171} RSP <0000010136975c28>
CR2: 0000000000133000
 <0>Kernel panic - not syncing: Oops

Comment 26 Wendy Cheng 2006-08-23 16:16:15 UTC

The gfs_holder is accessed one entry by one entry so there is really no 
need for contiguous memory for this huge array. However, I suspect the 
memory within the gfs_holder structure itself needs to be contiguous. So
This statement would be the culprit:

gha = vmalloc(slots * sizeof(struct gfs_holder), GFP_KERNEL);

Could I log into your machine (to get the exact line of the panic RIP) ?
I don't have 64 bit machine handy to do this.

Comment 27 David Teigland 2006-09-14 20:30:04 UTC

Sorry, I wasn't on the cc list so I didn't get the last message.
This still happens every time I do df on my machine, let me know
when you'd like to use it and I'll stop my other testing on it.

Comment 31 Wendy Cheng 2006-09-15 21:11:33 UTC

I thought I recreated dct's crash - but I'm not really sure.
                                                                               
                    
What happened was after install-reinstall mess with my new x86_64 node, I
got lazy and hand-crafted the change. So I accidentally replaced kmalloc
with vmalloc as the following:
                                                                               
                    
from gha = kmalloc(slots * sizeof(struct gfs_holder), GFP_KERNEL);
to   gha = vmalloc(slots * sizeof(struct gfs_holder), GFP_KERNEL);
                                                                               
                    
The faulty code was compiled and ran fine on i686 machine but crashed on
x86_64. The code in CVS is correct though.
                                                                               
                    
dct will double check later this week.
                                                                               
                    
However, the interesting thing is why i686 let me get away with this
but x86_64 consistently generates stack memory corruption.

Comment 34 Wendy Cheng 2006-09-20 16:01:26 UTC

Apparently memset in x86_64 expects contiguous memory - not sure whether this is
a feature or a bug though.

In any case, I'll do some (gfs) code adjustments to work around this problem.

Comment 35 Wendy Cheng 2006-09-20 21:19:11 UTC

Turn out to be my bug - I havn't included <linux/vmalloc.h>. Somehow the 
compiler let me get away with it and gfs module can be loaded. During run 
time, the return address of vmalloc() is set to "int" and the higer 4 
bytes of address is chopped off. It ends up with a 32 bit address, instead 
of a full 64 bit address. This explains why i686 doesn't see this issue.
I also had other changes that checked in the past have this problem. All of
these need to get cleaned up. 

Dave Jones said building with -Werror-implicit would have caught this.
The RHEL5 kernel has this in its default CFLAGS.

Comment 38 Red Hat Bugzilla 2007-05-10 21:14:01 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0142.html

Comment 39 Don Domingo 2007-05-13 23:57:29 UTC

making this bug block tracker bug (BZ#239985)