From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050921 Red Hat/1.0.7-1.4.1 Firefox/1.0.7 Description of problem: SPECsfs fails to run correctly with a GFS, where it works fine on the same system with an EXT3. The configuration is using TCP and NFS V3 over copper. A single filesystem is being accessed. Jumbo frames is not turned on. Errors returned from the benchmark include ... sfs30: too many failed RPC calls - 661 good 29 bad sfs30: too many failed RPC calls - 1262 good 59 bad Child 0 - 50 failed ops sfs30: too many failed RPC calls - 1955 good 102 bad Child 0 - 100 failed ops sfs30: too many failed RPC calls - 3152 good 192 bad ... This result was generated with a single process on a single client at a very light workload (200 Ops/sec). As the workload increases the percentage of failures increases to the point the benchmark invalidates the test. The benchmark was able to generate 3600 Ops/sec without a single reported RPC issue. Turning on some benchmark debug information yields the following output at an 800 Ops/sec request sfs30: READ failed sfs30: OP 6 failed sfs30: getattr call NFS error Stale NFS file handle on file 14708 sfs30: WRITE failed sfs30: OP 8 failed sfs30: getattr call NFS error Stale NFS file handle on file 30892 sfs30: getattr call NFS error Stale NFS file handle on file 5764 sfs30: READ failed sfs30: OP 6 failed sfs30: READ failed sfs30: OP 6 failed sfs30: READ failed sfs30: OP 6 failed sfs30: READ failed sfs30: OP 6 failed sfs30: getattr call NFS error Stale NFS file handle on file 35188 sfs30: setattr call NFS error Stale NFS file handle on file 26990 sfs30: getattr call NFS error Stale NFS file handle on file 9705 sfs30: setattr call NFS error Stale NFS file handle on file 23402 sfs30: getattr call NFS error Stale NFS file handle on file 32444 sfs30: getattr call NFS error Stale NFS file handle on file 31947 sfs30: WRITE failed sfs30: OP 8 failed sfs30: getattr call NFS error Stale NFS file handle on file 19242 sfs30: getattr call NFS error Stale NFS file handle on file 35397 sfs30: too many failed RPC calls - 419 good 31 bad sfs30: WRITE failed ... tons more Note below the failed read/write ops. Aggregate Test Parameters: Number of processes = 1 Requested Load (NFS V3 operations/second) = 800 Maximum number of outstanding biod writes = 2 Maximum number of outstanding biod reads = 2 Warm-up time (seconds) = 300 Run time (seconds) = 300 File Set = 312001 Files created for I/O operations 31200 Files accessed for I/O operations 6241 Files for non-I/O operations 21 Symlinks 10400 Directories Additional non-I/O files created as necessary SFS Aggregate Results for 1 Client(s), Wed Jan 18 14:48:26 2006 NFS Protocol Version 3 ------------------------------------------------------------------------------ NFS Target Actual NFS NFS Mean Std Dev Std Error Pcnt Op NFS NFS Op Op Response Response of Mean,95% of Type Mix Mix Success Error Time Time Confidence Total Pcnt Pcnt Count Count Msec/Op Msec/Op +- Msec/Op Time ------------------------------------------------------------------------------ getattr 11% 11.2% 27846 0 0.35 1.25 0.01 4.9% setattr 1% 1.0% 2576 0 1.36 2.33 0.06 1.8% lookup 27% 27.3% 68124 0 0.29 1.08 0.01 10.2% readlink 7% 7.1% 17664 0 0.19 0.05 0.00 1.7% read 18% 17.8% 44440 850 0.93 2.57 0.01 21.0% write 9% 8.8% 22078 280 1.47 11.83 0.05 16.5% create 1% 1.0% 2452 0 2.05 10.98 0.13 2.5% remove 1% 1.0% 2534 0 0.86 0.08 0.01 1.1% readdir 2% 2.0% 5073 0 1.11 0.10 0.01 2.9% fsstat 1% 1.0% 2541 0 0.93 0.34 0.02 1.2% access 7% 7.1% 17655 0 0.19 0.31 0.01 1.7% commit 5% 4.5% 11144 0 4.10 0.45 0.01 23.2% fsinfo 1% 1.0% 2579 0 0.17 0.04 0.01 0.2% readdirplus 9% 9.1% 22798 0 0.95 0.11 0.00 11.0% ------------------------------------------------------------------------------ -------------------------------------------------------- | SPEC SFS VERSION 3.0 AGGREGATE RESULTS SUMMARY | -------------------------------------------------------- NFS V3 THROUGHPUT: 834 Ops/Sec AVG. RESPONSE TIME: 0.8 Msec/Op TCP PROTOCOL NFS MIXFILE: [ SFS default ] AGGREGATE REQUESTED LOAD: 800 Ops/Sec TOTAL NFS OPERATIONS: 249504 TEST TIME: 299 Sec NUMBER OF SFS CLIENTS: 1 TOTAL FILE SET SIZE CREATED: 7921.9 MB TOTAL FILE SET SIZE ACCESSED: 792.2 - 874.0 MB (100% to 110% of Base) Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. See Wendy Chang for setup. Her SPECsfs runs show the same errors. 2. 3. Actual Results: See Description Expected Results: No RPC errors warnings Additional info:
This was done on RHEL4 U3 Beta
I was adding some traps code within the SPECsfs code itself (on the NFS client machine) over the weekend and noticed the NFS/GFS server got frozen. Didn't expect this so there was no crash server setup. Went to check the machines in the lab today (Monday) - part of the panic log had rolled over the screen so I couldn't catch the exact place. This shouldn't happen since SPECSsfs is a stand-alone application that doesn't have any kernel piece and nor it uses any of kernel NFS code. It directly interacts with NFS/GFS server via network packages. Part of crash route: nfsd_getattr write_inode wh_kupdate __sync_single_inode write_inode_now_err kthread nfsd_setattr nfsd_create_v3 nfsd_proc_create nfsd_dispatch svc_process nfsd Assertion "FALSE" failed function = check_seg_usage ... gfs-kernel-2.6.9-47/smp/src/log.c line 590 time=1137891017
I ran thru one round of SPECsfs today (first time ever) with "NFS Op Error Count" all zero with the following combination: 1. Comment out (disabled) base kernel NFSD's readache code. 2. Byte swap the 6th word of GFS file handle during decoding (gfs_decode_fh). 3. (NFS) Export the filesystem using "fsid" option (bypassing GFS diaper device). 4. Re-make GFS filesystem (gfs_mkfs). 5. Fix a debug buffer over-flow issue. Need to see whether it is repeatble and further isolate the culprits.
Third round works without error. So let's ship it !
Thanks Wendy - Good news - the SPECsfs workloads in x86 mode runs to completion no transactions errors reported. Please login to bigbaddell2.lab which is updated to 2.6.9-32.ELsmp Barry will baseline the hugemem against 2cpu, 4cpu and 8cpus on the SPECsfs testbed.
The fix has been built by Christ Feist and should be on RHEL4 U3 GFS release.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0234.html