I think some more information will be required. First, have they actually measured any sort of performance degradation or are they just basing their opinions on "nfsstat -c" counts? Is this a multi-user environment? This would mean multiple users with different uids. These read-only files, do they change much? How large are they? What mount options are used on the NFS clients? Let's start with this information.
Re: "are they just basing their opinions on "nfsstat -c" counts?" Yes, and therefor conclusion ACCESS calls are responsible isn't 100% correct due to relative time/resource usage for the various calls. I understand ACCESS is very fast and light-weight (please correct if not true). Re: "Is this a multi-user environment?" Unknown, though by the sound of it, it's single or very few unique UIDs. Re: "do they change much?" Unknown. Customer states that they use stat() (or something similar) to monitor the files for changes. Re: "How large are they?" Customer states several terabytes between all files. Exact number unknown. Re: "What mount options are used on the NFS clients?" ro,tcp,nfsvers=3,bg,hard,intr,rsize=8192,wsize=8192,acregmin=3600,acregmax=7200,nocto --- Additional data requested from customer: 1) A 'sosreport -k rpm.rpmva=off' from the test system just prior to testing 2) A 128-byte binary network packet capture, of traffic between the test system and the filer. 3) A matching run of your workload under strace -ff -q -r -T -tt -v -o /tmp/filename [-p pid or command] 4) A run-time providing representative data (e.g. 5-15 minutes). 5) A final 'sosreport -k rpm.rpmva=off' from the test system just after testing.
Customer's responses to your questions: All data updates are written by a single user on the master filer. The updates are then sent out to the child filers via netapp snapmirror. No data is ever updated on the child filers except through snapmirror. All mounts on the child filers are read only mounts, and are read by a single userid. On the child filers it is 100% impossible to change the data. The snapmirrors are never broken. Unbroken snapmirrors are read-only replicas of the WAFL filesystem. We receive data updates ~10 times a day, the data is updated via the snapmirror from the master filer. Data files are not changed, but new data files are added to the data set. Then we change the high water mark, and the system now "notices" the new updated files. We work with the newest files to oldest. All nodes in the compute cluster use the same files, I don't have a good count right now but it is in excess of 500k files. Our total data set size is about 600GB. On the NFS server we get about 95% cache hit ratio out of 32gb read cache.
Other data is being uploaded to dropbox. I'll attach it privately once it's here. Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by cevich issue 299688
Thanx for the information. Some more questions generated by the response in Comment #4 -- 500,000+ files. In how many directories? On the NFS clients, how often is each file accessed? Either via stat, open, read, or whatever? What is the hardware configuration of the NFS clients?
(In reply to comment #6) Customer can't get exact breakdown numbers, but it sounds like the files are fairly spread out: "there are a lot of directories for data organization. When we can get some perf time I'll try to get a find fired off to make you a data file index." I've asked customer for a rough estimation, i.e. tens, hundreds, thousands, or more files per directory. Re: "On the NFS clients, how often is each file accessed?" "How often the file is accessed depends on a number of factors. Data that is most commonly used, is always kept in memory and accessed via the mmap(). Other data is read into mmaps and accessed from there. Eventually we reach a triggered condition where we munmap() the files and start over. We call this fill&flush, fill&flush is not used for all data. Of the data eligible for fill/flush how often it is accessed will depend on how often we are using that data." Re: "What is the hardware configuration of the NFS clients?" "We are platformed on IBM X3650 servers with 16GB RAM and 8 cores total (2 sockets). Network fabric is Cisco 6500 with fabic enabled blades and sup720's. Each server has two gigE interfaces configured with the bonding driver in an active/passive configuration. Each child filer is a IBM re-badged NetApp 6080c with 32gb of read cache and 28 14k FCAL drives. 95% of disk operations complete from read cache. Each filer head is connected to the network fabric with a 6 gigE member port channel."
Created attachment 346583 [details] log showing NFS operations every 10 seconds for a production server to illustrating the I/O rates
Whay are athey using 8k as the rsize and wsize? What is the Data Ontap version?
Event posted on 06-09-2009 03:32pm EDT by cevich Answers: Most IO's are less than 4k, 8k figure comes from Netapp best practices guide. Ontap version is 7.2.5.1 but planning on migrating to 7.3.1. This event sent from IssueTracker by cevich issue 299688
You can't just concatenate those capture files together like that. They have a header at the beginning which identifies them. It might easier to just count directories and then assume that everything else is a regular file. Then, the "df -i" statistics could be useful. I might suggest stopping the find and trying to get the information from some other avenue. The more interesting information is that the working set size of the files being accessed or scanned. Are all 89 million looked at? Some subset? At what rate, ie. how many files per second? Perhaps the strace would be useful in this regard. I will look at the capture files, but I am not terribly hopeful of getting much useful out of them. The best capture file for me would be that 6G file as 1 continuous capture and from the NFS client's perspective. That way, I could look to see which files were being looked at. By starting and stopping the network capture, packets are missed and the information starts to become less and less reliable. It is not possible to determine the time duration that the captures were done for, so I can't infer how many files per second were being accessed. So, additionally, what is the hardware configuration for a typical NFS client in this deployment?
My current analysis -- In the Big capture file, I see somewhere around 7,395,500 NFS RPC calls. Of these, somewhere around 3,175,300 are NFS ACCESS calls. These ACCESS calls were made for 750,950 unique file handles. Of these file handles, somewhere around 455,200 were used more than once. The peak number of ACCESS calls per file handle was 170 and it was a directory, probably a (the?) mount point. I also noticed a lot of FSSTAT calls. Is df(1) being run frequently? A question -- is the equivalent of "noatime" and "nodiratime" being used on the NFS server file system? A question -- the NFS client described by the sosreport in the IT appears to have 16G memory and 8 3.16GHz cpus. Have any tunings been applied? Some back of the envelope scratching -- 3271 seconds of capture time. 2/10,000 seconds per ACCESS call or about 650 seconds doing ACCESS calls. Divide that by the number of processors indicates that each processor spent about 80 seconds out of 3271 seconds doing ACCESS calls. That's somewhere around 2.5% of the time. Of course, all of this is fuzzy math. Looking at just absolute numbers, the number of ACCESS calls appears to be very high. It is almost 43%. However, when some context is considered, then this doesn't seem so high any more. The number of files is 750,000+, which reduces to an average of about 4 ACCESS calls per file over the period of more than 50 minutes. Given that each ACCESS call takes on the order of 0.2 milliseconds, about 1 millisecond per file over the 50+ minutes was spent checking ACCESS. I think that I would recommend looking elsewhere for increased performance if a real need exists. Perhaps with this sheer number of files, a different architecture, utilizing multiple clients, each working on a subset of the files, may help?
I am not sure any of our NetApp BPG suggests to use 8k in TCP environment. It used to be the case while using UDP. If it does then that needs to corrected. Now we strongly recommend to use 64k for rsize and wsize. We are in the process of updating the documents to suggest that way.
64k is good, but definitely not for UDP. For UDP, 32k would be a better bigger value than 8k.
We highly recommend to use TCP in our NFS environments. UDP is no longer a best practice. While TCP is used we also recemmend to use 64k in the new RH kernels like RHEL5 and later.
Any comments or feedback on Comment #62?
(I've made your analysis public since it may provide helpful info. to others facing a similar question) Thanks for the detailed analysis. I agree, we probably need to look elsewhere for bigger knobs to tweak. We've provided all this (and your questions) back to the customer and are waiting on feedback.
Customer states client-side NFS acl processing is driving a significant load on their NetApp filers. Customer is engaged with Red Hat Consulting to provide client-side "fix". Not sure why server-side ACL processing cannot simply be disabled (i.e. in RHEL, simply specify no_acl as an export option). Either way, there isn't a client-side bug here.