Bug 431092
Summary: | Degraded NFSv3 client performance with kernel upgrade | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Johnny Hughes <johnny> | ||||
Component: | kernel | Assignee: | Peter Staubach <staubach> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Martin Jenner <mjenner> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 5.1 | CC: | admin, alessandro.tinivelli, amyagi, cmc, herrold, hiroto.shibuya, jlayton, joshua.bakerlepain, j.s.peatfield, mishu, pasteur, ralph+rh-bugzilla, rh, sm, steved | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
URL: | http://bugs.centos.org/view.php?id=2635 | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-02-06 15:34:16 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Johnny Hughes
2008-01-31 18:34:44 UTC
Haven't had time to verify it, but it looks like the silly-rename fixes that went in may have caused this. The most helpful thing to get this resolved would be a way to reproduce this. Can you offer one? I thought he did so :) Seems to be reproducable by just pushing around a rather large amount of files (see also http://bugs.centos.org/view.php?id=2635 and the referenced mailing list thread). Especially nfs lookups seem to go up in a large way. The guy from the mailing list is running against some EMC storage and seems to have the largest performance hit yet. Jeff, Why do you think the silly-rename fixes are causing this problem? There does not seem to be any renames or remove happening.... The largest performance issue we have heard of was with an EMC NAS device as the NFS server, where the difference was more than 2000x instead of 15x or 3.4x (it seems they where writing to mysql databases that were on the NFS mount from more than one machine). But we can duplicate the effects doing anything (copying, deleting, moving, untaring, updating databases, etc.) to files on the NFS Share with the new kernel on the client in controlled tests. Those were the only NFS-related patches that I saw between -53.1.4 and -53.1.6 so I figured they were the likely culprit. I could certainly be wrong though... the person with the problem said that backing out the "NFS related" packages solved his problem. I've confirmed the observation reported by the original poster in the CentOS mailing list. I removed the 5 nfs patches added to 53.1.6 and rebuilt the kernel. It now behaves like 53.1.4 (no performance problem). One more piece of information in relation to the 5 nfs patches added to the 53.1.6 kernel. I recompiled 53.1.6 by removing only one patch at a time. Turned out that the performance problem went away when ANY ONE of them was omitted. Sorry, please disregard comment #10. Because the result did not make sense, I rechecked the spec file and noticed I was not properly commenting out the patch lines. Needs to be retested by taking the correct line(s) out. Hope I did it correctly this time. Don't have a complete set of the test yet, but so far got the following: Removal of 21852 -> no effect Removal of 21851 and 21852 -> no effect Removal of 21849 TO 21852 -> performance reverted to the 53.1.4 kernel level This indicates at least 21848 is OK. ... and as expected (?), Removal of 21850 TO 21852 -> no effect All "silly" patch series had to be omitted from 53.1.6 to solve the issue. Looking at linux-2.6-nfs-infrastructure-changes-for-silly-renames.patch (Patch21849) I see that commit 3062c532ad410fe0e8320566fe2879a396be6701 is listed but the patch matching that doesn't get completely applied - at least not in any obvious way. http://www.mail-archive.com/git-commits-head@vger.kernel.org/msg18560.html and for example http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3062c532ad410fe0e8320566fe2879a396be6701;hp=be879c4e249a8875d7129f3b0c1bb62584dafbd8 has two changes, one to make void nfs_set_verifier() stuff the data into dentry->d_time and another in nfs_check_verifier() to pull the data out. In this patch things are different 'cos obviously more has changed elsewhere e.g. nfs_set_verifier() in the git patch is shown in fs/nfs/dir.c but we have it as an inline function in nfs_fs.h. Anyway I can't see anything which changed nfs_check_verifier to extract dentry->d_time so it isn't doing what the original patch intended... Of course I could be totally off base. -- Jon You might check bz321111 for a patch which seems to help significantly and without backing out the sillyrename patches. I would love to check ... only if that patch is made accessible to us. Does anyone feel brave enough to see if the 'test' kernel(s) pointed at from #429109 (ie from http://people.redhat.com/dzickus/el5/) address this problem? From the release-numbering I'm assuming that these are test versions of what is planned to go into 'update 2' so it might be worth checking if the same problems occur there as well (these have quite different sets of nfs patches from what I see by diff'ing the specfiles etc). -- Jon Created attachment 293901 [details] Patch from bz321111 No magic about this patch... :-) It appears to be significantly reduce the number of over the wire RPCs which are made, for some operation mixes. I suspect that it might help here. The patch from bz321111 provided in #18 failed when applied to -53.1.6: 1 out of 30 hunks FAILED -- saving rejects to file fs/nfs/dir.c.rej cat dir.c.rej shows: *************** *** 827,833 **** */ static void nfs_dentry_iput(struct dentry *dentry, struct inode *inode) { - nfs_inode_return_delegation(inode); if (S_ISDIR(inode->i_mode)) /* drop any readdir cache as it could easily be old */ NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA; --- 831,836 ---- */ static void nfs_dentry_iput(struct dentry *dentry, struct inode *inode) { if (S_ISDIR(inode->i_mode)) /* drop any readdir cache as it could easily be old */ NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA; That patch has been developed on much more recent builds of RHEL-5, namely the low to mid 70's. Some porting may be required... That said, that rejected chunk is to remove the call to nfs_inode_return_delegation() in nfs_dentry_iput(). experiencing same issue with NetApp file server, mounted by a web server. performances are not degradated... after a few seconds the web server stops responding and the file server is flooded by "crazy" lookup requests, becoming unavailable for all. This bug should have been considered as URGENT *** Bug 431619 has been marked as a duplicate of this bug. *** (In reply to comment #17) > Does anyone feel brave enough to see if the 'test' kernel(s) pointed at from > #429109 (ie from http://people.redhat.com/dzickus/el5/) address this problem? > > From the release-numbering I'm assuming that these are test versions of what is > planned to go into 'update 2' so it might be worth checking if the same problems > occur there as well (these have quite different sets of nfs patches from what I > see by diff'ing the specfiles etc). > > -- Jon I tested 2.6.18-77.el5 which was the latest in that directory. It has exactly the same problem. (In reply to comment #20) > That patch has been developed on much more recent builds of RHEL-5, namely > the low to mid 70's. Some porting may be required... Now that I installed 2.6.18-77.el5 (see #23), I tried to apply the patch. It went through just fine this time. After loading the patched 2.6.18-77, I performed the same test as before. The nfs problem was apparently gone. Note that 2.6.18-77 without the patch still had the issue (#23). Akemi Good. Thank you for trying this and posting the results. (In reply to comment #25) > Good. Thank you for trying this and posting the results. Would this patch be included in 5.2? Or, in the z series? I can't make a commitment, but we are trying for 5.2. Please wait a bit and I'll have a better idea of whether or not the changes will make 5.2 or will have to wait until 5.3. In the meantime, I am going to close this bugzilla as a duplicate of 321111. *** This bug has been marked as a duplicate of 321111 *** 321111 is not open to the public. Could you keep this bugzilla open and update the status here so that we all can see how things are developing? Any changes to bz321111 won't be automatically reflected here and I know that I won't remember to do so enough. I will check to see how to make bz321111 readable. ------- Additional Comments From dzickus 2008-02-06 15:55 EST ------- in 2.6.18-78.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 (From an update to bz321111...) Just tested 2.6.18-78.el5 (x86_64) and it worked fine. I hope others can confirm this. testing 2.6.18-78.el5 (x86_64) on web server with netapp appliance: working fine at the moment. in general, it would be a good policy (IMHO) not to close openly readable bugs as duplicates of bugs that are private. I know you want input from the several million stong CentOS userbase too :D I understand the need to have private bugs, but if CentOS is going to help Red Hat fix things, we will need bugs that are not Private (except of course for current security issues where their might be exploits being discussed). In this particular case, I know I can access the private bug and I will follow it and post the fixes / public comments here that do not need to be private. That is also a option for some of these private cases. Just trying to help make EL5 better for everyone :D Thanx for the advice, although it was completely unnecessary. When I closed this bugzillas as a duplicate of 321111, I didn't realize that 321111 was restricted. When I found this out, I opened it as much as I could. I can't completely open it by myself, so I am working on getting it opened the rest of the way. Please give us a chance to do the work... Bz321111 should now be open. If it is not, then please let me know. Thank you. I can see it now. I also experienced a significant drop in NFS performance on going from kernel 53.1.4 to 53.1.6 on NFS clients. (This is on Scientific Linux 5.1, as mentioned in comment #0.) The 2.6.18-78.el5 kernel also fixes it for me. I'm surprised to see this bug marked as low priority, with the possibility of no fix until even RHEL 5.3 (if I read comment #27 correctly). I consider the NFS performance of 53.1.6 to be so bad as to be almost unusable, and I've reverted all my clients to 53.1.4. It may be that the changes that fix bug #321111 also happen to fix this issue, but that doesn't mean this issue is a duplicate of that one. This issue is a very specific, noticeable problem that occurred in the change from 53.1.4 to 53.1.6. If the NFS changes there were indeed relatively small, then I would have hoped for a relatively easy fix. Please consider giving this bug a higher priority, and releasing a fix for this specific issue on a shorter time-scale. (Apologies if I have misunderstood comment #27 or bug #321111.) yes, the fact that this bug had a "low priority" is an incredible underestimation of it from RedHat. The word "degradation" is wrong: in my environment the machine with new bugged kernel FLOODED THE FILE SERVER MAKING IT ALMOST UNAVAILABLE AT ALL, even for non bugged machines. For giving the measure of this, with 6 servers working with my filer, the NetApp CPU usage was about 30%. Moving just ONE to the bugged kernel, the CPU was between 80% and 100%. I wonder what could have happened if i had upgraded two more machines before seeing this! And, at the moment, if someone installs today a new machine and iussue "yum update" he gets the bugged kernel!!! > However, keep in mind that it is a TEST kernel. The .78 kernel I tested and > confirmed about the nfs fix is UNstable and some people are experiencing system > instability / crashes. amyagi, have you tested .79 to see if it has the same instability issues that you mentioned .78 had in the comments from bug 432251? I have not tested .79 yet. The instability with .78 was also seen by Scientific Linux people as well. In my case, I set up a virtual machine just to do a test and therefore did not continue to run it long enough (to say anything about the stability). Could we stick with one issue per bugzilla, please? Please do not read too much into the priority setting for this bugzilla. The patch from 321111 was integrated into b78, so, barring unforeseen issues with that patch, RHEL-5.2 should contain the changes. It's nice that NFS will (probably) be better in 5.2. But in the meantime, given that an important kernel security update for 5.1 exists today, those of us affected by this NFS problem are left with the choice between poor NFS performance, security risks, or trying to back out the .4 to .6 NFS-related changes on our own. (In reply to comment #42) > But in the meantime, given that an important kernel security update for 5.1 > exists today, those of us affected by this NFS problem are left with the choice > between poor NFS performance, security risks, or trying to back out the .4 to .6 > NFS-related changes on our own. In fact, I have just done that - rebuilding the -53.1.13 kernel without the 4 "silly" nfs kernels. Its nfs behavior was back to normal as expected (of course, nothing surprising). Akemi gmorris: if you want you can use the test .79 kernel from http://people.redhat.com/dzickus/el5 . It fixes the NFS problem and from our testing the exploit (bug 432251) as well. It seems stable enough for daily use. Also it looks like he has a .80 kernel now as well which could be tested. Is there anything happening with this bug in RHEL5.1? We have about 200 workstations in need of this patch and about 200 more waiting to be reinstalled with RHEL5. The joy of being a large site with one common system and making about 10k users able to log on anywhere and get the same desktop and files is not quite there after the introduction of this kernel... I can just advice you that I have been using che test kernels numbered .80 and .83 and they seem stable for my use (web server). However is very hard to understand how can Ret Had still spread (as the official one!) a kernel which does DoS attacks to the NFS servers it mounts. But, honestly, I have to say that the DoS attacks are perfectly completed. |