| Summary: | NFS server crashes in readdir_fstat_cbk due to extra fd unref | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Shehjar Tikoo <shehjart> | ||||
| Component: | nfs | Assignee: | Shehjar Tikoo <shehjart> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||
| Severity: | high | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 3.1.0 | CC: | aavati, admin, d.a.bretherton, gluster-bugs, rabhat, vijaykumar | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | Type: | --- | |||||
| Regression: | RTP | Mount Type: | nfs | ||||
| Documentation: | DNR | CRM: | |||||
| Verified Versions: | 3.1.0 | Category: | --- | ||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Shehjar Tikoo
2010-11-08 04:21:28 UTC
ello Gluster developers, Thanks for looking at this NFS problem. I have been running with the NFS server processes in TRACE mode for a day or so but they are stubbornly refusing to crash. There are a couple of possible reasons that I can think of (based on not very much evidence and limited understanding of how GlusterFS works): 1) It is one of those programs that doesn't go wrong when you are debugging it (which happened to me quite often when I used to do C programming) 2) The fault occurs only for newly created volumes where there is pre-existing data on the bricks (as when upgrading from 3.0.5 to 3.1). The reason I think that might be the case is some odd behaviour I noticed after I first upgraded. I think I encountered this bug: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2057, before another user posted their similar findings on the Gluster users list. I didn't think it was a bug because I've seen similar behaviour in the past after creating distributed volumes with pre-existing data. To solve the problem of not finding any subdirectories I saved the output of "find . -print" run on each of the bricks, then ran a script to "ls" each line in each of these output files on a GlusterFS client. After that I could see all the files in all the subdirectories (after a non-exhaustive search), but still experienced long delays when browsing some directories for the first time. It was after that when the NFS process started crashing, when people came to work on the Monday morning and started processing the data and running jobs on the HPC cluster. I thought perhaps that GlusterFS was still settling down in some sense, optimising its hashing algorithm perhaps. I am guessing here, but thought it might be worth mentioning my observations. I should add that the NFS load is distributed much better now than it was at first. Initially there was only one NFS server machine, but there are now several machines acting as NFS servers for the different GluseterFS volumes. Having said that, one of the NFS processes did crash almost as soon as the first model run started on the cluster, and that is what led me to believe that the problem was not load dependent. All the cluster's 64 processors are in use now (not just 20 as before) but NFS is still refusing to crash. I am not sure how best to proceed. I will probably wait another day or so to see if anything happens while still running in TRACE mode. I will be away from the office Thursday-Sunday so won't be able to do anything else until next week. Regards, Dan Bretherton Created attachment 374 Comment on attachment 374 There were 5 crashes in a row, and I made a separate copy of the nfs.log file each time to avoid my logrotate cron job cleaning them up. Therefore some of the files may contain the same information. The first log file in the sequence (nfs_trace.log.CRASH) goes back 5 minutes and is 1.3GB uncompressed. Thanks! I'll try to look at this soon. *** Bug 2079 has been marked as a duplicate of this bug. *** Dan, please break the tarball into separate log files. I may not have to download the full 1.3g file to investigate the bug. Smaller log files in this tarball may be just as useful. Thanks. (In reply to comment #6) > Dan, please break the tarball into separate log files. I may not have to > download the full 1.3g file to investigate the bug. Smaller log files in this > tarball may be just as useful. Thanks. It is only 97 MB compressed. I can still split it up if you like. The big one is called nfs_trace.log.CRASH, so you can save some time by not extracting it. The small ones are nfs_trace.log.CRASH2, ..., nfs_trace.log.CRASH5. -Dan. PATCH: http://patches.gluster.com/patch/5699 in master (nfs: opendir/closedir for every readdir) Hi, just to give our experience. We have also experienced crashes of NFS with gluster 3.1 (Distributed volume, 3 Servers, Lucid Lynx 64bit with kernel 2.6.35-22-server) Our crashes happened when a lot of clients connected (~100) to the nfs-volume. It was not so much the bandwith or requests. Maybe the nfs-server runs out of file-descriptors? thanks -- udo. (In reply to comment #9) > Hi, > just to give our experience. > We have also experienced crashes of NFS with gluster 3.1 (Distributed volume, 3 > Servers, Lucid Lynx 64bit with kernel 2.6.35-22-server) > > Our crashes happened when a lot of clients connected (~100) to the nfs-volume. > It was not so much the bandwith or requests. > > Maybe the nfs-server runs out of file-descriptors? > > thanks -- udo. Nah. It was just a oversight on my part. The directory reading code didnt behave well with hundreds of clients accessing the same directory. The patch above will fix it. As discussed with developer , its a very corner case which is difficult to hit. so moving it to verified state. |