Description of problem: I have a 5 node GFS cluster running ES 3u4, with GFS 6.0.2-24 which is based on HP DL360's, attached to an MSA1000, with 3 volumes of 20, 40, and 500 GB. On the largest volume I have thousdands of small files in a few dirs. the first three nodes are also lock managers. Any time a process accesses all of the files, that particular machine will become unresponsive and the filesystem won't allow access which require a hard reboot, or fence_node to correct. Version-Release number of selected component (if applicable): ES3u4 kernel 2.4.21-27.0.1 GFS 6.0.2-24 How reproducible: Always Steps to Reproduce: 1. Create large volume and populate with thousands of files 2. Run "find /gfs" Actual results: Node becomes unresponsive after a while and filesystem cannot be accessed Expected results: Node should continue to operate normal Additional info:
This smells like an enbedded gulm issue.
By embedded gulm issue do are you talking about my having three of the five nodes acting as lock managers? If so, is there anything that can be done to make this better aside from adding machines to do lock management? Also, in the cluster mailing list there was a message that eluded to several undocumented lock_gulm tunable paramters like highwater mark settings etc. Are they documented yet? If not, when will they be?
Set up five nodes with embedded lock servers on three of them. Mounted a 147G gfs file system with 10 directories and 200000 files (20000 in each dir). Ran find on a client only node, then on the master lockserver node, then on all five nodes at once. No problems. (Every couple of minutes or so, ran ssh <node> uptime to make sure they were still responsive.) Machines are all dual Pentium III, 256M ram, 100mb ethernet, qla2200 fiber channel to an EuroLogic array. Something else on your nodes that is using resources isn't playing nice with the gfs tools. Tunables are documented in the 6.1 release. I'll get that done in the 6.0 version too.
Thanks, I was beginning to think this had gotten kicked to the curb. I've been away on travel for a few weeks so I've not had a chance to read this until now. Seeing as your machines only have 256MB ram I don't think it's a memory issue anymore and we are using GB ethernet as well with the tg3 drivers. Our hardware is as follows... nodes: DL360g3 dual Pent fc cards: Qlogic fc2340 single port san: HP MSA1000 nic: Built in Broadcom based 1 GB nics, tg3.o driver. I'll try it with the newer updates and see how things go. I'll keep you posted...
More info... By thousands of files I mean thousands of ~50k files which add up to about 400GB. So things are still a bit different with respect to filesystem sizes, and number of files but I'll go do my testing and get back with results...
Tweeked so that I had 10 directories and 200000 files (20000 in each dir) where each file was 50k. Ran the finds again, same results. Not sure what to say. What else besides teh base OS and gfs related processes is running on the nodes when you run the find?
Well I think we can rule out simple finds as I ran several of them as have you. The culprit is seems is writes that were going on during the time the find was running. I've seen, just today in fact, another user with a similar problem on the cluster mailing list. I have a test cluster with only three nodes and I can kill it almost at will by rsyncing the 400+GB of data to any node in the cluster. As for other software, it's only the base OS and GFS, data getting rsync'd from another source. Any other ideas/help will be greatly appreciated. Also, if there are any specific tests you'd like me to run on this cluster I'll be happy to. Thanks
ok. this starts to make more sense. Sounds like network flooding. (or rather, pushing the network card to its limit.) Can you see if it is just large writes? or Large copies over the network? (I'm trying a bunch of large writes, and things seem ok. (dd if=/dev/zero of=/gfs/Joga/BAfile bs=25M count=400)) I'm betting its the large copies over the network. The thing with rsync (and well, most similar tools) is that they try to push as much data as the pipe will allow. This doesn't leave much room for lock or heartbeat traffic. To date, the best solution we have found for this is to install multiple ethernet cards and give the lock and heartbeat traffic its own network. Also, for just rsync, you can try adding the --bwlimit option. The trick is that there needs to be some space left in the pipe out of the network card for lock and heartbeat traffic. Otherwise gfs cannot get locks, and will get stuck. For the finds I did above, I measured about 1MB/s to 1.5MB/s while the find was running. Most of that was lock traffic. (some of it was ganglia recording the network stats.)
the traffic shaping stuff might also work for this. I'm testing some stuff to see if it does, since I don't yet know.
This is stating to sound more promising as someone else on the cluster mailing list is having this very same problem with large writes across the network. I'll try setting up a second nic after doing some large local writes etc and see how that goes. Thanks!
Ok, ran some tests overnight and it looks as though you may have hit the nail on the head. If I use a --bwlimit of 10000, it works using 100Mb connections. Without the restriction or with a higher number (which makes sence since 10000KB (~81Mbit) is around peek for a 100Mb card) it locks up the target machine. I'll plug the machines into GB ports soon and retest with muich higher limits and see how that goes.
I played with the traffic shaping tool avalible in RHEL3, and for my contrived tests, it seems to be another solution to this. Since the rules need to be tweeked for a given setup, I'm not pasting them here. (though I supose i could) Have you tried the higher limits yet?
I tried the higher limits and it the target nod elocked up which was expected since the higher limits were more than a 100Mb nic could handle. I am now trying ther latest version I could find (6.0.2.20) which I beleive includes the alternate nic usage capability but I amnot finished yet. Also need to try setting the lt_high_mark settings higher I think.
tried the alternate nic yet? does it work/help?
Sorry for the delay again. Last I tried, the second nic "seemed" to be working although GFS was reporting the primary nic being used on some machines even though it would be impossible since the nics are on seperate vlans (nic2 private). I am restarting the testing again so I'll post more in the next day or two.
Any updates on your results? Are you still having issues?
Ok, I'm back at this, and here is where we sit... The Hardware is this.... NODES: HP DL380-G3 FIBRE: Qlogic 2340 SAN: HP HSG80 (I've also had HP MSA1000's hooked up) We have a test cluster now with 3 nodes, connected to fibre storage (HSG80) running RHEL3u5 (kernel 2.4.21-32.ELsmp, GFS-6.0.2.20-1). All three nodes are lock managers and all three nodes access the file system. When I rsync data (several hundred gigs) to any of the cluster nodes, after a time (in the latest case, 379GB worth) the target node ran out of memory. This requires the machine to be rebooted since nothing will fork. Once that is done, it all comes back and things are fine again. What we have observed is that the free locks go to zero and the system starts creating more locks, to catch up. We have set our lt_high_locks = 2097151 in the cluster.ccs file in order to try to avoid the lock flushing scenario and are running the test again. We have not used the alternate nic method in this round of tests as the last time I tried it, gulm_tool nodestats was reporting conflicting configuration information. That is to say that even though GFS was configured to use eth1 as it's interface, and eth1 was on a private lan, and GFS started and allowed access to files etc. guml_tool would report one of the nodes was actually using eth0 which in the way things were configured would have been impossble. Also: The interaction between nodes with respect to locks is not absulutely clear. What I believe is this... 1. Node A wants a lock, and asks it's lock_gulmd 2. Node A lock_gulmd then asks the lock Master if a lock is held. 3. Lock Master asks other nodes via gulm if anyone has the lock 4. If any of the nodes respond yes, then the reqest from Node A is denied by sending a response to Node A's lock_gulmd 5 Node A's lock_gulmd passes the yes/no down the chain Is this correct? Can you explain this if not? The effort to get this fixed is a high priority again around here as people are getting a little tired of rebooting nodes etc. We will be working this problem until it is fixed. Your help is appreciated, thanks
I believe this problem is caused because locks are not being release by GFS. Can you provide the output of the following commands: on the gulm master server - gulm_tool getstats localhost:lt000 on the clients - cat /proc/gulm/lockspace Can you also run top on the gulm servers and clients and watch the memory utilization of lock_gulm? (Type 'M' to sort by memory).
Will do. We did another test in order to observe the high water behavior. We set the limite to 10k (lower limit) then ran the xfer. Once the locks hit the 10k mark, we witnessed the periodic flush, but the locks still creapt up to 30k+, then we stopped the xfer expecting the locks to decrease but they did not. Re-starting the xfer simply increase the locks again at a steady rate. Thanks again
Ok, here are the files.. This is while data is being sent to nodea gulm_tool getstats nodeb:lt000 I_am = Master run time = 6777 pid = 3123 verbosity = Default id = 0 partitions = 1 out_queue = 0 drpb_queue = 0 locks = 101602 unlocked = 6388 exculsive = 40 shared = 95174 deferred = 0 lvbs = 6391 expired = 0 lock ops = 2920678 conflicts = 8 incomming_queue = 0 conflict_queue = 0 reply_queue = 0 free_locks = 72630 free_lkrqs = 60 used_lkrqs = 0 free_holders = 72630 used_holders = 103459 highwater = 10000 on nodea cat /proc/gulm/lockspace lock counts: total: 279839 unl: 141919 exl: 9 shd: 137911 dfr: 0 pending: 0 lvbs: 4024 lops: 418502 on nodec lock counts: total: 45 unl: 20 exl: 3 shd: 21 dfr: 0 pending: 0 lvbs: 0 lops: 11689 I had to type this in but it should be accurate...
More info... The above files were taken when the xfer was started again so they might not give you what you want. We observed the mem for lock_gulmd creep up steadily at about 400k/min once the available locks went to zero and the locks rose at about 15 per second during that time.
More info. We ran the xfer over the weekend and were able to make it fail again. The machine at a point runs out of memory and cannot fork anymore. This seems to allow the cluster to stay "alive" but in a bad state since lock_gulmd on the affected machine can still heartbeat etc. The other nodes don't see a "problem" since the are getting HB info but the affect machine is not able to answer any requests for locking status etc. Therefore that particular filesystem is in a hung state. So it seems there is a leak somewhere. The locks reached 300K+ as referenced by gulm_tool getstats. We had the highwater mark set to 10000 to force lock cleanup and that is noted in the logs. The system did free locks when it was supposed to, or at least tried. There are fluctuations which indicate it was doing something. Either it's not doing enough, or the leak (if it exists) is elsewhere.
The high water mark only helps free locks that are cached in gulm, not locks held by GFS caching inodes, etc. We are working on a patch which will force GFS to release inodes in its cache which will in turn allow gulm to release those locks. Do you know approx. how much memory gulm is using before your cluster dies?
The nodes have 3GB or ram and gulm is using approx 85-90M when things go down the tubes.
Are you still using GFS 6.0.2-24? Also, which kernel are you using?
I recieved a patched kernel and version of GFS to test. A short description from #17 above is the current config. The Hardware is this.... NODES: HP DL380-G3 FIBRE: Qlogic 2340 SAN: HP HSG80 (I've also had HP MSA1000's hooked up) We have a test cluster now with 3 nodes, connected to fibre storage (HSG80) running RHEL3u5 (kernel 2.4.21-32.ELsmp, GFS-6.0.2.20-1). All three nodes are lock managers and all three nodes access the file system. I've not had a chance to test the new patches (pre update 7 stuff) because the qlogic drivers expect the san to be "SPIFFI" complient which an HSG80 is not. I need to get an older copy of the drivers, then I'll retest the system with the tuning suggested.
The patched kernel and version of GFS I recieved seems to be doing fine with the inoded_purge set to 30 as recommended. I have 800GB+ xferred successfully so far with no signs of failer so far. I have been xferring it in chunks of approx 300GB. Once I get all 1.1TB xferred, I'll go for the whole thing (without a reboot) and see how it goes. Is there a way to get the same effect (purging locks on an RHEL3u4 system? If Ic an do it without upgrading, that would be ideal. A command I can manually if I have to?
Unfortunately, there isn't a command to purge the locks on RHEL3U4, a kernel change was necessary which is why we haven't been able to get it in until the pre RHEL3U7 code. Can you post the output of the following commands, just so I know exactly what you have? rpm -q GFS rpm -q GFS-modules rpm -q kernel uname -a Thanks!
I got the GFS and Kernel rpms from http://people.redhat.com/wcheng/Patches/GFS/
Sorry, forgot to add that I am using the i686 smp rpms...
Interesting to note: I reformatted my GFS partition and remounted it. In doing so I forgot to issue a 'gfs_tool settune /data inoded_purge 30' against the mount point and started the xfer all over again, this time intending to xfer the whole 1.2TB at one time. This morning I checked the xfer and although it was still going , the locks were at 600K+ so I issued the above command and the lock went down to 300K+ but do not seem to be getting any lower. During the previous test the locks stayed down below 12K. I expected the same after issuing the settune since my lt_high_locks is set to 10000. I am going to let the xfer go to completion and hope at that time the lock will trim down to something more sane as it my be that lt_high_locks simply can't keep up. I expected the 30% purge every 16 secs to free up a lot more locks than it did though..
Update on comment 31 above: I did not run the settune on all cluster members rather on a single node. Once I ran it on the other nodes, the locks fell below the 10000 mark as expected.
So far the patch set used (pre update 7 stuff) is working very well. I've not had to kill any of the nodes and all my data is getting sync'd up fine. Any word on when update 7 will be official?
Great! Unfortunately I don't have an ETA for RHEL3U7, but it should happen in the next few months.
Marking this bug modified.
Will the fixes at least be in an errata kernel?
The fixes should be scheduled to be in the RHEL3U7 errata kernel, I don't believe they'll go in any sooner than that.
Cleanup. This bugzilla should be resolved with the current packages from RHN for Red Hat Global File System for Red Hat Enterprise Linux 3. GFS-6.0.2.36-9