155449 – node becomes unresponsive when accessing thousands of files

Bug 155449 - node becomes unresponsive when accessing thousands of files

Summary: node becomes unresponsive when accessing thousands of files

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Chris Feist
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-04-20 13:51 UTC by Need Real Name
Modified:	2010-01-12 03:04 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-12-22 20:39:03 UTC
Embargoed:

Attachments	(Terms of Use)

Description Need Real Name 2005-04-20 13:51:33 UTC

Description of problem:
I have a 5 node GFS cluster running ES 3u4, with GFS 6.0.2-24
which is based on HP DL360's, attached to an MSA1000, with 3 volumes
of 20, 40, and 500 GB. On the largest volume I have thousdands of small
files in a few dirs. the first three nodes are also lock managers. Any 
time a process accesses all of the files, that particular machine will 
become unresponsive and the filesystem won't allow access which require 
a hard reboot, or fence_node to correct.

Version-Release number of selected component (if applicable):
ES3u4
kernel 2.4.21-27.0.1
GFS 6.0.2-24

How reproducible:
Always

Steps to Reproduce:
1. Create large volume and populate with thousands of files
2. Run "find /gfs"

  
Actual results:
Node becomes unresponsive after a while and filesystem cannot be accessed

Expected results:
Node should continue to operate normal

Additional info:

Comment 1 Ken Preslan 2005-04-20 20:46:06 UTC

This smells like an enbedded gulm issue.

Comment 2 Need Real Name 2005-04-21 15:16:20 UTC

By embedded gulm issue do are you talking about my having three of the five 
nodes acting as lock managers? If so, is there anything that can be done to 
make this better aside from adding machines to do lock management?

Also, in the cluster mailing list there was a message that eluded to several 
undocumented lock_gulm tunable paramters like highwater mark settings etc. Are 
they documented yet? If not, when will they be?

Comment 3 michael conrad tadpol tilstra 2005-05-11 15:23:37 UTC

Set up five nodes with embedded lock servers on three of them.  Mounted a 147G
gfs file system with 10 directories and 200000 files (20000 in each dir). Ran
find on a client only node, then on the master lockserver node, then on all five
nodes at once.  No problems.  (Every couple of minutes or so, ran ssh <node>
uptime to make sure they were still responsive.)

Machines are all dual Pentium III, 256M ram, 100mb ethernet, qla2200 fiber
channel to an EuroLogic array.

Something else on your nodes that is using resources isn't playing nice with the
gfs tools.


Tunables are documented in the 6.1 release.  I'll get that done in the 6.0
version too.

Comment 4 Need Real Name 2005-05-19 17:10:04 UTC

Thanks, I was beginning to think this had gotten kicked to the curb. I've been 
away on travel for a few weeks so I've not had a chance to read this until now. 

Seeing as your machines only have 256MB ram I don't think it's a memory issue 
anymore and we are using GB ethernet as well with the tg3 drivers. 

Our hardware is as follows...

nodes:    DL360g3 dual Pent
fc cards: Qlogic fc2340 single port
san:      HP MSA1000
nic:      Built in Broadcom based 1 GB nics, tg3.o driver.


I'll try it with the newer updates and see how things go. I'll keep you 
posted...

Comment 5 Need Real Name 2005-05-19 17:12:44 UTC

More info...

By thousands of files I mean thousands of ~50k files which add up to about 
400GB. So things are still a bit different with respect to filesystem sizes, 
and number of files but I'll go do my testing and get back with results...

Comment 6 michael conrad tadpol tilstra 2005-05-31 15:56:48 UTC

Tweeked so that I had 10 directories and 200000 files (20000 in each dir) where
each file was 50k. Ran the finds again, same results.  Not sure what to say. 
What else besides teh base OS and gfs related processes is running on the nodes
when you run the find?

Comment 7 Need Real Name 2005-05-31 18:36:36 UTC

Well I think we can rule out simple finds as I ran several of them as have you. 
The culprit is seems is writes that were going on during the time the find was 
running. I've seen, just today in fact, another user with a similar problem on 
the cluster mailing list. I have a test cluster with only three nodes and I can 
kill it almost at will by rsyncing the 400+GB of data to any node in the 
cluster. 

As for other software, it's only the base OS and GFS, data getting rsync'd from
another source.

Any other ideas/help will be greatly appreciated. Also, if there are any 
specific tests you'd like me to run on this cluster I'll be happy to.

Thanks

Comment 8 michael conrad tadpol tilstra 2005-05-31 19:30:25 UTC

ok. this starts to make more sense.  Sounds like network flooding. (or rather,
pushing the network card to its limit.)

Can you see if it is just large writes? or Large copies over the network?
(I'm trying a bunch of large writes, and things seem ok. (dd if=/dev/zero
of=/gfs/Joga/BAfile bs=25M count=400))  I'm betting its the large copies over
the network.

The thing with rsync (and well, most similar tools) is that they try to push as
much data as the pipe will allow.  This doesn't leave much room for lock or
heartbeat traffic.  To date, the best solution we have found for this is to
install multiple ethernet cards and give the lock and heartbeat traffic its own
network.  Also, for just rsync, you can try adding the --bwlimit option.

The trick is that there needs to be some space left in the pipe out of the
network card for lock and heartbeat traffic.  Otherwise gfs cannot get locks,
and will get stuck.  For the finds I did above, I measured about 1MB/s to
1.5MB/s while the find was running.  Most of that was lock traffic. (some of it
was ganglia recording the network stats.)

Comment 9 michael conrad tadpol tilstra 2005-05-31 20:54:54 UTC

the traffic shaping stuff might also work for this.  I'm testing some stuff to
see if it does, since I don't yet know.

Comment 10 Need Real Name 2005-06-01 14:36:07 UTC

This is stating to sound more promising as someone else on the cluster mailing 
list is having this very same problem with large writes across the network. 
I'll try setting up a second nic after doing some large local writes etc and 
see how that goes.

Thanks!

Comment 11 Need Real Name 2005-06-02 16:02:04 UTC

Ok, ran some tests overnight and it looks as though you may have hit the nail 
on the head. If I use a --bwlimit of 10000, it works using 100Mb connections. 
Without the restriction or with a higher number (which makes sence since 
10000KB (~81Mbit) is around peek for a 100Mb card) it locks up the target 
machine. I'll plug the machines into GB ports soon and retest with muich higher 
limits and see how that goes.

Comment 12 michael conrad tadpol tilstra 2005-06-06 13:55:37 UTC

I played with the traffic shaping tool avalible in RHEL3, and for my contrived
tests, it seems to be another solution to this.  Since the rules need to be
tweeked for a given setup, I'm not pasting them here. (though I supose i could)

Have you tried the higher limits yet?

Comment 13 Need Real Name 2005-06-06 14:50:30 UTC

I tried the higher limits and it the target nod elocked up which was expected 
since the higher limits were more than a 100Mb nic could handle. I am now 
trying ther latest version I could find (6.0.2.20) which I beleive includes the 
alternate nic usage capability but I amnot finished yet. Also need to try 
setting the lt_high_mark settings higher I think.

Comment 14 michael conrad tadpol tilstra 2005-07-27 19:34:34 UTC

tried the alternate nic yet? does it work/help?

Comment 15 Need Real Name 2005-08-25 17:07:41 UTC

Sorry for the delay again. Last I tried, the second nic "seemed" to be working
although GFS was reporting the primary nic being used on some machines even
though  it would be impossible since the nics are on seperate vlans (nic2
private). I am restarting the testing again so I'll post more in the next day or
two.

Comment 16 Kiersten (Kerri) Anderson 2005-10-11 21:54:43 UTC

Any updates on your results?  Are you still having issues?

Comment 17 Need Real Name 2005-12-09 16:49:11 UTC

Ok, I'm back at this, and here is where we sit...

The Hardware is this....

NODES: HP DL380-G3
FIBRE: Qlogic 2340 
SAN:   HP HSG80 (I've also had HP MSA1000's hooked up)

We have a test cluster now with 3 nodes, connected to fibre storage (HSG80) 
running RHEL3u5 (kernel 2.4.21-32.ELsmp, GFS-6.0.2.20-1). All three nodes are 
lock managers and all three nodes access the file system.

When I rsync data (several hundred gigs) to any of the cluster nodes, after a 
time (in the latest case, 379GB worth) the target node ran out of memory. This 
requires the machine to be rebooted since nothing will fork.  Once that is 
done, it all comes back and things are fine again.

What we have observed is that the free locks go to zero and the system starts 
creating more locks, to catch up. We have set our lt_high_locks = 2097151 in 
the cluster.ccs file in order to try to avoid the lock flushing scenario and 
are running the test again. 

We have not used the alternate nic method in this round of tests as the last 
time I tried it, gulm_tool nodestats was reporting conflicting configuration 
information. That is to say that even though GFS was configured to use eth1 as 
it's interface, and eth1 was on a private lan, and GFS started and allowed 
access to files etc. guml_tool would report one of the nodes was actually using 
eth0 which in the way things were configured would have been impossble.

Also: The interaction between nodes with respect to locks is not absulutely 
clear. What I believe is this...

1. Node A wants a lock, and asks it's lock_gulmd
2. Node A lock_gulmd then asks the lock Master if a lock
   is held.
3. Lock Master asks other nodes via gulm if anyone has the lock
4. If any of the nodes respond yes, then the reqest from Node A is
   denied by sending a response to Node A's lock_gulmd
5 Node A's lock_gulmd passes the yes/no down the chain

Is this correct? Can you explain this if not?

The effort to get this fixed is a high priority again around here
as people are getting a little tired of rebooting nodes etc. We will
be working this problem until it is fixed.

Your help is appreciated, thanks

Comment 18 Chris Feist 2005-12-09 17:19:58 UTC

I believe this problem is caused because locks are not being release by GFS. 
Can you provide the output of the following commands:

on the gulm master server - 
gulm_tool getstats localhost:lt000

on the clients -
cat /proc/gulm/lockspace

Can you also run top on the gulm servers and clients and watch the memory
utilization of lock_gulm?  (Type 'M' to sort by memory).

Comment 19 Need Real Name 2005-12-09 18:19:38 UTC

Will do. We did another test in order to observe the high water behavior. We 
set the limite to 10k (lower limit) then ran the xfer. Once the locks hit the 
10k mark, we witnessed the periodic flush, but the locks still creapt up to 
30k+, then we stopped the xfer expecting the locks to decrease but they did 
not. Re-starting the xfer simply increase the locks again at a steady rate.


Thanks again

Comment 20 Need Real Name 2005-12-09 19:07:37 UTC

Ok, here are the files..

This is while data is being sent to nodea

gulm_tool getstats nodeb:lt000
I_am = Master
run time = 6777
pid = 3123
verbosity = Default
id = 0
partitions = 1
out_queue = 0
drpb_queue = 0
locks = 101602
unlocked = 6388
exculsive = 40
shared = 95174
deferred = 0
lvbs = 6391
expired = 0
lock ops = 2920678
conflicts = 8
incomming_queue = 0
conflict_queue = 0
reply_queue = 0
free_locks = 72630
free_lkrqs = 60
used_lkrqs = 0
free_holders = 72630
used_holders = 103459
highwater = 10000


on nodea
cat /proc/gulm/lockspace
lock counts:
   total: 279839
     unl: 141919
     exl: 9
     shd: 137911
     dfr: 0
 pending: 0
    lvbs: 4024
    lops: 418502


on nodec
lock counts:
   total: 45
     unl: 20
     exl: 3
     shd: 21
     dfr: 0
 pending: 0
    lvbs: 0
    lops: 11689


I had to type this in but it should be accurate...

Comment 21 Need Real Name 2005-12-09 19:28:39 UTC

More info...  The above files were taken when the xfer was started again so 
they might not give you what you want. We observed the mem for lock_gulmd creep 
up steadily at about 400k/min once the available locks went to zero and the 
locks rose at about 15 per second during that time.

Comment 22 Need Real Name 2005-12-12 14:25:29 UTC

More info. We ran the xfer over the weekend and were able to make it fail again.
The machine at a point runs out of memory and cannot fork anymore. This seems to
allow the cluster to stay "alive" but in a bad state since lock_gulmd on the
affected machine can still heartbeat etc. The other nodes don't see a "problem"
since the are getting HB info but the affect machine is not able to answer any
requests for locking status etc. Therefore that particular filesystem is in a
hung state. 

So it seems there is a leak somewhere. The locks reached 300K+ as referenced by
gulm_tool getstats. We had the highwater mark set to 10000 to force lock cleanup
and that is noted in the logs. The system did free locks when it was supposed
to, or at least tried. There are fluctuations which indicate it was doing
something. Either it's not doing enough, or the leak (if it exists) is elsewhere.

Comment 23 Chris Feist 2005-12-15 21:58:45 UTC

The high water mark only helps free locks that are cached in gulm, not locks
held by GFS caching inodes, etc.  We are working on a patch which will force GFS
to release inodes in its cache which will in turn allow gulm to release those locks.

Do you know approx. how much memory gulm is using before your cluster dies?

Comment 24 Need Real Name 2005-12-16 02:56:22 UTC

The nodes have 3GB or ram and gulm is using approx 85-90M when things go down 
the tubes.

Comment 25 Chris Feist 2005-12-16 16:21:48 UTC

Are you still using GFS 6.0.2-24?  Also, which kernel are you using?

Comment 26 Need Real Name 2005-12-16 17:35:35 UTC

I recieved a patched kernel and version of GFS to test. A short description 
from #17 above is the current config.

The Hardware is this....

NODES: HP DL380-G3
FIBRE: Qlogic 2340 
SAN:   HP HSG80 (I've also had HP MSA1000's hooked up)

We have a test cluster now with 3 nodes, connected to fibre storage (HSG80) 
running RHEL3u5 (kernel 2.4.21-32.ELsmp, GFS-6.0.2.20-1). All three nodes are 
lock managers and all three nodes access the file system.


I've not had a chance to test the new patches (pre update 7 stuff) because the 
qlogic drivers expect the san to be "SPIFFI" complient which an HSG80 is not.
I need to get an older copy of the drivers, then I'll retest the system with 
the tuning suggested.

Comment 27 Need Real Name 2005-12-21 14:35:04 UTC

The patched kernel and version of GFS I recieved seems to be doing fine with the
inoded_purge set to 30 as recommended. I have 800GB+ xferred successfully so far
with no signs of failer so far. I have been xferring it in chunks of approx
300GB. Once I get all 1.1TB xferred, I'll go for the whole thing (without a
reboot) and see how it goes.

Is there a way to get the same effect (purging locks on an RHEL3u4 system? If Ic
an do it without upgrading, that would be ideal. A command I can manually if I
have to?

Comment 28 Chris Feist 2005-12-21 15:51:37 UTC

Unfortunately, there isn't a command to purge the locks on RHEL3U4, a kernel
change was necessary which is why we haven't been able to get it in until the
pre RHEL3U7 code.

Can you post the output of the following commands, just so I know exactly what
you have?

rpm -q GFS
rpm -q GFS-modules
rpm -q kernel
uname -a

Thanks!

Comment 29 Need Real Name 2005-12-21 18:50:50 UTC

I got the GFS and Kernel rpms from http://people.redhat.com/wcheng/Patches/GFS/

Comment 30 Need Real Name 2005-12-21 19:33:31 UTC

Sorry, forgot to add that I am using the i686 smp rpms...

Comment 31 Need Real Name 2005-12-22 12:29:51 UTC

Interesting to note:

I reformatted my GFS partition and remounted it. In doing so I forgot to issue 
a 'gfs_tool settune /data inoded_purge 30' against the mount point and started 
the xfer all over again, this time intending to xfer the whole 1.2TB at one 
time. This morning I checked the xfer and although it was still going , the 
locks were at 600K+ so I issued the above command and the lock went down to 
300K+ but do not seem to be getting any lower. During the previous test the 
locks stayed down below 12K. I expected the same after issuing the settune 
since my lt_high_locks is set to 10000. I am going to let the xfer go to 
completion and hope at that time the lock will trim down to something more sane 
as it my be that lt_high_locks simply can't keep up. I expected the 30% purge 
every 16 secs to free up a lot more locks than it did though..

Comment 32 Need Real Name 2005-12-22 16:16:29 UTC

Update on comment 31 above: I did not run the settune on all cluster members
rather on a single node. Once I ran it on the other nodes, the locks fell below
the 10000 mark as expected.

Comment 33 Need Real Name 2006-01-06 12:47:37 UTC

So far the patch set used (pre update 7 stuff) is working very well. I've 
not had to kill any of the nodes and all my data is getting sync'd up fine.
Any word on when update 7 will be official?

Comment 34 Chris Feist 2006-01-06 16:52:22 UTC

Great!  Unfortunately I don't have an ETA for RHEL3U7, but it should happen in
the next few months.

Comment 35 Chris Feist 2006-01-06 16:53:58 UTC

Marking this bug modified.

Comment 36 Need Real Name 2006-01-06 18:46:43 UTC

Will the fixes at least be in an errata kernel?

Comment 37 Chris Feist 2006-01-06 18:57:18 UTC

The fixes should be scheduled to be in the RHEL3U7 errata kernel, I don't
believe they'll go in any sooner than that.

Comment 38 Lon Hohberger 2009-12-22 20:39:03 UTC

Cleanup.

This bugzilla should be resolved with the current packages from RHN for Red Hat Global File System for Red Hat Enterprise Linux 3.

GFS-6.0.2.36-9

Note You need to log in before you can comment on or make changes to this bug.