466645 – reproduceable gfs (dlm) hanger with simple stresstest

Bug 466645 - reproduceable gfs (dlm) hanger with simple stresstest

Summary: reproduceable gfs (dlm) hanger with simple stresstest

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	GFS-kernel
Sub Component:
Version:	4
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Assignee:	Abhijith Das
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	471059 479430
TreeView+	depends on / blocked

Reported:	2008-10-12 00:33 UTC by Rene Zbinden
Modified:	2018-10-20 02:54 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-05-18 21:09:32 UTC
Embargoed:

Attachments	(Terms of Use)
glocks graph (1.96 KB, image/png) 2008-10-12 00:33 UTC, Rene Zbinden	no flags	Details
locks graph (1.83 KB, image/png) 2008-10-12 00:34 UTC, Rene Zbinden	no flags	Details
locks held graph (1.88 KB, image/png) 2008-10-12 00:35 UTC, Rene Zbinden	no flags	Details
procs in D state on node 3 (2.37 KB, image/png) 2008-10-12 00:36 UTC, Rene Zbinden	no flags	Details
glocks graph with statfs_fast = 0 option (1.72 KB, image/png) 2008-10-13 07:49 UTC, Rene Zbinden	no flags	Details
locks graph with statfs_fast = 0 option (1.43 KB, image/png) 2008-10-13 07:50 UTC, Rene Zbinden	no flags	Details
locks held graph with statfs_fast = 0 option (1.51 KB, image/png) 2008-10-13 07:51 UTC, Rene Zbinden	no flags	Details
procs in D state on node 3 graph with statfs_fast = 0 option (2.25 KB, image/png) 2008-10-13 07:52 UTC, Rene Zbinden	no flags	Details
ps output on node 1 with statfs_fast = 0 option (50.43 KB, text/plain) 2008-10-13 07:53 UTC, Rene Zbinden	no flags	Details
ps output on node 2 with statfs_fast = 0 option (52.40 KB, text/plain) 2008-10-13 07:55 UTC, Rene Zbinden	no flags	Details
ps output on node 3 with statfs_fast = 0 option (52.23 KB, text/plain) 2008-10-13 07:56 UTC, Rene Zbinden	no flags	Details
sysrq-t output node1 (946.56 KB, text/plain) 2008-10-14 07:09 UTC, Rene Zbinden	no flags	Details
sysrq-t output node2 (1.67 MB, text/plain) 2008-10-14 07:09 UTC, Rene Zbinden	no flags	Details
sysrq-t output node3 (2.18 MB, text/plain) 2008-10-14 07:10 UTC, Rene Zbinden	no flags	Details
node1: ps ax -o pid,stat,cmd,wchan (16.06 KB, text/plain) 2008-10-15 07:29 UTC, Rene Zbinden	no flags	Details
node2: ps ax -o pid,stat,cmd,wchan (15.61 KB, text/plain) 2008-10-15 07:30 UTC, Rene Zbinden	no flags	Details
node3: ps ax -o pid,stat,cmd,wchan (15.58 KB, text/plain) 2008-10-15 07:30 UTC, Rene Zbinden	no flags	Details
node1: ps ax -o pid,stat,cmd,wchan (6.10 KB, text/plain) 2008-10-16 09:27 UTC, Rene Zbinden	no flags	Details
node2: ps ax -o pid,stat,cmd,wchan (6.10 KB, text/plain) 2008-10-16 09:28 UTC, Rene Zbinden	no flags	Details
node3: ps ax -o pid,stat,cmd,wchan (5.83 KB, text/plain) 2008-10-16 09:29 UTC, Rene Zbinden	no flags	Details
sysrq-t and sysrq-w of node2 (1.67 MB, text/plain) 2008-10-16 09:30 UTC, Rene Zbinden	no flags	Details
ps and sysrq of node2 (729.12 KB, text/plain) 2008-10-16 16:18 UTC, Rene Zbinden	no flags	Details
RHEL5 patch - updated max stat-rate to 10 stats/sec (3.49 KB, patch) 2008-11-06 18:26 UTC, Abhijith Das	no flags	Details \| Diff
RHEL4 patch (3.48 KB, patch) 2008-11-06 20:04 UTC, Abhijith Das	no flags	Details \| Diff
Show Obsolete (9) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:1045	0	normal	SHIPPED_LIVE	GFS-kernel bug-fix update	2009-05-18 21:09:29 UTC

Description Rene Zbinden 2008-10-12 00:33:55 UTC

Created attachment 320129 [details]
glocks graph

Description of problem:
With a simple gfs stresstest I am able to put the gfs in a state that it hangs. Any operation on the fs will hang and the used process will get into D state
forever (i.e. it is not possible to abort an ls,cd,touch etc... with ctrl+c).

The reason for this test was, that we had some gfs hangers on our production systems after running serveral weeks under high load.
I needed a test to achieve the same state of gfs that we found on our production clusters.

Version-Release number of selected component (if applicable):
on the corresponding cluster suite for rhel4u6 and rhel4u7

How reproducible:
For the thest I have the following setup.
* 3 Node Cluster with ~30 gfs mounted
* On Node 1 I run postmark (http://wiki.linux-nfs.org/wiki/index.php/Testing_tools#PostMark) with the following paramaters:
 - random file size 5 - 10k
 - number of files 70000
 - number of transactions 600000
 - number of subdirectories 2
 - no buffering
* On Node 2
 - there runs a python script which tries to create on the same filesystem 20 directies with 100000 files in each. Each filename and file size is random
* On all Nodes I have a find cronjob: */10 * * * * root /usr/bin/find / -name core -type f -exec ls -al {} \; > /tmp/find.log 2>&1
* I have defined 10 dummy services distributed on the three nodes. Each start-stop-check script just runs /bin/true
* we have the following gfs paramesters defined on our systems:
 - glock_purge=50
 - statfs_fast=1
  
Actual results:
On every run I get the situation where gfs hangs. Maybe it is also a dlm problem, I am not sure. If the
filesystem hangs, I can sometimes cd into the fs and sometimes not. but any fs operation (ls, touch etc...) hangs in D state. If the cluster is in that state, clustat gives me:
Timed out waiting for a response from Resource Group Manager
and dlm_recvd is at very high (100%) CPU utilization.
The graphs:
On the glock.png graph you can see, that the # of glocks increases until a certain value, then it stagnates, here GFS hangs. The same for locks und locks held. When you watch the nummer of 
processes in D state on the third node you can see a peak when gfs hangs and then it increases steadily after.

Expected results:
GFS can get slow, but should never hang.

Additional info:
I have to state here, if you use statfs_fast=0 it takes much longer until gfs hangs (several days). But since this value was recommended to us (we had slow df issues, yes we know the parameter statfs_slots, but
didn't test it until now) and all our production systems run with this parameter, I used for the test this paramater too.

Our Hardware for the test:
Hardware =  HP  ProLiant DL585 G1
Release = Red Hat Enterprise Linux AS release 4 (Nahant Update 7)
OS = 2.6.9-78.0.1.ELsmp GNU/Linux
CPU(s) = 8
Memory = 32031 MB

Thanks for your help
Rene Zbinden

Comment 1 Rene Zbinden 2008-10-12 00:34:52 UTC

Created attachment 320130 [details]
locks graph

Comment 2 Rene Zbinden 2008-10-12 00:35:30 UTC

Created attachment 320131 [details]
locks held graph

Comment 3 Rene Zbinden 2008-10-12 00:36:49 UTC

Created attachment 320132 [details]
procs in D state on node 3

Comment 4 Rene Zbinden 2008-10-13 07:47:23 UTC

I have now repeated the test for a system with the option statfs_fast=0 and I got similar results (see new graphs). I also uploaded a process list of the hanging systems.

Comment 5 Rene Zbinden 2008-10-13 07:49:48 UTC

Created attachment 320173 [details]
glocks graph with statfs_fast = 0 option

Comment 6 Rene Zbinden 2008-10-13 07:50:31 UTC

Created attachment 320174 [details]
locks graph with statfs_fast = 0 option

Comment 7 Rene Zbinden 2008-10-13 07:51:00 UTC

Created attachment 320175 [details]
locks held graph with statfs_fast = 0 option

Comment 8 Rene Zbinden 2008-10-13 07:52:16 UTC

Created attachment 320176 [details]
procs in D state on node 3 graph with statfs_fast = 0 option

Comment 9 Rene Zbinden 2008-10-13 07:53:44 UTC

Created attachment 320177 [details]
ps output on node 1 with statfs_fast = 0 option

Comment 10 Rene Zbinden 2008-10-13 07:55:52 UTC

Created attachment 320178 [details]
ps output on node 2 with statfs_fast = 0 option

Comment 11 Rene Zbinden 2008-10-13 07:56:37 UTC

Created attachment 320179 [details]
ps output on node 3 with statfs_fast = 0 option

Comment 12 Mark Hlawatschek 2008-10-13 10:57:43 UTC

Hi Rene,

could you please provide the output of SysRq-T after you reproduced the problem ? Could you please do the SysRq-T 3 times in a row (please wait 5 secs after every iteration)

-Mark

Comment 13 Marc Grimme 2008-10-13 11:12:25 UTC

Hi Rene,
how many filesystems are concurrently mounted?
-Marc.

Comment 15 David Teigland 2008-10-13 15:45:37 UTC

What we need to do is isolate any legitimate gfs/dlm problems from the effects
of simply overwhelming them.  We also want to eliminate gfs/dlm settings that
could be having unknown effects on the system.

For the first test, please do the following (on *all* nodes):
- disable anything that runs df (e.g. hp software)
- disable statfs_fast
- disable glock_purge
- disable drop_locks; echo "0" >> /proc/cluster/lock_dlm/drop_count
  before mounting any gfs file systems
- change the way "find" is run so that one find command is not
  run until after the previous one has completed

Comment 16 Marc Grimme 2008-10-13 19:56:41 UTC

I started a test with clean settings as supposed. I'll keep you up2date.

Comment 17 Rene Zbinden 2008-10-14 07:09:13 UTC

Created attachment 320260 [details]
sysrq-t output node1

Comment 18 Rene Zbinden 2008-10-14 07:09:48 UTC

Created attachment 320261 [details]
sysrq-t output node2

Comment 19 Rene Zbinden 2008-10-14 07:10:22 UTC

Created attachment 320262 [details]
sysrq-t output node3

Comment 20 Rene Zbinden 2008-10-14 07:13:52 UTC

(In reply to comment #12)
> Hi Rene,
> 
> could you please provide the output of SysRq-T after you reproduced the problem
> ? Could you please do the SysRq-T 3 times in a row (please wait 5 secs after
> every iteration)
> 
> -Mark

Hi Mark

I uploaded the requested files.

Regards Rene

Comment 21 Rene Zbinden 2008-10-14 07:46:08 UTC

(In reply to comment #13)
> Hi Rene,
> how many filesystems are concurrently mounted?
> -Marc.

Hi Marc

There are 30 FS mounted.

Regards Rene

Comment 22 Rene Zbinden 2008-10-14 08:04:53 UTC

(In reply to comment #15)
> What we need to do is isolate any legitimate gfs/dlm problems from the effects
> of simply overwhelming them.  We also want to eliminate gfs/dlm settings that
> could be having unknown effects on the system.
> 
> For the first test, please do the following (on *all* nodes):
> - disable anything that runs df (e.g. hp software)
> - disable statfs_fast
> - disable glock_purge
> - disable drop_locks; echo "0" >> /proc/cluster/lock_dlm/drop_count
>   before mounting any gfs file systems
> - change the way "find" is run so that one find command is not
>   run until after the previous one has completed

I will run the test as requested. I keep u informed.

Regards Rene

Comment 23 David Teigland 2008-10-14 17:03:14 UTC

Is the test run on just one of the 30 mounted filesystems?
Are the other 29 empty?

Comment 24 Rene Zbinden 2008-10-14 20:47:42 UTC

(In reply to comment #23)
> Is the test run on just one of the 30 mounted filesystems?
> Are the other 29 empty?

Hi Dave

There are 30 FS mounted but the test is only run on just one filesystem. The other FS are not empty. I copied the kernel sources on each of it.

By the way: I could reproduce the hanger with your above suggestions in Comment 15.

Comment 25 David Teigland 2008-10-14 21:11:14 UTC

OK, let's continue to narrow things down one step at a time.
Please run another iteration of the same test, but this time
mount only the one fs being used for the test.

A couple questions bout this previous test that hung (we're trying
to copy what you're doing here), approximately how long was the test
running before it hung?  And when it hangs, do you observe a sudden
change in behavior, where the test is making perceptible progress,
and then suddenly stops making any progress?  Or, do things progressively
slow down, and eventually reach a point where you judge that it's hung?

If the systems are still in the original hung state (i.e. you haven't
done anything to change the state of the systems since they hung),
could you collect the following from each of the three nodes:
cman_tool nodes
cman_tool services
ps ax -o pid,stat,cmd,wchan
gfs_tool counters <mountpoint>
anything related to cman/dlm/gfs in /var/log/messages

Comment 26 Abhijith Das 2008-10-15 04:21:34 UTC

I'm running the following setup on a RHEL5 cluster with GFS1 and I haven't hit any hangs yet. The system is slow but still responsive. It's been running for about 10 hrs now.

* Only 1 gfs filesystem mounted on all cluster nodes
* statfs_fast, glock_purge, drop_locks are all 0
* Node 1 runs postmark on the gfs filesystem with the configuration mentioned in comment #0
Current configuration is:
The base number of files is 70000
Transactions: 600000
Files range between 4.88 kilobytes and 9.77 kilobytes in size
Working directory:
        /mnt/gfs (weight=1)
2 subdirectories will be used
Block sizes are: read=512 bytes, write=512 bytes
Biases are: read/append=5, create/delete=5
Not using Unix buffered file I/O
Random number generator seed is 42
Report format is verbose.

* Node 2 runs a program I wrote (similar to the python script mentioned in comment #0) that creates 20 directories with 100000 files in each with filesize upto 1M.

* Finally, all nodes also run the exact find command mentioned in comment #0 in a loop (instead of the cronjob). i.e for i in `seq 1 100`; do /usr/bin/find / -name core -type f -exec ls -al {} \; > /tmp/find.log 2>&1; done;

I've also started the same setup on a RHEL4 cluster just now. Hopefully we'll have a hang in the morning.

Comment 27 Rene Zbinden 2008-10-15 06:50:27 UTC

(In reply to comment #25)
> OK, let's continue to narrow things down one step at a time.
> Please run another iteration of the same test, but this time
> mount only the one fs being used for the test.
> 
> A couple questions bout this previous test that hung (we're trying
> to copy what you're doing here), approximately how long was the test
> running before it hung?  And when it hangs, do you observe a sudden
> change in behavior, where the test is making perceptible progress,
> and then suddenly stops making any progress?  Or, do things progressively
> slow down, and eventually reach a point where you judge that it's hung?
> 
> If the systems are still in the original hung state (i.e. you haven't
> done anything to change the state of the systems since they hung),
> could you collect the following from each of the three nodes:
> cman_tool nodes
> cman_tool services
> ps ax -o pid,stat,cmd,wchan
> gfs_tool counters <mountpoint>
> anything related to cman/dlm/gfs in /var/log/messages

It takes between 1h - 2h until it gfs "hangs". With the above test it took me 30 Minutes. And actually I am not sure if we have a deadlock or if gfs gets "indefinetly" slow. When GFS is in that state, a strace of an ls or cd ends with:
...
read(3, ",noatime,nodiratime 0 0\n/dev/vg_"..., 1024) = 189
read(3, "", 1024)                       = 0
close(3)                                = 0
munmap(0x2a9556c000, 4096)              = 0
open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=48516848, ...}) = 0
mmap(NULL, 48516848, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2a95df1000
close(3)                                = 0
open("/proc/filesystems", O_RDONLY)     = 3
read(3, "nodev\tsysfs\nnodev\trootfs\nnodev\tb"..., 4095) = 297
close(3)                                = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, TIOCGWINSZ, {ws_row=64, ws_col=177, ws_xpixel=0, ws_ypixel=0}) = 0
stat("dir0",

and stays there in D state. As mentionned above I cannot abort the command with ctrl+c. It is the state which I experienced on some of our production systems. 

I have a lot of the following messages in /var/log/messages:
...
Oct 15 07:03:43 smlxt301 kernel: dlm: reply
Oct 15 07:03:43 smlxt301 kernel: dlm: Magma: reply from 3 no lock
Oct 15 07:03:43 smlxt301 kernel: dlm: reply
Oct 15 07:03:43 smlxt301 kernel: dlm: Magma: reply from 3 no lock
Oct 15 07:03:43 smlxt301 kernel: dlm: reply
Oct 15 07:44:45 smlxt301 kernel: dlm: Magma: reply from 3 no lock
Oct 15 07:44:45 smlxt301 kernel: dlm: reply
Oct 15 07:44:47 smlxt301 kernel: dlm: Magma: reply from 3 no lock
Oct 15 07:44:47 smlxt301 kernel: dlm: reply
Oct 15 07:44:47 smlxt301 kernel: dlm: Magma: reply from 3 no lock
Oct 15 07:44:47 smlxt301 kernel: dlm: reply
Oct 15 07:44:47 smlxt301 kernel: dlm: Magma: reply from 3 no lock
Oct 15 07:44:47 smlxt301 kernel: dlm: reply
Oct 15 07:44:47 smlxt301 kernel: dlm: Magma: reply from 3 no lock
...

# cman_tool nodes
Node  Votes Exp Sts  Name
   0    4    0   M   /dev/mapper/quorumdisk
   1    1    3   M   smlxt301
   2    1    3   M   smlxt302
   3    1    3   M   smlxt303

# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 run       -
[2 3 1]

DLM Lock Space:  "clvmd"                             5   3 run       -
[2 3 1]

DLM Lock Space:  "fs_1_1"                            7   4 run       -
[2 3 1]

DLM Lock Space:  "fs_2_1"                           11   6 run       -
[2 3 1]

DLM Lock Space:  "fs_3_1"                           15   8 run       -
[2 3 1]

DLM Lock Space:  "fs_4_1"                           19  10 run       -
[2 3 1]

DLM Lock Space:  "fs_5_1"                           23  12 run       -
[2 3 1]

DLM Lock Space:  "fs_6_1"                           27  14 run       -
[2 3 1]

DLM Lock Space:  "fs_7_1"                           31  16 run       -
[2 3 1]

DLM Lock Space:  "fs_8_1"                           35  18 run       -
[2 3 1]

DLM Lock Space:  "fs_9_1"                           39  20 run       -
[2 3 1]

DLM Lock Space:  "fs_10_1"                          43  22 run       -
[2 3 1]

DLM Lock Space:  "fs_11_1"                          47  24 run       -
[2 3 1]

DLM Lock Space:  "fs_12_1"                          51  26 run       -
[2 3 1]

DLM Lock Space:  "fs_13_1"                          55  28 run       -
[2 3 1]

DLM Lock Space:  "fs_14_1"                          59  30 run       -
[2 3 1]

DLM Lock Space:  "fs_15_1"                          63  32 run       -
[2 3 1]

DLM Lock Space:  "fs_16_1"                          67  34 run       -
[2 3 1]

DLM Lock Space:  "fs_17_1"                          71  36 run       -
[2 3 1]

DLM Lock Space:  "fs_19_1"                          76  38 run       -
[2 3 1]

DLM Lock Space:  "fs_20_1"                          82  40 run       -
[2 3 1]

DLM Lock Space:  "fs_22_1"                          88  42 run       -
[2 3 1]

DLM Lock Space:  "fs_23_1"                          94  44 run       -
[2 3 1]

DLM Lock Space:  "fs_24_1"                         100  46 run       -
[2 3 1]

DLM Lock Space:  "fs_25_1"                         106  48 run       -
[2 3 1]

DLM Lock Space:  "fs_26_1"                         112  50 run       -
[2 3 1]

DLM Lock Space:  "fs_27_1"                         118  52 run       -
[2 3 1]

DLM Lock Space:  "fs_28_1"                         124  54 run       -
[2 3 1]

DLM Lock Space:  "fs_29_1"                         130  56 run       -
[2 3 1]

DLM Lock Space:  "fs_30_1"                         136  58 run       -
[2 3 1]

DLM Lock Space:  "Magma"                           145  61 run       -
[3 1 2]

GFS Mount Group: "fs_1_1"                            9   5 run       -
[2 3 1]

GFS Mount Group: "fs_2_1"                           13   7 run       -
[2 3 1]

GFS Mount Group: "fs_3_1"                           17   9 run       -
[2 3 1]

GFS Mount Group: "fs_4_1"                           21  11 run       -
[2 3 1]

GFS Mount Group: "fs_5_1"                           25  13 run       -
[2 3 1]

GFS Mount Group: "fs_6_1"                           29  15 run       -
[2 3 1]

GFS Mount Group: "fs_7_1"                           33  17 run       -
[2 3 1]

GFS Mount Group: "fs_8_1"                           37  19 run       -
[2 3 1]

GFS Mount Group: "fs_9_1"                           41  21 run       -
[2 3 1]

GFS Mount Group: "fs_10_1"                          45  23 run       -
[2 3 1]

GFS Mount Group: "fs_11_1"                          49  25 run       -
[2 3 1]

GFS Mount Group: "fs_12_1"                          53  27 run       -
[2 3 1]

GFS Mount Group: "fs_13_1"                          57  29 run       -
[2 3 1]

GFS Mount Group: "fs_14_1"                          61  31 run       -
[2 3 1]

GFS Mount Group: "fs_15_1"                          65  33 run       -
[2 3 1]

GFS Mount Group: "fs_16_1"                          69  35 run       -
[2 3 1]

GFS Mount Group: "fs_17_1"                          73  37 run       -
[2 3 1]

GFS Mount Group: "fs_19_1"                          79  39 run       -
[2 3 1]

GFS Mount Group: "fs_20_1"                          85  41 run       -
[2 3 1]

GFS Mount Group: "fs_22_1"                          91  43 run       -
[2 3 1]

GFS Mount Group: "fs_23_1"                          97  45 run       -
[2 3 1]

GFS Mount Group: "fs_24_1"                         103  47 run       -
[2 3 1]

GFS Mount Group: "fs_25_1"                         109  49 run       -
[2 3 1]

GFS Mount Group: "fs_26_1"                         115  51 run       -
[2 3 1]

GFS Mount Group: "fs_27_1"                         121  53 run       -
[2 3 1]

GFS Mount Group: "fs_28_1"                         127  55 run       -
[2 3 1]

GFS Mount Group: "fs_29_1"                         133  57 run       -
[2 3 1]

GFS Mount Group: "fs_30_1"                         139  59 run       -
[2 3 1]

User:            "usrm::manager"                   142  60 run       -
[2 3 1]

# gfs_tool counters /fs/1/1

                                  locks 141445
                             locks held 70749
                           freeze count 0
                          incore inodes 70638
                       metadata buffers 47
                        unlinked inodes 1
                              quota IDs 0
                     incore log buffers 0
                         log space used 0.05%
              meta header cache entries 7855
                     glock dependencies 19
                 glocks on reclaim list 0
                              log wraps 35
                   outstanding LM calls 9
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 10226468
                         glock nq calls 14587201
                         glock dq calls 14515409
                   glock prefetch calls 4989623
                          lm_lock calls 5511896
                        lm_unlock calls 5184959
                           lm callbacks 10702559
                     address operations 3870033
                      dentry operations 753854
                      export operations 0
                        file operations 5147274
                       inode operations 2688443
                       super operations 13375048
                          vm operations 0
                        block I/O reads 2720819
                       block I/O writes 2060819

Comment 28 Rene Zbinden 2008-10-15 07:29:45 UTC

Created attachment 320398 [details]
node1: ps ax -o pid,stat,cmd,wchan

Comment 29 Rene Zbinden 2008-10-15 07:30:10 UTC

Created attachment 320399 [details]
node2: ps ax -o pid,stat,cmd,wchan

Comment 30 Rene Zbinden 2008-10-15 07:30:47 UTC

Created attachment 320400 [details]
node3: ps ax -o pid,stat,cmd,wchan

Comment 31 Mark Hlawatschek 2008-10-15 07:38:37 UTC

Rene, could you please provide the gfs_tool counters from all nodes ?

Comment 32 Rene Zbinden 2008-10-15 07:57:47 UTC

Yes:
node1:

                                  locks 141445
                             locks held 70749
                           freeze count 0
                          incore inodes 70638
                       metadata buffers 47
                        unlinked inodes 1
                              quota IDs 0
                     incore log buffers 0
                         log space used 0.05%
              meta header cache entries 7855
                     glock dependencies 19
                 glocks on reclaim list 0
                              log wraps 35
                   outstanding LM calls 9
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 10226470
                         glock nq calls 14587492
                         glock dq calls 14515699
                   glock prefetch calls 4989627
                          lm_lock calls 5511900
                        lm_unlock calls 5184961
                           lm callbacks 10702565
                     address operations 3870033
                      dentry operations 753855
                      export operations 0
                        file operations 5148102
                       inode operations 2688661
                       super operations 13375126
                          vm operations 0
                        block I/O reads 2720821
                       block I/O writes 2060819

node2:

                                  locks 922873
                             locks held 455963
                           freeze count 0
                          incore inodes 455738
                       metadata buffers 1188
                        unlinked inodes 1
                              quota IDs 2
                     incore log buffers 0
                         log space used 0.05%
              meta header cache entries 107
                     glock dependencies 0
                 glocks on reclaim list 0
                              log wraps 41
                   outstanding LM calls 110407
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 11441806
                         glock nq calls 15121627
                         glock dq calls 14665567
                   glock prefetch calls 8693843
                          lm_lock calls 10001385
                        lm_unlock calls 9193992
                           lm callbacks 19094890
                     address operations 0
                      dentry operations 1907787
                      export operations 0
                        file operations 2015619
                       inode operations 7537515
                       super operations 17571647
                          vm operations 0
                        block I/O reads 7712420
                       block I/O writes 2407714

node3:

                                  locks 385685
                             locks held 104950
                           freeze count 0
                          incore inodes 100012
                       metadata buffers 0
                        unlinked inodes 0
                              quota IDs 0
                     incore log buffers 0
                         log space used 0.10%
              meta header cache entries 0
                     glock dependencies 0
                 glocks on reclaim list 0
                              log wraps 4
                   outstanding LM calls 218740
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 10267822
                         glock nq calls 1842881
                         glock dq calls 1742865
                   glock prefetch calls 5778466
                          lm_lock calls 5898557
                        lm_unlock calls 5355866
                           lm callbacks 11041587
                     address operations 0
                      dentry operations 382740
                      export operations 0
                        file operations 205793
                       inode operations 1052899
                       super operations 6052575
                          vm operations 0
                        block I/O reads 2990554
                       block I/O writes 130725

Comment 33 Mark Hlawatschek 2008-10-15 08:33:27 UTC

In my opinion, the "outstanding LM calls" shows the problem. I believe, that the ls command tries to acquire a lock from the lockmanager which is not processed. 

It also looks like the order, the queued lock requests are handled is not fair. Otherwise, the lock for the ls command should be handled at some time.

In our tests the ls command "hung" for some hours. It returned after I killed all find processes.

@Rene: what happens if you kill the find processes ? (Watch the "outstanding LM calls")

Comment 34 Marc Grimme 2008-10-15 11:02:46 UTC

BTW: we can reproduce the same problem on our two node cluster with only one gfs filesystem mounted.

Comment 35 David Teigland 2008-10-15 14:55:29 UTC

- The "outstanding LM calls" count is interesting, will investigate what
that might mean.

- What is the "bpbkar" process?  It's blocked on gfs on nodes 2 and 3.

- "dlm: Magma: reply from 3 no lock" is not good.  It can be fixed easily
by echo "300" >> /proc/cluster/config/dlm/lock_timeout
also see http://sources.redhat.com/cluster/wiki/FAQ/LockManager#dlm_timeout

- When you next run the test, and the filesystems hang, try to kill some of
the processes (find, python, postmark) on the nodes and wait to see if the
filesystems become responsive again.

Comment 36 Mark Hlawatschek 2008-10-15 15:02:12 UTC

During the tests (RHEL4u7) I observed the problem, that in my opinion is
related to bz# 247766.

I tried to trigger a sysrq-t. Unfortunately, one node paniced during the
operation. 

The filesystem was very unresponsive and rgmanager got some lock timeouts:
<err> #48: Unable to obtain cluster lock: Connection 
timed out

The kernel reports the following:
Oct 15 12:06:00 lilr205a dlm: Magma: reply from 2 no lock
Oct 15 12:06:00 lilr205a dlm: reply
Oct 15 12:06:00 lilr205a rh_cmd 5
Oct 15 12:06:00 lilr205a rh_lkid ac03e9
Oct 15 12:06:00 lilr205a lockstate 0
Oct 15 12:06:00 lilr205a nodeid 1
Oct 15 12:06:00 lilr205a status 0
Oct 15 12:06:00 lilr205a lkid 0
...

And ps ax -o stat,wchan,command,pid shows:

D<   18446744071566487680 [dlm_recvd]                               6525

Comment 37 David Teigland 2008-10-15 15:20:22 UTC

Abhi, one thing I think we missed in our test setup were the
10 dummy rgmanager services.

Rene, in the next iteration of the test:
- mount only the 1 fs being used for the test (already mentioned this)
- eliminate the 10 dummy rgmanager services (don't even run rgmanager)
- remove the unknown bpbkar process

Comment 38 Rene Zbinden 2008-10-15 20:59:44 UTC

(In reply to comment #37)
> Abhi, one thing I think we missed in our test setup were the
> 10 dummy rgmanager services.
> 
> Rene, in the next iteration of the test:
> - mount only the 1 fs being used for the test (already mentioned this)
> - eliminate the 10 dummy rgmanager services (don't even run rgmanager)
> - remove the unknown bpbkar process
I did the run with one fs, no rgmanager and no bpbkar process (backup). Same result whithin 1h. Here the counters:
node1:
                                  locks 144779
                             locks held 74146
                           freeze count 0
                          incore inodes 70664
                       metadata buffers 281
                        unlinked inodes 55
                              quota IDs 2
                     incore log buffers 0
                         log space used 28.66%
              meta header cache entries 9999
                     glock dependencies 1582
                 glocks on reclaim list 0
                              log wraps 15
                   outstanding LM calls 7856
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 1081533
                         glock nq calls 8317643
                         glock dq calls 8243424
                   glock prefetch calls 553179
                          lm_lock calls 824675
                        lm_unlock calls 470422
                           lm callbacks 1293488
                     address operations 2382672
                      dentry operations 357131
                      export operations 0
                        file operations 2837593
                       inode operations 1493792
                       super operations 5399593
                          vm operations 0
                        block I/O reads 91502
                       block I/O writes 1066519

node2:
no output, gfs_tool counters is not responding....

node3:
                                locks 1436
                             locks held 15
                           freeze count 0
                          incore inodes 8
                       metadata buffers 0
                        unlinked inodes 0
                              quota IDs 0
                     incore log buffers 0
                         log space used 0.00%
              meta header cache entries 0
                     glock dependencies 0
                 glocks on reclaim list 0
                              log wraps 0
                   outstanding LM calls 1416
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 2010718
                         glock nq calls 7058
                         glock dq calls 7047
                   glock prefetch calls 1288530
                          lm_lock calls 1291482
                        lm_unlock calls 727693
                           lm callbacks 2022042
                     address operations 0
                      dentry operations 58
                      export operations 0
                        file operations 6661
                       inode operations 317
                       super operations 1288676
                          vm operations 0
                        block I/O reads 105691
                       block I/O writes 4

I tried to kill several times (with the 9 Signal) the find, postmark and python process. No success. Maybe it takes some time. I'll wait until tomorrow.

Regards Rene.

Comment 39 David Teigland 2008-10-15 21:42:50 UTC

It will be interesting to leave things in this state for some time.
We'd like to look for any change at all between now and tomorrow when
you check it:
. did any of the killed processes eventually exit?
. have any of the processes (find, postmark, python) made any further
  progress at all?
. run gfs_tool counters again looking for any changes
. run ps ax -o pid,stat,cmd,wchan

In our similar tests here, we see very very slow progress/response of the fs,
some things taking up to 5 minutes to respond.  We're looking for concrete evidence that your tests are not hung, but just very slow.  The steps we take to further diagnose the problems will depend significantly upon whether dlm/gfs are truely hung or whether they are still operating but making no noticable progress.

For the next test iteration, we're going to have you remove the "find" processes from all nodes.

Comment 40 David Teigland 2008-10-15 22:13:25 UTC

Update to comment 39:

Things to check and collect after the current hung state has been left for some time:

. did any of the killed processes eventually exit?
. have any of the processes (find, postmark, python) made any further
  progress at all?
. cman_tool nodes and services
. ps ax -o pid,stat,cmd,wchan
. anything related to cman/dlm/gfs in /var/log/messages
. gfs_tool counters again looking for any changes
. if gfs_tool counters has not responded on any node (like node 2 in comment 38),
  run sysrq-t and sysrq-w on those nodes

Comment 41 Rene Zbinden 2008-10-16 09:12:41 UTC

(In reply to comment #40)
> Update to comment 39:
> 
> Things to check and collect after the current hung state has been left for some
> time:
> 
> . did any of the killed processes eventually exit?
> . have any of the processes (find, postmark, python) made any further
>   progress at all?
> . cman_tool nodes and services
> . ps ax -o pid,stat,cmd,wchan
> . anything related to cman/dlm/gfs in /var/log/messages
> . gfs_tool counters again looking for any changes
> . if gfs_tool counters has not responded on any node (like node 2 in comment
> 38),
>   run sysrq-t and sysrq-w on those nodes
 * The Processes never quit. I tried several times.

 * cman_tool nodes:
 Node  Votes Exp Sts  Name
   0    4    0   M   /dev/mapper/quorumdisk
   1    1    3   M   smlxt301
   2    1    3   M   smlxt302
   3    1    3   M   smlxt303

 * cman_tool services
  Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             3   3 run       -
[1 2 3]

DLM Lock Space:  "fs_1_1"                            8   8 run       -
[1 2 3]

GFS Mount Group: "fs_1_1"                            9   9 run       -
[1 2 3]

 * This time nothing special in /var/log/messages

 * gfs_tool counters:
 node 1:
 
                                  locks 144779
                             locks held 74146
                           freeze count 0
                          incore inodes 70664
                       metadata buffers 281
                        unlinked inodes 55
                              quota IDs 2
                     incore log buffers 0
                         log space used 28.66%
              meta header cache entries 0
                     glock dependencies 0
                 glocks on reclaim list 0
                              log wraps 15
                   outstanding LM calls 7857
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 1081596
                         glock nq calls 8320680
                         glock dq calls 8246459
                   glock prefetch calls 553179
                          lm_lock calls 824746
                        lm_unlock calls 470485
                           lm callbacks 1293628
                     address operations 2382672
                      dentry operations 357131
                      export operations 0
                        file operations 2846493
                       inode operations 1496017
                       super operations 5399836
                          vm operations 0
                        block I/O reads 91502
                       block I/O writes 1066519

node 2:
no output, gfs_tool counters is not responding....

node 3:
                                 locks 1596
                             locks held 15
                           freeze count 0
                          incore inodes 8
                       metadata buffers 0
                        unlinked inodes 0
                              quota IDs 0
                     incore log buffers 0
                         log space used 0.00%
              meta header cache entries 0
                     glock dependencies 0
                 glocks on reclaim list 0
                              log wraps 0
                   outstanding LM calls 1417
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 2010782
                         glock nq calls 7865
                         glock dq calls 7852
                   glock prefetch calls 1288530
                          lm_lock calls 1291547
                        lm_unlock calls 727757
                           lm callbacks 2022170
                     address operations 0
                      dentry operations 58
                      export operations 0
                        file operations 6665
                       inode operations 318
                       super operations 1288918
                          vm operations 0
                        block I/O reads 105696
                       block I/O writes 4

Comment 42 Rene Zbinden 2008-10-16 09:27:57 UTC

Created attachment 320531 [details]
node1: ps ax -o pid,stat,cmd,wchan

Comment 43 Rene Zbinden 2008-10-16 09:28:31 UTC

Created attachment 320532 [details]
node2: ps ax -o pid,stat,cmd,wchan

Comment 44 Rene Zbinden 2008-10-16 09:29:03 UTC

Created attachment 320533 [details]
node3: ps ax -o pid,stat,cmd,wchan

Comment 45 Rene Zbinden 2008-10-16 09:30:30 UTC

Created attachment 320534 [details]
sysrq-t and sysrq-w of node2

Comment 46 David Teigland 2008-10-16 15:28:47 UTC

The node2 sysrq doesn't appear to match the node2 ps.

Comment 47 David Teigland 2008-10-16 16:08:45 UTC

As mentioned in comment 39, for the next test iteration, please remove
the find processes from all nodes.

Comment 48 Rene Zbinden 2008-10-16 16:18:22 UTC

Created attachment 320571 [details]
ps and sysrq of node2

Comment 49 Rene Zbinden 2008-10-16 16:44:50 UTC

(In reply to comment #47)
> As mentioned in comment 39, for the next test iteration, please remove
> the find processes from all nodes.
Test runs without find... I'll keep u updated.

Comment 50 Rene Zbinden 2008-10-17 10:57:47 UTC

Test runs for more than 12h without getting into a hanger.

Comment 51 Rene Zbinden 2008-10-20 06:20:32 UTC

Test runs for four days. No hanging. Can we conclude that the find is the problem and that we should not use find in combination with gfs?

Comment 52 Marc Grimme 2008-10-20 07:18:59 UTC

We are running with nonconcurrent finds since 5 days without problems. 

Should we rise the count of find processes again?

Comment 53 Rene Zbinden 2008-10-20 11:02:42 UTC

(In reply to comment #52)
> We are running with nonconcurrent finds since 5 days without problems. 
> 
> Should we rise the count of find processes again?
Without finds we experience no hangers.

We had RH Engineers at our site, and they told us gfs (1) is not designed for concurrent finds. We should wait for gfs2 and until then avoid finds on the system.

Comment 54 David Teigland 2008-10-20 16:46:01 UTC

Rene, it sounds like removing the finds solves the problems in your test cluster;
I don't know how directly this translates to your production systems, where
the processes and load will be different.

Marc, it sounds like finds do not create the same problems on your cluster.
It's the combination of "find" with the other load on the filesystem that
creates problems.

In summary, the problems on Rene's cluster do not appear to be "bugs" per se
that can be fixed in gfs/dlm.  The problem was applications (find+postmark+manycreates) stressing gfs/dlm in ways that it cannot handle.
The result of this stress is the system becomes so loaded that nothing makes
any forward progress.

Comment 55 Marc Grimme 2008-10-20 17:32:28 UTC

Dave, "my" cluster also does not hang or anything the like when only one find runs (that's the good news). I don't know if finds only really cause the problem but any process that somehow loads the filesystem with "read" (not sure about that) I/Os and more then one or two processes must be running of that type at the same time.

Two things on that topic:
1. On a normal cluster you can never be sure that you can avoid such a behaviour can you?
2. I would agree to not having a bug (whereas this is really hard to tell) but what we saw is concurrent processes accessing the fs that were leading to a "deadlock" where calls would only resume when another dependent (find-)process was killed. Independent on how much time you're waiting. I still think this is an issue/bug we are seeing from time to time that should be analysed. What do you think?

Comment 56 David Teigland 2008-10-20 19:59:04 UTC

Marc, you are correct, we can't be sure with problems like this.  The best
we can do is identify a particular usage/load where things run ok a large
percentage of the time.  In Rene's cluster, find+postmark+createfiles would
only run for a few hours, but postmark+createfiles ran for 5 days. (To be
more confident in these results, we could add back some of the other things
that we eliminated in the process of narrowing down the problem, to verify
that the final result doesn't change.)

We would need to run a similar sequence of tests in your case to identify
the "culprit", i.e. identify a specific set of processes, each with a specific load, which create a regular failure; then start eliminating processes or
changing configuration settings one at a time until the combination no longer creates a failure.

Comment 57 Mark Hlawatschek 2008-10-20 21:05:50 UTC

Please notice, that according to comment #38 and comment #41 Rene was not able to kill the find processes within one day. 
During my tests, a "hang" (ls hung for several hours) that was caused by the gfs/dlm stressing could be resolved by simply killing all find processes. 

In my opinion, a "hang" that cannot be resolved is very likely to be caused by a bug. I.e. a "hang" that is caused by overstressing the gfs/dlm should be resolvable by killing all processes causing the overstressing.

Can you confirm this ?

Comment 58 David Teigland 2008-10-20 22:13:18 UTC

I believe that whether or not stalled processes can be killed depends
on how severely the system is overloaded.  It probably also depends on
the specifics of what the different processes are doing.  So, I would
expect on overstressed cluster could sometimes have killable processes,
and sometimes non-killable processes.

Comment 59 Abhijith Das 2008-10-24 03:33:23 UTC

I ran some numbers on the smoke cluster and here are the results:
I have one gfs1 and one gfs2 filesystem (both size 100 G) filled to capacity with files upto 1 MB in size.

gfs1: 3 directories, 202126 files in all
gfs2: 3 directories, 202336 files in all

The filesystems are mounted on all the nodes and I ran 'strace -c find' on all the nodes. Once on the gfs1 filesystem and once on the gfs2 filesystem.

Something like:
strace -c /usr/bin/find /mnt/gfs -name core -type f -exec ls -al {} \; > /tmp/find.log 2> /tmp/find-gfs-stats

Below are the stats for the find processes on each node. The first set of numbers is for gfs1. second set for gfs2.

camel:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   30.116114       15134      1990           getdents64
  0.00    0.000304          34         9           munmap
  0.00    0.000157         157         1           execve
  0.00    0.000018           2        12           brk
  0.00    0.000000           0         8           read
  0.00    0.000000           0        23           open
  0.00    0.000000           0        23           close
  0.00    0.000000           0         1           time
  0.00    0.000000           0         2         1 access
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0        12           fchdir
  0.00    0.000000           0        14           mremap
  0.00    0.000000           0        23           mmap2
  0.00    0.000000           0         4           stat64
  0.00    0.000000           0         5           lstat64
  0.00    0.000000           0        13           fstat64
  0.00    0.000000           0         4           fcntl64
  0.00    0.000000           0         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00   30.116593                  2150         1 total

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.66    0.108964          55      1992           getdents64
  0.17    0.000182           8        23           close
  0.16    0.000174          12        14           mremap
  0.02    0.000021           1        23           mmap2
  0.00    0.000000           0         8           read
  0.00    0.000000           0        23           open
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           time
  0.00    0.000000           0         2         1 access
  0.00    0.000000           0        12           brk
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         9           munmap
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0        12           fchdir
  0.00    0.000000           0         4           stat64
  0.00    0.000000           0         5           lstat64
  0.00    0.000000           0        13           fstat64
  0.00    0.000000           0         4           fcntl64
  0.00    0.000000           0         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00    0.109341                  2152         1 total

merit:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   31.580428       15870      1990           getdents64
  0.00    0.000034           4         8           read
  0.00    0.000022           2        12           brk
  0.00    0.000021           1        23           mmap2
  0.00    0.000000           0        23           open
  0.00    0.000000           0        23           close
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           time
  0.00    0.000000           0         2         1 access
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         9           munmap
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0        12           fchdir
  0.00    0.000000           0        14           mremap
  0.00    0.000000           0         4           stat64
  0.00    0.000000           0         5           lstat64
  0.00    0.000000           0        13           fstat64
  0.00    0.000000           0         4           fcntl64
  0.00    0.000000           0         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00   31.580505                  2150         1 total

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.70    0.195846          98      1992           getdents64
  0.24    0.000464          52         9           munmap
  0.05    0.000107           8        14           mremap
  0.01    0.000023           1        23           open
  0.00    0.000000           0         8           read
  0.00    0.000000           0        23           close
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           time
  0.00    0.000000           0         2         1 access
  0.00    0.000000           0        12           brk
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0        12           fchdir
  0.00    0.000000           0        23           mmap2
  0.00    0.000000           0         4           stat64
  0.00    0.000000           0         5           lstat64
  0.00    0.000000           0        13           fstat64
  0.00    0.000000           0         4           fcntl64
  0.00    0.000000           0         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00    0.196440                  2152         1 total

winston:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   31.435381       15797      1990           getdents64
  0.00    0.000417          46         9           munmap
  0.00    0.000113           8        14           mremap
  0.00    0.000000           0         8           read
  0.00    0.000000           0        23           open
  0.00    0.000000           0        23           close
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           time
  0.00    0.000000           0         2         1 access
  0.00    0.000000           0        12           brk
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0        12           fchdir
  0.00    0.000000           0        23           mmap2
  0.00    0.000000           0         4           stat64
  0.00    0.000000           0         5           lstat64
  0.00    0.000000           0        13           fstat64
  0.00    0.000000           0         4           fcntl64
  0.00    0.000000           0         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00   31.435911                  2150         1 total

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.86    0.233801         117      1992           getdents64
  0.12    0.000289          32         9           munmap
  0.02    0.000038           2        23           mmap2
  0.00    0.000000           0         8           read
  0.00    0.000000           0        23           open
  0.00    0.000000           0        23           close
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           time
  0.00    0.000000           0         2         1 access
  0.00    0.000000           0        12           brk
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0        12           fchdir
  0.00    0.000000           0        14           mremap
  0.00    0.000000           0         4           stat64
  0.00    0.000000           0         5           lstat64
  0.00    0.000000           0        13           fstat64
  0.00    0.000000           0         4           fcntl64
  0.00    0.000000           0         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00    0.234128                  2152         1 total

salem:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   45.390793       22809      1990           getdents
  0.00    0.000999         200         5           lstat
  0.00    0.000999          56        18           brk
  0.00    0.000016           1        23           mmap
  0.00    0.000000           0         8           read
  0.00    0.000000           0        23           open
  0.00    0.000000           0        23           close
  0.00    0.000000           0         4           stat
  0.00    0.000000           0        13           fstat
  0.00    0.000000           0         7           mprotect
  0.00    0.000000           0         6           munmap
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         2         1 access
  0.00    0.000000           0         7           mremap
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           uname
  0.00    0.000000           0         4           fcntl
  0.00    0.000000           0        12           fchdir
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           time
------ ----------- ----------- --------- --------- ----------------
100.00   45.392807                  2151         1 total

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.63    0.272699         137      1992           getdents
  0.37    0.000999          56        18           brk
  0.00    0.000000           0         8           read
  0.00    0.000000           0        23           open
  0.00    0.000000           0        23           close
  0.00    0.000000           0         4           stat
  0.00    0.000000           0        13           fstat
  0.00    0.000000           0         5           lstat
  0.00    0.000000           0        23           mmap
  0.00    0.000000           0         7           mprotect
  0.00    0.000000           0         6           munmap
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         2         1 access
  0.00    0.000000           0         7           mremap
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           uname
  0.00    0.000000           0         4           fcntl
  0.00    0.000000           0        12           fchdir
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           time
------ ----------- ----------- --------- --------- ----------------
100.00    0.273698                  2153         1 total

Comment 60 Steve Whitehouse 2008-10-24 09:31:27 UTC

Now looking at those figures reminds me of something... there is a difference between GFS1 and GFS2 in that GFS2 doesn't do the readahead of inodes that GFS1 does. I made that change right back in the early days of GFS2 because ls seemed to  be very slow.

I wouldn't be at all surprised if thats what you've come across here.

Comment 69 Abhijith Das 2008-11-06 18:26:29 UTC

Created attachment 322767 [details]
RHEL5 patch - updated max stat-rate to 10 stats/sec

Comment 70 Abhijith Das 2008-11-06 20:04:22 UTC

Created attachment 322779 [details]
RHEL4 patch

Comment 71 Abhijith Das 2008-11-06 21:01:43 UTC

Checked in the above patch to RHEL4, RHEL5, STABLE2 and master

Comment 72 Rene Zbinden 2008-11-10 17:33:51 UTC

Do those patches solve the issues we had with the stress test? And if so, can we count on having this patches included in rh4u8 release?

Comment 73 Abhijith Das 2008-11-10 17:49:11 UTC

(In reply to comment #72)
> Do those patches solve the issues we had with the stress test? And if so, can
> we count on having this patches included in rh4u8 release?

Rene,
When you got rid of the find processes in your stress-test, the machines recovered from the hang. We tried this in our test-cluster as well and found that finds (without stat) were very slow because gfs was trying to prefetch locks in the hope that stats would be performed on the files returned by find. This assumption was incorrect. The patch above dramatically improves performance of finds without stats by regulating prefetching.

The patch has been checked into the RHEL4 git branch and should make the 4.8 release.

Comment 74 Mark Hlawatschek 2008-11-14 08:24:32 UTC

Hi Abhijith,

It would be nice to have a tunable for the prefetch behaviour. E.g. a max_stat_rate tunable parameter. 
If the max_stat_rate parameter is "optimized" for find processes in the code, this might lead to performance issues in other load scenarios. 

-Mark

Comment 75 Rene Zbinden 2008-12-10 07:45:32 UTC

Hi

I repeated the test with this proposed patch. This is really an improvement!!!! The test runs 24h now and no hanging. Without this patch, GFS hung after 30-60 minutes and I could only recover the system with a reboot.

I keep the test running and keep you updated.

Cheers Rene

Comment 77 Rene Zbinden 2008-12-29 09:27:18 UTC

Hi

The test runs now for 10 days without a hanger. Stopping now.

Regards Rene.

Comment 81 Chris Feist 2009-01-22 22:43:24 UTC

Changed component to GFS-kernel

Comment 82 Chris Feist 2009-01-23 00:13:47 UTC

Removing 4.7.z flag as this bug has already been cloned to 4.7.z.

Comment 88 errata-xmlrpc 2009-05-18 21:09:32 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1045.html

Note You need to log in before you can comment on or make changes to this bug.