Bug 466645
Description
Rene Zbinden
2008-10-12 00:33:55 UTC
Created attachment 320130 [details]
locks graph
Created attachment 320131 [details]
locks held graph
Created attachment 320132 [details]
procs in D state on node 3
I have now repeated the test for a system with the option statfs_fast=0 and I got similar results (see new graphs). I also uploaded a process list of the hanging systems. Created attachment 320173 [details]
glocks graph with statfs_fast = 0 option
Created attachment 320174 [details]
locks graph with statfs_fast = 0 option
Created attachment 320175 [details]
locks held graph with statfs_fast = 0 option
Created attachment 320176 [details]
procs in D state on node 3 graph with statfs_fast = 0 option
Created attachment 320177 [details]
ps output on node 1 with statfs_fast = 0 option
Created attachment 320178 [details]
ps output on node 2 with statfs_fast = 0 option
Created attachment 320179 [details]
ps output on node 3 with statfs_fast = 0 option
Hi Rene, could you please provide the output of SysRq-T after you reproduced the problem ? Could you please do the SysRq-T 3 times in a row (please wait 5 secs after every iteration) -Mark Hi Rene, how many filesystems are concurrently mounted? -Marc. What we need to do is isolate any legitimate gfs/dlm problems from the effects of simply overwhelming them. We also want to eliminate gfs/dlm settings that could be having unknown effects on the system. For the first test, please do the following (on *all* nodes): - disable anything that runs df (e.g. hp software) - disable statfs_fast - disable glock_purge - disable drop_locks; echo "0" >> /proc/cluster/lock_dlm/drop_count before mounting any gfs file systems - change the way "find" is run so that one find command is not run until after the previous one has completed I started a test with clean settings as supposed. I'll keep you up2date. Created attachment 320260 [details]
sysrq-t output node1
Created attachment 320261 [details]
sysrq-t output node2
Created attachment 320262 [details]
sysrq-t output node3
(In reply to comment #12) > Hi Rene, > > could you please provide the output of SysRq-T after you reproduced the problem > ? Could you please do the SysRq-T 3 times in a row (please wait 5 secs after > every iteration) > > -Mark Hi Mark I uploaded the requested files. Regards Rene (In reply to comment #13) > Hi Rene, > how many filesystems are concurrently mounted? > -Marc. Hi Marc There are 30 FS mounted. Regards Rene (In reply to comment #15) > What we need to do is isolate any legitimate gfs/dlm problems from the effects > of simply overwhelming them. We also want to eliminate gfs/dlm settings that > could be having unknown effects on the system. > > For the first test, please do the following (on *all* nodes): > - disable anything that runs df (e.g. hp software) > - disable statfs_fast > - disable glock_purge > - disable drop_locks; echo "0" >> /proc/cluster/lock_dlm/drop_count > before mounting any gfs file systems > - change the way "find" is run so that one find command is not > run until after the previous one has completed I will run the test as requested. I keep u informed. Regards Rene Is the test run on just one of the 30 mounted filesystems? Are the other 29 empty? (In reply to comment #23) > Is the test run on just one of the 30 mounted filesystems? > Are the other 29 empty? Hi Dave There are 30 FS mounted but the test is only run on just one filesystem. The other FS are not empty. I copied the kernel sources on each of it. By the way: I could reproduce the hanger with your above suggestions in Comment 15. OK, let's continue to narrow things down one step at a time. Please run another iteration of the same test, but this time mount only the one fs being used for the test. A couple questions bout this previous test that hung (we're trying to copy what you're doing here), approximately how long was the test running before it hung? And when it hangs, do you observe a sudden change in behavior, where the test is making perceptible progress, and then suddenly stops making any progress? Or, do things progressively slow down, and eventually reach a point where you judge that it's hung? If the systems are still in the original hung state (i.e. you haven't done anything to change the state of the systems since they hung), could you collect the following from each of the three nodes: cman_tool nodes cman_tool services ps ax -o pid,stat,cmd,wchan gfs_tool counters <mountpoint> anything related to cman/dlm/gfs in /var/log/messages I'm running the following setup on a RHEL5 cluster with GFS1 and I haven't hit any hangs yet. The system is slow but still responsive. It's been running for about 10 hrs now. * Only 1 gfs filesystem mounted on all cluster nodes * statfs_fast, glock_purge, drop_locks are all 0 * Node 1 runs postmark on the gfs filesystem with the configuration mentioned in comment #0 Current configuration is: The base number of files is 70000 Transactions: 600000 Files range between 4.88 kilobytes and 9.77 kilobytes in size Working directory: /mnt/gfs (weight=1) 2 subdirectories will be used Block sizes are: read=512 bytes, write=512 bytes Biases are: read/append=5, create/delete=5 Not using Unix buffered file I/O Random number generator seed is 42 Report format is verbose. * Node 2 runs a program I wrote (similar to the python script mentioned in comment #0) that creates 20 directories with 100000 files in each with filesize upto 1M. * Finally, all nodes also run the exact find command mentioned in comment #0 in a loop (instead of the cronjob). i.e for i in `seq 1 100`; do /usr/bin/find / -name core -type f -exec ls -al {} \; > /tmp/find.log 2>&1; done; I've also started the same setup on a RHEL4 cluster just now. Hopefully we'll have a hang in the morning. (In reply to comment #25) > OK, let's continue to narrow things down one step at a time. > Please run another iteration of the same test, but this time > mount only the one fs being used for the test. > > A couple questions bout this previous test that hung (we're trying > to copy what you're doing here), approximately how long was the test > running before it hung? And when it hangs, do you observe a sudden > change in behavior, where the test is making perceptible progress, > and then suddenly stops making any progress? Or, do things progressively > slow down, and eventually reach a point where you judge that it's hung? > > If the systems are still in the original hung state (i.e. you haven't > done anything to change the state of the systems since they hung), > could you collect the following from each of the three nodes: > cman_tool nodes > cman_tool services > ps ax -o pid,stat,cmd,wchan > gfs_tool counters <mountpoint> > anything related to cman/dlm/gfs in /var/log/messages It takes between 1h - 2h until it gfs "hangs". With the above test it took me 30 Minutes. And actually I am not sure if we have a deadlock or if gfs gets "indefinetly" slow. When GFS is in that state, a strace of an ls or cd ends with: ... read(3, ",noatime,nodiratime 0 0\n/dev/vg_"..., 1024) = 189 read(3, "", 1024) = 0 close(3) = 0 munmap(0x2a9556c000, 4096) = 0 open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=48516848, ...}) = 0 mmap(NULL, 48516848, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2a95df1000 close(3) = 0 open("/proc/filesystems", O_RDONLY) = 3 read(3, "nodev\tsysfs\nnodev\trootfs\nnodev\tb"..., 4095) = 297 close(3) = 0 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(1, TIOCGWINSZ, {ws_row=64, ws_col=177, ws_xpixel=0, ws_ypixel=0}) = 0 stat("dir0", and stays there in D state. As mentionned above I cannot abort the command with ctrl+c. It is the state which I experienced on some of our production systems. I have a lot of the following messages in /var/log/messages: ... Oct 15 07:03:43 smlxt301 kernel: dlm: reply Oct 15 07:03:43 smlxt301 kernel: dlm: Magma: reply from 3 no lock Oct 15 07:03:43 smlxt301 kernel: dlm: reply Oct 15 07:03:43 smlxt301 kernel: dlm: Magma: reply from 3 no lock Oct 15 07:03:43 smlxt301 kernel: dlm: reply Oct 15 07:44:45 smlxt301 kernel: dlm: Magma: reply from 3 no lock Oct 15 07:44:45 smlxt301 kernel: dlm: reply Oct 15 07:44:47 smlxt301 kernel: dlm: Magma: reply from 3 no lock Oct 15 07:44:47 smlxt301 kernel: dlm: reply Oct 15 07:44:47 smlxt301 kernel: dlm: Magma: reply from 3 no lock Oct 15 07:44:47 smlxt301 kernel: dlm: reply Oct 15 07:44:47 smlxt301 kernel: dlm: Magma: reply from 3 no lock Oct 15 07:44:47 smlxt301 kernel: dlm: reply Oct 15 07:44:47 smlxt301 kernel: dlm: Magma: reply from 3 no lock ... # cman_tool nodes Node Votes Exp Sts Name 0 4 0 M /dev/mapper/quorumdisk 1 1 3 M smlxt301 2 1 3 M smlxt302 3 1 3 M smlxt303 # cman_tool services Service Name GID LID State Code Fence Domain: "default" 3 2 run - [2 3 1] DLM Lock Space: "clvmd" 5 3 run - [2 3 1] DLM Lock Space: "fs_1_1" 7 4 run - [2 3 1] DLM Lock Space: "fs_2_1" 11 6 run - [2 3 1] DLM Lock Space: "fs_3_1" 15 8 run - [2 3 1] DLM Lock Space: "fs_4_1" 19 10 run - [2 3 1] DLM Lock Space: "fs_5_1" 23 12 run - [2 3 1] DLM Lock Space: "fs_6_1" 27 14 run - [2 3 1] DLM Lock Space: "fs_7_1" 31 16 run - [2 3 1] DLM Lock Space: "fs_8_1" 35 18 run - [2 3 1] DLM Lock Space: "fs_9_1" 39 20 run - [2 3 1] DLM Lock Space: "fs_10_1" 43 22 run - [2 3 1] DLM Lock Space: "fs_11_1" 47 24 run - [2 3 1] DLM Lock Space: "fs_12_1" 51 26 run - [2 3 1] DLM Lock Space: "fs_13_1" 55 28 run - [2 3 1] DLM Lock Space: "fs_14_1" 59 30 run - [2 3 1] DLM Lock Space: "fs_15_1" 63 32 run - [2 3 1] DLM Lock Space: "fs_16_1" 67 34 run - [2 3 1] DLM Lock Space: "fs_17_1" 71 36 run - [2 3 1] DLM Lock Space: "fs_19_1" 76 38 run - [2 3 1] DLM Lock Space: "fs_20_1" 82 40 run - [2 3 1] DLM Lock Space: "fs_22_1" 88 42 run - [2 3 1] DLM Lock Space: "fs_23_1" 94 44 run - [2 3 1] DLM Lock Space: "fs_24_1" 100 46 run - [2 3 1] DLM Lock Space: "fs_25_1" 106 48 run - [2 3 1] DLM Lock Space: "fs_26_1" 112 50 run - [2 3 1] DLM Lock Space: "fs_27_1" 118 52 run - [2 3 1] DLM Lock Space: "fs_28_1" 124 54 run - [2 3 1] DLM Lock Space: "fs_29_1" 130 56 run - [2 3 1] DLM Lock Space: "fs_30_1" 136 58 run - [2 3 1] DLM Lock Space: "Magma" 145 61 run - [3 1 2] GFS Mount Group: "fs_1_1" 9 5 run - [2 3 1] GFS Mount Group: "fs_2_1" 13 7 run - [2 3 1] GFS Mount Group: "fs_3_1" 17 9 run - [2 3 1] GFS Mount Group: "fs_4_1" 21 11 run - [2 3 1] GFS Mount Group: "fs_5_1" 25 13 run - [2 3 1] GFS Mount Group: "fs_6_1" 29 15 run - [2 3 1] GFS Mount Group: "fs_7_1" 33 17 run - [2 3 1] GFS Mount Group: "fs_8_1" 37 19 run - [2 3 1] GFS Mount Group: "fs_9_1" 41 21 run - [2 3 1] GFS Mount Group: "fs_10_1" 45 23 run - [2 3 1] GFS Mount Group: "fs_11_1" 49 25 run - [2 3 1] GFS Mount Group: "fs_12_1" 53 27 run - [2 3 1] GFS Mount Group: "fs_13_1" 57 29 run - [2 3 1] GFS Mount Group: "fs_14_1" 61 31 run - [2 3 1] GFS Mount Group: "fs_15_1" 65 33 run - [2 3 1] GFS Mount Group: "fs_16_1" 69 35 run - [2 3 1] GFS Mount Group: "fs_17_1" 73 37 run - [2 3 1] GFS Mount Group: "fs_19_1" 79 39 run - [2 3 1] GFS Mount Group: "fs_20_1" 85 41 run - [2 3 1] GFS Mount Group: "fs_22_1" 91 43 run - [2 3 1] GFS Mount Group: "fs_23_1" 97 45 run - [2 3 1] GFS Mount Group: "fs_24_1" 103 47 run - [2 3 1] GFS Mount Group: "fs_25_1" 109 49 run - [2 3 1] GFS Mount Group: "fs_26_1" 115 51 run - [2 3 1] GFS Mount Group: "fs_27_1" 121 53 run - [2 3 1] GFS Mount Group: "fs_28_1" 127 55 run - [2 3 1] GFS Mount Group: "fs_29_1" 133 57 run - [2 3 1] GFS Mount Group: "fs_30_1" 139 59 run - [2 3 1] User: "usrm::manager" 142 60 run - [2 3 1] # gfs_tool counters /fs/1/1 locks 141445 locks held 70749 freeze count 0 incore inodes 70638 metadata buffers 47 unlinked inodes 1 quota IDs 0 incore log buffers 0 log space used 0.05% meta header cache entries 7855 glock dependencies 19 glocks on reclaim list 0 log wraps 35 outstanding LM calls 9 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 10226468 glock nq calls 14587201 glock dq calls 14515409 glock prefetch calls 4989623 lm_lock calls 5511896 lm_unlock calls 5184959 lm callbacks 10702559 address operations 3870033 dentry operations 753854 export operations 0 file operations 5147274 inode operations 2688443 super operations 13375048 vm operations 0 block I/O reads 2720819 block I/O writes 2060819 Created attachment 320398 [details]
node1: ps ax -o pid,stat,cmd,wchan
Created attachment 320399 [details]
node2: ps ax -o pid,stat,cmd,wchan
Created attachment 320400 [details]
node3: ps ax -o pid,stat,cmd,wchan
Rene, could you please provide the gfs_tool counters from all nodes ? Yes: node1: locks 141445 locks held 70749 freeze count 0 incore inodes 70638 metadata buffers 47 unlinked inodes 1 quota IDs 0 incore log buffers 0 log space used 0.05% meta header cache entries 7855 glock dependencies 19 glocks on reclaim list 0 log wraps 35 outstanding LM calls 9 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 10226470 glock nq calls 14587492 glock dq calls 14515699 glock prefetch calls 4989627 lm_lock calls 5511900 lm_unlock calls 5184961 lm callbacks 10702565 address operations 3870033 dentry operations 753855 export operations 0 file operations 5148102 inode operations 2688661 super operations 13375126 vm operations 0 block I/O reads 2720821 block I/O writes 2060819 node2: locks 922873 locks held 455963 freeze count 0 incore inodes 455738 metadata buffers 1188 unlinked inodes 1 quota IDs 2 incore log buffers 0 log space used 0.05% meta header cache entries 107 glock dependencies 0 glocks on reclaim list 0 log wraps 41 outstanding LM calls 110407 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 11441806 glock nq calls 15121627 glock dq calls 14665567 glock prefetch calls 8693843 lm_lock calls 10001385 lm_unlock calls 9193992 lm callbacks 19094890 address operations 0 dentry operations 1907787 export operations 0 file operations 2015619 inode operations 7537515 super operations 17571647 vm operations 0 block I/O reads 7712420 block I/O writes 2407714 node3: locks 385685 locks held 104950 freeze count 0 incore inodes 100012 metadata buffers 0 unlinked inodes 0 quota IDs 0 incore log buffers 0 log space used 0.10% meta header cache entries 0 glock dependencies 0 glocks on reclaim list 0 log wraps 4 outstanding LM calls 218740 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 10267822 glock nq calls 1842881 glock dq calls 1742865 glock prefetch calls 5778466 lm_lock calls 5898557 lm_unlock calls 5355866 lm callbacks 11041587 address operations 0 dentry operations 382740 export operations 0 file operations 205793 inode operations 1052899 super operations 6052575 vm operations 0 block I/O reads 2990554 block I/O writes 130725 In my opinion, the "outstanding LM calls" shows the problem. I believe, that the ls command tries to acquire a lock from the lockmanager which is not processed. It also looks like the order, the queued lock requests are handled is not fair. Otherwise, the lock for the ls command should be handled at some time. In our tests the ls command "hung" for some hours. It returned after I killed all find processes. @Rene: what happens if you kill the find processes ? (Watch the "outstanding LM calls") BTW: we can reproduce the same problem on our two node cluster with only one gfs filesystem mounted. - The "outstanding LM calls" count is interesting, will investigate what that might mean. - What is the "bpbkar" process? It's blocked on gfs on nodes 2 and 3. - "dlm: Magma: reply from 3 no lock" is not good. It can be fixed easily by echo "300" >> /proc/cluster/config/dlm/lock_timeout also see http://sources.redhat.com/cluster/wiki/FAQ/LockManager#dlm_timeout - When you next run the test, and the filesystems hang, try to kill some of the processes (find, python, postmark) on the nodes and wait to see if the filesystems become responsive again. During the tests (RHEL4u7) I observed the problem, that in my opinion is related to bz# 247766. I tried to trigger a sysrq-t. Unfortunately, one node paniced during the operation. The filesystem was very unresponsive and rgmanager got some lock timeouts: <err> #48: Unable to obtain cluster lock: Connection timed out The kernel reports the following: Oct 15 12:06:00 lilr205a dlm: Magma: reply from 2 no lock Oct 15 12:06:00 lilr205a dlm: reply Oct 15 12:06:00 lilr205a rh_cmd 5 Oct 15 12:06:00 lilr205a rh_lkid ac03e9 Oct 15 12:06:00 lilr205a lockstate 0 Oct 15 12:06:00 lilr205a nodeid 1 Oct 15 12:06:00 lilr205a status 0 Oct 15 12:06:00 lilr205a lkid 0 ... And ps ax -o stat,wchan,command,pid shows: D< 18446744071566487680 [dlm_recvd] 6525 Abhi, one thing I think we missed in our test setup were the 10 dummy rgmanager services. Rene, in the next iteration of the test: - mount only the 1 fs being used for the test (already mentioned this) - eliminate the 10 dummy rgmanager services (don't even run rgmanager) - remove the unknown bpbkar process (In reply to comment #37) > Abhi, one thing I think we missed in our test setup were the > 10 dummy rgmanager services. > > Rene, in the next iteration of the test: > - mount only the 1 fs being used for the test (already mentioned this) > - eliminate the 10 dummy rgmanager services (don't even run rgmanager) > - remove the unknown bpbkar process I did the run with one fs, no rgmanager and no bpbkar process (backup). Same result whithin 1h. Here the counters: node1: locks 144779 locks held 74146 freeze count 0 incore inodes 70664 metadata buffers 281 unlinked inodes 55 quota IDs 2 incore log buffers 0 log space used 28.66% meta header cache entries 9999 glock dependencies 1582 glocks on reclaim list 0 log wraps 15 outstanding LM calls 7856 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 1081533 glock nq calls 8317643 glock dq calls 8243424 glock prefetch calls 553179 lm_lock calls 824675 lm_unlock calls 470422 lm callbacks 1293488 address operations 2382672 dentry operations 357131 export operations 0 file operations 2837593 inode operations 1493792 super operations 5399593 vm operations 0 block I/O reads 91502 block I/O writes 1066519 node2: no output, gfs_tool counters is not responding.... node3: locks 1436 locks held 15 freeze count 0 incore inodes 8 metadata buffers 0 unlinked inodes 0 quota IDs 0 incore log buffers 0 log space used 0.00% meta header cache entries 0 glock dependencies 0 glocks on reclaim list 0 log wraps 0 outstanding LM calls 1416 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 2010718 glock nq calls 7058 glock dq calls 7047 glock prefetch calls 1288530 lm_lock calls 1291482 lm_unlock calls 727693 lm callbacks 2022042 address operations 0 dentry operations 58 export operations 0 file operations 6661 inode operations 317 super operations 1288676 vm operations 0 block I/O reads 105691 block I/O writes 4 I tried to kill several times (with the 9 Signal) the find, postmark and python process. No success. Maybe it takes some time. I'll wait until tomorrow. Regards Rene. It will be interesting to leave things in this state for some time. We'd like to look for any change at all between now and tomorrow when you check it: . did any of the killed processes eventually exit? . have any of the processes (find, postmark, python) made any further progress at all? . run gfs_tool counters again looking for any changes . run ps ax -o pid,stat,cmd,wchan In our similar tests here, we see very very slow progress/response of the fs, some things taking up to 5 minutes to respond. We're looking for concrete evidence that your tests are not hung, but just very slow. The steps we take to further diagnose the problems will depend significantly upon whether dlm/gfs are truely hung or whether they are still operating but making no noticable progress. For the next test iteration, we're going to have you remove the "find" processes from all nodes. Update to comment 39: Things to check and collect after the current hung state has been left for some time: . did any of the killed processes eventually exit? . have any of the processes (find, postmark, python) made any further progress at all? . cman_tool nodes and services . ps ax -o pid,stat,cmd,wchan . anything related to cman/dlm/gfs in /var/log/messages . gfs_tool counters again looking for any changes . if gfs_tool counters has not responded on any node (like node 2 in comment 38), run sysrq-t and sysrq-w on those nodes (In reply to comment #40) > Update to comment 39: > > Things to check and collect after the current hung state has been left for some > time: > > . did any of the killed processes eventually exit? > . have any of the processes (find, postmark, python) made any further > progress at all? > . cman_tool nodes and services > . ps ax -o pid,stat,cmd,wchan > . anything related to cman/dlm/gfs in /var/log/messages > . gfs_tool counters again looking for any changes > . if gfs_tool counters has not responded on any node (like node 2 in comment > 38), > run sysrq-t and sysrq-w on those nodes * The Processes never quit. I tried several times. * cman_tool nodes: Node Votes Exp Sts Name 0 4 0 M /dev/mapper/quorumdisk 1 1 3 M smlxt301 2 1 3 M smlxt302 3 1 3 M smlxt303 * cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 2 3] DLM Lock Space: "clvmd" 3 3 run - [1 2 3] DLM Lock Space: "fs_1_1" 8 8 run - [1 2 3] GFS Mount Group: "fs_1_1" 9 9 run - [1 2 3] * This time nothing special in /var/log/messages * gfs_tool counters: node 1: locks 144779 locks held 74146 freeze count 0 incore inodes 70664 metadata buffers 281 unlinked inodes 55 quota IDs 2 incore log buffers 0 log space used 28.66% meta header cache entries 0 glock dependencies 0 glocks on reclaim list 0 log wraps 15 outstanding LM calls 7857 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 1081596 glock nq calls 8320680 glock dq calls 8246459 glock prefetch calls 553179 lm_lock calls 824746 lm_unlock calls 470485 lm callbacks 1293628 address operations 2382672 dentry operations 357131 export operations 0 file operations 2846493 inode operations 1496017 super operations 5399836 vm operations 0 block I/O reads 91502 block I/O writes 1066519 node 2: no output, gfs_tool counters is not responding.... node 3: locks 1596 locks held 15 freeze count 0 incore inodes 8 metadata buffers 0 unlinked inodes 0 quota IDs 0 incore log buffers 0 log space used 0.00% meta header cache entries 0 glock dependencies 0 glocks on reclaim list 0 log wraps 0 outstanding LM calls 1417 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 2010782 glock nq calls 7865 glock dq calls 7852 glock prefetch calls 1288530 lm_lock calls 1291547 lm_unlock calls 727757 lm callbacks 2022170 address operations 0 dentry operations 58 export operations 0 file operations 6665 inode operations 318 super operations 1288918 vm operations 0 block I/O reads 105696 block I/O writes 4 Created attachment 320531 [details]
node1: ps ax -o pid,stat,cmd,wchan
Created attachment 320532 [details]
node2: ps ax -o pid,stat,cmd,wchan
Created attachment 320533 [details]
node3: ps ax -o pid,stat,cmd,wchan
Created attachment 320534 [details]
sysrq-t and sysrq-w of node2
The node2 sysrq doesn't appear to match the node2 ps. As mentioned in comment 39, for the next test iteration, please remove the find processes from all nodes. Created attachment 320571 [details]
ps and sysrq of node2
(In reply to comment #47) > As mentioned in comment 39, for the next test iteration, please remove > the find processes from all nodes. Test runs without find... I'll keep u updated. Test runs for more than 12h without getting into a hanger. Test runs for four days. No hanging. Can we conclude that the find is the problem and that we should not use find in combination with gfs? We are running with nonconcurrent finds since 5 days without problems. Should we rise the count of find processes again? (In reply to comment #52) > We are running with nonconcurrent finds since 5 days without problems. > > Should we rise the count of find processes again? Without finds we experience no hangers. We had RH Engineers at our site, and they told us gfs (1) is not designed for concurrent finds. We should wait for gfs2 and until then avoid finds on the system. Rene, it sounds like removing the finds solves the problems in your test cluster; I don't know how directly this translates to your production systems, where the processes and load will be different. Marc, it sounds like finds do not create the same problems on your cluster. It's the combination of "find" with the other load on the filesystem that creates problems. In summary, the problems on Rene's cluster do not appear to be "bugs" per se that can be fixed in gfs/dlm. The problem was applications (find+postmark+manycreates) stressing gfs/dlm in ways that it cannot handle. The result of this stress is the system becomes so loaded that nothing makes any forward progress. Dave, "my" cluster also does not hang or anything the like when only one find runs (that's the good news). I don't know if finds only really cause the problem but any process that somehow loads the filesystem with "read" (not sure about that) I/Os and more then one or two processes must be running of that type at the same time. Two things on that topic: 1. On a normal cluster you can never be sure that you can avoid such a behaviour can you? 2. I would agree to not having a bug (whereas this is really hard to tell) but what we saw is concurrent processes accessing the fs that were leading to a "deadlock" where calls would only resume when another dependent (find-)process was killed. Independent on how much time you're waiting. I still think this is an issue/bug we are seeing from time to time that should be analysed. What do you think? Marc, you are correct, we can't be sure with problems like this. The best we can do is identify a particular usage/load where things run ok a large percentage of the time. In Rene's cluster, find+postmark+createfiles would only run for a few hours, but postmark+createfiles ran for 5 days. (To be more confident in these results, we could add back some of the other things that we eliminated in the process of narrowing down the problem, to verify that the final result doesn't change.) We would need to run a similar sequence of tests in your case to identify the "culprit", i.e. identify a specific set of processes, each with a specific load, which create a regular failure; then start eliminating processes or changing configuration settings one at a time until the combination no longer creates a failure. Please notice, that according to comment #38 and comment #41 Rene was not able to kill the find processes within one day. During my tests, a "hang" (ls hung for several hours) that was caused by the gfs/dlm stressing could be resolved by simply killing all find processes. In my opinion, a "hang" that cannot be resolved is very likely to be caused by a bug. I.e. a "hang" that is caused by overstressing the gfs/dlm should be resolvable by killing all processes causing the overstressing. Can you confirm this ? I believe that whether or not stalled processes can be killed depends on how severely the system is overloaded. It probably also depends on the specifics of what the different processes are doing. So, I would expect on overstressed cluster could sometimes have killable processes, and sometimes non-killable processes. I ran some numbers on the smoke cluster and here are the results: I have one gfs1 and one gfs2 filesystem (both size 100 G) filled to capacity with files upto 1 MB in size. gfs1: 3 directories, 202126 files in all gfs2: 3 directories, 202336 files in all The filesystems are mounted on all the nodes and I ran 'strace -c find' on all the nodes. Once on the gfs1 filesystem and once on the gfs2 filesystem. Something like: strace -c /usr/bin/find /mnt/gfs -name core -type f -exec ls -al {} \; > /tmp/find.log 2> /tmp/find-gfs-stats Below are the stats for the find processes on each node. The first set of numbers is for gfs1. second set for gfs2. camel: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 30.116114 15134 1990 getdents64 0.00 0.000304 34 9 munmap 0.00 0.000157 157 1 execve 0.00 0.000018 2 12 brk 0.00 0.000000 0 8 read 0.00 0.000000 0 23 open 0.00 0.000000 0 23 close 0.00 0.000000 0 1 time 0.00 0.000000 0 2 1 access 0.00 0.000000 0 1 ioctl 0.00 0.000000 0 1 uname 0.00 0.000000 0 3 mprotect 0.00 0.000000 0 12 fchdir 0.00 0.000000 0 14 mremap 0.00 0.000000 0 23 mmap2 0.00 0.000000 0 4 stat64 0.00 0.000000 0 5 lstat64 0.00 0.000000 0 13 fstat64 0.00 0.000000 0 4 fcntl64 0.00 0.000000 0 1 set_thread_area ------ ----------- ----------- --------- --------- ---------------- 100.00 30.116593 2150 1 total % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.66 0.108964 55 1992 getdents64 0.17 0.000182 8 23 close 0.16 0.000174 12 14 mremap 0.02 0.000021 1 23 mmap2 0.00 0.000000 0 8 read 0.00 0.000000 0 23 open 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 time 0.00 0.000000 0 2 1 access 0.00 0.000000 0 12 brk 0.00 0.000000 0 1 ioctl 0.00 0.000000 0 9 munmap 0.00 0.000000 0 1 uname 0.00 0.000000 0 3 mprotect 0.00 0.000000 0 12 fchdir 0.00 0.000000 0 4 stat64 0.00 0.000000 0 5 lstat64 0.00 0.000000 0 13 fstat64 0.00 0.000000 0 4 fcntl64 0.00 0.000000 0 1 set_thread_area ------ ----------- ----------- --------- --------- ---------------- 100.00 0.109341 2152 1 total merit: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 31.580428 15870 1990 getdents64 0.00 0.000034 4 8 read 0.00 0.000022 2 12 brk 0.00 0.000021 1 23 mmap2 0.00 0.000000 0 23 open 0.00 0.000000 0 23 close 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 time 0.00 0.000000 0 2 1 access 0.00 0.000000 0 1 ioctl 0.00 0.000000 0 9 munmap 0.00 0.000000 0 1 uname 0.00 0.000000 0 3 mprotect 0.00 0.000000 0 12 fchdir 0.00 0.000000 0 14 mremap 0.00 0.000000 0 4 stat64 0.00 0.000000 0 5 lstat64 0.00 0.000000 0 13 fstat64 0.00 0.000000 0 4 fcntl64 0.00 0.000000 0 1 set_thread_area ------ ----------- ----------- --------- --------- ---------------- 100.00 31.580505 2150 1 total % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.70 0.195846 98 1992 getdents64 0.24 0.000464 52 9 munmap 0.05 0.000107 8 14 mremap 0.01 0.000023 1 23 open 0.00 0.000000 0 8 read 0.00 0.000000 0 23 close 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 time 0.00 0.000000 0 2 1 access 0.00 0.000000 0 12 brk 0.00 0.000000 0 1 ioctl 0.00 0.000000 0 1 uname 0.00 0.000000 0 3 mprotect 0.00 0.000000 0 12 fchdir 0.00 0.000000 0 23 mmap2 0.00 0.000000 0 4 stat64 0.00 0.000000 0 5 lstat64 0.00 0.000000 0 13 fstat64 0.00 0.000000 0 4 fcntl64 0.00 0.000000 0 1 set_thread_area ------ ----------- ----------- --------- --------- ---------------- 100.00 0.196440 2152 1 total winston: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 31.435381 15797 1990 getdents64 0.00 0.000417 46 9 munmap 0.00 0.000113 8 14 mremap 0.00 0.000000 0 8 read 0.00 0.000000 0 23 open 0.00 0.000000 0 23 close 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 time 0.00 0.000000 0 2 1 access 0.00 0.000000 0 12 brk 0.00 0.000000 0 1 ioctl 0.00 0.000000 0 1 uname 0.00 0.000000 0 3 mprotect 0.00 0.000000 0 12 fchdir 0.00 0.000000 0 23 mmap2 0.00 0.000000 0 4 stat64 0.00 0.000000 0 5 lstat64 0.00 0.000000 0 13 fstat64 0.00 0.000000 0 4 fcntl64 0.00 0.000000 0 1 set_thread_area ------ ----------- ----------- --------- --------- ---------------- 100.00 31.435911 2150 1 total % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.86 0.233801 117 1992 getdents64 0.12 0.000289 32 9 munmap 0.02 0.000038 2 23 mmap2 0.00 0.000000 0 8 read 0.00 0.000000 0 23 open 0.00 0.000000 0 23 close 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 time 0.00 0.000000 0 2 1 access 0.00 0.000000 0 12 brk 0.00 0.000000 0 1 ioctl 0.00 0.000000 0 1 uname 0.00 0.000000 0 3 mprotect 0.00 0.000000 0 12 fchdir 0.00 0.000000 0 14 mremap 0.00 0.000000 0 4 stat64 0.00 0.000000 0 5 lstat64 0.00 0.000000 0 13 fstat64 0.00 0.000000 0 4 fcntl64 0.00 0.000000 0 1 set_thread_area ------ ----------- ----------- --------- --------- ---------------- 100.00 0.234128 2152 1 total salem: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 45.390793 22809 1990 getdents 0.00 0.000999 200 5 lstat 0.00 0.000999 56 18 brk 0.00 0.000016 1 23 mmap 0.00 0.000000 0 8 read 0.00 0.000000 0 23 open 0.00 0.000000 0 23 close 0.00 0.000000 0 4 stat 0.00 0.000000 0 13 fstat 0.00 0.000000 0 7 mprotect 0.00 0.000000 0 6 munmap 0.00 0.000000 0 1 ioctl 0.00 0.000000 0 2 1 access 0.00 0.000000 0 7 mremap 0.00 0.000000 0 1 execve 0.00 0.000000 0 2 uname 0.00 0.000000 0 4 fcntl 0.00 0.000000 0 12 fchdir 0.00 0.000000 0 1 arch_prctl 0.00 0.000000 0 1 time ------ ----------- ----------- --------- --------- ---------------- 100.00 45.392807 2151 1 total % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.63 0.272699 137 1992 getdents 0.37 0.000999 56 18 brk 0.00 0.000000 0 8 read 0.00 0.000000 0 23 open 0.00 0.000000 0 23 close 0.00 0.000000 0 4 stat 0.00 0.000000 0 13 fstat 0.00 0.000000 0 5 lstat 0.00 0.000000 0 23 mmap 0.00 0.000000 0 7 mprotect 0.00 0.000000 0 6 munmap 0.00 0.000000 0 1 ioctl 0.00 0.000000 0 2 1 access 0.00 0.000000 0 7 mremap 0.00 0.000000 0 1 execve 0.00 0.000000 0 2 uname 0.00 0.000000 0 4 fcntl 0.00 0.000000 0 12 fchdir 0.00 0.000000 0 1 arch_prctl 0.00 0.000000 0 1 time ------ ----------- ----------- --------- --------- ---------------- 100.00 0.273698 2153 1 total Now looking at those figures reminds me of something... there is a difference between GFS1 and GFS2 in that GFS2 doesn't do the readahead of inodes that GFS1 does. I made that change right back in the early days of GFS2 because ls seemed to be very slow. I wouldn't be at all surprised if thats what you've come across here. Created attachment 322767 [details]
RHEL5 patch - updated max stat-rate to 10 stats/sec
Created attachment 322779 [details]
RHEL4 patch
Checked in the above patch to RHEL4, RHEL5, STABLE2 and master Do those patches solve the issues we had with the stress test? And if so, can we count on having this patches included in rh4u8 release? (In reply to comment #72) > Do those patches solve the issues we had with the stress test? And if so, can > we count on having this patches included in rh4u8 release? Rene, When you got rid of the find processes in your stress-test, the machines recovered from the hang. We tried this in our test-cluster as well and found that finds (without stat) were very slow because gfs was trying to prefetch locks in the hope that stats would be performed on the files returned by find. This assumption was incorrect. The patch above dramatically improves performance of finds without stats by regulating prefetching. The patch has been checked into the RHEL4 git branch and should make the 4.8 release. Hi Abhijith, It would be nice to have a tunable for the prefetch behaviour. E.g. a max_stat_rate tunable parameter. If the max_stat_rate parameter is "optimized" for find processes in the code, this might lead to performance issues in other load scenarios. -Mark Hi I repeated the test with this proposed patch. This is really an improvement!!!! The test runs 24h now and no hanging. Without this patch, GFS hung after 30-60 minutes and I could only recover the system with a reboot. I keep the test running and keep you updated. Cheers Rene Hi The test runs now for 10 days without a hanger. Stopping now. Regards Rene. Changed component to GFS-kernel Removing 4.7.z flag as this bug has already been cloned to 4.7.z. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1045.html |