Created attachment 132203 [details] /var/log/messages output from one machine showing the error.
Description of problem: My logs seem to be full of an assertion error. This is seen under high workloads. If I have 4 machines do high I/O to a SAN device then I shall see these errors. Version-Release number of selected component (if applicable): RHEL4 Kernel 2.6.9-34.0.1.ELsmp using rpms from up2date GFS-kernel-2.6.9-49.1 GFS-6.1.5-0 Running on 32bit Linux on a 4 way HP DL-585. With a Qlogic HBA connected to a CX700 SAN system. How reproducible: Steps to Reproduce: 1. Get 4 machines to time dd if=/dev/zero of=/mount/path/gfs/usr/file.test.machine-name-1 bs=4096 count=8388608. Simultaneiusly in the same directory 2. Any high I/O. 3. Actual results: Write test took 26 mins. Expected results: Write test took 26 mins and my logs wear filled with Jul 10 14:39:02 pa-dev101 kernel: GFS: fsid=alpha_cluster:dbc1.0: warning: assertion "!error" failed Jul 10 14:39:02 pa-dev101 kernel: GFS: fsid=alpha_cluster:dbc1.0: function = do_flock Jul 10 14:39:02 pa-dev101 kernel: GFS: fsid=alpha_cluster:dbc1.0: file = /usr/src/build/751518-i686/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/ops_file.c, line = 1667 But they are not in the logs of all the machines. They are only seeing during periods of high I/O. Additional info:
I'm unable to reproduce this problem. I doubt if dd is doing any flocks at all. There could be other process(es) triggering this. It'd be great if you can provide me more info: a) the mkfs command line used to create the gfs fs. It is unclear as to what locking module (gulm/dlm) you're running. b) the output of 'gfs_tool sb <device> all' c) the output of sysrq for running processes, memory info etc.
Created attachment 132627 [details] Sysreport Output of one of my machines.
I attached the output of sysreport. a) I don't know what mkfs.gfs options I used to create the FS. The lock manager in use is dlm. b) The output of gfs_tool sb is ################################################################################## [root@pa-dev101 abhattacharya]# gfs_tool sb /dev/dbc1/dbc1 all mh_magic = 0x01161970 mh_type = 1 mh_generation = 0 mh_format = 100 mh_incarn = 0 sb_fs_format = 1309 sb_multihost_format = 1401 sb_flags = 0 sb_bsize = 4096 sb_bsize_shift = 12 sb_seg_size = 16 no_formal_ino = 22 no_addr = 22 no_formal_ino = 23 no_addr = 23 no_formal_ino = 26 no_addr = 26 sb_lockproto = lock_dlm sb_locktable = alpha_cluster:dbc1 no_formal_ino = 24 no_addr = 24 no_formal_ino = 25 no_addr = 25 sb_reserved = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ################################################################################ c). I am not quite sure you mean by the sysrq ?
I have not been able to recreate this problem. I've tried combinations of heavy IO and flocks but haven't seen this. It'd be very helpful if you can give me a step-by-step of how to create this. The messages you are seeing are triggered by the flock code but I don't see any flocking in your test-case. A list of processes running at the time of the bug would help too. Here are simple instructions for using the "magic sysrq" in case you're unfamiliar: 1. Turn it on by doing: echo "1" > /proc/sys/kernel/sysrq 2. Recreate your problem 3. If you're at the system console with a keyboard, do alt-sysrq t (task list) If you have a telnet console instead, do ctrl-] to get telnet> prompt telnet> send brk (send a break char) t (task list) If you don't have a keyboard or telnet, but do have a shell: echo "t" > /proc/sysrq-trigger If you're doing it from a minicom, use: <ctrl-a>f followed by t (For other types of serial consoles, you have to get it to send a break, then letter t) 4. The task info will be dumped to the console, so hopefully you have a way to save that off.
I have set up the sysrq and I will be waiting for the problem to happen. It seems like it is a problem that we are getting intermittently.
We have actually noticed this to happen quite frequently on our GFS-->NFS servers. GFS mount is exported as through NFS. When the GFS FS runs out of space. This is not notitced through df but rather through gfs_tool df and data space reports 100% usage. Our NFS threads die, during file creation. This would creep up during high I/O. -Anand
Created attachment 140729 [details] Debug patch for flock issue Anand, please try out this debug patch. It is against gfs-kernel/src/gfs/ in the RHEL4 cvs branch. It prints the error code for the flock error that was tripping the assert. Should give us more info. --Thanks
I've compiled GFS-kernel with this patch and got the following output: GFS: error -11 from flock_lock_file_wait() GFS: fsid=cluster1:var.5: warning: assertion "!error" failed GFS: fsid=cluster1:var.5: function = do_flock GFS: fsid=cluster1:var.5: file = /builddir/build/BUILD/gfs-kernel-2.6.9-72/hugemem/src/gfs/ops_file.c, line = 1690 GFS: fsid=cluster1:var.5: time = 1184954580 So flock_lock_file_wait is returning -EAGAIN?
From what I can make out from the code, GFS is not expecting flock_lock_file_wait() to return EAGAIN at that point. GFS assumes that once it gets a glock on the file, a VFS level lock shouldn't fail. Are you running NFS on GFS? Can you reproduce this problem reliably? If yes, can you upload some test programs? Also, when you hit it again, can you collect the output of 'cat /proc/locks' and 'gfs_tool lockdump' ?
Created attachment 161310 [details] gfs_tool lockdump of my GFS filesystem
Created attachment 161311 [details] Output of /proc/locks Here is the output of /proc/locks. -Anand
Created attachment 161339 [details] Bug reproduction script I've played around a little and found a way to reproduce the error. I'm attaching a perl script that does the trick, but all it does is: Open a file. Get an exclusive flock on the filehandle. Open the file again with a new filehandle. Attempt an exclusive, non-blocking flock on the new filehandle. This sounds a bit perverse, but I imagine a real-world scenario would involve multi-threading.
*** This bug has been marked as a duplicate of 272301 ***