Bug 198302 - Error : warning: assertion "!error" failed
Error : warning: assertion "!error" failed
Status: CLOSED DUPLICATE of bug 272301
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Abhijith Das
GFS Bugs
:
Depends On: 272301
Blocks:
  Show dependency treegraph
 
Reported: 2006-07-10 18:13 EDT by Anand
Modified: 2010-01-11 22:11 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-09-25 09:40:27 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log/messages output from one machine showing the error. (4.62 KB, application/octet-stream)
2006-07-10 18:13 EDT, Anand
no flags Details
Sysreport Output of one of my machines. (887.49 KB, application/x-bzip)
2006-07-18 19:26 EDT, Anand
no flags Details
Debug patch for flock issue (623 bytes, patch)
2006-11-08 18:57 EST, Abhijith Das
no flags Details | Diff
gfs_tool lockdump of my GFS filesystem (1.98 MB, application/octet-stream)
2007-08-14 17:20 EDT, Anand
no flags Details
Output of /proc/locks (843 bytes, application/octet-stream)
2007-08-14 17:21 EDT, Anand
no flags Details
Bug reproduction script (190 bytes, text/plain)
2007-08-15 06:11 EDT, Robert Clark
no flags Details

  None (edit)
Description Anand 2006-07-10 18:13:46 EDT
Created attachment 132203 [details]
/var/log/messages output from one machine showing the error.
Comment 1 Anand 2006-07-10 18:13:46 EDT
Description of problem:

My logs seem to be full of an assertion error. This is seen under high
workloads. If I have 4 machines do high I/O to a SAN device then I shall see
these errors. 

Version-Release number of selected component (if applicable):

RHEL4   Kernel 2.6.9-34.0.1.ELsmp
using rpms from up2date 
GFS-kernel-2.6.9-49.1
GFS-6.1.5-0

Running on 32bit Linux on a 4 way HP DL-585. With a Qlogic HBA connected to a
CX700 SAN system. 

How reproducible:


Steps to Reproduce:
1. Get 4 machines to 

time dd if=/dev/zero of=/mount/path/gfs/usr/file.test.machine-name-1 bs=4096
count=8388608. Simultaneiusly in the same directory 
2. Any high I/O.
3.
  
Actual results:

Write test took 26 mins. 

Expected results:

Write test took 26 mins and my logs wear filled with 

 Jul 10 14:39:02 pa-dev101 kernel: GFS: fsid=alpha_cluster:dbc1.0: warning:
assertion "!error" failed
Jul 10 14:39:02 pa-dev101 kernel: GFS: fsid=alpha_cluster:dbc1.0:   function =
do_flock
Jul 10 14:39:02 pa-dev101 kernel: GFS: fsid=alpha_cluster:dbc1.0:   file =
/usr/src/build/751518-i686/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/ops_file.c,
line = 1667

But they are not in the logs of all the machines. They are only seeing during
periods of high I/O. 

Additional info:
Comment 2 Abhijith Das 2006-07-18 15:47:36 EDT
I'm unable to reproduce this problem. I doubt if dd is doing any flocks at all.
There could be other process(es) triggering this.
It'd be great if you can provide me more info:
a) the mkfs command line used to create the gfs fs. It is unclear as to what
locking module (gulm/dlm) you're running.
b) the output of 'gfs_tool sb <device> all'
c) the output of sysrq for running processes, memory info etc.
Comment 3 Anand 2006-07-18 19:26:36 EDT
Created attachment 132627 [details]
Sysreport Output of one of my machines.
Comment 4 Anand 2006-07-18 19:46:42 EDT
I attached the output of sysreport. 

a)  I don't know what mkfs.gfs options I used to create the FS. The lock manager
in use is dlm.
b) The output of gfs_tool sb is
##################################################################################
[root@pa-dev101 abhattacharya]# gfs_tool sb /dev/dbc1/dbc1  all
  mh_magic = 0x01161970
  mh_type = 1
  mh_generation = 0
  mh_format = 100
  mh_incarn = 0
  sb_fs_format = 1309
  sb_multihost_format = 1401
  sb_flags = 0
  sb_bsize = 4096
  sb_bsize_shift = 12
  sb_seg_size = 16
  no_formal_ino = 22
  no_addr = 22
  no_formal_ino = 23
  no_addr = 23
  no_formal_ino = 26
  no_addr = 26
  sb_lockproto = lock_dlm
  sb_locktable = alpha_cluster:dbc1
  no_formal_ino = 24
  no_addr = 24
  no_formal_ino = 25
  no_addr = 25
  sb_reserved =
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
################################################################################

c). I am not quite sure you mean by the sysrq ?
Comment 5 Abhijith Das 2006-08-28 17:08:41 EDT
I have not been able to recreate this problem. I've tried combinations of heavy
IO and flocks but haven't seen this. It'd be very helpful if you can give me a
step-by-step of how to create this. The messages you are seeing are triggered by
the flock code but I don't see any flocking in your test-case.
A list of processes running at the time of the bug would help too.

Here are simple instructions for using the "magic sysrq" in case you're unfamiliar:

1. Turn it on by doing:
   echo "1" >  /proc/sys/kernel/sysrq
2. Recreate your problem
3. If you're at the system console with a keyboard, do alt-sysrq t (task list)
  If you have a telnet console instead, do ctrl-] to get telnet> prompt
  telnet> send brk  (send a break char)
  t (task list)
  If you don't have a keyboard or telnet, but do have a shell:
  echo "t" > /proc/sysrq-trigger
  If you're doing it from a minicom, use: <ctrl-a>f followed by t
(For other types of serial consoles, you have to get it to send a break, then
letter t)
4. The task info will be dumped to the console, so hopefully you have
   a way to save that off. 
Comment 6 Anand 2006-09-05 14:47:14 EDT
I have set up the sysrq and I will be waiting for the problem to happen. It
seems like it is a problem that we are getting intermittently. 
Comment 8 Anand 2006-10-20 13:37:41 EDT
We have actually noticed this to happen quite frequently on our GFS-->NFS servers. 
GFS mount is exported as through NFS. 

When the GFS FS runs out of space. This is not notitced through df but rather
through gfs_tool df and data space reports 100% usage. 
Our NFS threads die, during file creation. 

This would creep up during high I/O.
-Anand 
Comment 9 Abhijith Das 2006-11-08 18:57:12 EST
Created attachment 140729 [details]
Debug patch for flock issue

Anand, please try out this debug patch. It is against gfs-kernel/src/gfs/ in
the RHEL4 cvs branch. It prints the error code for the flock error that was
tripping the assert. Should give us more info. --Thanks
Comment 12 Robert Clark 2007-07-23 05:12:47 EDT
I've compiled GFS-kernel with this patch and got the following output:

GFS: error -11 from flock_lock_file_wait()
GFS: fsid=cluster1:var.5: warning: assertion "!error" failed
GFS: fsid=cluster1:var.5:   function = do_flock
GFS: fsid=cluster1:var.5:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-72/hugemem/src/gfs/ops_file.c, line = 1690
GFS: fsid=cluster1:var.5:   time = 1184954580

So flock_lock_file_wait is returning -EAGAIN?
Comment 13 Abhijith Das 2007-08-13 12:28:01 EDT
From what I can make out from the code, GFS is not expecting
flock_lock_file_wait() to return EAGAIN at that point. GFS assumes that once it
gets a glock on the file, a VFS level lock shouldn't fail.
Are you running NFS on GFS? Can you reproduce this problem reliably? If yes, can
you upload some test programs? Also, when you hit it again, can you collect the
output of 'cat /proc/locks' and 'gfs_tool lockdump' ?
Comment 14 Anand 2007-08-14 17:20:10 EDT
Created attachment 161310 [details]
gfs_tool lockdump of my GFS filesystem
Comment 15 Anand 2007-08-14 17:21:40 EDT
Created attachment 161311 [details]
Output of /proc/locks

Here is the output of /proc/locks. 
-Anand
Comment 16 Robert Clark 2007-08-15 06:11:36 EDT
Created attachment 161339 [details]
Bug reproduction script

I've played around a little and found a way to reproduce the error. I'm
attaching a perl script that does the trick, but all it does is:

Open a file.
Get an exclusive flock on the filehandle.
Open the file again with a new filehandle.
Attempt an exclusive, non-blocking flock on the new filehandle.

This sounds a bit perverse, but I imagine a real-world scenario would involve
multi-threading.
Comment 17 Abhijith Das 2007-09-25 09:40:27 EDT

*** This bug has been marked as a duplicate of 272301 ***

Note You need to log in before you can comment on or make changes to this bug.