198302 – Error : warning: assertion "!error" failed

Bug 198302 - Error : warning: assertion "!error" failed

Summary: Error : warning: assertion "!error" failed

Keywords:
Status:	CLOSED DUPLICATE of bug 272301
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Abhijith Das
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	272301
Blocks:
TreeView+	depends on / blocked

Reported:	2006-07-10 22:13 UTC by Anand
Modified:	2010-01-12 03:11 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-09-25 13:40:27 UTC
Embargoed:

Attachments	(Terms of Use)
/var/log/messages output from one machine showing the error. (4.62 KB, application/octet-stream) 2006-07-10 22:13 UTC, Anand	no flags	Details
Sysreport Output of one of my machines. (887.49 KB, application/x-bzip) 2006-07-18 23:26 UTC, Anand	no flags	Details
Debug patch for flock issue (623 bytes, patch) 2006-11-08 23:57 UTC, Abhijith Das	no flags	Details \| Diff
gfs_tool lockdump of my GFS filesystem (1.98 MB, application/octet-stream) 2007-08-14 21:20 UTC, Anand	no flags	Details
Output of /proc/locks (843 bytes, application/octet-stream) 2007-08-14 21:21 UTC, Anand	no flags	Details
Bug reproduction script (190 bytes, text/plain) 2007-08-15 10:11 UTC, Robert Clark	no flags	Details
View All

Description Anand 2006-07-10 22:13:46 UTC

Created attachment 132203 [details]
/var/log/messages output from one machine showing the error.

Comment 1 Anand 2006-07-10 22:13:46 UTC

Description of problem:

My logs seem to be full of an assertion error. This is seen under high
workloads. If I have 4 machines do high I/O to a SAN device then I shall see
these errors. 

Version-Release number of selected component (if applicable):

RHEL4   Kernel 2.6.9-34.0.1.ELsmp
using rpms from up2date 
GFS-kernel-2.6.9-49.1
GFS-6.1.5-0

Running on 32bit Linux on a 4 way HP DL-585. With a Qlogic HBA connected to a
CX700 SAN system. 

How reproducible:


Steps to Reproduce:
1. Get 4 machines to 

time dd if=/dev/zero of=/mount/path/gfs/usr/file.test.machine-name-1 bs=4096
count=8388608. Simultaneiusly in the same directory 
2. Any high I/O.
3.
  
Actual results:

Write test took 26 mins. 

Expected results:

Write test took 26 mins and my logs wear filled with 

 Jul 10 14:39:02 pa-dev101 kernel: GFS: fsid=alpha_cluster:dbc1.0: warning:
assertion "!error" failed
Jul 10 14:39:02 pa-dev101 kernel: GFS: fsid=alpha_cluster:dbc1.0:   function =
do_flock
Jul 10 14:39:02 pa-dev101 kernel: GFS: fsid=alpha_cluster:dbc1.0:   file =
/usr/src/build/751518-i686/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/ops_file.c,
line = 1667

But they are not in the logs of all the machines. They are only seeing during
periods of high I/O. 

Additional info:

Comment 2 Abhijith Das 2006-07-18 19:47:36 UTC

I'm unable to reproduce this problem. I doubt if dd is doing any flocks at all.
There could be other process(es) triggering this.
It'd be great if you can provide me more info:
a) the mkfs command line used to create the gfs fs. It is unclear as to what
locking module (gulm/dlm) you're running.
b) the output of 'gfs_tool sb <device> all'
c) the output of sysrq for running processes, memory info etc.

Comment 3 Anand 2006-07-18 23:26:36 UTC

Created attachment 132627 [details]
Sysreport Output of one of my machines.

Comment 4 Anand 2006-07-18 23:46:42 UTC

I attached the output of sysreport. 

a)  I don't know what mkfs.gfs options I used to create the FS. The lock manager
in use is dlm.
b) The output of gfs_tool sb is
##################################################################################
[root@pa-dev101 abhattacharya]# gfs_tool sb /dev/dbc1/dbc1  all
  mh_magic = 0x01161970
  mh_type = 1
  mh_generation = 0
  mh_format = 100
  mh_incarn = 0
  sb_fs_format = 1309
  sb_multihost_format = 1401
  sb_flags = 0
  sb_bsize = 4096
  sb_bsize_shift = 12
  sb_seg_size = 16
  no_formal_ino = 22
  no_addr = 22
  no_formal_ino = 23
  no_addr = 23
  no_formal_ino = 26
  no_addr = 26
  sb_lockproto = lock_dlm
  sb_locktable = alpha_cluster:dbc1
  no_formal_ino = 24
  no_addr = 24
  no_formal_ino = 25
  no_addr = 25
  sb_reserved =
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
################################################################################

c). I am not quite sure you mean by the sysrq ?

Comment 5 Abhijith Das 2006-08-28 21:08:41 UTC

I have not been able to recreate this problem. I've tried combinations of heavy
IO and flocks but haven't seen this. It'd be very helpful if you can give me a
step-by-step of how to create this. The messages you are seeing are triggered by
the flock code but I don't see any flocking in your test-case.
A list of processes running at the time of the bug would help too.

Here are simple instructions for using the "magic sysrq" in case you're unfamiliar:

1. Turn it on by doing:
   echo "1" >  /proc/sys/kernel/sysrq
2. Recreate your problem
3. If you're at the system console with a keyboard, do alt-sysrq t (task list)
  If you have a telnet console instead, do ctrl-] to get telnet> prompt
  telnet> send brk  (send a break char)
  t (task list)
  If you don't have a keyboard or telnet, but do have a shell:
  echo "t" > /proc/sysrq-trigger
  If you're doing it from a minicom, use: <ctrl-a>f followed by t
(For other types of serial consoles, you have to get it to send a break, then
letter t)
4. The task info will be dumped to the console, so hopefully you have
   a way to save that off.

Comment 6 Anand 2006-09-05 18:47:14 UTC

I have set up the sysrq and I will be waiting for the problem to happen. It
seems like it is a problem that we are getting intermittently.

Comment 8 Anand 2006-10-20 17:37:41 UTC

We have actually noticed this to happen quite frequently on our GFS-->NFS servers. 
GFS mount is exported as through NFS. 

When the GFS FS runs out of space. This is not notitced through df but rather
through gfs_tool df and data space reports 100% usage. 
Our NFS threads die, during file creation. 

This would creep up during high I/O.
-Anand

Comment 9 Abhijith Das 2006-11-08 23:57:12 UTC

Created attachment 140729 [details]
Debug patch for flock issue

Anand, please try out this debug patch. It is against gfs-kernel/src/gfs/ in
the RHEL4 cvs branch. It prints the error code for the flock error that was
tripping the assert. Should give us more info. --Thanks

Comment 12 Robert Clark 2007-07-23 09:12:47 UTC

I've compiled GFS-kernel with this patch and got the following output:

GFS: error -11 from flock_lock_file_wait()
GFS: fsid=cluster1:var.5: warning: assertion "!error" failed
GFS: fsid=cluster1:var.5:   function = do_flock
GFS: fsid=cluster1:var.5:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-72/hugemem/src/gfs/ops_file.c, line = 1690
GFS: fsid=cluster1:var.5:   time = 1184954580

So flock_lock_file_wait is returning -EAGAIN?

Comment 13 Abhijith Das 2007-08-13 16:28:01 UTC

From what I can make out from the code, GFS is not expecting
flock_lock_file_wait() to return EAGAIN at that point. GFS assumes that once it
gets a glock on the file, a VFS level lock shouldn't fail.
Are you running NFS on GFS? Can you reproduce this problem reliably? If yes, can
you upload some test programs? Also, when you hit it again, can you collect the
output of 'cat /proc/locks' and 'gfs_tool lockdump' ?

Comment 14 Anand 2007-08-14 21:20:10 UTC

Created attachment 161310 [details]
gfs_tool lockdump of my GFS filesystem

Comment 15 Anand 2007-08-14 21:21:40 UTC

Created attachment 161311 [details]
Output of /proc/locks

Here is the output of /proc/locks. 
-Anand

Comment 16 Robert Clark 2007-08-15 10:11:36 UTC

Created attachment 161339 [details]
Bug reproduction script

I've played around a little and found a way to reproduce the error. I'm
attaching a perl script that does the trick, but all it does is:

Open a file.
Get an exclusive flock on the filehandle.
Open the file again with a new filehandle.
Attempt an exclusive, non-blocking flock on the new filehandle.

This sounds a bit perverse, but I imagine a real-world scenario would involve
multi-threading.

Comment 17 Abhijith Das 2007-09-25 13:40:27 UTC


*** This bug has been marked as a duplicate of 272301 ***

Note You need to log in before you can comment on or make changes to this bug.