221729 – Deadlock still in copy process with gfs2 volume

Bug 221729 - Deadlock still in copy process with gfs2 volume

Summary: Deadlock still in copy process with gfs2 volume

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	6
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Steve Whitehouse
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-01-06 23:40 UTC by Gary Lindstrom
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:	2.6.20-1.2943
Clone Of:
Environment:
Last Closed:	2007-04-11 20:23:28 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
backtrace of copy process dead, but should not be as it has not completed... (120.49 KB, application/octet-stream) 2007-01-06 23:40 UTC, Gary Lindstrom	no flags	Details
Another backtrace (120.49 KB, application/octet-stream) 2007-01-08 23:05 UTC, Gary Lindstrom	no flags	Details
Backtrace with development kernel (117.42 KB, application/octet-stream) 2007-01-10 05:53 UTC, Gary Lindstrom	no flags	Details
patch to fix lockdep annotations (641 bytes, patch) 2007-01-15 14:14 UTC, Christine Caulfield	no flags	Details \| Diff
New backtrace showing different things (120.64 KB, application/octet-stream) 2007-01-20 23:03 UTC, Gary Lindstrom	no flags	Details
New messages file with backtrace.... sorry (299.86 KB, application/octet-stream) 2007-03-16 06:08 UTC, Gary Lindstrom	no flags	Details
View All

Description Gary Lindstrom 2007-01-06 23:40:55 UTC

Description of problem:

Copy process from an ext3 volume to gfs2 volume hangs (deadlock I suppose).

Version-Release number of selected component (if applicable):

kernel-2.6.18-1.2869.fc6 and all latest fc6 updates

Also, been having problems with fc6 versions of the following so grabbed the
following RPMs from EL5 beta 2 and tried with no change:

cman-2.0.35-2.el5.i386.rpm
device-mapper-1.02.12-2.el5.i386.rpm
gfs2-utils-0.1.14-1.el5.i386.rpm
gfs-utils-0.1.7-1.el5.i386.rpm
gnbd-1.1.4-2.el5.i386.rpm
lvm2-2.02.12-7.el5.i386.rpm
lvm2-cluster-2.02.12-7.el5.i386.rpm
openais-0.80.1-15.el5.i386.rpm
rgmanager-2.0.16-1.i386.rpm

How reproducible:

Intermittent...

Steps to Reproduce:
1.  Copy lots of data from ext3 volume to gfs2 volume...

  
Actual results:
Hang (deadlock I suppose)

Expected results:
Copy should finish

Additional info:

Attaching a backtrace to this BZ for you to look at.  Steve, maybe the kernel I
have doesn't have the patches in it yet, and if so, just let me know.  If it is
suppose to have everything in it, we still have some problem somewhere.  Also, I
still think there may be a problem with ACL support on GFS2, but I'll track that
one later...  until this problem goes away, I can't say for sure...

Note that this is on a 3 node cluster, 3TB storage, only the one box had the
gfs2 volume mounted although all were part of the cluster.  All running FC6 with
same config.

Comment 1 Gary Lindstrom 2007-01-06 23:40:55 UTC

Created attachment 144989 [details]
backtrace of copy process dead, but should not be as it has not completed...

Comment 2 Steve Whitehouse 2007-01-08 15:59:09 UTC

The FC6 kernel is rather behind still I'm afraid. The upstream -git tree and the
RHEL5 (beta) kernels are the most uptodate with regards to bug fixes. Russell is
still looking into the deadlock and I'll try and have another look at it too as
soon as I can.

Comment 4 Gary Lindstrom 2007-01-08 19:54:45 UTC

Thanks Steve...  I had a couple FC6 kernel updates come through and thought that
maybe they were suppose to be in there.  I also wasn't sure if these were
in-progress/fixed/on-hold because I had (wrongly) combined a couple problems in
the same BZ....  Probably the ACL problem is in the same category...  As you can
see I took some of the EL5 Beta2 updates and used on a fresh fc6 install, so
maybe I'll try the kernel too... Thanks Again!

Comment 5 Russell Cattelan 2007-01-08 20:56:49 UTC

It is possbile this is same problem, hard to say at this time.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217356

Since running anything beyond a simple dd will result
in a gfs2 deadlock it might all be releated? or I might not
be?

I haven't made any progress on 217356.

Comment 6 Gary Lindstrom 2007-01-08 23:05:34 UTC

Created attachment 145119 [details]
Another backtrace

Same problem with the EL5 Beta2 kernel...  Attaching another backtrace in case
it is usefull...

Comment 7 Gary Lindstrom 2007-01-10 05:53:54 UTC

Created attachment 145223 [details]
Backtrace with development kernel

Got this idea to try the latest development kernel... same problem, but the
backtrace shows the locks being held...  maybe it will help lead to the
problem... so basically I have a clean fc6, with the previous referenced el5
updates for gfs and clustering, and the development kernel...  The following is
the sequence used to create problem and I am attaching the corresponding output
from a backtrace:


[root@spool7 ~]# time mkfs.gfs2 -r 2048 -j 16 -p lock_dlm -t fpcl01:vg00lv00
/dev/mapper/fpcl01vg00-fpcl01vg00lv00
This will destroy any data on /dev/mapper/fpcl01vg00-fpcl01vg00lv00.
  It appears to contain a gfs2 filesystem.

Are you sure you want to proceed? [y/n] y

Device: 		   /dev/mapper/fpcl01vg00-fpcl01vg00lv00
Blocksize:		   4096
Device Size		   3019.94 GB (791658496 blocks)
Filesystem Size:	   3019.94 GB (791658496 blocks)
Journals:		   16
Resource Groups:	   1510
Locking Protocol:	   "lock_dlm"
Lock Table:		   "fpcl01:vg00lv00"


real	2m1.198s
user	1m23.715s
sys	0m4.904s
[root@spool7 ~]#
[root@spool7 ~]# mount -t ext3 -r -o defaults
/dev/mapper/fpcl01vg01-fpcl01vg01lv00 /mnt/fpcl01vg01lv00
[root@spool7 ~]# mount -t gfs2 -o defaults
/dev/mapper/fpcl01vg00-fpcl01vg00lv00 /mnt/fpcl01vg00lv00
[root@spool7 ~]# df
Filesystem	     1K-blocks	    Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
		      15109112	 3271208  11058028  23% /
/dev/cciss/c0d0p1	101086	   25993     69874  28% /boot
tmpfs		       1031512	       0   1031512   0% /dev/shm
/dev/mapper/fpcl01vg01-fpcl01vg01lv00
		     420080288	61923572 336817788  16% /mnt/fpcl01vg01lv00
/dev/mapper/fpcl01vg00-fpcl01vg00lv00
		     3166434592    542192 3165892400   1% /mnt/fpcl01vg00lv00
[root@spool7 ~]# cd //mnt/fpcl01vg00lv00
[root@spool7 fpcl01vg00lv00]# cp -ax /mnt/fpcl01vg01lv00 .

Comment 8 Steve Whitehouse 2007-01-10 09:28:53 UTC

I'm reassigning this to Patrick on the basis of the latest dmesg in comment #7
which appears to implicate the DLM.

Patrick, if you are able to rule out the DLM, then please reassign back to one
of us.

Comment 9 Christine Caulfield 2007-01-12 13:14:06 UTC

I'm not sure whether the messages relating to the DLM are accurate or not. The
sock_sem that is locked in the middle of accept_from_sock will always be a
different one from the one that is locked at the start of that function - though
the message implies it's the same.

Have emailed ingo to see if lockdep could be simply getting confused and leading
us up a blind alley.

Comment 10 Christine Caulfield 2007-01-15 14:14:52 UTC

Created attachment 145576 [details]
patch to fix lockdep annotations

I think the lockdep warnings are spurious. To confirm this is it possible that
you could apply the attached patch please ? If this gets rid of the warnings
then it seems likely that the DLM lowcomms is not the culprit.

Comment 11 Gary Lindstrom 2007-01-20 23:03:33 UTC

Created attachment 146066 [details]
New backtrace showing different things

Upgradeded to kernel-2.6.19-1.2895.fc6 - same results, copy process deadlocks -
maybe some additional info/clues in this dmesg with backtrace?	There is a
fatal assertion right before the backtrace...

Comment 12 Russell Cattelan 2007-01-22 19:20:06 UTC

GFS2 has several lock ordering problems right now that result in ABBA type of
deadlock situations.

The trace you have attached does not look familiar so it is hard to say
if it is the same problem as the one I am chasing.

It's unclear how long it is going to take to find all the lock ordering
issues, but it could potentially be quite a while.

Comment 13 Steve Whitehouse 2007-03-15 10:40:48 UTC

This should be fixed in the latest upstream kernel and also the latest FC-6
kernel. Please let me know if this is still a problem.

Comment 14 Gary Lindstrom 2007-03-16 06:08:44 UTC

Created attachment 150197 [details]
New messages file with backtrace.... sorry

You guys are gonna hate me... :(  I can still make it deadlock...  sorry.  All
three machines in cluster are fc6 with all latest updates, specifically
kernel-2.6.20-1.2925.fc6.  Attached is the messages file from the latest boot
with a backtrace that I generated after the deadlock.  Let me know if you need
more info.

Comment 15 Steve Whitehouse 2007-03-19 09:35:15 UTC

Thats fixed upstream, but its not made it to FC yet. I really didn't expect you
to run into that as well :( I'll let you know when that has made it into FC-6 too.

Comment 16 Steve Whitehouse 2007-04-05 09:50:20 UTC

It should be in FC-6 by now: kernel-2.6.20-1.2943 or later.

Note You need to log in before you can comment on or make changes to this bug.