Red Hat Bugzilla – Bug 221729
Deadlock still in copy process with gfs2 volume
Last modified: 2007-11-30 17:11:52 EST
Description of problem:
Copy process from an ext3 volume to gfs2 volume hangs (deadlock I suppose).
Version-Release number of selected component (if applicable):
kernel-2.6.18-1.2869.fc6 and all latest fc6 updates
Also, been having problems with fc6 versions of the following so grabbed the
following RPMs from EL5 beta 2 and tried with no change:
Steps to Reproduce:
1. Copy lots of data from ext3 volume to gfs2 volume...
Hang (deadlock I suppose)
Copy should finish
Attaching a backtrace to this BZ for you to look at. Steve, maybe the kernel I
have doesn't have the patches in it yet, and if so, just let me know. If it is
suppose to have everything in it, we still have some problem somewhere. Also, I
still think there may be a problem with ACL support on GFS2, but I'll track that
one later... until this problem goes away, I can't say for sure...
Note that this is on a 3 node cluster, 3TB storage, only the one box had the
gfs2 volume mounted although all were part of the cluster. All running FC6 with
Created attachment 144989 [details]
backtrace of copy process dead, but should not be as it has not completed...
The FC6 kernel is rather behind still I'm afraid. The upstream -git tree and the
RHEL5 (beta) kernels are the most uptodate with regards to bug fixes. Russell is
still looking into the deadlock and I'll try and have another look at it too as
soon as I can.
Thanks Steve... I had a couple FC6 kernel updates come through and thought that
maybe they were suppose to be in there. I also wasn't sure if these were
in-progress/fixed/on-hold because I had (wrongly) combined a couple problems in
the same BZ.... Probably the ACL problem is in the same category... As you can
see I took some of the EL5 Beta2 updates and used on a fresh fc6 install, so
maybe I'll try the kernel too... Thanks Again!
It is possbile this is same problem, hard to say at this time.
Since running anything beyond a simple dd will result
in a gfs2 deadlock it might all be releated? or I might not
I haven't made any progress on 217356.
Created attachment 145119 [details]
Same problem with the EL5 Beta2 kernel... Attaching another backtrace in case
it is usefull...
Created attachment 145223 [details]
Backtrace with development kernel
Got this idea to try the latest development kernel... same problem, but the
backtrace shows the locks being held... maybe it will help lead to the
problem... so basically I have a clean fc6, with the previous referenced el5
updates for gfs and clustering, and the development kernel... The following is
the sequence used to create problem and I am attaching the corresponding output
from a backtrace:
[root@spool7 ~]# time mkfs.gfs2 -r 2048 -j 16 -p lock_dlm -t fpcl01:vg00lv00
This will destroy any data on /dev/mapper/fpcl01vg00-fpcl01vg00lv00.
It appears to contain a gfs2 filesystem.
Are you sure you want to proceed? [y/n] y
Device Size 3019.94 GB (791658496 blocks)
Filesystem Size: 3019.94 GB (791658496 blocks)
Resource Groups: 1510
Locking Protocol: "lock_dlm"
Lock Table: "fpcl01:vg00lv00"
[root@spool7 ~]# mount -t ext3 -r -o defaults
[root@spool7 ~]# mount -t gfs2 -o defaults
[root@spool7 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
15109112 3271208 11058028 23% /
/dev/cciss/c0d0p1 101086 25993 69874 28% /boot
tmpfs 1031512 0 1031512 0% /dev/shm
420080288 61923572 336817788 16% /mnt/fpcl01vg01lv00
3166434592 542192 3165892400 1% /mnt/fpcl01vg00lv00
[root@spool7 ~]# cd //mnt/fpcl01vg00lv00
[root@spool7 fpcl01vg00lv00]# cp -ax /mnt/fpcl01vg01lv00 .
I'm reassigning this to Patrick on the basis of the latest dmesg in comment #7
which appears to implicate the DLM.
Patrick, if you are able to rule out the DLM, then please reassign back to one
I'm not sure whether the messages relating to the DLM are accurate or not. The
sock_sem that is locked in the middle of accept_from_sock will always be a
different one from the one that is locked at the start of that function - though
the message implies it's the same.
Have emailed ingo to see if lockdep could be simply getting confused and leading
us up a blind alley.
Created attachment 145576 [details]
patch to fix lockdep annotations
I think the lockdep warnings are spurious. To confirm this is it possible that
you could apply the attached patch please ? If this gets rid of the warnings
then it seems likely that the DLM lowcomms is not the culprit.
Created attachment 146066 [details]
New backtrace showing different things
Upgradeded to kernel-2.6.19-1.2895.fc6 - same results, copy process deadlocks -
maybe some additional info/clues in this dmesg with backtrace? There is a
fatal assertion right before the backtrace...
GFS2 has several lock ordering problems right now that result in ABBA type of
The trace you have attached does not look familiar so it is hard to say
if it is the same problem as the one I am chasing.
It's unclear how long it is going to take to find all the lock ordering
issues, but it could potentially be quite a while.
This should be fixed in the latest upstream kernel and also the latest FC-6
kernel. Please let me know if this is still a problem.
Created attachment 150197 [details]
New messages file with backtrace.... sorry
You guys are gonna hate me... :( I can still make it deadlock... sorry. All
three machines in cluster are fc6 with all latest updates, specifically
kernel-2.6.20-1.2925.fc6. Attached is the messages file from the latest boot
with a backtrace that I generated after the deadlock. Let me know if you need
Thats fixed upstream, but its not made it to FC yet. I really didn't expect you
to run into that as well :( I'll let you know when that has made it into FC-6 too.
It should be in FC-6 by now: kernel-2.6.20-1.2943 or later.