Description of problem: Copy process from an ext3 volume to gfs2 volume hangs (deadlock I suppose). Version-Release number of selected component (if applicable): kernel-2.6.18-1.2869.fc6 and all latest fc6 updates Also, been having problems with fc6 versions of the following so grabbed the following RPMs from EL5 beta 2 and tried with no change: cman-2.0.35-2.el5.i386.rpm device-mapper-1.02.12-2.el5.i386.rpm gfs2-utils-0.1.14-1.el5.i386.rpm gfs-utils-0.1.7-1.el5.i386.rpm gnbd-1.1.4-2.el5.i386.rpm lvm2-2.02.12-7.el5.i386.rpm lvm2-cluster-2.02.12-7.el5.i386.rpm openais-0.80.1-15.el5.i386.rpm rgmanager-2.0.16-1.i386.rpm How reproducible: Intermittent... Steps to Reproduce: 1. Copy lots of data from ext3 volume to gfs2 volume... Actual results: Hang (deadlock I suppose) Expected results: Copy should finish Additional info: Attaching a backtrace to this BZ for you to look at. Steve, maybe the kernel I have doesn't have the patches in it yet, and if so, just let me know. If it is suppose to have everything in it, we still have some problem somewhere. Also, I still think there may be a problem with ACL support on GFS2, but I'll track that one later... until this problem goes away, I can't say for sure... Note that this is on a 3 node cluster, 3TB storage, only the one box had the gfs2 volume mounted although all were part of the cluster. All running FC6 with same config.
Created attachment 144989 [details] backtrace of copy process dead, but should not be as it has not completed...
The FC6 kernel is rather behind still I'm afraid. The upstream -git tree and the RHEL5 (beta) kernels are the most uptodate with regards to bug fixes. Russell is still looking into the deadlock and I'll try and have another look at it too as soon as I can.
Thanks Steve... I had a couple FC6 kernel updates come through and thought that maybe they were suppose to be in there. I also wasn't sure if these were in-progress/fixed/on-hold because I had (wrongly) combined a couple problems in the same BZ.... Probably the ACL problem is in the same category... As you can see I took some of the EL5 Beta2 updates and used on a fresh fc6 install, so maybe I'll try the kernel too... Thanks Again!
It is possbile this is same problem, hard to say at this time. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217356 Since running anything beyond a simple dd will result in a gfs2 deadlock it might all be releated? or I might not be? I haven't made any progress on 217356.
Created attachment 145119 [details] Another backtrace Same problem with the EL5 Beta2 kernel... Attaching another backtrace in case it is usefull...
Created attachment 145223 [details] Backtrace with development kernel Got this idea to try the latest development kernel... same problem, but the backtrace shows the locks being held... maybe it will help lead to the problem... so basically I have a clean fc6, with the previous referenced el5 updates for gfs and clustering, and the development kernel... The following is the sequence used to create problem and I am attaching the corresponding output from a backtrace: [root@spool7 ~]# time mkfs.gfs2 -r 2048 -j 16 -p lock_dlm -t fpcl01:vg00lv00 /dev/mapper/fpcl01vg00-fpcl01vg00lv00 This will destroy any data on /dev/mapper/fpcl01vg00-fpcl01vg00lv00. It appears to contain a gfs2 filesystem. Are you sure you want to proceed? [y/n] y Device: /dev/mapper/fpcl01vg00-fpcl01vg00lv00 Blocksize: 4096 Device Size 3019.94 GB (791658496 blocks) Filesystem Size: 3019.94 GB (791658496 blocks) Journals: 16 Resource Groups: 1510 Locking Protocol: "lock_dlm" Lock Table: "fpcl01:vg00lv00" real 2m1.198s user 1m23.715s sys 0m4.904s [root@spool7 ~]# [root@spool7 ~]# mount -t ext3 -r -o defaults /dev/mapper/fpcl01vg01-fpcl01vg01lv00 /mnt/fpcl01vg01lv00 [root@spool7 ~]# mount -t gfs2 -o defaults /dev/mapper/fpcl01vg00-fpcl01vg00lv00 /mnt/fpcl01vg00lv00 [root@spool7 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup00-LogVol00 15109112 3271208 11058028 23% / /dev/cciss/c0d0p1 101086 25993 69874 28% /boot tmpfs 1031512 0 1031512 0% /dev/shm /dev/mapper/fpcl01vg01-fpcl01vg01lv00 420080288 61923572 336817788 16% /mnt/fpcl01vg01lv00 /dev/mapper/fpcl01vg00-fpcl01vg00lv00 3166434592 542192 3165892400 1% /mnt/fpcl01vg00lv00 [root@spool7 ~]# cd //mnt/fpcl01vg00lv00 [root@spool7 fpcl01vg00lv00]# cp -ax /mnt/fpcl01vg01lv00 .
I'm reassigning this to Patrick on the basis of the latest dmesg in comment #7 which appears to implicate the DLM. Patrick, if you are able to rule out the DLM, then please reassign back to one of us.
I'm not sure whether the messages relating to the DLM are accurate or not. The sock_sem that is locked in the middle of accept_from_sock will always be a different one from the one that is locked at the start of that function - though the message implies it's the same. Have emailed ingo to see if lockdep could be simply getting confused and leading us up a blind alley.
Created attachment 145576 [details] patch to fix lockdep annotations I think the lockdep warnings are spurious. To confirm this is it possible that you could apply the attached patch please ? If this gets rid of the warnings then it seems likely that the DLM lowcomms is not the culprit.
Created attachment 146066 [details] New backtrace showing different things Upgradeded to kernel-2.6.19-1.2895.fc6 - same results, copy process deadlocks - maybe some additional info/clues in this dmesg with backtrace? There is a fatal assertion right before the backtrace...
GFS2 has several lock ordering problems right now that result in ABBA type of deadlock situations. The trace you have attached does not look familiar so it is hard to say if it is the same problem as the one I am chasing. It's unclear how long it is going to take to find all the lock ordering issues, but it could potentially be quite a while.
This should be fixed in the latest upstream kernel and also the latest FC-6 kernel. Please let me know if this is still a problem.
Created attachment 150197 [details] New messages file with backtrace.... sorry You guys are gonna hate me... :( I can still make it deadlock... sorry. All three machines in cluster are fc6 with all latest updates, specifically kernel-2.6.20-1.2925.fc6. Attached is the messages file from the latest boot with a backtrace that I generated after the deadlock. Let me know if you need more info.
Thats fixed upstream, but its not made it to FC yet. I really didn't expect you to run into that as well :( I'll let you know when that has made it into FC-6 too.
It should be in FC-6 by now: kernel-2.6.20-1.2943 or later.