Bug 495514 - gfs deadlocks during 4.7.z testing
gfs deadlocks during 4.7.z testing
Status: CLOSED NOTABUG
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Robert Peterson
Cluster QE
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-13 11:53 EDT by Corey Marthaler
Modified: 2010-01-11 22:17 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-04-17 12:16:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
kern dump from z1 (131.10 KB, text/plain)
2009-04-13 12:00 EDT, Corey Marthaler
no flags Details
kern dump from z2 (110.12 KB, text/plain)
2009-04-13 12:01 EDT, Corey Marthaler
no flags Details
kern dump from z3 (136.59 KB, text/plain)
2009-04-13 12:01 EDT, Corey Marthaler
no flags Details
gfs_tool lockdump from z1 (196.87 KB, text/plain)
2009-04-13 18:17 EDT, Nate Straz
no flags Details
gfs_tool lockdump from z2 (129.13 KB, text/plain)
2009-04-13 18:18 EDT, Nate Straz
no flags Details
logs from hung tankmorph cluster (186.65 KB, application/x-gzip)
2009-04-15 10:39 EDT, Nate Straz
no flags Details
Proposed patch (1.13 KB, patch)
2009-04-15 12:05 EDT, Robert Peterson
no flags Details | Diff

  None (edit)
Description Corey Marthaler 2009-04-13 11:53:27 EDT
Description of problem:
Twice now I've restarted the 4.7.z gfs tests, and twice they've ended up deadlocked.

[root@z1 ~]# ps -elf | grep gfs
4 S root      4852  4816  0  76   0 -  1245 -      09:22 pts/2    00:00:00 grep gfs
1 S root     20203     1  0  75   0 -     0 -      Apr09 ?        00:00:56 [gfs_scand]
1 D root     20204     1  0  75   0 -     0 -      Apr09 ?        00:00:00 [gfs_glockd]
1 S root     20205     1  0  75   0 -     0 -      Apr09 ?        00:00:00 [gfs_recoverd]
1 S root     20206     1  0  75   0 -     0 -      Apr09 ?        00:00:00 [gfs_logd]
1 D root     20207     1  0  76   0 -     0 gfs_lo Apr09 ?        00:00:00 [gfs_quotad]
1 D root     20208     1  0  75   0 -     0 gfs_lo Apr09 ?        00:00:00 [gfs_inoded]
1 D 500      20327 20326  0  75   0 -   725 gfs_lo Apr09 ?        00:00:00 accordion -S 28685 -i 10s -p 5 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5
1 D 500      20330 20326  0  76   0 -   725 gfs_lo Apr09 ?        00:00:00 accordion -S 28685 -i 10s -p 5 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5


[root@z2 ~]# ps -elf | grep gfs
4 S root      4853  4818  0  77   0 - 12775 pipe_w 09:23 pts/1    00:00:00 grep gfs
1 S root     20348     1  0  75   0 -     0 110434 Apr09 ?        00:00:10 [gfs_scand]
1 S root     20349     1  0  75   0 -     0 gfs_gl Apr09 ?        00:00:00 [gfs_glockd]
1 S root     20350     1  0  75   0 -     0 110434 Apr09 ?        00:00:00 [gfs_recoverd]
1 S root     20351     1  0  75   0 -     0 110852 Apr09 ?        00:00:00 [gfs_logd]
1 D root     20352     1  0  75   0 -     0 glock_ Apr09 ?        00:00:00 [gfs_quotad]
1 S root     20353     1  0  75   0 -     0 110434 Apr09 ?        00:00:00 [gfs_inoded]


[root@z3 ~]# ps -elf | grep gfs
4 S root      4040  4005  0  78   0 -  3351 pipe_w 09:23 pts/1    00:00:00 grep gfs
1 S root     19517     1  0  75   0 -     0 gfs_sc Apr09 ?        00:01:04 [gfs_scand]
1 S root     19518     1  0  75   0 -     0 gfs_gl Apr09 ?        00:00:00 [gfs_glockd]
1 S root     19519     1  0  75   0 -     0 gfs_re Apr09 ?        00:00:00 [gfs_recoverd]
1 S root     19520     1  0  75   0 -     0 gfs_lo Apr09 ?        00:00:00 [gfs_logd]
1 D root     19521     1  0  75   0 -     0 glock_ Apr09 ?        00:00:00 [gfs_quotad]
1 S root     19522     1  0  75   0 -     0 gfs_in Apr09 ?        00:00:00 [gfs_inoded]


I'll attach the kernel dumps of each nodes.


Version-Release number of selected component (if applicable):
2.6.9-78.0.17

[root@z1 tmp]# rpm -qa | grep GFS
GFS-kernel-xenU-2.6.9-80.9.el4_7.15
GFS-kernheaders-2.6.9-80.9.el4_7.15
GFS-kernel-2.6.9-80.9.el4_7.15
GFS-kernel-smp-2.6.9-80.9.el4_7.15
GFS-kernel-debuginfo-2.6.9-80.9.el4_7.15
GFS-6.1.18-1
GFS-kernel-hugemem-2.6.9-80.9.el4_7.15
GFS-debuginfo-6.1.18-1
Comment 1 Corey Marthaler 2009-04-13 12:00:24 EDT
Created attachment 339334 [details]
kern dump from z1
Comment 2 Corey Marthaler 2009-04-13 12:01:15 EDT
Created attachment 339335 [details]
kern dump from z2
Comment 3 Corey Marthaler 2009-04-13 12:01:58 EDT
Created attachment 339336 [details]
kern dump from z3
Comment 4 Robert Peterson 2009-04-13 18:09:59 EDT
Corey, did this test pass on GFS-kernel-2.6.9-80.9.el4_7.13?
Also, can I get a gfs lock dump from the hang?
Comment 5 Nate Straz 2009-04-13 18:17:40 EDT
Created attachment 339391 [details]
gfs_tool lockdump from z1

Bob, I grabbed gfs_tool lockdump output for z1 and z2.
Comment 6 Nate Straz 2009-04-13 18:18:21 EDT
Created attachment 339392 [details]
gfs_tool lockdump from z2
Comment 7 Steve Whitehouse 2009-04-15 08:52:09 EDT
On z1:

Glock (3, 3656288)
  gl_flags = 1 
  gl_count = 4
  gl_state = 0
  req_gh = yes
  req_bh = yes
  lvb_count = 1
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 0
  ail_bufs = no
  Request
    owner = 31219
    gh_state = 3
    gh_flags = 5 6 8 
    error = 0
    gh_iflags = 1 
  Waiter3
    owner = 31219
    gh_state = 3
    gh_flags = 5 6 8 
    error = 0
    gh_iflags = 1 

This seems to be waiting for a shared lock, but on z2 nothing seems to be holding this lock. Are there lock dumps from z3? DLM lock dumps would also be useful as they might have more info about whats happening on other nodes if you can find the master for that lock.

This reminds me just why the gfs2 lock dump format is so much better. I have no idea what the two lock requests are actually doing since we don't have the address of the function which initialised them, unlike with gfs2.
Comment 8 Robert Peterson 2009-04-15 10:38:11 EDT
The only difference between this 4.7.z and the previous is the 
fix for bug #455696 and its z-stream counterpart.  This will likely
be closed as a DUP and that one set to FAILS_QA.  I need to adjust
that fix.
Comment 9 Nate Straz 2009-04-15 10:39:58 EDT
Created attachment 339692 [details]
logs from hung tankmorph cluster

I hit this during 4.8 testing too, attached are logs from all nodes with SysRq-T, gfs_tool lockdump, and DLM lock dumps.
Comment 10 Robert Peterson 2009-04-15 10:42:56 EDT
Afaict, the problem seems to be with the new sd_log_flush_lock I
introduced with the patch for bug #455696.  There's a circular lock
problem that involves function gfs_log_dump.
Comment 12 Robert Peterson 2009-04-15 11:11:32 EDT
If I'm not mistaken, the hanging calling sequence looks something
like this:

gfs_log_flush_glock
   down(&sdp->sd_log_flush_lock)
   -->gfs_log_flush_internal
      -->gfs_log_dump
         -->gfs_log_reserve
            -->gfs_log_flush
               down(&sdp->sd_log_flush_lock);

That brings up the question of how the code did this properly before
my patch.  I need to investigate that, and hopefully come up with a
solution.  Since I can recreate the hang now, I should at least be
able to tell if my solution will work.
Comment 13 Robert Peterson 2009-04-15 12:05:29 EDT
Created attachment 339708 [details]
Proposed patch

With this patch, I'm no longer able to recreate the failure on my
system.  I'd like to get Corey and possible Nate to test this in
patch form before I push it to the git repositories.
Comment 14 Robert Peterson 2009-04-15 18:06:59 EDT
Waiting to hear the outcome of the patch.  Setting NEEDINFO.
Comment 15 Corey Marthaler 2009-04-16 18:13:38 EDT
The scratch build with the patch does pass our GFS regression tests.
Comment 16 Robert Peterson 2009-04-17 09:42:23 EDT
I pushed the patch to the RHEL4, RHEL47 and RHEL48 branches of the
cluster git tree.  Now it's a matter of rebuilding those branches.
I'm not sure of the best way to handle the bugzilla paperwork for
this problem.  I've already got Modified bugzillas for all three
releases that will need to be respun.  I'm tempted to just close
this as a duplicate of one of those, otherwise I'll have to go through
the pain of requesting bugzillas and ack flags for all three releases
again.  I guess I need to talk to Mr. Feist about it.
Comment 17 Robert Peterson 2009-04-17 12:16:57 EDT
This bug was a problem with the patch for bug #455696 and its z-stream
counterparts.  So I'm closing this as NOTABUG and I'll let Mr. Feist
rebuild the appropriate z-stream rpms using those bugs.

Note You need to log in before you can comment on or make changes to this bug.