Bug 617306 - dlm_controld: fix plock owner syncing
dlm_controld: fix plock owner syncing
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: David Teigland
Cluster QE
: RHELNAK
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-22 14:33 EDT by David Teigland
Modified: 2016-04-26 10:27 EDT (History)
8 users (show)

See Also:
Fixed In Version: cluster-3.0.12-25.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-19 09:03:54 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
full set of dlm_tool dumps taken from all nodes just after reproduction (324.29 KB, application/x-gzip)
2010-07-22 17:29 EDT, Nate Straz
no flags Details
working patch with fixes and debugging (5.55 KB, text/plain)
2010-07-29 18:23 EDT, David Teigland
no flags Details
patch (8.97 KB, text/plain)
2010-08-06 16:50 EDT, David Teigland
no flags Details
patch (8.58 KB, text/plain)
2010-08-12 17:07 EDT, David Teigland
no flags Details

  None (edit)
Description David Teigland 2010-07-22 14:33:55 EDT
Description of problem:

Nate has hit this same problem twice, I'll let him record the exact test case that is being run.

nodes tank-01, tank-03, tank-04

All three are busy with plock operations, tank-03 is killed, restarts, and goes back to running the tests.

During the restart on tank-03, one or more plock resource owners gets out of sync on tank-03:

dlm_tool plocks shows the resource owner is node 1, and this plock operation is waiting for the resource owner to become 0 (unowned) before it can procede:
1133921 WR 0-9223372036854775807 nodeid 3 pid 2091 owner fffffffff59b3c80 rown 1 PENDING

On tank-01 and tank-04, dlm_tool plocks shows the resource owner is already 0:
1133921 rown 0 unused_ms 1695303

tank-03 seems to have synced the correct plock state (from tank-01) while restarting.

tank-01:
1279812582 brawl0 store_plocks r_count 7630 p_count 0 total_size 0 max_section_size 0
1279812582 brawl0 store_plocks open ckpt handle 4ad084e900000000
1279812643 brawl0 store_plocks first 115007 last 1264717 r_count 7630 p_count 0 sig 1290bc
1279812643 brawl0 send_plocks_stored cg 9 flags a data2 1290bc counts 8 3 1 0 0
1279812643 brawl0 receive_plocks_stored 1:9 flags a sig 1290bc need_plocks 0

tank-03:
1279812652 brawl0 receive_plocks_stored 1:9 flags a sig 1290bc need_plocks 1
1279812652 brawl0 match_change 1:9 matches cg 1
1279812652 brawl0 retrieve_plocks
1279812736 brawl0 retrieve_plocks first 1264717 last 115007 r_count 7630 p_count 0 sig 1290bc


Ideally, dlm_tool log_plock would should us the owner value at the time of sync, and the message that changed it from 1 to 0.  Unfortunately, that info has been pushed out of the circular buffer because the test continued running, with a lot of plock activity, for a long time after the problem occured.

Eventually, the kernel will show the blocked processes that are waiting for the plock with the bad owner value:

Jul 22 11:13:03 tank-03 kernel: INFO: task genesis:2091 blocked for more than 120 seconds.
Jul 22 11:13:03 tank-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 22 11:13:03 tank-03 kernel: genesis       D f1d63e64     0  2091   2090 0x00000080
Jul 22 11:13:03 tank-03 kernel: f32aca90 00000082 00000002 f1d63e64 c2804b34 00000000 f1d63e5c 00008fe7
Jul 22 11:13:03 tank-03 kernel: 00000000 f5ada980 000002ad c0ad6080 f1cc066f 000002ad c0ad6080 f32acd34
Jul 22 11:13:03 tank-03 kernel: c0ad6080 c0ad1b34 c0ad6080 f32acd34 c2804b34 00000400 00286445 f32aca90
Jul 22 11:13:03 tank-03 kernel: Call Trace:
Jul 22 11:13:03 tank-03 kernel: [<f8cffb55>] ? dlm_posix_lock+0x195/0x2c0 [dlm]
Jul 22 11:13:03 tank-03 kernel: [<c0470830>] ? autoremove_wake_function+0x0/0x40
Jul 22 11:13:03 tank-03 kernel: [<f8e876b4>] ? gfs2_lock+0xa4/0x140 [gfs2]
Jul 22 11:13:03 tank-03 kernel: [<f8e87610>] ? gfs2_lock+0x0/0x140 [gfs2]
Jul 22 11:13:03 tank-03 kernel: [<c05588e2>] ? vfs_lock_file+0x22/0x50
Jul 22 11:13:03 tank-03 kernel: [<c05589fb>] ? do_lock_file_wait+0x4b/0xd0
Jul 22 11:13:03 tank-03 kernel: [<c0558d5f>] ? fcntl_setlk+0xff/0x1e0
Jul 22 11:13:03 tank-03 kernel: [<c052b522>] ? do_fcntl+0x92/0x440
Jul 22 11:13:03 tank-03 kernel: [<c052b92a>] ? sys_fcntl64+0x5a/0xd0
Jul 22 11:13:03 tank-03 kernel: [<c04099bc>] ? syscall_call+0x7/0xb



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 2 RHEL Product and Program Management 2010-07-22 14:58:37 EDT
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Comment 3 Nate Straz 2010-07-22 14:58:56 EDT
I'm probably lucky in the state I'm in.  Whenever I restart genesis on tank-03 I hit this bug right away.  Here's the output from log_plock which I think is relevant:

[root@tank-03 tmp]# dlm_tool log_plock brawl0 | grep 114d61
1279824407 brawl0 read plock 114d61 LK WR 0-7fffffffffffffff 3/2110/fffffffff39abe80 w 0
1279824407 brawl0 receive_own 114d61 from 3 owner 3
Comment 5 Nate Straz 2010-07-22 17:29:29 EDT
Created attachment 433814 [details]
full set of dlm_tool dumps taken from all nodes just after reproduction
Comment 6 Nate Straz 2010-07-22 17:54:00 EDT
How reproducible:
Easily

Steps to Reproduce:
1. mount gfs2 on multiple nodes
2. run genesis load w/ plocks on all nodes in separate directories
   /usr/tests/sts-rhel6.0/bin/genesis -S 15259 -i 30s -n 100 -d 10 -p 5  -L fcntl  -w /mnt/brawl/$hostname
3. reboot one node
4. restart cman, clvmd, and remount file system
5. run genesis load again on rebooted node

Actual results:
genesis hangs on the first file it tries to lock

Expected results:
genesis load should run without hanging or failing.
Comment 7 David Teigland 2010-07-22 18:15:33 EDT
I can't explain what the logs are showing.

When tank-03 joins, tank-01 puts this into the plock ckpt:

1279830719 brawl0 store_plocks section size 0 id 11 "r1133921.1"

The ".1" means the r->owner is nodeid 1 (itself).

When tank-01 gets to the end of writing the checkpoint, it prints
1279830727 brawl0 store_plocks first 115007 last 1264717 r_count 7630 p_count 0 sig 1290bc

and the very next lines are:

1279830898 brawl0 read plock 114d61 LK WR 0-7fffffffffffffff 1/3133/fffffffff29bb080 w 0
1279830898 brawl0 receive plock 114d61 LK WR 0-7fffffffffffffff 1/3133/fffffffff29bb080 w 0
1279830898 brawl0 receive_own 114d61 from 3 owner 3
1279830898 brawl0 receive plock 114d61 LK WR 0-7fffffffffffffff 4/1585/fffffffff3cc0080 w 0
1279830899 brawl0 read plock 114d61 UN - 0-7fffffffffffffff 1/3133/fffffffff29bb080 w 0
1279830899 brawl0 receive plock 114d61 UN - 0-7fffffffffffffff 1/3133/fffffffff29bb080 w 0

The "receive plock" implies tank-01 just did a send_plock(), which only happens if r->owner is 0, which contradicts the owner of 1 that it just wrote into the checkpoint.  tank-01 and tank-04 also show owner of 0 from their plock dumps:

1133921 rown 0 unused_ms 2079
1133921 rown 0 unused_ms 2464

tank-03 sees owner 1 from the checkpoint and correctly responds by sending owner 3 and waiting for the owner to become 0, which is why things are stuck on tank-03.

I can't come up with any explanations yet for why tank-01 would put owner of 1 into the checkpoint when r->owner actually seems to be 0.  And I can't come up with any explanations how the owner value could have changed between writing the checkpoint and the operation immediately following -- anything that would change that should appear in the logs.
Comment 8 David Teigland 2010-07-27 18:31:43 EDT
I can reproduce this with:

bull-01: mount; lock_load
bull-02: mount; lock_load
bull-04: mount; lock_load
bull-05: mount; lock_load
bull-01: kill lock_load; unmount
bull-02: kill lock_load; unmount
bull-04: kill lock_load; unmount
bull-01: mount; lock_load <-- hang on first operation

(plock_ownership="1" of course)

bull-01
-------

single line of output from lock_load
000000 file0019 ino 2056f R 05-09 pid 24509 err 00 sec 0.000328

ps ax
24509 pts/0    D+     0:00 /root/lock_load

dlm_tool plocks
132603 RD 0-4 nodeid 1 pid 24509 owner ffff88013df4e680 rown 5 PENDING
132463 RD 5-9 nodeid 1 pid 24509 owner ffff88013df4e680 rown 0

dlm_tool log_plock
1280267313 g retrieve_plocks section size 40 id 10 "r132603.5"
1280267313 g retrieve_plocks ckpt read 40 bytes

1280267314 g read plock 205fb LK RD 0-4 1/24509/ffff88013df4e680 w 0
1280267314 g receive_own 205fb from 1 owner 1


bull-05
-------

dlm_tool plocks
132603 rown 0 unused_ms 12610
132463 RD 5-9 nodeid 1 pid 24509 owner ffff88013df4e680 rown 0

dlm_tool log_plock
1280267313 g store_plocks ro 0 gu 0 ss 40 id 10 "r132603.5"

1280267314 g receive_own 205fb from 1 owner 1


code to determine what owner value is written in checkpoint:
                else if (!r->owner && !got_unown(r))
                        owner = our_nodeid;

extra store_plocks debugging:
log_plock(ls, "store_plocks ro %d gu %d ss %u id %u \"%s\"",
          r->owner, got_unown(r), section_len, section_id.idLen, buf);

So bull-05 writes owner 5 (itself) into checkpoint because r->owner is 0 and got_unown(r) is 0.  !got_unown is supposed to mean that we have sent a message to change the owner from us to 0/unowned, but have not received that message yet.  When we receive the unown message we sent, we'll set GOT_UNOWN.

The logic behind that, I think, is that we set the owner to zero right when we send the unown message that sets the owner to zero.  The idea is that the new node that is going to read the checkpoint will receive our unown message at the same time we do, and change the owner from us to 0.  The owner is not really 0 until our unown message is received.

log_plock from bull-05 seems to indicate that the owner has been 0 for a long time prior to this happening; at least 50 seconds, which is as far back as the log shows.  I suspect that the code is missing setting the GOT_UNOWN flag on the resource on bull-05.  e.g. if the resource was already unowned when bull-05 joined, and the resource was synced to bull-05 in the first place, is the GOT_UNOWNED flag ever set in bull-05's resource struct?  My initial suspicion is that it's not -- that the resource was synced to bull-05 with owner 0 but GOT_UNOWN is never set.  That wouldn't be a visible problem until/unless bull-05 is called on to sync plocks to another node.
Comment 9 David Teigland 2010-07-27 18:39:57 EDT
The way this works is the same in RHEL5, so we can also try the test there to verify the failure is the same.
Comment 10 David Teigland 2010-07-29 18:23:39 EDT
Created attachment 435421 [details]
working patch with fixes and debugging

Have found and fixed a couple problems with syncing ownership state, but not all.  Latest debugging notes...


bull-01

134377 RD 0-9 nodeid 1 pid 2514 owner ffff88013cdfde40 rown 5 PENDING

bull-05

134377 WR 0-9 nodeid 5 pid 2379 owner ffff88007a1caec0 rown 0

--------------------------------------
dec2hex 132720
20670


bull-01
132720 RD 0-4 nodeid 1 pid 2641 owner ffff880007fd0580 rown 5 PENDING

396742116 g retrieve_plocks section size 0 id 10 "r132720.5"
396742116 g retrieve_plocks ckpt read 0 bytes


bull-05
132720 RD 0-9 nodeid 5 pid 2449 owner ffff88013a6bfc80 rown 0

1280354916 g store_plocks ro 0 rf 0 len 0 "r132720.5"

logs on bull-05 don't show an outstanding ownership change from 5->0
as we'd expect when r->owner is 0, and no GOT_UNOWN flag.  perhaps
this is another situation where GOT_UNOWN is not being set on 5
when it should be?

The same test that was working with the new GOT_UNOWN in unpack is
now failing every time?

01 mount; lock_load -n 1000
02 mount; lock_load -n 1000
04 mount; lock_load -n 1000
05 mount; lock_load -n 1000
01 kill lock_load; unmount
02 kill lock_load; unmount
04 kill lock_load; unmount
01 mount; lock_load -n 1000

The test that was working was -n 100, this is -n 1000, is that
making the difference?

------------------------------------------------

dec2hex 133683
20a33

01

133683 RD 0-4 nodeid 1 pid 21904 owner ffff880007a62940 rown 5 PENDING

1280440859 g retrieve_plocks section size 0 id 10 "r133683.5"
1280440859 g retrieve_plocks ckpt read 0 bytes


05

133683 WR 0-9 nodeid 5 pid 3190 owner ffff88013df56680 rown 0

1280440859 g store_plocks ro 0 rf 0 len 0 "r133683.5"

so r->owner is 0, and we put our_nodeid (5) as owner in ckpt...

                else if (!r->owner && !got_unown(r))
                        owner = our_nodeid;

no receive_own message for 20a33, before or after.
rf 0 = no R_GOT_UNOWN, no R_SEND_UNOWN, no R_SEND_OWN,

how does r->owner become zero without 5 getting owner 0 message
or sending owner 0 message?
Comment 11 David Teigland 2010-08-06 16:50:41 EDT
Created attachment 437261 [details]
patch

This patch finally appears to work consistently for the the test:
node1: mount /gfs; cd /gfs; lock_load -n 1000
node2: mount /gfs; cd /gfs; lock_load -n 1000
node3: mount /gfs; cd /gfs; lock_load -n 1000
node4: mount /gfs; cd /gfs; lock_load -n 1000
node1: kill lock_load; cd; umount /gfs
node2: kill lock_load; cd; umount /gfs
node3: kill lock_load; cd; umount /gfs
node1: mount /gfs; cd /gfs; lock_load -n 1000

and variations of this.  There are multiple related fixes in the patch, as well as some extra debugging that I may leave in place to help debug next time.
Comment 12 David Teigland 2010-08-12 17:07:19 EDT
Created attachment 438527 [details]
patch

The patch for this bug.  The other part of the earlier patch goes in bug 623810.
Comment 14 David Teigland 2010-08-13 16:56:56 EDT
the corresponding patch for gfs_controld pushed to RHEL6 branch
http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=aa4bde63a2307636f5f64c03082aab80eb6dd27e
Comment 17 errata-xmlrpc 2011-05-19 09:03:54 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html

Note You need to log in before you can comment on or make changes to this bug.