617306 – dlm_controld: fix plock owner syncing

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 617306 - dlm_controld: fix plock owner syncing

Summary: dlm_controld: fix plock owner syncing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	cluster
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-22 18:33 UTC by David Teigland
Modified:	2016-04-26 14:27 UTC (History)
CC List:	8 users (show)
Fixed In Version:	cluster-3.0.12-25.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-05-19 13:03:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
full set of dlm_tool dumps taken from all nodes just after reproduction (324.29 KB, application/x-gzip) 2010-07-22 21:29 UTC, Nate Straz	no flags	Details
working patch with fixes and debugging (5.55 KB, text/plain) 2010-07-29 22:23 UTC, David Teigland	no flags	Details
patch (8.97 KB, text/plain) 2010-08-06 20:50 UTC, David Teigland	no flags	Details
patch (8.58 KB, text/plain) 2010-08-12 21:07 UTC, David Teigland	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0537	0	normal	SHIPPED_LIVE	cluster and gfs2-utils bug fix update	2011-05-18 17:57:40 UTC

Description David Teigland 2010-07-22 18:33:55 UTC

Description of problem:

Nate has hit this same problem twice, I'll let him record the exact test case that is being run.

nodes tank-01, tank-03, tank-04

All three are busy with plock operations, tank-03 is killed, restarts, and goes back to running the tests.

During the restart on tank-03, one or more plock resource owners gets out of sync on tank-03:

dlm_tool plocks shows the resource owner is node 1, and this plock operation is waiting for the resource owner to become 0 (unowned) before it can procede:
1133921 WR 0-9223372036854775807 nodeid 3 pid 2091 owner fffffffff59b3c80 rown 1 PENDING

On tank-01 and tank-04, dlm_tool plocks shows the resource owner is already 0:
1133921 rown 0 unused_ms 1695303

tank-03 seems to have synced the correct plock state (from tank-01) while restarting.

tank-01:
1279812582 brawl0 store_plocks r_count 7630 p_count 0 total_size 0 max_section_size 0
1279812582 brawl0 store_plocks open ckpt handle 4ad084e900000000
1279812643 brawl0 store_plocks first 115007 last 1264717 r_count 7630 p_count 0 sig 1290bc
1279812643 brawl0 send_plocks_stored cg 9 flags a data2 1290bc counts 8 3 1 0 0
1279812643 brawl0 receive_plocks_stored 1:9 flags a sig 1290bc need_plocks 0

tank-03:
1279812652 brawl0 receive_plocks_stored 1:9 flags a sig 1290bc need_plocks 1
1279812652 brawl0 match_change 1:9 matches cg 1
1279812652 brawl0 retrieve_plocks
1279812736 brawl0 retrieve_plocks first 1264717 last 115007 r_count 7630 p_count 0 sig 1290bc


Ideally, dlm_tool log_plock would should us the owner value at the time of sync, and the message that changed it from 1 to 0.  Unfortunately, that info has been pushed out of the circular buffer because the test continued running, with a lot of plock activity, for a long time after the problem occured.

Eventually, the kernel will show the blocked processes that are waiting for the plock with the bad owner value:

Jul 22 11:13:03 tank-03 kernel: INFO: task genesis:2091 blocked for more than 120 seconds.
Jul 22 11:13:03 tank-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 22 11:13:03 tank-03 kernel: genesis       D f1d63e64     0  2091   2090 0x00000080
Jul 22 11:13:03 tank-03 kernel: f32aca90 00000082 00000002 f1d63e64 c2804b34 00000000 f1d63e5c 00008fe7
Jul 22 11:13:03 tank-03 kernel: 00000000 f5ada980 000002ad c0ad6080 f1cc066f 000002ad c0ad6080 f32acd34
Jul 22 11:13:03 tank-03 kernel: c0ad6080 c0ad1b34 c0ad6080 f32acd34 c2804b34 00000400 00286445 f32aca90
Jul 22 11:13:03 tank-03 kernel: Call Trace:
Jul 22 11:13:03 tank-03 kernel: [<f8cffb55>] ? dlm_posix_lock+0x195/0x2c0 [dlm]
Jul 22 11:13:03 tank-03 kernel: [<c0470830>] ? autoremove_wake_function+0x0/0x40
Jul 22 11:13:03 tank-03 kernel: [<f8e876b4>] ? gfs2_lock+0xa4/0x140 [gfs2]
Jul 22 11:13:03 tank-03 kernel: [<f8e87610>] ? gfs2_lock+0x0/0x140 [gfs2]
Jul 22 11:13:03 tank-03 kernel: [<c05588e2>] ? vfs_lock_file+0x22/0x50
Jul 22 11:13:03 tank-03 kernel: [<c05589fb>] ? do_lock_file_wait+0x4b/0xd0
Jul 22 11:13:03 tank-03 kernel: [<c0558d5f>] ? fcntl_setlk+0xff/0x1e0
Jul 22 11:13:03 tank-03 kernel: [<c052b522>] ? do_fcntl+0x92/0x440
Jul 22 11:13:03 tank-03 kernel: [<c052b92a>] ? sys_fcntl64+0x5a/0xd0
Jul 22 11:13:03 tank-03 kernel: [<c04099bc>] ? syscall_call+0x7/0xb



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 RHEL Program Management 2010-07-22 18:58:37 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 3 Nate Straz 2010-07-22 18:58:56 UTC

I'm probably lucky in the state I'm in.  Whenever I restart genesis on tank-03 I hit this bug right away.  Here's the output from log_plock which I think is relevant:

[root@tank-03 tmp]# dlm_tool log_plock brawl0 | grep 114d61
1279824407 brawl0 read plock 114d61 LK WR 0-7fffffffffffffff 3/2110/fffffffff39abe80 w 0
1279824407 brawl0 receive_own 114d61 from 3 owner 3

Comment 5 Nate Straz 2010-07-22 21:29:29 UTC

Created attachment 433814 [details]
full set of dlm_tool dumps taken from all nodes just after reproduction

Comment 6 Nate Straz 2010-07-22 21:54:00 UTC

How reproducible:
Easily

Steps to Reproduce:
1. mount gfs2 on multiple nodes
2. run genesis load w/ plocks on all nodes in separate directories
   /usr/tests/sts-rhel6.0/bin/genesis -S 15259 -i 30s -n 100 -d 10 -p 5  -L fcntl  -w /mnt/brawl/$hostname
3. reboot one node
4. restart cman, clvmd, and remount file system
5. run genesis load again on rebooted node

Actual results:
genesis hangs on the first file it tries to lock

Expected results:
genesis load should run without hanging or failing.

Comment 7 David Teigland 2010-07-22 22:15:33 UTC

I can't explain what the logs are showing.

When tank-03 joins, tank-01 puts this into the plock ckpt:

1279830719 brawl0 store_plocks section size 0 id 11 "r1133921.1"

The ".1" means the r->owner is nodeid 1 (itself).

When tank-01 gets to the end of writing the checkpoint, it prints
1279830727 brawl0 store_plocks first 115007 last 1264717 r_count 7630 p_count 0 sig 1290bc

and the very next lines are:

1279830898 brawl0 read plock 114d61 LK WR 0-7fffffffffffffff 1/3133/fffffffff29bb080 w 0
1279830898 brawl0 receive plock 114d61 LK WR 0-7fffffffffffffff 1/3133/fffffffff29bb080 w 0
1279830898 brawl0 receive_own 114d61 from 3 owner 3
1279830898 brawl0 receive plock 114d61 LK WR 0-7fffffffffffffff 4/1585/fffffffff3cc0080 w 0
1279830899 brawl0 read plock 114d61 UN - 0-7fffffffffffffff 1/3133/fffffffff29bb080 w 0
1279830899 brawl0 receive plock 114d61 UN - 0-7fffffffffffffff 1/3133/fffffffff29bb080 w 0

The "receive plock" implies tank-01 just did a send_plock(), which only happens if r->owner is 0, which contradicts the owner of 1 that it just wrote into the checkpoint.  tank-01 and tank-04 also show owner of 0 from their plock dumps:

1133921 rown 0 unused_ms 2079
1133921 rown 0 unused_ms 2464

tank-03 sees owner 1 from the checkpoint and correctly responds by sending owner 3 and waiting for the owner to become 0, which is why things are stuck on tank-03.

I can't come up with any explanations yet for why tank-01 would put owner of 1 into the checkpoint when r->owner actually seems to be 0.  And I can't come up with any explanations how the owner value could have changed between writing the checkpoint and the operation immediately following -- anything that would change that should appear in the logs.

Comment 8 David Teigland 2010-07-27 22:31:43 UTC

I can reproduce this with:

bull-01: mount; lock_load
bull-02: mount; lock_load
bull-04: mount; lock_load
bull-05: mount; lock_load
bull-01: kill lock_load; unmount
bull-02: kill lock_load; unmount
bull-04: kill lock_load; unmount
bull-01: mount; lock_load <-- hang on first operation

(plock_ownership="1" of course)

bull-01
-------

single line of output from lock_load
000000 file0019 ino 2056f R 05-09 pid 24509 err 00 sec 0.000328

ps ax
24509 pts/0    D+     0:00 /root/lock_load

dlm_tool plocks
132603 RD 0-4 nodeid 1 pid 24509 owner ffff88013df4e680 rown 5 PENDING
132463 RD 5-9 nodeid 1 pid 24509 owner ffff88013df4e680 rown 0

dlm_tool log_plock
1280267313 g retrieve_plocks section size 40 id 10 "r132603.5"
1280267313 g retrieve_plocks ckpt read 40 bytes

1280267314 g read plock 205fb LK RD 0-4 1/24509/ffff88013df4e680 w 0
1280267314 g receive_own 205fb from 1 owner 1


bull-05
-------

dlm_tool plocks
132603 rown 0 unused_ms 12610
132463 RD 5-9 nodeid 1 pid 24509 owner ffff88013df4e680 rown 0

dlm_tool log_plock
1280267313 g store_plocks ro 0 gu 0 ss 40 id 10 "r132603.5"

1280267314 g receive_own 205fb from 1 owner 1


code to determine what owner value is written in checkpoint:
                else if (!r->owner && !got_unown(r))
                        owner = our_nodeid;

extra store_plocks debugging:
log_plock(ls, "store_plocks ro %d gu %d ss %u id %u \"%s\"",
          r->owner, got_unown(r), section_len, section_id.idLen, buf);

So bull-05 writes owner 5 (itself) into checkpoint because r->owner is 0 and got_unown(r) is 0.  !got_unown is supposed to mean that we have sent a message to change the owner from us to 0/unowned, but have not received that message yet.  When we receive the unown message we sent, we'll set GOT_UNOWN.

The logic behind that, I think, is that we set the owner to zero right when we send the unown message that sets the owner to zero.  The idea is that the new node that is going to read the checkpoint will receive our unown message at the same time we do, and change the owner from us to 0.  The owner is not really 0 until our unown message is received.

log_plock from bull-05 seems to indicate that the owner has been 0 for a long time prior to this happening; at least 50 seconds, which is as far back as the log shows.  I suspect that the code is missing setting the GOT_UNOWN flag on the resource on bull-05.  e.g. if the resource was already unowned when bull-05 joined, and the resource was synced to bull-05 in the first place, is the GOT_UNOWNED flag ever set in bull-05's resource struct?  My initial suspicion is that it's not -- that the resource was synced to bull-05 with owner 0 but GOT_UNOWN is never set.  That wouldn't be a visible problem until/unless bull-05 is called on to sync plocks to another node.

Comment 9 David Teigland 2010-07-27 22:39:57 UTC

The way this works is the same in RHEL5, so we can also try the test there to verify the failure is the same.

Comment 10 David Teigland 2010-07-29 22:23:39 UTC

Created attachment 435421 [details]
working patch with fixes and debugging

Have found and fixed a couple problems with syncing ownership state, but not all.  Latest debugging notes...


bull-01

134377 RD 0-9 nodeid 1 pid 2514 owner ffff88013cdfde40 rown 5 PENDING

bull-05

134377 WR 0-9 nodeid 5 pid 2379 owner ffff88007a1caec0 rown 0

--------------------------------------
dec2hex 132720
20670


bull-01
132720 RD 0-4 nodeid 1 pid 2641 owner ffff880007fd0580 rown 5 PENDING

396742116 g retrieve_plocks section size 0 id 10 "r132720.5"
396742116 g retrieve_plocks ckpt read 0 bytes


bull-05
132720 RD 0-9 nodeid 5 pid 2449 owner ffff88013a6bfc80 rown 0

1280354916 g store_plocks ro 0 rf 0 len 0 "r132720.5"

logs on bull-05 don't show an outstanding ownership change from 5->0
as we'd expect when r->owner is 0, and no GOT_UNOWN flag.  perhaps
this is another situation where GOT_UNOWN is not being set on 5
when it should be?

The same test that was working with the new GOT_UNOWN in unpack is
now failing every time?

01 mount; lock_load -n 1000
02 mount; lock_load -n 1000
04 mount; lock_load -n 1000
05 mount; lock_load -n 1000
01 kill lock_load; unmount
02 kill lock_load; unmount
04 kill lock_load; unmount
01 mount; lock_load -n 1000

The test that was working was -n 100, this is -n 1000, is that
making the difference?

------------------------------------------------

dec2hex 133683
20a33

01

133683 RD 0-4 nodeid 1 pid 21904 owner ffff880007a62940 rown 5 PENDING

1280440859 g retrieve_plocks section size 0 id 10 "r133683.5"
1280440859 g retrieve_plocks ckpt read 0 bytes


05

133683 WR 0-9 nodeid 5 pid 3190 owner ffff88013df56680 rown 0

1280440859 g store_plocks ro 0 rf 0 len 0 "r133683.5"

so r->owner is 0, and we put our_nodeid (5) as owner in ckpt...

                else if (!r->owner && !got_unown(r))
                        owner = our_nodeid;

no receive_own message for 20a33, before or after.
rf 0 = no R_GOT_UNOWN, no R_SEND_UNOWN, no R_SEND_OWN,

how does r->owner become zero without 5 getting owner 0 message
or sending owner 0 message?

Comment 11 David Teigland 2010-08-06 20:50:41 UTC

Created attachment 437261 [details]
patch

This patch finally appears to work consistently for the the test:
node1: mount /gfs; cd /gfs; lock_load -n 1000
node2: mount /gfs; cd /gfs; lock_load -n 1000
node3: mount /gfs; cd /gfs; lock_load -n 1000
node4: mount /gfs; cd /gfs; lock_load -n 1000
node1: kill lock_load; cd; umount /gfs
node2: kill lock_load; cd; umount /gfs
node3: kill lock_load; cd; umount /gfs
node1: mount /gfs; cd /gfs; lock_load -n 1000

and variations of this.  There are multiple related fixes in the patch, as well as some extra debugging that I may leave in place to help debug next time.

Comment 12 David Teigland 2010-08-12 21:07:19 UTC

Created attachment 438527 [details]
patch

The patch for this bug.  The other part of the earlier patch goes in bug 623810.

Comment 13 David Teigland 2010-08-12 21:34:19 UTC

pushed to RHEL6 branch
http://git.fedorahosted.org/git?p=cluster.git;a=commitdiff;h=597111c711eeded2a9c6579064ce25de6ae0ab1b

Comment 14 David Teigland 2010-08-13 20:56:56 UTC

the corresponding patch for gfs_controld pushed to RHEL6 branch
http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=aa4bde63a2307636f5f64c03082aab80eb6dd27e

Comment 17 errata-xmlrpc 2011-05-19 13:03:54 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html

Note You need to log in before you can comment on or make changes to this bug.