206193 – duplicate global id's occur in SM

Bug 206193 - duplicate global id's occur in SM

Summary: duplicate global id's occur in SM

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	207487
TreeView+	depends on / blocked

Reported:	2006-09-12 20:50 UTC by Corey Marthaler
Modified:	2009-04-16 20:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:	RHBA-2007-0135
Clone Of:
Environment:
Last Closed:	2009-01-20 20:32:56 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0135	0	normal	SHIPPED_LIVE	cman-kernel bug fix update	2007-05-10 21:21:38 UTC

Description Corey Marthaler 2006-09-12 20:50:15 UTC

Description of problem:
I was playing around with mounting and umounting 30 gfs and was seeing no
problems so I then created 30 more gfs and attempted to mount them. Shortly
after, the mount commands hung.

These are the stuck services (on taft-04):
DLM Lock Space:  "31"                              305 306 update    U-4,1,2
[1 2]

DLM Lock Space:  "33"                              305 308 join      S-8,40,3
[1 2 4]

4 D root     14137 11551  0  78   0 - 12492 kcl_jo 09:46 pts/0    00:00:00 mount
-t gfs /dev/vg/gfs33 /mnt/gfs33

Sep 12 09:56:37 taft-04 kernel: mount         D 0000010217ba2ca0     0 14137 
14546                     (NOTLB)
Sep 12 09:56:37 taft-04 kernel: 00000102120a18f8 0000000000000006
ffffffffa025cf24 0000010212ba5400
Sep 12 09:56:37 taft-04 kernel:        0000010200000000 dead4ead00000001
00000102120a18c8 0000000180132155
Sep 12 09:56:38 taft-04 kernel:        000001020d5d37f0 0000000000005f7a
Sep 12 09:56:38 taft-04 kernel: Call
Trace:<ffffffffa025cf24>{:dlm:dlm_recoverd+0}
<ffffffff8030a405>{wait_for_completion+167}
Sep 12 09:56:38 taft-04 kernel:       
<ffffffff80133da9>{default_wake_function+0}
<ffffffff80133da9>{default_wake_function+0}
Sep 12 09:56:38 taft-04 kernel:       
<ffffffffa023a175>{:cman:kcl_join_service+381}
<ffffffffa02549ca>{:dlm:dlm_new_lockspace+1401}
Sep 12 09:56:38 taft-04 kernel:       
<ffffffffa02ce94f>{:lock_dlm:lm_dlm_mount+1322}
<ffffffffa0278310>{:lock_harness:lm_mount+358}
Sep 12 09:56:38 taft-04 kernel:        <ffffffffa02936e4>{:gfs:lm_cb+0}
<ffffffffa02938bd>{:gfs:gfs_lm_mount+453}
Sep 12 09:56:38 taft-04 kernel:        <ffffffffa029f768>{:gfs:gfs_get_sb+1552}
<ffffffff80185bd2>{do_lookup+44}
Sep 12 09:56:38 taft-04 kernel:        <ffffffff8018f019>{dput+56}
<ffffffff80186c39>{__link_path_walk+3427}
Sep 12 09:56:38 taft-04 kernel:        <ffffffff8018f019>{dput+56}
<ffffffff80161936>{cache_alloc_refill+390}
Sep 12 09:56:38 taft-04 kernel:        <ffffffff8017fb41>{do_kern_mount+161}
<ffffffff801951a0>{do_mount+1690}
Sep 12 09:56:38 taft-04 kernel:        <ffffffff801ea821>{__up_read+16}
<ffffffff80123ed3>{do_page_fault+575}
Sep 12 09:56:38 taft-04 kernel:        <ffffffff80169ca3>{handle_mm_fault+373}
<ffffffff8018728a>{__user_walk+94}
Sep 12 09:56:38 taft-04 kernel:        <ffffffff80195543>{sys_mount+186}
<ffffffff8011026a>{system_call+126}


[root@taft-04 ~]# cat /proc/cluster/dlm_debug
 waiting requests
28 marked 0 requests
28 recover event 1575 done
28 move flags 0,0,1 ids 1555,1575,1575
28 process held requests
28 processed 0 requests
28 resend marked requests
28 resent 0 requests
28 recover event 1575 finished
31 move flags 0,1,0 ids 0,1577,0
31 move use event 1577
31 recover event 1577 (first)
31 add nodes
31 total nodes 1
31 rebuild resource directory
31 rebuilt 0 resources
31 recover event 1577 done
31 move flags 0,0,1 ids 0,1577,1577
31 process held requests
31 processed 0 requests
31 recover event 1577 finished
31 move flags 1,0,0 ids 1577,1577,1577
31 move flags 0,1,0 ids 1577,1580,1577
31 move use event 1580
31 recover event 1580
31 add node 2
31 total nodes 2
31 rebuild resource directory
31 rebuilt 7 resources
31 purge requests
31 purged 0 requests
33 move flags 0,1,0 ids 0,1579,0
33 move use event 1579
33 recover event 1579 (first)
33 add nodes
33 total nodes 3
33 rebuild resource directory
33 rebuilt 0 resources
33 unexpected rcom msgid a/8 nodeid=2
33 recover event 1579 done
[root@taft-04 ~]# cat /proc/cluster/dlm_locks
[root@taft-04 ~]# cat /proc/cluster/dlm_stats
DLM stats (HZ=1000)

Lock operations:     293925
Unlock operations:   288971
Convert operations:  249200
Completion ASTs:     832029
Blocking ASTs:          497

Lockqueue        num  waittime   ave
WAIT_RSB      217955    205340     0
WAIT_CONV        154       472     3
WAIT_GRANT      5998      8541     1
WAIT_UNLOCK     4465    139725    31
Total         228572    354078     1


Version-Release number of selected component (if applicable):
[root@taft-04 ~]# uname -ar
Linux taft-04 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 17:57:31 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux
[root@taft-04 ~]# rpm -q dlm
dlm-1.0.1-1

Comment 1 Christine Caulfield 2006-09-13 10:31:29 UTC

I don't know whether they are related but I can reproduce both this and
bz#206212 with the following script:

NUM=60

for i in `seq 1 $NUM`; do
  ./lstest -l SPACE-$i -d0 "$@" & 
done

it uses the lstest program in cluster/dlm/tests/usertest

Comment 2 Kiersten (Kerri) Anderson 2006-09-14 18:47:59 UTC

Marking as a beta2 blocker for rhel5.  Also cc'ing steve dake since it might be
an openais issue.

Comment 3 Kiersten (Kerri) Anderson 2006-09-14 18:51:33 UTC

Sorry, this is a rhel4 bug, marked improperly for cluster-5.0 - remarked for
cluster-4.5

Comment 4 David Teigland 2006-09-15 17:58:10 UTC

Trying to reproduce in my own cluster.  If you happen to
hit this again, please collect all the following from
each node:

foreach: mount >> log.txt
foreach: cman_tool nodes >> log.txt
foreach: cman_tool services >> log.txt
foreach: ps ax -o pid,cmd,wchan >> log.txt
foreach: cat /proc/cluster/sm_debug >> log.txt
foreach: cat /proc/cluster/dlm_debug >> log.txt
foreach: cat /proc/cluster/lock_dlm/debug >> log.txt
foreach: sysrq
foreach: collect /var/log/messages

Comment 5 Christine Caulfield 2006-09-18 12:24:37 UTC

I think the key thing here is the

33 unexpected rcom msgid a/8 nodeid=2

When it happens to me, there is always such a message in the dlm_debug log. The
"received" msg numbers seem often to be junk too. my latest happening was:

gfs-2 unexpected rcom msgid d/25b nodeid=6

So I'm going to look for message corruption.

Comment 6 David Teigland 2006-09-18 14:45:45 UTC

I left a mount/unmount test on 20 fs's running over the weekend
and found the same thing -- one node reported the rcom msgid error
before everything hung.  Interestingly, just before it was the error
message I added last week when fixing bug 206212 where
process_startdone_barrier_new() gets an sg with no sevent for it.
We now print this new error (below) instead of oopsing by using the
null pointer.

Sep 15 21:29:26 bull-02 kernel:
SM: 010020dc process_startdone_barrier_new: no sev 0
Sep 15 21:29:29 bull-02 kernel:
GFS: Trying to join cluster "lock_dlm", "bull:x10"
Sep 15 21:29:31 bull-02 kernel:
dlm: x9: unexpected rcom msgid 5/27e59 nodeid=4

Comment 7 David Teigland 2006-09-18 15:59:59 UTC

bull-01 (nodeid 1)
DLM Lock Space:  "x9"                              8412 28080 run       -
[1 2]

bull-02 (nodeid 2)
DLM Lock Space:  "x9"                              8412 27804 join      S-8,40,2
[1 2]

20dc == 8412 == lockspace x9

So, bull-02 is joining lockspace x9 which already contains bull-01.
bull-02 is waiting for the final barrier that says everyone is done and
will complete the join.  bull-01 has apparently gotten the barrier and moved
on to the normal running state.  When the barrier callback is delivered
on bull-02 (with a success status of 0), process_startdone_barrier_new()
doesn't find an sevent on the sg structure for x9, so it ignores the barrier
callback and does the printk in comment 6.  This explains why x9 is still in
sevent state SEST_BARRIER_WAIT.  What doesn't make sense is why the barrier
callback doesn't find the sevent structure -- clearly one exists as it's shown
by /proc/cluster/services.

This doesn't seem to explain the subsequent dlm msgid printk in any way;
nodeid 4 (sender of the rcom message) isn't in lockspace x9 and doesn't
appear to be joining it either.

Comment 8 Christine Caulfield 2006-09-19 08:55:21 UTC

Hmm, I think the DLM message might just be related to the SM problem, look at this:

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[7 5 6 8]

DLM Lock Space:  "clvmd"                             1   3 run       -
[5 6 7 8]

DLM Lock Space:  "SPACE-51"                         53  54 run       -
[5 6 7 8]

DLM Lock Space:  "gfs-1"                            54  55 update    U-4,1,6
[7 6]

DLM Lock Space:  "gfs-2"                            54  57 join      S-8,40,2
[5 7]

GFS Mount Group: "gfs-1"                            56  56 run       -
[7]


See, lockspaces gfs-1 & gfs-2 have the same global ID !

Comment 9 David Teigland 2006-09-19 13:35:02 UTC

Good catch, that's exactly the bug here -- two groups are being
created with the same global id.  SM gets a barrier callback
specifying the gid, does a gid-to-sg lookup, gets the wrong sg
back, finds that sg doesn't have an sevent and prints
"process_startdone_barrier_new: no sev 0".  The correct sg is
left without the barrier callback it needed.

Not sure yet how to fix the creation of global id's.

Comment 12 David Teigland 2006-09-19 15:00:09 UTC

If someone had seen this bug they would have gotten kernel oopses
(prior to a recent change) or the "process_startdone_barrier_new"
message (after the recent change).  The "recent change" was bug 206212
which turns the oops into a kernel message + hang when this bug
(duplicate global ids) is hit.  So, fixing this bug will make the
fix for bug 206212 pretty irrelevant.  Note the comment in bug 206212
that states "the deeper question is why" -- this bug is the deeper why.

Comment 14 David Teigland 2006-09-19 20:05:47 UTC

I think we should try to get a fix for this into the errata release,
it turns out to pretty simple to hit.  Working on a possible solution.

Comment 15 David Teigland 2006-09-19 20:54:05 UTC

Before choosing a new global id for a new group, a node discovers
what the current max group id seen by any node in the cluster is
and picks new=max+1.  The problem is when two nodes do this in
parallel for two different new groups (that are at the same level)
and end up assigning the same global id to the two different groups.
To fix this, the idea is to have a node take this max group id and add
its own nodeid to it, so new=max+our_nodeid.  This is similar to
designating some of the id bits to be the nodeid of the node creating
the group, but that method would break backward compatibility.
What I'm not certain of is whether there are cases where the two nodes
might see different values of max so that their different nodeids would
make the two new values the same again.

Comment 17 David Teigland 2006-09-20 15:34:26 UTC

The test results of the idea in comment 15 look good, although I'm
doing more; I may also add a check somewhere to give us an explicit
warning if duplicates exist.

I'm pretty satisfied that adding the nodeid will solve the problem,
when two nodes are creating a new group in parallel, they'll have a
common max value AFAICS so no max(4)+nodeid(1)=gid(5) vs
max(3)+nodeid(2)=gid(5) collision.

A number of reasons we can't use uuidgen: we're buried in the kernel, our id
is 32 bits, we need to maintain backward compat.  In groupd (rhel5) gid's
are still created (by an even simpler local nodeid | local counter), but
not used, only passed to dlm_controld and gfs_controld.  gfs_controld can
just use a local counter it keeps itself (i.e. its' not global), and I hope
to do something simpler for dlm lockspace id's within the dlm. Then the gid's
can be removed completely.

Comment 19 David Teigland 2006-09-21 15:11:41 UTC

Checked the fix into the RHEL4 branch.
http://sources.redhat.com/ml/cluster-cvs/2006-q3/msg00661.html

Comment 20 David Teigland 2006-11-29 22:02:47 UTC

This "fix" for this from comment 19 is bad and makes the
problem much worse.  Testing what I expect is the correct
fix right now.

Comment 21 Corey Marthaler 2007-04-23 20:35:02 UTC

Just made/mounted 65 gfs filesytems on a 4 node cluster. Marking verified.

Comment 23 Red Hat Bugzilla 2007-05-10 21:21:40 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0135.html

Comment 24 David Teigland 2007-10-23 21:37:03 UTC

Test loops mounting four fs's in parallel on four nodes, then unmounts
all in parallel.  After some time hit this bug again, see that lockspaces
lv0 and lv1 both have GID 5531.

[root@marathon-01 src]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 2 3 4 5]

DLM Lock Space:  "clvmd"                             3   3 run       -
[1 2 3 4 5]

DLM Lock Space:  "lv2"                             5527 2584 run       -
[1 3]

DLM Lock Space:  "lv1"                             5531 2586 update    U-4,1,2
[1 2]

DLM Lock Space:  "lv0"                             5531 2588 join      S-8,40,2
[1 2]

GFS Mount Group: "lv2"                             5530 2585 run       -
[1 3]

GFS Mount Group: "lv1"                             5532 2587 run       -
[1]


[root@marathon-02 src]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 2 3 4 5]

DLM Lock Space:  "clvmd"                             3   3 run       -
[1 2 3 4 5]

DLM Lock Space:  "lv3"                             5526 2582 run       -
[2 4 3]

DLM Lock Space:  "lv0"                             5531 2584 update    U-4,1,1
[2 1]

DLM Lock Space:  "lv1"                             5531 2586 join      S-8,40,2
[1 2]

GFS Mount Group: "lv3"                             5529 2583 run       -
[2 4 3]

GFS Mount Group: "lv0"                             5533 2585 run       -
[2]


marathon-01:
dlm: lv0: unexpected rcom msgid 5/4 nodeid=2

marathon-02:
dlm: lv1: unexpected rcom msgid 4/1 nodeid=1

Comment 25 Chris Feist 2008-08-05 21:35:43 UTC

Changing acks to target 4.8 (instead of 4.5).

Comment 26 David Teigland 2009-01-20 20:32:56 UTC

It's not possible to fix this without major work breaking compatibility.
No users or customers have ever reported seeing this, and it would be
highly unlikely.

Note You need to log in before you can comment on or make changes to this bug.