Bug 257881 - Flush/recovery collision leads to deadlock after leg and node failure
Summary: Flush/recovery collision leads to deadlock after leg and node failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cmirror-kernel
Version: 4
Hardware: All
OS: All
medium
low
Target Milestone: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-08-27 20:05 UTC by Corey Marthaler
Modified: 2010-01-12 02:11 UTC (History)
0 users

Fixed In Version: RHBA-2007-0991
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-21 21:15:21 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0991 0 normal SHIPPED_LIVE cmirror-kernel bug fix update 2007-11-29 14:39:06 UTC

Description Corey Marthaler 2007-08-27 20:05:25 UTC
Description of problem:
This seems to be yet another deadlock scenario due to a mirror leg and node
failure. This is the first time however that I've seen these "Flush/recovery
collision" messages though.

In this case, 3 node cluster with 2 mirrors and gfs on top. 
[root@link-08 ~]# cman_tool nodes
 Node  Votes Exp Sts  Name
    1    1    3   M   link-07
    2    1    3   M   link-08
    3    1    3   M   link-02

All three nodes were doing gfs i/o (xdoio/xiogen) to each mirror. I then failed
link-02.


Aug 27 14:08:00 link-08 kernel: dm-cmirror: LOG INFO:
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   uuid:
LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC09FtKCMUuKu1eqm5b6hiBbGU8SlRa3tW
s
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   uuid_ref    : 1
Aug 27 14:08:00 link-08 kernel: dm-cmirror:  ?region_count: 10240
Aug 27 14:08:00 link-08 kernel: dm-cmirror:  ?sync_count  : 10240
Aug 27 14:08:00 link-08 kernel: dm-cmirror:  ?sync_search : 5218
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   in_sync     : YES
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   suspended   : NO
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   server_id   : 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   server_valid: YES
Aug 27 14:08:00 link-08 kernel: dm-cmirror: Creating SlRa3tWs (2)
Aug 27 14:08:00 link-08 kernel: dm-cmirror: server_id=dead, server_valid=0, SlRa3tWs
Aug 27 14:08:00 link-08 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT
Aug 27 14:08:00 link-08 kernel: dm-cmirror: LRT_ELECTION(9): (SlRa3tWs)
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   starter     : 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   co-ordinator: 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   node_count  : 0
Aug 27 14:08:00 link-08 kernel: dm-cmirror: LRT_ELECTION(9): (SlRa3tWs)
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   starter     : 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   co-ordinator: 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   node_count  : 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror: LRT_SELECTION(10): (SlRa3tWs)
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   starter     : 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   co-ordinator: 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   node_count  : 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror: LRT_MASTER_ASSIGN(11): (SlRa3tWs)
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   starter     : 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   co-ordinator: 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   node_count  : 2
Aug 27 14:08:00 link-08 kernel: dm-cmirror: I'm the cluster mirror log server
for SlRa3tWs
Aug 27 14:08:00 link-08 kernel: dm-cmirror: Not reading disk log because I'm
suspended (SlRa3tWs)
Aug 27 14:08:00 link-08 kernel: dm-cmirror: Resync work completed: 58
Aug 27 14:08:00 link-08 kernel: dm-cmirror: Assigning recovery work to
2/y41ot1K6: 59
Aug 27 14:08:00 link-08 kernel: dm-cmirror: Client received resync work: 59/y41ot1K6
Aug 27 14:08:00 link-08 kernel: dm-cmirror: LRT_ELECTION(9): (SlRa3tWs)
Aug 27 14:08:00 link-08 kernel: dm-cmirror:   starter     : 1
Aug 27 14:08:01 link-08 kernel: dm-cmirror:   co-ordinator: 1
Aug 27 14:08:01 link-08 kernel: dm-cmirror:   node_count  : 1
Aug 27 14:08:01 link-08 kernel: dm-cmirror: LRT_SELECTION(10): (SlRa3tWs)
Aug 27 14:08:01 link-08 kernel: dm-cmirror:   starter     : 1
Aug 27 14:08:01 link-08 kernel: dm-cmirror:   co-ordinator: 1
Aug 27 14:08:01 link-08 kernel: dm-cmirror:   node_count  : 1
Aug 27 14:08:01 link-08 kernel: dm-cmirror: LRT_MASTER_ASSIGN(11): (SlRa3tWs)
Aug 27 14:08:01 link-08 kernel: dm-cmirror:   starter     : 1
Aug 27 14:08:01 link-08 kernel: dm-cmirror:   co-ordinator: 1
Aug 27 14:08:01 link-08 kernel: dm-cmirror:   node_count  : 1


[...]

I'm seeing this on the recovering node (link-08) over and over.
Aug 27 14:13:21 link-08 kernel: dm-cmirror: Delaying flush due to recovery
Aug 27 14:13:21 link-08 kernel: dm-cmirror: Flush/recovery collision
(5217/SlRa3tWs): count = 2, marker = 1, recoverer = 2
Aug 27 14:13:22 link-08 kernel: dm-cmirror: Flush/recovery collision
(5217/SlRa3tWs): count = 2, marker = 1, recoverer = 2
Aug 27 14:13:22 link-08 kernel: dm-cmirror: Delaying flush due to recovery
Aug 27 14:13:22 link-08 kernel: dm-cmirror: Flush/recovery collision
(5217/SlRa3tWs): count = 2, marker = 1, recoverer = 2
Aug 27 14:13:23 link-08 kernel: dm-cmirror: Flush/recovery collision
(5217/SlRa3tWs): count = 2, marker = 1, recoverer = 2
Aug 27 14:13:23 link-08 kernel: dm-cmirror: Delaying flush due to recovery
Aug 27 14:13:23 link-08 kernel: dm-cmirror: Flush/recovery collision
(5217/SlRa3tWs): count = 2, marker = 1, recoverer = 2


[root@link-07 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 run       U-1,10,3
[1 2]

DLM Lock Space:  "clvmd"                            16   7 run       -
[1 2]

DLM Lock Space:  "clustered_log"                    25  12 run       -
[1 2]

DLM Lock Space:  "A"                                27  13 run       S-10,200,0
[1 2]

DLM Lock Space:  "B"                                31  15 run       -
[1 2]

GFS Mount Group: "A"                                29  14 recover 2 -
[1 2]

GFS Mount Group: "B"                                33  16 recover 2 -
[1 2]



[root@link-08 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 run       U-1,10,3
[2 1]

DLM Lock Space:  "clvmd"                            16   7 run       -
[1 2]

DLM Lock Space:  "clustered_log"                    25  12 run       -
[1 2]

DLM Lock Space:  "A"                                27  13 run       -
[2 1]

DLM Lock Space:  "B"                                31  15 run       -
[2 1]

GFS Mount Group: "A"                                29  14 recover 4 -
[2 1]

GFS Mount Group: "B"                                33  16 recover 4 -
[2 1]


[root@link-07 ~]# dmsetup info
Name:              helter_skelter-cmirrorB_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        2
Event number:      0
Major, minor:      253, 7
Number of targets: 1
UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC09FtKCMUuKu1eqm5b6hiBbGU8SlRa3tWs

Name:              helter_skelter-cmirrorB_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        2
Event number:      0
Major, minor:      253, 10
Number of targets: 1
UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0eQ1Q3huZrNeTaPJh7NYe0aZlrmzHHjZC

Name:              helter_skelter-cmirrorB_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        2
Event number:      0
Major, minor:      253, 9
Number of targets: 1
UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0yluKUhYNDfOayUiJdQ7DB1KP0Gn3W1Cu

Name:              helter_skelter-cmirrorA_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0QR6IuD8m3W4ji7FbuC6bHoakgPztDJUH

Name:              helter_skelter-cmirrorB_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 8
Number of targets: 1
UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0P63r93CxVt71v2grnlIjKnGzkFWTonPX

Name:              helter_skelter-cmirrorA_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 4
Number of targets: 1
UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0S2F1xXjB3nsqaJfAW45NbquCyiv0EBAc

Name:              VolGroup00-LogVol01
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-STM37j70AgnyxGIok8z6Do2lnDqib62LNwih5UmH5f6NXox5no7dn9QvpYIwLxCo

Name:              helter_skelter-cmirrorB
State:             ACTIVE
Tables present:    LIVE & INACTIVE
Open count:        1
Event number:      39
Major, minor:      253, 11
Number of targets: 1
UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC03IvxbAPAm2MtIud7UV2zCjBGO6wXi2ub

Name:              VolGroup00-LogVol00
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-STM37j70AgnyxGIok8z6Do2lnDqib62L8NgJLsAPdn2xis4KU1whY1E1FJqScoMV

Name:              helter_skelter-cmirrorA
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      46
Major, minor:      253, 6
Number of targets: 1
UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0tU8KTVt23nMJEURar3vzQWOivOR9ro5l

Name:              helter_skelter-cmirrorA_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 2
Number of targets: 1
UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0tthIfQT3Jk3nSlnVmRwA3HUoy41ot1K6



[root@link-08 ~]# dmsetup status
helter_skelter-cmirrorB_mlog: 0 8192 linear
helter_skelter-cmirrorB_mimage_2: 0 10485760 linear
helter_skelter-cmirrorB_mimage_1: 0 10485760 linear
helter_skelter-cmirrorA_mimage_2: 0 10485760 linear
helter_skelter-cmirrorB_mimage_0: 0 10485760 linear
helter_skelter-cmirrorA_mimage_1: 0 10485760 linear
VolGroup00-LogVol01: 0 4063232 linear
helter_skelter-cmirrorB: 0 10485760 mirror 3 253:8 253:9 253:10 10240/10240 1
DAA 3 clustered_disk 253:7 A
VolGroup00-LogVol00: 0 151781376 linear
helter_skelter-cmirrorA: 0 10485760 mirror 2 253:5 253:4 10240/10240 1 AA 3
clustered_disk 253:2 A
helter_skelter-cmirrorA_mlog: 0 8192 linear


Version-Release number of selected component (if applicable):
2.6.9-56.ELsmp
cmirror-kernel-2.6.9-33.2
lvm2-cluster-2.02.27-1.el4

Comment 1 Jonathan Earl Brassow 2007-08-31 18:43:24 UTC
Can you reproduce with cmirror-kernel > 2.6.9-33.2 ?

Comment 2 Jonathan Earl Brassow 2007-08-31 18:48:49 UTC
Judging by the LRT_ELECTION messages...  It seems that link-08 thinks its the
server, but link-07 tried an election and got _itself_ as the master.  This
would mean that the view of who is the server is not consistent.

If you hit this again, do a 'dmsetup table' on each node.  That will print out
the 'LOG INFO' information.  From that, we will be able to tell who everyone
thinks the server is.  Obviously, the server should be the same on all nodes.


Comment 3 Jonathan Earl Brassow 2007-09-12 15:41:03 UTC
I can trigger this without node failures.  I do:

1) create 4 LVs on the same VG
2) create GFS file system on them all
3) mount 4 LVs on 3 machines and do I/O
4) loop convert to/from disk log on one of the LVs

Within 15 minutes, I hit an unresolvable 'Flush/recovery collision'.

I believe this is due to the fact that a flush from any machine can block if the
new log bitmap contains a 'mark' that will conflict with the recovering region -
even if the machine requesting the flush did not perform the conflicting 'mark'.

So, you can get into a situation where:
node1) gets recovery work
node2) marks region that is being recovered
node1) marks some different region and tries to flush but is delayed
node1) can't finish recovery work because it can't get past the flush
**) indefinite stand-off


Comment 4 Jonathan Earl Brassow 2007-09-13 15:25:49 UTC
checkin log:
The procedure for coordinating nominal I/O and recovery I/O, was to
either:
1) delay a flush which contained a mark to a region being recovered
2) skip over regions that are currently marked when assigning recovery

This bug has to do with the way #1 was implemented.

The following scenario would trigger it:
1) node1 is assigned recovery on region X
2) node1 also does a mark (write) on region Y
3) node2 attempts to mark region X
**) any flush issued here will delay waiting for recovery to complete on X
4) node1 needs to perform the flush before it can get on with completing
recovery - but it can't flush, so everyone is delayed *forever*.

The fix was to allow flushes from nodes that are not attempting to mark
regions that are being recovered.  In the example above, node1 should be
allowed to complete the flush because it is not trying to write to the
same region that is being recovered.  node2 would be correctly delayed.
Since node1 can complete the flush, it can also complete the recovery -
thus allowing things to proceed.

This bug only affects mirrors that are not in-sync and are doing I/O.
This bug can occur whether there are device/machine failures or not.
This bug is most easily reproduced with a number of mirrors, but would
be possible with just one.

I've also fixed up some debugging output so it is more consistent and
easier to follow the flow of events.


Comment 7 errata-xmlrpc 2007-11-21 21:15:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0991.html



Note You need to log in before you can comment on or make changes to this bug.