Description of problem: This seems to be yet another deadlock scenario due to a mirror leg and node failure. This is the first time however that I've seen these "Flush/recovery collision" messages though. In this case, 3 node cluster with 2 mirrors and gfs on top. [root@link-08 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 3 M link-07 2 1 3 M link-08 3 1 3 M link-02 All three nodes were doing gfs i/o (xdoio/xiogen) to each mirror. I then failed link-02. Aug 27 14:08:00 link-08 kernel: dm-cmirror: LOG INFO: Aug 27 14:08:00 link-08 kernel: dm-cmirror: uuid: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC09FtKCMUuKu1eqm5b6hiBbGU8SlRa3tW s Aug 27 14:08:00 link-08 kernel: dm-cmirror: uuid_ref : 1 Aug 27 14:08:00 link-08 kernel: dm-cmirror: ?region_count: 10240 Aug 27 14:08:00 link-08 kernel: dm-cmirror: ?sync_count : 10240 Aug 27 14:08:00 link-08 kernel: dm-cmirror: ?sync_search : 5218 Aug 27 14:08:00 link-08 kernel: dm-cmirror: in_sync : YES Aug 27 14:08:00 link-08 kernel: dm-cmirror: suspended : NO Aug 27 14:08:00 link-08 kernel: dm-cmirror: server_id : 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: server_valid: YES Aug 27 14:08:00 link-08 kernel: dm-cmirror: Creating SlRa3tWs (2) Aug 27 14:08:00 link-08 kernel: dm-cmirror: server_id=dead, server_valid=0, SlRa3tWs Aug 27 14:08:00 link-08 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT Aug 27 14:08:00 link-08 kernel: dm-cmirror: LRT_ELECTION(9): (SlRa3tWs) Aug 27 14:08:00 link-08 kernel: dm-cmirror: starter : 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: co-ordinator: 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: node_count : 0 Aug 27 14:08:00 link-08 kernel: dm-cmirror: LRT_ELECTION(9): (SlRa3tWs) Aug 27 14:08:00 link-08 kernel: dm-cmirror: starter : 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: co-ordinator: 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: node_count : 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: LRT_SELECTION(10): (SlRa3tWs) Aug 27 14:08:00 link-08 kernel: dm-cmirror: starter : 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: co-ordinator: 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: node_count : 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: LRT_MASTER_ASSIGN(11): (SlRa3tWs) Aug 27 14:08:00 link-08 kernel: dm-cmirror: starter : 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: co-ordinator: 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: node_count : 2 Aug 27 14:08:00 link-08 kernel: dm-cmirror: I'm the cluster mirror log server for SlRa3tWs Aug 27 14:08:00 link-08 kernel: dm-cmirror: Not reading disk log because I'm suspended (SlRa3tWs) Aug 27 14:08:00 link-08 kernel: dm-cmirror: Resync work completed: 58 Aug 27 14:08:00 link-08 kernel: dm-cmirror: Assigning recovery work to 2/y41ot1K6: 59 Aug 27 14:08:00 link-08 kernel: dm-cmirror: Client received resync work: 59/y41ot1K6 Aug 27 14:08:00 link-08 kernel: dm-cmirror: LRT_ELECTION(9): (SlRa3tWs) Aug 27 14:08:00 link-08 kernel: dm-cmirror: starter : 1 Aug 27 14:08:01 link-08 kernel: dm-cmirror: co-ordinator: 1 Aug 27 14:08:01 link-08 kernel: dm-cmirror: node_count : 1 Aug 27 14:08:01 link-08 kernel: dm-cmirror: LRT_SELECTION(10): (SlRa3tWs) Aug 27 14:08:01 link-08 kernel: dm-cmirror: starter : 1 Aug 27 14:08:01 link-08 kernel: dm-cmirror: co-ordinator: 1 Aug 27 14:08:01 link-08 kernel: dm-cmirror: node_count : 1 Aug 27 14:08:01 link-08 kernel: dm-cmirror: LRT_MASTER_ASSIGN(11): (SlRa3tWs) Aug 27 14:08:01 link-08 kernel: dm-cmirror: starter : 1 Aug 27 14:08:01 link-08 kernel: dm-cmirror: co-ordinator: 1 Aug 27 14:08:01 link-08 kernel: dm-cmirror: node_count : 1 [...] I'm seeing this on the recovering node (link-08) over and over. Aug 27 14:13:21 link-08 kernel: dm-cmirror: Delaying flush due to recovery Aug 27 14:13:21 link-08 kernel: dm-cmirror: Flush/recovery collision (5217/SlRa3tWs): count = 2, marker = 1, recoverer = 2 Aug 27 14:13:22 link-08 kernel: dm-cmirror: Flush/recovery collision (5217/SlRa3tWs): count = 2, marker = 1, recoverer = 2 Aug 27 14:13:22 link-08 kernel: dm-cmirror: Delaying flush due to recovery Aug 27 14:13:22 link-08 kernel: dm-cmirror: Flush/recovery collision (5217/SlRa3tWs): count = 2, marker = 1, recoverer = 2 Aug 27 14:13:23 link-08 kernel: dm-cmirror: Flush/recovery collision (5217/SlRa3tWs): count = 2, marker = 1, recoverer = 2 Aug 27 14:13:23 link-08 kernel: dm-cmirror: Delaying flush due to recovery Aug 27 14:13:23 link-08 kernel: dm-cmirror: Flush/recovery collision (5217/SlRa3tWs): count = 2, marker = 1, recoverer = 2 [root@link-07 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 3 2 run U-1,10,3 [1 2] DLM Lock Space: "clvmd" 16 7 run - [1 2] DLM Lock Space: "clustered_log" 25 12 run - [1 2] DLM Lock Space: "A" 27 13 run S-10,200,0 [1 2] DLM Lock Space: "B" 31 15 run - [1 2] GFS Mount Group: "A" 29 14 recover 2 - [1 2] GFS Mount Group: "B" 33 16 recover 2 - [1 2] [root@link-08 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 3 2 run U-1,10,3 [2 1] DLM Lock Space: "clvmd" 16 7 run - [1 2] DLM Lock Space: "clustered_log" 25 12 run - [1 2] DLM Lock Space: "A" 27 13 run - [2 1] DLM Lock Space: "B" 31 15 run - [2 1] GFS Mount Group: "A" 29 14 recover 4 - [2 1] GFS Mount Group: "B" 33 16 recover 4 - [2 1] [root@link-07 ~]# dmsetup info Name: helter_skelter-cmirrorB_mlog State: ACTIVE Tables present: LIVE Open count: 2 Event number: 0 Major, minor: 253, 7 Number of targets: 1 UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC09FtKCMUuKu1eqm5b6hiBbGU8SlRa3tWs Name: helter_skelter-cmirrorB_mimage_2 State: ACTIVE Tables present: LIVE Open count: 2 Event number: 0 Major, minor: 253, 10 Number of targets: 1 UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0eQ1Q3huZrNeTaPJh7NYe0aZlrmzHHjZC Name: helter_skelter-cmirrorB_mimage_1 State: ACTIVE Tables present: LIVE Open count: 2 Event number: 0 Major, minor: 253, 9 Number of targets: 1 UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0yluKUhYNDfOayUiJdQ7DB1KP0Gn3W1Cu Name: helter_skelter-cmirrorA_mimage_2 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 5 Number of targets: 1 UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0QR6IuD8m3W4ji7FbuC6bHoakgPztDJUH Name: helter_skelter-cmirrorB_mimage_0 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 8 Number of targets: 1 UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0P63r93CxVt71v2grnlIjKnGzkFWTonPX Name: helter_skelter-cmirrorA_mimage_1 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 4 Number of targets: 1 UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0S2F1xXjB3nsqaJfAW45NbquCyiv0EBAc Name: VolGroup00-LogVol01 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 1 Number of targets: 1 UUID: LVM-STM37j70AgnyxGIok8z6Do2lnDqib62LNwih5UmH5f6NXox5no7dn9QvpYIwLxCo Name: helter_skelter-cmirrorB State: ACTIVE Tables present: LIVE & INACTIVE Open count: 1 Event number: 39 Major, minor: 253, 11 Number of targets: 1 UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC03IvxbAPAm2MtIud7UV2zCjBGO6wXi2ub Name: VolGroup00-LogVol00 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1 UUID: LVM-STM37j70AgnyxGIok8z6Do2lnDqib62L8NgJLsAPdn2xis4KU1whY1E1FJqScoMV Name: helter_skelter-cmirrorA State: ACTIVE Tables present: LIVE Open count: 1 Event number: 46 Major, minor: 253, 6 Number of targets: 1 UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0tU8KTVt23nMJEURar3vzQWOivOR9ro5l Name: helter_skelter-cmirrorA_mlog State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 2 Number of targets: 1 UUID: LVM-k7vgzVJA1j4NkvxgpXItDkpkVmlU1iC0tthIfQT3Jk3nSlnVmRwA3HUoy41ot1K6 [root@link-08 ~]# dmsetup status helter_skelter-cmirrorB_mlog: 0 8192 linear helter_skelter-cmirrorB_mimage_2: 0 10485760 linear helter_skelter-cmirrorB_mimage_1: 0 10485760 linear helter_skelter-cmirrorA_mimage_2: 0 10485760 linear helter_skelter-cmirrorB_mimage_0: 0 10485760 linear helter_skelter-cmirrorA_mimage_1: 0 10485760 linear VolGroup00-LogVol01: 0 4063232 linear helter_skelter-cmirrorB: 0 10485760 mirror 3 253:8 253:9 253:10 10240/10240 1 DAA 3 clustered_disk 253:7 A VolGroup00-LogVol00: 0 151781376 linear helter_skelter-cmirrorA: 0 10485760 mirror 2 253:5 253:4 10240/10240 1 AA 3 clustered_disk 253:2 A helter_skelter-cmirrorA_mlog: 0 8192 linear Version-Release number of selected component (if applicable): 2.6.9-56.ELsmp cmirror-kernel-2.6.9-33.2 lvm2-cluster-2.02.27-1.el4
Can you reproduce with cmirror-kernel > 2.6.9-33.2 ?
Judging by the LRT_ELECTION messages... It seems that link-08 thinks its the server, but link-07 tried an election and got _itself_ as the master. This would mean that the view of who is the server is not consistent. If you hit this again, do a 'dmsetup table' on each node. That will print out the 'LOG INFO' information. From that, we will be able to tell who everyone thinks the server is. Obviously, the server should be the same on all nodes.
I can trigger this without node failures. I do: 1) create 4 LVs on the same VG 2) create GFS file system on them all 3) mount 4 LVs on 3 machines and do I/O 4) loop convert to/from disk log on one of the LVs Within 15 minutes, I hit an unresolvable 'Flush/recovery collision'. I believe this is due to the fact that a flush from any machine can block if the new log bitmap contains a 'mark' that will conflict with the recovering region - even if the machine requesting the flush did not perform the conflicting 'mark'. So, you can get into a situation where: node1) gets recovery work node2) marks region that is being recovered node1) marks some different region and tries to flush but is delayed node1) can't finish recovery work because it can't get past the flush **) indefinite stand-off
checkin log: The procedure for coordinating nominal I/O and recovery I/O, was to either: 1) delay a flush which contained a mark to a region being recovered 2) skip over regions that are currently marked when assigning recovery This bug has to do with the way #1 was implemented. The following scenario would trigger it: 1) node1 is assigned recovery on region X 2) node1 also does a mark (write) on region Y 3) node2 attempts to mark region X **) any flush issued here will delay waiting for recovery to complete on X 4) node1 needs to perform the flush before it can get on with completing recovery - but it can't flush, so everyone is delayed *forever*. The fix was to allow flushes from nodes that are not attempting to mark regions that are being recovered. In the example above, node1 should be allowed to complete the flush because it is not trying to write to the same region that is being recovered. node2 would be correctly delayed. Since node1 can complete the flush, it can also complete the recovery - thus allowing things to proceed. This bug only affects mirrors that are not in-sync and are doing I/O. This bug can occur whether there are device/machine failures or not. This bug is most easily reproduced with a number of mirrors, but would be possible with just one. I've also fixed up some debugging output so it is more consistent and easier to follow the flow of events.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0991.html