Description of problem: While running I/O on one node on a cluster mirror my I/O load hangs. Upon closer inspection I found that all nodes complained. Here is the tail of each node's dmesg. I will attach the complete output from basic. [nstraz@try 3]$ tail -n 12 *.dmesg ==> basic.dmesg <== dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 dm-cmirror: Remote recovery conflict: (3337311 >= 24639)/K8rNJHz9 ==> doral.dmesg <== cdrom: open failed. cdrom: open failed. dm-cmirror: Creating K8rNJHz9 (1) dm-cmirror: start_server called dm-cmirror: cluster_log_serverd ready for work dm-cmirror: Node joining dm-cmirror: server_id=dead, server_valid=0, K8rNJHz9 dm-cmirror: trigger = LRT_GET_RESYNC_WORK dm-cmirror: LRT_ELECTION(10): (K8rNJHz9) dm-cmirror: starter : 4 dm-cmirror: co-ordinator: 4 dm-cmirror: node_count : 0 ==> kent.dmesg <== dm-cmirror: Creating K8rNJHz9 (1) dm-cmirror: start_server called dm-cmirror: cluster_log_serverd ready for work dm-cmirror: Node joining dm-cmirror: server_id=dead, server_valid=0, K8rNJHz9 dm-cmirror: trigger = LRT_GET_SYNC_COUNT dm-cmirror: LRT_ELECTION(10): (K8rNJHz9) dm-cmirror: starter : 2 dm-cmirror: co-ordinator: 2 dm-cmirror: node_count : 0 dm-cmirror: Node joining dm-cmirror: Node joining ==> newport.dmesg <== cdrom: open failed. dm-cmirror: Creating K8rNJHz9 (1) dm-cmirror: start_server called dm-cmirror: cluster_log_serverd ready for work dm-cmirror: Node joining dm-cmirror: server_id=dead, server_valid=0, K8rNJHz9 dm-cmirror: trigger = LRT_GET_RESYNC_WORK dm-cmirror: LRT_ELECTION(10): (K8rNJHz9) dm-cmirror: starter : 3 dm-cmirror: co-ordinator: 3 dm-cmirror: node_count : 0 dm-cmirror: Node joining Version-Release number of selected component (if applicable): lvm2-cluster-2.02.21-7.el4.ppc64 lvm2-2.02.21-5.el4.ppc cmirror-1.0.1-1.ppc64 device-mapper-1.02.17-3.el4.ppc device-mapper-1.02.17-3.el4.ppc64 How reproducible: I've hit it twice already. It should be easy to hit again on the same hardware. Steps to Reproduce: lvm_try, mirror_2 volume config Actual results: See above Expected results: Additional info:
Created attachment 153896 [details] complete dmseg output from node basic.
Writes are disallowed to regions that have not yet been recovered. This makes I/O suck, I know. Kernel changes are required to fix this problem without delaying I.O. This will be done in 4.6. For now, writes simply get delayed until the mirror has synced past the region being attempted.
Keep this bug open for a 4.6 errata.
2.6.9-55.16.ELsmp kernel has the necessary patches. cmirror-kernel code updated July/11/2007
*** Bug 243773 has been marked as a duplicate of this bug. ***
post -> modified.
*** Bug 252007 has been marked as a duplicate of this bug. ***
This bug may not be completely fixed yet and may be related to currently open bz 290821.
Closing this out since it missed the errata process.