Bug 233034
| Summary: | cmirror server failure/migration during GFS I/O causes metadata error and then GFS withdraw | ||
|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
| Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> |
| Status: | CLOSED WORKSFORME | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4 | CC: | agk, cfeist, dwysocha, jbrassow, mbroz, prockai |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2008-07-01 20:46:44 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Corey Marthaler
2007-03-19 23:18:30 UTC
assigned -> post Code change is a cluster product not a rhel product - switching to cluster4.5 flag instead of rhel4.5 flag. post -> modified I appear to have hit this bug attempting the same scenario in the original report, that is... start GFS i/o to all 3 cmirrors, fail a leg on all three mirrors, wait for the down convert to work, and then fail a node in the cluster. In this case I failed link-07, and the 'invalid metadata' happened on link-04. May 24 15:37:28 link-04 kernel: dm-cmirror: I'm the cluster mirror log server for PIKAqmNT May 24 15:37:28 link-04 kernel: dm-cmirror: Not reading disk log because I'm suspended (PIKAqmNT) May 24 15:37:28 link-04 kernel: dm-cmirror: Performing flush to work around bug 235040 May 24 15:37:28 link-04 kernel: dm-cmirror: Log flush complete May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_MASTER_LEAVING(13): (PIKAqmNT) May 24 15:37:48 link-04 kernel: dm-cmirror: starter : 1 May 24 15:37:48 link-04 kernel: dm-cmirror: co-ordinator: 0 May 24 15:37:48 link-04 kernel: dm-cmirror: node_count : 2 May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_ELECTION(10): (PIKAqmNT) May 24 15:37:48 link-04 kernel: dm-cmirror: starter : 1 May 24 15:37:48 link-04 kernel: dm-cmirror: co-ordinator: 57005 May 24 15:37:48 link-04 kernel: dm-cmirror: node_count : 2 May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_SELECTION(11): (PIKAqmNT) May 24 15:37:48 link-04 kernel: dm-cmirror: starter : 1 May 24 15:37:48 link-04 kernel: dm-cmirror: co-ordinator: 57005 May 24 15:37:48 link-04 kernel: dm-cmirror: node_count : 2 May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_MASTER_ASSIGN(12): (PIKAqmNT) May 24 15:37:48 link-04 kernel: dm-cmirror: starter : 1 May 24 15:37:48 link-04 kernel: dm-cmirror: co-ordinator: 57005 May 24 15:37:48 link-04 kernel: dm-cmirror: node_count : 1 May 24 15:37:48 link-04 kernel: dm-cmirror: Removing PIKAqmNT (1) May 24 15:37:48 link-04 kernel: dm-cmirror: 0 region user structures freed May 24 15:37:48 link-04 lvm[14010]: Monitoring mirror device vg3-mirror3 for events May 24 15:37:48 link-04 lvm[14010]: vg3-mirror3 is now in-sync May 24 15:44:32 link-04 kernel: CMAN: removing node link-07 from the cluster : Missed too many hea rtbeats May 24 15:44:33 link-04 fenced[13193]: fencing deferred to link-02 May 24 15:45:13 link-04 kernel: dm-cmirror: A cluster mirror log member has failed. May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Resume:: PIKAqmNT (active) May 24 15:45:13 link-04 kernel: dm-cmirror: Live nodes :: 3 May 24 15:45:13 link-04 kernel: dm-cmirror: In-Use Regions :: 1 May 24 15:45:13 link-04 kernel: dm-cmirror: Good IUR's :: 1 May 24 15:45:13 link-04 kernel: dm-cmirror: Bad IUR's :: 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Sync count :: 10240 May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Region count :: 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Region count :: 10240 May 24 15:45:13 link-04 kernel: dm-cmirror: NOTE: Mapping has changed. May 24 15:45:13 link-04 kernel: dm-cmirror: Marked regions:: May 24 15:45:13 link-04 kernel: dm-cmirror: Total = 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Out-of-sync regions:: May 24 15:45:13 link-04 kernel: dm-cmirror: Total = 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Resume:: mloZLxpj (active) May 24 15:45:13 link-04 kernel: dm-cmirror: Live nodes :: 3 May 24 15:45:13 link-04 kernel: dm-cmirror: In-Use Regions :: 1 May 24 15:45:13 link-04 kernel: dm-cmirror: Good IUR's :: 1 May 24 15:45:13 link-04 kernel: dm-cmirror: Bad IUR's :: 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Sync count :: 10240 May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Region count :: 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Region count :: 10240 May 24 15:45:13 link-04 kernel: dm-cmirror: NOTE: Mapping has changed. May 24 15:45:13 link-04 kernel: dm-cmirror: Marked regions:: May 24 15:45:13 link-04 kernel: dm-cmirror: Total = 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Out-of-sync regions:: May 24 15:45:13 link-04 kernel: dm-cmirror: Total = 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Resume:: vbQ8ivnj (active) May 24 15:45:13 link-04 kernel: dm-cmirror: Live nodes :: 3 May 24 15:45:13 link-04 kernel: dm-cmirror: In-Use Regions :: 1 May 24 15:45:13 link-04 kernel: dm-cmirror: Good IUR's :: 1 May 24 15:45:13 link-04 kernel: dm-cmirror: Bad IUR's :: 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Sync count :: 10240 May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Region count :: 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Region count :: 10240 May 24 15:45:13 link-04 kernel: dm-cmirror: NOTE: Mapping has changed. May 24 15:45:13 link-04 kernel: dm-cmirror: Marked regions:: May 24 15:45:13 link-04 kernel: dm-cmirror: Total = 0 May 24 15:45:13 link-04 kernel: dm-cmirror: Out-of-sync regions:: May 24 15:45:13 link-04 kernel: dm-cmirror: Total = 0 May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Trying to acquire journal lock... May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Trying to acquire journal lock... May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Trying to acquire journal lock... May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Looking at journal... May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Looking at journal... May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Looking at journal... May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Acquiring the transaction lock... May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Acquiring the transaction lock... May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Acquiring the transaction lock... May 24 15:45:14 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Replaying journal... May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Replayed 7 of 15 blocks May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: replays = 7, skips = 7, sames = 1 May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Replaying journal... May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Replaying journal... May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Journal replayed in 2s May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Done May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: fatal: invalid metadata block May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: bh = 637377 (magic) May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: function = foreach_descriptor May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: file = /builddir/build/BUILD/gfs-kernel-2.6.9-72/largesmp/src/gfs/recovery.c, line = 400 May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: time = 1180039516 May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: about to withdraw from the cluster May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: waiting for outstanding I/O May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: telling LM to withdraw May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: fatal: invalid metadata block May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: bh = 635057 (magic) May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: function = foreach_descriptor May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: file = /builddir/build/BUILD/gfs-kernel-2.6.9-72/largesmp/src/gfs/recovery.c, line = 400 May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: time = 1180039516 May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: about to withdraw from the cluster May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: waiting for outstanding I/O May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: telling LM to withdraw May 24 16:00:33 link-04 sshd(pam_unix)[17118]: session opened for user root by root(uid=0) [root@link-04 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 4 X link-07 2 1 4 M link-02 3 1 4 M link-04 4 1 4 M link-08 [root@link-04 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 5 2 run - [2 3 4] DLM Lock Space: "clvmd" 10 5 run - [3 2 4] DLM Lock Space: "clustered_log" 11 6 run - [2 3 4] DLM Lock Space: "1" 13 7 run S-10,200,0 [2 3 4] DLM Lock Space: "2" 17 9 run - [2 3 4] DLM Lock Space: "3" 21 11 run - [2 3 4] GFS Mount Group: "1" 15 8 recover 2 - [2 3 4] GFS Mount Group: "2" 19 10 recover 2 - [2 3 4] GFS Mount Group: "3" 23 12 run - [2 3 4] [root@link-04 ~]# dmsetup ls vg1-mirror1 (253, 5) vg3-mirror3_mimage_2 (253, 13) vg3-mirror3_mimage_1 (253, 12) vg3-mirror3_mlog (253, 10) vg2-mirror2 (253, 9) vg1-mirror1_mimage_2 (253, 4) vg1-mirror1_mimage_1 (253, 3) vg3-mirror3 (253, 14) VolGroup00-LogVol01 (253, 1) vg2-mirror2_mimage_2 (253, 8) VolGroup00-LogVol00 (253, 0) vg2-mirror2_mimage_1 (253, 7) [root@link-04 ~]# dmsetup info Name: vg1-mirror1 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 1 Major, minor: 253, 5 Number of targets: 1 UUID: LVM-jVW0p5yyy5wTagIL0aygoSMVZLemXmb8obumXSC6B3CduV0RBrFdhrGJvbQ8ivnj Name: vg3-mirror3_mimage_2 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 13 Number of targets: 1 UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZpaLs9n0BnDV4VslNvPVcRL3d45zqssuA Name: vg3-mirror3_mimage_1 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 12 Number of targets: 1 UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZMRSO8BF06jpxnWlWYB0WGOy56PYjrS3d Name: vg3-mirror3_mlog State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 10 Number of targets: 1 UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZcc12hKC015ulgOhPIdSBVSxqPIKAqmNT Name: vg2-mirror2 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 7 Major, minor: 253, 9 Number of targets: 1 UUID: LVM-klO57to59SCrg2sHUlIenGLcGTPah2JRo1FEv8TbkEAtSiJbPXhpunPRmloZLxpj Name: vg1-mirror1_mimage_2 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 4 Number of targets: 1 UUID: LVM-jVW0p5yyy5wTagIL0aygoSMVZLemXmb8Nb6ifzgNNMjv03zJH9BX1b2hgnqbvP1n Name: vg1-mirror1_mimage_1 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 3 Number of targets: 1 UUID: LVM-jVW0p5yyy5wTagIL0aygoSMVZLemXmb8zrRxcvrXawT7rn885SId6bvqma8h4Rb4 Name: vg3-mirror3 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 1 Major, minor: 253, 14 Number of targets: 1 UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZOy5EfnriDVYBEO5dps4g4ySR9UEHUDfN Name: VolGroup00-LogVol01 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 1 Number of targets: 1 UUID: LVM-RsBOCkblQpLicArLlocNswlochRxjsFjyYdc3o77yCGdTIILMdW0LlXkWPGcQWen Name: vg2-mirror2_mimage_2 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 8 Number of targets: 1 UUID: LVM-klO57to59SCrg2sHUlIenGLcGTPah2JRzz7RntrbEeSmb96NeIWeMxRNWXu3gxlB Name: VolGroup00-LogVol00 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1 UUID: LVM-RsBOCkblQpLicArLlocNswlochRxjsFjRMSyxR0nBvsyde1ijAWS9nf4aX3gDTYI Name: vg2-mirror2_mimage_1 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 7 Number of targets: 1 UUID: LVM-klO57to59SCrg2sHUlIenGLcGTPah2JRThc04y0AcxA1ucoQDOV5PQNDonEmiKPQ This should be fixed with the latest cmirror code (>= 9/28/07). assigned -> post This has not been seen in over a year, marking fixed. |