Bug 233034 - cmirror server failure/migration during GFS I/O causes metadata error and then GFS withdraw
cmirror server failure/migration during GFS I/O causes metadata error and the...
Status: CLOSED WORKSFORME
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cmirror (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-03-19 19:18 EDT by Corey Marthaler
Modified: 2010-01-11 21:02 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-01 16:46:44 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2007-03-19 19:18:30 EDT
Description of problem:
Three (3-way) cmirrors, gfs on top of all, gfs I/O ruuning to all
(xdoio/xiogen). I then failed the primary legs, and they appeared to down conver
t to 2-way cmirrors properly. I then kill one of the nodes (link-07) in the
cluster (was also the cmirror server to at least one of the cmirrors) and that
caused the following problems with GFS.


GFS: fsid=LINK_128:gfs2.1: fatal: invalid metadata block
GFS: fsid=LINK_128:gfs2.1:   bh = 1086097 (magic)
GFS: fsid=LINK_128:gfs2.1:   function = foreach_descriptor
GFS: fsid=LINK_128:gfs2.1:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-71/smp/src/gfs/recovery.c, line = 400
GFS: fsid=LINK_128:gfs2.1:   time = 1174323986
GFS: fsid=LINK_128:gfs2.1: about to withdraw from the cluster
GFS: fsid=LINK_128:gfs2.1: waiting for outstanding I/O
GFS: fsid=LINK_128:gfs2.1: telling LM to withdraw
GFS: fsid=LINK_128:gfs1.1: fatal: invalid metadata block
GFS: fsid=LINK_128:gfs1.1:   bh = 1099633 (magic)
GFS: fsid=LINK_128:gfs1.1:   function = foreach_descriptor
GFS: fsid=LINK_128:gfs1.1:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-71/smp/src/gfs/recovery.c, line = 400
GFS: fsid=LINK_128:gfs1.1:   time = 1174323986
GFS: fsid=LINK_128:gfs1.1: about to withdraw from the cluster
GFS: fsid=LINK_128:gfs1.1: waiting for outstanding I/O
GFS: fsid=LINK_128:gfs1.1: telling LM to withdraw
GFS: fsid=LINK_128:gfs3.1: jid=3: Journal replayed in 4s
GFS: fsid=LINK_128:gfs3.1: jid=3: Done
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1: fatal: invalid
metadata block
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1:   bh = 1086097 (magic)
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1:   function =
foreach_descriptor
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-71/smp/src/gfs/recovery.c, line = 400
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1:   time = 1174323986
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1: about to withdraw
from the cluster
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1: waiting for
outstanding I/O
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1: telling LM to withdraw
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1: fatal: invalid
metadata block
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1:   bh = 1099633 (magic)
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1:   function =
foreach_descriptor
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-71/smp/src/gfs/recovery.c, line = 400
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1:   time = 1174323986
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1: about to withdraw
from the cluster
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1: waiting for
outstanding I/O
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1: telling LM to withdraw
Mar 19 12:06:27 link-08 kernel: GFS: fsid=LINK_128:gfs3.1: jid=3: Journal
replayed in 4s
Mar 19 12:06:27 link-08 kernel: GFS: fsid=LINK_128:gfs3.1: jid=3: Done


Version-Release number of selected component (if applicable):
2.6.9-50.ELsmp
cmirror-kernel-2.6.9-25.0
Comment 1 Jonathan Earl Brassow 2007-03-22 18:35:13 EDT
assigned -> post
Comment 5 Kiersten (Kerri) Anderson 2007-03-27 15:51:55 EDT
Code change is a cluster product not a rhel product - switching to cluster4.5
flag instead of rhel4.5 flag.
Comment 6 Jonathan Earl Brassow 2007-04-03 16:12:19 EDT
post -> modified
Comment 7 Corey Marthaler 2007-05-24 17:08:03 EDT
I appear to have hit this bug attempting the same scenario in the original
report, that is... start GFS i/o to all 3 cmirrors, fail a leg on all three
mirrors, wait for the down convert to work, and then fail a node in the cluster.
In this case I failed link-07, and the 'invalid metadata' happened on link-04.


May 24 15:37:28 link-04 kernel: dm-cmirror: I'm the cluster mirror log server
for PIKAqmNT
May 24 15:37:28 link-04 kernel: dm-cmirror: Not reading disk log because I'm
suspended (PIKAqmNT)
May 24 15:37:28 link-04 kernel: dm-cmirror: Performing flush to work around bug
235040
May 24 15:37:28 link-04 kernel: dm-cmirror: Log flush complete
May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_MASTER_LEAVING(13): (PIKAqmNT)
May 24 15:37:48 link-04 kernel: dm-cmirror:   starter     : 1
May 24 15:37:48 link-04 kernel: dm-cmirror:   co-ordinator: 0
May 24 15:37:48 link-04 kernel: dm-cmirror:   node_count  : 2
May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_ELECTION(10): (PIKAqmNT)
May 24 15:37:48 link-04 kernel: dm-cmirror:   starter     : 1
May 24 15:37:48 link-04 kernel: dm-cmirror:   co-ordinator: 57005
May 24 15:37:48 link-04 kernel: dm-cmirror:   node_count  : 2
May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_SELECTION(11): (PIKAqmNT)
May 24 15:37:48 link-04 kernel: dm-cmirror:   starter     : 1
May 24 15:37:48 link-04 kernel: dm-cmirror:   co-ordinator: 57005
May 24 15:37:48 link-04 kernel: dm-cmirror:   node_count  : 2
May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_MASTER_ASSIGN(12): (PIKAqmNT)
May 24 15:37:48 link-04 kernel: dm-cmirror:   starter     : 1
May 24 15:37:48 link-04 kernel: dm-cmirror:   co-ordinator: 57005
May 24 15:37:48 link-04 kernel: dm-cmirror:   node_count  : 1
May 24 15:37:48 link-04 kernel: dm-cmirror: Removing PIKAqmNT (1)
May 24 15:37:48 link-04 kernel: dm-cmirror: 0 region user structures freed
May 24 15:37:48 link-04 lvm[14010]: Monitoring mirror device vg3-mirror3 for events
May 24 15:37:48 link-04 lvm[14010]: vg3-mirror3 is now in-sync
May 24 15:44:32 link-04 kernel: CMAN: removing node link-07 from the cluster :
Missed too many hea
rtbeats
May 24 15:44:33 link-04 fenced[13193]: fencing deferred to link-02
May 24 15:45:13 link-04 kernel: dm-cmirror: A cluster mirror log member has failed.
May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Resume::  PIKAqmNT (active)
May 24 15:45:13 link-04 kernel: dm-cmirror:   Live nodes        :: 3
May 24 15:45:13 link-04 kernel: dm-cmirror:   In-Use Regions    :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Good IUR's        :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Bad IUR's         :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Sync count        :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   Disk Region count :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Region count      :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   NOTE:  Mapping has changed.
May 24 15:45:13 link-04 kernel: dm-cmirror: Marked regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: dm-cmirror: Out-of-sync regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Resume::  mloZLxpj (active)
May 24 15:45:13 link-04 kernel: dm-cmirror:   Live nodes        :: 3
May 24 15:45:13 link-04 kernel: dm-cmirror:   In-Use Regions    :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Good IUR's        :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Bad IUR's         :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Sync count        :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   Disk Region count :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Region count      :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   NOTE:  Mapping has changed.
May 24 15:45:13 link-04 kernel: dm-cmirror: Marked regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: dm-cmirror: Out-of-sync regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Resume::  vbQ8ivnj (active)
May 24 15:45:13 link-04 kernel: dm-cmirror:   Live nodes        :: 3
May 24 15:45:13 link-04 kernel: dm-cmirror:   In-Use Regions    :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Good IUR's        :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Bad IUR's         :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Sync count        :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   Disk Region count :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Region count      :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   NOTE:  Mapping has changed.
May 24 15:45:13 link-04 kernel: dm-cmirror: Marked regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: dm-cmirror: Out-of-sync regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Trying to acquire
journal lock...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Trying to acquire
journal lock...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Trying to acquire
journal lock...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Looking at journal...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Looking at journal...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Looking at journal...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Acquiring the
transaction lock...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Acquiring the
transaction lock...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Acquiring the
transaction lock...
May 24 15:45:14 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Replaying journal...
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Replayed 7 of 15
blocks
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: replays = 7,
skips = 7, sames = 1
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Replaying journal...
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Replaying journal...
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Journal replayed
in 2s
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Done
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: fatal: invalid metadata
block
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2:   bh = 637377 (magic)
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2:   function =
foreach_descriptor
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-72/largesmp/src/gfs/recovery.c, line = 400
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2:   time = 1180039516
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: about to withdraw from
the cluster
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: waiting for outstanding I/O
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: telling LM to withdraw
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: fatal: invalid metadata
block
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2:   bh = 635057 (magic)
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2:   function =
foreach_descriptor
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-72/largesmp/src/gfs/recovery.c, line = 400
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2:   time = 1180039516
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: about to withdraw from
the cluster
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: waiting for outstanding I/O
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: telling LM to withdraw
May 24 16:00:33 link-04 sshd(pam_unix)[17118]: session opened for user root by
root(uid=0)


[root@link-04 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   X   link-07
   2    1    4   M   link-02
   3    1    4   M   link-04
   4    1    4   M   link-08
[root@link-04 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[2 3 4]

DLM Lock Space:  "clvmd"                            10   5 run       -
[3 2 4]

DLM Lock Space:  "clustered_log"                    11   6 run       -
[2 3 4]

DLM Lock Space:  "1"                                13   7 run       S-10,200,0
[2 3 4]

DLM Lock Space:  "2"                                17   9 run       -
[2 3 4]

DLM Lock Space:  "3"                                21  11 run       -
[2 3 4]

GFS Mount Group: "1"                                15   8 recover 2 -
[2 3 4]

GFS Mount Group: "2"                                19  10 recover 2 -
[2 3 4]

GFS Mount Group: "3"                                23  12 run       -
[2 3 4]

[root@link-04 ~]# dmsetup ls
vg1-mirror1     (253, 5)
vg3-mirror3_mimage_2    (253, 13)
vg3-mirror3_mimage_1    (253, 12)
vg3-mirror3_mlog        (253, 10)
vg2-mirror2     (253, 9)
vg1-mirror1_mimage_2    (253, 4)
vg1-mirror1_mimage_1    (253, 3)
vg3-mirror3     (253, 14)
VolGroup00-LogVol01     (253, 1)
vg2-mirror2_mimage_2    (253, 8)
VolGroup00-LogVol00     (253, 0)
vg2-mirror2_mimage_1    (253, 7)
[root@link-04 ~]# dmsetup info
Name:              vg1-mirror1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-jVW0p5yyy5wTagIL0aygoSMVZLemXmb8obumXSC6B3CduV0RBrFdhrGJvbQ8ivnj

Name:              vg3-mirror3_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 13
Number of targets: 1
UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZpaLs9n0BnDV4VslNvPVcRL3d45zqssuA

Name:              vg3-mirror3_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 12
Number of targets: 1
UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZMRSO8BF06jpxnWlWYB0WGOy56PYjrS3d

Name:              vg3-mirror3_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 10
Number of targets: 1
UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZcc12hKC015ulgOhPIdSBVSxqPIKAqmNT

Name:              vg2-mirror2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      7
Major, minor:      253, 9
Number of targets: 1
UUID: LVM-klO57to59SCrg2sHUlIenGLcGTPah2JRo1FEv8TbkEAtSiJbPXhpunPRmloZLxpj

Name:              vg1-mirror1_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 4
Number of targets: 1
UUID: LVM-jVW0p5yyy5wTagIL0aygoSMVZLemXmb8Nb6ifzgNNMjv03zJH9BX1b2hgnqbvP1n

Name:              vg1-mirror1_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 3
Number of targets: 1
UUID: LVM-jVW0p5yyy5wTagIL0aygoSMVZLemXmb8zrRxcvrXawT7rn885SId6bvqma8h4Rb4

Name:              vg3-mirror3
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 14
Number of targets: 1
UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZOy5EfnriDVYBEO5dps4g4ySR9UEHUDfN

Name:              VolGroup00-LogVol01
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-RsBOCkblQpLicArLlocNswlochRxjsFjyYdc3o77yCGdTIILMdW0LlXkWPGcQWen

Name:              vg2-mirror2_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 8
Number of targets: 1
UUID: LVM-klO57to59SCrg2sHUlIenGLcGTPah2JRzz7RntrbEeSmb96NeIWeMxRNWXu3gxlB

Name:              VolGroup00-LogVol00
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-RsBOCkblQpLicArLlocNswlochRxjsFjRMSyxR0nBvsyde1ijAWS9nf4aX3gDTYI

Name:              vg2-mirror2_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 7
Number of targets: 1
UUID: LVM-klO57to59SCrg2sHUlIenGLcGTPah2JRThc04y0AcxA1ucoQDOV5PQNDonEmiKPQ
                                   



Comment 8 Jonathan Earl Brassow 2007-10-01 11:53:28 EDT
This should be fixed with the latest cmirror code (>= 9/28/07).

assigned -> post
Comment 9 Corey Marthaler 2008-07-01 16:46:44 EDT
This has not been seen in over a year, marking fixed.

Note You need to log in before you can comment on or make changes to this bug.