Bug 169693 - GFS corruption
GFS corruption
Status: CLOSED DUPLICATE of bug 164331
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
All Linux
medium Severity high
: ---
: ---
Assigned To: Ben Marzinski
GFS Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-10-01 07:58 EDT by Axel Thimm
Modified: 2010-01-11 22:07 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-01-04 14:05:25 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Axel Thimm 2005-10-01 07:58:38 EDT
Description of problem:
We made an error in having the cluster intracommunication network be on the
clients LAN side. Some heavy bandwidth applications increased heartbeat
latencies and package drops so the cluster members decided to tell each-other to
leave the cluster (it did get as far as fencing the other nodes).

The above is a user error and we learned that we should really have a dedicated
cluster network for cman and dlm to operate over. That's not what this bug is
about, though, this situation only triggered some GFS corruption. For the sake
of this bug, let us just assume that this dedicated intracluster network
existed, but was dropping packages or increasing latencies due to some switch issue.

The above happened twice, the first time, cluster nodes were told to leave the
cluster until the quorum was lost and "activity blocked". The second time, 18h
later, a cluster node even paniced, but unfortunately no staff at that time
wrote the panic to a piece of paper.

After rebooting the cluster everything seemed to be running as normal again
(according to logs), but we found that two cluster members would see a different
view of the filesystem than the third member, and modified files on 1 & 2 would
not show up on 3 and vice versa.

I also found out that "quorum lost, blocking activity" doesn't block the GFS
filesystem. The nodes can still access it and write on it. That's probably the
real bug.

Again, this setup was bad, and is not the recommeneded and supported setup for
running cman and friends. But the situation can be redefined as having a proper
intracluster network which turns bad and delays/drops packages. In that case the
failover and recovery mechanisms of the cluster infrastruture and GFS shouldn't
fail in that way.

First cluster crash:
XSep 29 12:06:42 zs01 kernel: CMAN: removing node zs03 from the cluster : Missed
too many heartbeats
 Sep 29 12:06:43 zs03 kernel: CMAN: Being told to leave the cluster by node 1
(P:kernel)
 Sep 29 12:06:43 zs02 kernel: CMAN: node zs03 has been removed from the cluster
: Missed too many heartbeats (P:kernel)
 Sep 29 12:06:43 zs03 kernel: CMAN: we are leaving the cluster.  (P:kernel)
XSep 29 12:06:48 zs01 kernel: CMAN: removing node zs02 from the cluster : No
response to messages
 Sep 29 12:06:48 zs02 kernel: CMAN: bad generation number 27 in HELLO message
from 1, expected 26 (P:kernel)
 Sep 29 12:06:54 zs02 kernel: CMAN: removing node zs01 from the cluster : No
response to messages (P:kernel)
XSep 29 12:06:54 zs01 kernel: CMAN: quorum lost, blocking activity
 Sep 29 12:07:00 zs02 kernel: CMAN: quorum lost, blocking activity (P:kernel)
 Sep 29 12:07:00 zs02 kernel: CMAN: node zs02 has been removed from the cluster
: No response to messages (P:kernel)
 Sep 29 12:07:00 zs02 kernel: CMAN: killed by NODEDOWN message (P:kernel)
 Sep 29 12:07:01 zs02 kernel: CMAN: we are leaving the cluster. No response to
messages (P:kernel)

Second cluster crash:
 Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel)
 Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
 Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
 Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
 Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
 Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : Missed
too many heartbeats (P:kernel)
 Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No
response to messages (P:kernel)
 Sep 30 05:08:45 zs03 kernel: CMAN: quorum lost, blocking activity (P:kernel)


Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-11.EL
GFS-6.1.0-0
GFS-kernel-2.6.9-35.5
cman-1.0.0-0
cman-kernel-2.6.9-36.0

How reproducible:
still to be found out

Steps to Reproduce:
1.Setup GFS on N cluster members
2.Flood the cluster network with packages
3.
  
Actual results:
cluster nodes tell each other to leave the cluster (no fencing!), quorum is
lost, activity blocked.
But also panics happen, and GFS is desynced between members

Expected results:
Cluster nodes should tell each other to leave or fence each other, losing quorum
 etc, but no panics should happen or filesystem corruption.

Also later when the cluster was rebooted for the second time, there was no
indication in the logs that the cluster nodes were desyncing their view of the
GFS filesystem.

Additional info:
After seeing the filesystem desync, a gfs_fsck revealed some issues with the
fielsystem. The first run had 75000 lines of output. gfs_fsck could fix almost
all of them, but there is a stable set of bitmap diffs that gfs_fsck claims to
be able to fix, but doesn't. See bug #169687
Comment 1 Axel Thimm 2005-10-02 03:00:53 EDT
I lied about not having logs of the filesystem desync. I even had yet another
such case, which doesn't have the history of broken cluster intracommunication:

Sep 30 10:28:18 zs02 fenced: fence_tool leave succeeded (P:fenced)
Sep 30 10:28:21 zs02 cman: failed to stop cman failed (P:cman)
Sep 30 10:28:21 zs02 ccsd[2430]: Stopping ccsd, SIGTERM received.  (P:ccsd)
Sep 30 10:28:23 zs02 ccsd:  succeeded (P:ccsd)
Sep 30 10:28:24 zs01 kernel: GFS: fsid=physik:data.2: fatal: filesystem
consistency error (P:kernel)
Sep 30 10:28:24 zs01 kernel: GFS: fsid=physik:data.2:   function =
trans_go_xmote_bh (P:kernel)
Sep 30 10:28:24 zs01 kernel: GFS: fsid=physik:data.2:   file =
/usr/src/build/574065-x86_64/BUILD/smp/src/gfs/glops.c, line = 542 (P:kernel)
Sep 30 10:28:24 zs01 kernel: GFS: fsid=physik:data.2:   time = 1128068903 (P:kernel)
Sep 30 10:28:24 zs01 kernel: GFS: fsid=physik:data.2: about to withdraw from the
cluster (P:kernel)
Sep 30 10:28:24 zs01 kernel: GFS: fsid=physik:data.2: waiting for outstanding
I/O (P:kernel)
Sep 30 10:28:24 zs01 kernel: GFS: fsid=physik:data.2: telling LM to withdraw
(P:kernel)
Sep 30 10:28:28 zs03 kernel: GFS: fsid=physik:data.0: jid=2: Trying to acquire
journal lock... (P:kernel)
Sep 30 10:28:28 zs03 kernel: GFS: fsid=physik:data.0: jid=2: Looking at
journal... (P:kernel)
Sep 30 10:28:28 zs03 kernel: GFS: fsid=physik:data.0: jid=2: Acquiring the
transaction lock... (P:kernel)
Sep 30 10:28:28 zs03 kernel: GFS: fsid=physik:data.0: jid=2: Replaying
journal... (P:kernel)
Sep 30 10:28:28 zs03 kernel: GFS: fsid=physik:data.0: jid=2: Replayed 0 of 0
blocks (P:kernel)
Sep 30 10:28:28 zs03 kernel: GFS: fsid=physik:data.0: jid=2: replays = 0, skips
= 0, sames = 0 (P:kernel)
Sep 30 10:28:28 zs03 kernel: GFS: fsid=physik:data.0: jid=2: Journal replayed in
1s (P:kernel)

Note that cman would not stop while shutting down. And the new one is

Oct  2 08:29:02 zs03 kernel: GFS: fsid=physik:data.1: fatal: filesystem
consistency error (P:kernel)
Oct  2 08:29:02 zs03 kernel: GFS: fsid=physik:data.1:   function =
trans_go_xmote_bh (P:kernel)
Oct  2 08:29:02 zs03 kernel: GFS: fsid=physik:data.1:   file =
/usr/src/build/574065-x86_64/BUILD/smp/src/gfs/glops.c, line = 542 (P:kernel)
Oct  2 08:29:02 zs03 kernel: GFS: fsid=physik:data.1:   time = 1128234542 (P:kernel)
Oct  2 08:29:02 zs03 kernel: GFS: fsid=physik:data.1: about to withdraw from the
cluster (P:kernel)
Oct  2 08:29:02 zs03 kernel: GFS: fsid=physik:data.1: waiting for outstanding
I/O (P:kernel)
Oct  2 08:29:02 zs03 kernel: GFS: fsid=physik:data.1: telling LM to withdraw
(P:kernel)
Oct  2 08:29:35 zs11 kernel: GFS: fsid=physik:data.0: jid=1: Trying to acquire
journal lock... (P:kernel)
Oct  2 08:29:35 zs11 kernel: GFS: fsid=physik:data.0: jid=1: Looking at
journal... (P:kernel)
Oct  2 08:29:35 zs11 kernel: GFS: fsid=physik:data.0: jid=1: Acquiring the
transaction lock... (P:kernel)
Oct  2 08:29:35 zs11 kernel: GFS: fsid=physik:data.0: jid=1: Replaying
journal... (P:kernel)
Oct  2 08:29:35 zs11 kernel: GFS: fsid=physik:data.0: jid=1: Replayed 0 of 0
blocks (P:kernel)
Oct  2 08:29:35 zs11 kernel: GFS: fsid=physik:data.0: jid=1: replays = 0, skips
= 0, sames = 0 (P:kernel)
Oct  2 08:29:35 zs11 kernel: GFS: fsid=physik:data.0: jid=1: Journal replayed in
1s (P:kernel)
Oct  2 08:29:35 zs03 kernel: lock_dlm: withdraw abandoned memory (P:kernel)
Oct  2 08:29:35 zs11 kernel: GFS: fsid=physik:data.0: jid=1: Done (P:kernel)
Oct  2 08:29:35 zs03 kernel: GFS: fsid=physik:data.1: withdrawn (P:kernel)
Oct  2 08:33:39 zs11 rgmanager: [31622]: <notice> Shutting down Cluster Service
Manager...  (P:rgmanager)
Oct  2 08:33:42 zs11 clurgmgrd[15476]: <notice> Shutting down  (P:clurgmgrd)
Oct  2 08:33:43 zs11 clurgmgrd[15476]: <notice> Shutdown complete, exiting 
(P:clurgmgrd)
Oct  2 08:33:44 zs11 rgmanager: [31622]: <notice> Cluster Service Manager is
stopped.  (P:rgmanager)
Oct  2 08:33:48 zs11 kernel: GFS: fsid=physik:data.0: fatal: filesystem
consistency error (P:kernel)
Oct  2 08:33:48 zs11 kernel: GFS: fsid=physik:data.0:   function =
trans_go_xmote_bh (P:kernel)
Oct  2 08:33:48 zs11 kernel: GFS: fsid=physik:data.0:   file =
/usr/src/build/574065-x86_64/BUILD/smp/src/gfs/glops.c, line = 542 (P:kernel)
Oct  2 08:33:48 zs11 kernel: GFS: fsid=physik:data.0:   time = 1128234827 (P:kernel)
Oct  2 08:33:48 zs11 kernel: GFS: fsid=physik:data.0: about to withdraw from the
cluster (P:kernel)
Oct  2 08:33:48 zs11 kernel: GFS: fsid=physik:data.0: waiting for outstanding
I/O (P:kernel)
Oct  2 08:33:48 zs11 kernel: GFS: fsid=physik:data.0: telling LM to withdraw
(P:kernel)
Oct  2 08:33:48 zs11 kernel: lock_dlm: withdraw abandoned memory (P:kernel)
Oct  2 08:33:48 zs11 kernel: GFS: fsid=physik:data.0: withdrawn (P:kernel)
Oct  2 08:33:50 zs02 rgmanager: [31078]: <notice> Cluster Service Manager is
stopped.  (P:rgmanager)
Oct  2 08:33:50 zs02 clvmd: clvmd shutdown failed (P:clvmd)
Oct  2 08:33:50 zs02 fenced: fence_tool leave failed (P:fenced)
Oct  2 08:33:50 zs02 cman:  succeeded (P:cman)
Oct  2 08:33:50 zs02 ccsd:  succeeded (P:ccsd)
Oct  2 08:39:35 zs03 kernel: GFS: fsid=physik:data.1: Unmount seems to be
stalled. Dumping lock state... (P:kernel)
Oct  2 08:39:35 zs03 kernel: Glock (5, 36426917) (P:kernel)
Oct  2 08:39:35 zs03 kernel:   gl_flags = 1  (P:kernel)
Oct  2 08:39:35 zs03 kernel:   gl_count = 3 (P:kernel)
Oct  2 08:39:35 zs03 kernel:   gl_state = 3 (P:kernel)
Oct  2 08:39:35 zs03 kernel:   req_gh = yes (P:kernel)
Oct  2 08:39:35 zs03 kernel:   req_bh = yes (P:kernel)
Oct  2 08:39:35 zs03 kernel:   lvb_count = 0 (P:kernel)
Oct  2 08:39:35 zs03 kernel:   object = no (P:kernel)
Oct  2 08:39:35 zs03 kernel:   new_le = no (P:kernel)
Oct  2 08:39:35 zs03 kernel:   incore_le = no (P:kernel)
Oct  2 08:39:35 zs03 kernel:   reclaim = no (P:kernel)
Oct  2 08:39:35 zs03 kernel:   aspace = no (P:kernel)
Oct  2 08:39:35 zs03 kernel:   ail_bufs = no (P:kernel)
Oct  2 08:39:35 zs03 kernel:   Request (P:kernel)
Oct  2 08:39:35 zs03 kernel:     owner = -1 (P:kernel)
Oct  2 08:39:35 zs03 kernel:     gh_state = 0 (P:kernel)
Comment 2 Axel Thimm 2005-10-03 07:37:35 EDT
(In reply to comment #0)
> (it did get as far as fencing the other nodes).

Sorry, that was a typo:

It did _NOT_ get as far as fencing the other nodes!

Also Patrick Caulfield pointed me to bug #165160, which explains the kernel
panic when a node accessing GFS over dlm is told to leave the cluster, so that's
understood and probably not fixable in the near future.

What remains to be understood is why the filesystem desynced and created
inconsistencies. The triggering event is the malfunctioning internode
communication. That led to cman removing nodes and sometimes panicing. Either
this already created the inconsistencies or the loaded network did directly
influence GFS/dlm (but there are no logs indicating this, other than the
inconsitency detected upon umounting the fs).

In both cases GFS should not desync. We demoted the cluster back to experimental
state, so I can do any tests you ask me to, to try to reproduce it.
Comment 3 Corey Marthaler 2005-10-20 16:33:58 EDT
Just a note that 164331 is probably this bug, I'll let Ben do the actual dup
though if he agrees. 

FYI: QA has reproduced this issue.
Comment 4 Axel Thimm 2005-10-20 16:53:17 EDT
Perhaps the part with "fatal: filesystem consistency error" is the same as bug
#164331, but the way the cluster got partially shutdown is different here. E.g.
CMAN losing its heartbeat and dlm's einval messages.
Comment 5 Ben Marzinski 2006-01-04 14:05:25 EST

*** This bug has been marked as a duplicate of 164331 ***

Note You need to log in before you can comment on or make changes to this bug.