Bug 210367 - Stalled unmount of GFS filesystem
Stalled unmount of GFS filesystem
Status: CLOSED DUPLICATE of bug 208836
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Ryan O'Hara
GFS Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-10-11 14:06 EDT by Lenny Maiorani
Modified: 2010-01-11 22:13 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-11-14 12:09:05 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log/messages (68.80 KB, application/octet-stream)
2006-11-13 16:38 EST, Lenny Maiorani
no flags Details

  None (edit)
Description Lenny Maiorani 2006-10-11 14:06:57 EDT
Description of problem:

From /var/log/messages
Sep  6 19:10:05 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: Unmount seems
to be stalled. Dumping lock state...
Sep  6 19:10:05 flsrv02 kernel: Glock (2, 995)
Sep  6 19:10:05 flsrv02 kernel:   gl_flags = 
Sep  6 19:10:05 flsrv02 kernel:   gl_count = 2
Sep  6 19:10:05 flsrv02 kernel:   gl_state = 0
Sep  6 19:10:05 flsrv02 kernel:   req_gh = no
Sep  6 19:10:05 flsrv02 kernel:   req_bh = no
Sep  6 19:10:05 flsrv02 kernel:   lvb_count = 0
Sep  6 19:10:05 flsrv02 kernel:   object = yes
Sep  6 19:10:05 flsrv02 kernel:   new_le = no
Sep  6 19:10:05 flsrv02 kernel:   incore_le = no
Sep  6 19:10:05 flsrv02 kernel:   reclaim = no
Sep  6 19:10:05 flsrv02 kernel:   aspace = 0
Sep  6 19:10:05 flsrv02 kernel:   ail_bufs = no
Sep  6 19:10:05 flsrv02 kernel:   Inode:
Sep  6 19:10:05 flsrv02 kernel:     num = 995/995
Sep  6 19:10:05 flsrv02 kernel:     type = 2
Sep  6 19:10:05 flsrv02 kernel:     i_count = 1
Sep  6 19:10:05 flsrv02 kernel:     i_flags = 
Sep  6 19:10:05 flsrv02 kernel:     vnode = yes
Sep  6 19:10:05 flsrv02 kernel: Glock (5, 995)
Sep  6 19:10:05 flsrv02 kernel:   gl_flags = 
Sep  6 19:10:05 flsrv02 kernel:   gl_count = 2
Sep  6 19:10:05 flsrv02 kernel:   gl_state = 3
Sep  6 19:10:05 flsrv02 kernel:   req_gh = no
Sep  6 19:10:05 flsrv02 kernel:   req_bh = no
Sep  6 19:10:05 flsrv02 kernel:   lvb_count = 0
Sep  6 19:10:05 flsrv02 kernel:   object = yes
Sep  6 19:10:05 flsrv02 kernel:   new_le = no
Sep  6 19:10:05 flsrv02 kernel:   incore_le = no
Sep  6 19:10:05 flsrv02 kernel:   reclaim = no
Sep  6 19:10:05 flsrv02 kernel:   aspace = no
Sep  6 19:10:05 flsrv02 kernel:   ail_bufs = no
Sep  6 19:10:05 flsrv02 kernel:   Holder
Sep  6 19:10:05 flsrv02 kernel:     owner = -1
Sep  6 19:10:05 flsrv02 kernel:     gh_state = 3
Sep  6 19:10:05 flsrv02 kernel:     gh_flags = 5 7 
Sep  6 19:10:06 flsrv02 kernel:     error = 0
Sep  6 19:10:06 flsrv02 kernel:     gh_iflags = 1 6 7 

This was happening over and over again before the node was rebooted. Also,
there was high load even though all but one of the services had been relocated
to other nodes by rgmanager so the node should have been idle.

Version-Release number of selected component (if applicable):
kernel 2.6.9-34
GFS 6.1.5-0
RHEL4 Update3

How reproducible:
unknown

Steps to Reproduce:
1. unknown
2.
3.
  
Actual results:
filesystem is unaccessible on this node

Expected results:
unmount should occur smoothly

Additional info:
Comment 1 Ryan O'Hara 2006-10-12 12:02:19 EDT
Did the node in question withdraw from the filesystem prior to this?

Comment 2 Lenny Maiorani 2006-10-12 15:27:17 EDT
We have seen this happen twice and in fact both times there was a filesystem
withdrawl. I had not noticed this before as it was a few hours earlier in the
logs. Here is the logs which go along with the logs previously posted in this bug:

Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: fatal: invalid
metadata block
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1:   bh =
1206661312 (magic)
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1:   function =
gfs_rgrp_read
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1:   file =
fs/gfs/rgrp.c, line = 830
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1:   time = 1157582806
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: about to
withdraw from the cluster
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: waiting for
outstanding I/O
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: telling LM to
withdraw
Sep  6 16:46:49 flsrv02 kernel: lock_dlm: withdraw abandoned memory
Sep  6 16:46:49 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: withdrawn


Let me know if you wish to see the logs from this more recent occurrence. They
look nearly the same.
Comment 3 Ryan O'Hara 2006-10-13 11:42:54 EDT
First, do you have any idea what caused the 'invalid metadata block'? That
obviously caused the withdraw, which may be a bug in itself. Moving on...

Second, the way that GFS current exists, we do not guarantee that you will be
able to umount a filesystem that has withdrawn. This means that a reboot is
required (which is required anyway to have the node rejoin the filesystem).
However, it is our intention that a node that withdraws from the filesystem
should be able to be restarted by hand. The reason is that an administrator may
want to shutdown some services by hand, debug the system, etc... none of which
would be possible if the node was just fenced. The catch here is that you
wouldn't be able to reboot the machine because during shutdown the node would
try to unmount the filesystem. It if for this reason I am going to leave this
bug open. So, in short, it is a known issue that a umount may not work on for a
filesystem from which the node withdrew from.

Third, the rgmanager services should migrate from the node in the case that they
were using a filesystem from which the node withdrew. If the node had some
services that were using a different filesystem on the same node, those would
continue to run. Was this the case for you? Did you have multiple services using
multiple filesystem?

Comment 4 Lenny Maiorani 2006-10-13 16:55:41 EDT
It is unknown what caused this "invalid metadata block". Actually, this has
happened 4 times in recent memory. Do you know of any causes? We have been using
gfs_fsck to repair.

I understand why the node wasn't fenced. That is good because it gave us the
ability to get in and grab logs, etc.

As for the services moving, the services were using other filesystems, so I
guess that is why they were not moved.

However, most interesting to me is how this "invalid metadata block" was achieved...
Comment 5 Ryan O'Hara 2006-11-13 16:08:10 EST
Has this continued to be a problem?

What type of hardware are you using? Do you have more log files you can provide?
Specifically I am interested in seeing if there are any SCSI errors prior to the
"invalid metadata block" message.

Comment 6 Ryan O'Hara 2006-11-13 16:11:32 EST
Also.. how large is the filesystem?

Comment 8 Lenny Maiorani 2006-11-13 16:37:17 EST
The filesystem is approx 1 TB. 

We have not seen this problem since running a patched version of gfs_fsck with
the patch from bz #208836 to fix invalid metadata blocks. 

Using x86_64. No, there are no earlier SCSI errors. I will attach the
/var/log/messages file.
Comment 9 Lenny Maiorani 2006-11-13 16:38:12 EST
Created attachment 141106 [details]
/var/log/messages

/var/log/messages file
Comment 10 Ryan O'Hara 2006-11-13 17:17:05 EST
Glad to here the the fix referred to above fix the corrupt RGs. Also, I
initially read the problem as something you had seen occur several times -- as
in you had a filesystem that became corrupt (invalid metadata) several (4)
times. Now I think I understand it differently -- is it more correct to say that
you it became corrupt once and you saw the symptoms of this 4 times. That makes
more sense. Sorry about that.

So, the "invalid metadata blocks" are fixed, but we still do not know what
initially caused the corruption. The BZ you referred to above (#208836) mentions
that, in that case, the corruption may have been caused by a failed device.
Could there be a similar cause for your case? I'm not sure how we can determine
the cause of the corruption at this point.
Comment 11 Lenny Maiorani 2006-11-13 18:03:29 EST
I am also not sure about how to determine the cause of corruption. It may be a
similar case however.
Comment 12 Ryan O'Hara 2006-11-14 12:09:05 EST
I am going to close this bug. If you hit a corruption problem again, please let
me know. If that happens the logs should help diagnose the cause of the invalid
metadata.

*** This bug has been marked as a duplicate of 208836 ***

*** This bug has been marked as a duplicate of 208836 ***

Note You need to log in before you can comment on or make changes to this bug.