Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 210367

Summary:

Stalled unmount of GFS filesystem

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Lenny Maiorani <lenny>

Component:

gfs

Assignee:

Ryan O'Hara <rohara>

Status:

CLOSED DUPLICATE

QA Contact:

GFS Bugs <gfs-bugs>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-11-14 17:09:05 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
/var/log/messages	none

Description Lenny Maiorani 2006-10-11 18:06:57 UTC

Description of problem:

From /var/log/messages
Sep  6 19:10:05 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: Unmount seems
to be stalled. Dumping lock state...
Sep  6 19:10:05 flsrv02 kernel: Glock (2, 995)
Sep  6 19:10:05 flsrv02 kernel:   gl_flags = 
Sep  6 19:10:05 flsrv02 kernel:   gl_count = 2
Sep  6 19:10:05 flsrv02 kernel:   gl_state = 0
Sep  6 19:10:05 flsrv02 kernel:   req_gh = no
Sep  6 19:10:05 flsrv02 kernel:   req_bh = no
Sep  6 19:10:05 flsrv02 kernel:   lvb_count = 0
Sep  6 19:10:05 flsrv02 kernel:   object = yes
Sep  6 19:10:05 flsrv02 kernel:   new_le = no
Sep  6 19:10:05 flsrv02 kernel:   incore_le = no
Sep  6 19:10:05 flsrv02 kernel:   reclaim = no
Sep  6 19:10:05 flsrv02 kernel:   aspace = 0
Sep  6 19:10:05 flsrv02 kernel:   ail_bufs = no
Sep  6 19:10:05 flsrv02 kernel:   Inode:
Sep  6 19:10:05 flsrv02 kernel:     num = 995/995
Sep  6 19:10:05 flsrv02 kernel:     type = 2
Sep  6 19:10:05 flsrv02 kernel:     i_count = 1
Sep  6 19:10:05 flsrv02 kernel:     i_flags = 
Sep  6 19:10:05 flsrv02 kernel:     vnode = yes
Sep  6 19:10:05 flsrv02 kernel: Glock (5, 995)
Sep  6 19:10:05 flsrv02 kernel:   gl_flags = 
Sep  6 19:10:05 flsrv02 kernel:   gl_count = 2
Sep  6 19:10:05 flsrv02 kernel:   gl_state = 3
Sep  6 19:10:05 flsrv02 kernel:   req_gh = no
Sep  6 19:10:05 flsrv02 kernel:   req_bh = no
Sep  6 19:10:05 flsrv02 kernel:   lvb_count = 0
Sep  6 19:10:05 flsrv02 kernel:   object = yes
Sep  6 19:10:05 flsrv02 kernel:   new_le = no
Sep  6 19:10:05 flsrv02 kernel:   incore_le = no
Sep  6 19:10:05 flsrv02 kernel:   reclaim = no
Sep  6 19:10:05 flsrv02 kernel:   aspace = no
Sep  6 19:10:05 flsrv02 kernel:   ail_bufs = no
Sep  6 19:10:05 flsrv02 kernel:   Holder
Sep  6 19:10:05 flsrv02 kernel:     owner = -1
Sep  6 19:10:05 flsrv02 kernel:     gh_state = 3
Sep  6 19:10:05 flsrv02 kernel:     gh_flags = 5 7 
Sep  6 19:10:06 flsrv02 kernel:     error = 0
Sep  6 19:10:06 flsrv02 kernel:     gh_iflags = 1 6 7 

This was happening over and over again before the node was rebooted. Also,
there was high load even though all but one of the services had been relocated
to other nodes by rgmanager so the node should have been idle.

Version-Release number of selected component (if applicable):
kernel 2.6.9-34
GFS 6.1.5-0
RHEL4 Update3

How reproducible:
unknown

Steps to Reproduce:
1. unknown
2.
3.
  
Actual results:
filesystem is unaccessible on this node

Expected results:
unmount should occur smoothly

Additional info:

Comment 1 Ryan O'Hara 2006-10-12 16:02:19 UTC

Did the node in question withdraw from the filesystem prior to this?

Comment 2 Lenny Maiorani 2006-10-12 19:27:17 UTC

We have seen this happen twice and in fact both times there was a filesystem
withdrawl. I had not noticed this before as it was a few hours earlier in the
logs. Here is the logs which go along with the logs previously posted in this bug:

Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: fatal: invalid
metadata block
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1:   bh =
1206661312 (magic)
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1:   function =
gfs_rgrp_read
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1:   file =
fs/gfs/rgrp.c, line = 830
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1:   time = 1157582806
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: about to
withdraw from the cluster
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: waiting for
outstanding I/O
Sep  6 16:46:46 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: telling LM to
withdraw
Sep  6 16:46:49 flsrv02 kernel: lock_dlm: withdraw abandoned memory
Sep  6 16:46:49 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: withdrawn


Let me know if you wish to see the logs from this more recent occurrence. They
look nearly the same.

Comment 3 Ryan O'Hara 2006-10-13 15:42:54 UTC

First, do you have any idea what caused the 'invalid metadata block'? That
obviously caused the withdraw, which may be a bug in itself. Moving on...

Second, the way that GFS current exists, we do not guarantee that you will be
able to umount a filesystem that has withdrawn. This means that a reboot is
required (which is required anyway to have the node rejoin the filesystem).
However, it is our intention that a node that withdraws from the filesystem
should be able to be restarted by hand. The reason is that an administrator may
want to shutdown some services by hand, debug the system, etc... none of which
would be possible if the node was just fenced. The catch here is that you
wouldn't be able to reboot the machine because during shutdown the node would
try to unmount the filesystem. It if for this reason I am going to leave this
bug open. So, in short, it is a known issue that a umount may not work on for a
filesystem from which the node withdrew from.

Third, the rgmanager services should migrate from the node in the case that they
were using a filesystem from which the node withdrew. If the node had some
services that were using a different filesystem on the same node, those would
continue to run. Was this the case for you? Did you have multiple services using
multiple filesystem?

Comment 4 Lenny Maiorani 2006-10-13 20:55:41 UTC

It is unknown what caused this "invalid metadata block". Actually, this has
happened 4 times in recent memory. Do you know of any causes? We have been using
gfs_fsck to repair.

I understand why the node wasn't fenced. That is good because it gave us the
ability to get in and grab logs, etc.

As for the services moving, the services were using other filesystems, so I
guess that is why they were not moved.

However, most interesting to me is how this "invalid metadata block" was achieved...

Comment 5 Ryan O'Hara 2006-11-13 21:08:10 UTC

Has this continued to be a problem?

What type of hardware are you using? Do you have more log files you can provide?
Specifically I am interested in seeing if there are any SCSI errors prior to the
"invalid metadata block" message.

Comment 6 Ryan O'Hara 2006-11-13 21:11:32 UTC

Also.. how large is the filesystem?

Comment 8 Lenny Maiorani 2006-11-13 21:37:17 UTC

The filesystem is approx 1 TB. 

We have not seen this problem since running a patched version of gfs_fsck with
the patch from bz #208836 to fix invalid metadata blocks. 

Using x86_64. No, there are no earlier SCSI errors. I will attach the
/var/log/messages file.

Comment 9 Lenny Maiorani 2006-11-13 21:38:12 UTC

Created attachment 141106 [details]
/var/log/messages

/var/log/messages file

Comment 10 Ryan O'Hara 2006-11-13 22:17:05 UTC

Glad to here the the fix referred to above fix the corrupt RGs. Also, I
initially read the problem as something you had seen occur several times -- as
in you had a filesystem that became corrupt (invalid metadata) several (4)
times. Now I think I understand it differently -- is it more correct to say that
you it became corrupt once and you saw the symptoms of this 4 times. That makes
more sense. Sorry about that.

So, the "invalid metadata blocks" are fixed, but we still do not know what
initially caused the corruption. The BZ you referred to above (#208836) mentions
that, in that case, the corruption may have been caused by a failed device.
Could there be a similar cause for your case? I'm not sure how we can determine
the cause of the corruption at this point.

Comment 11 Lenny Maiorani 2006-11-13 23:03:29 UTC

I am also not sure about how to determine the cause of corruption. It may be a
similar case however.

Comment 12 Ryan O'Hara 2006-11-14 17:09:05 UTC

I am going to close this bug. If you hit a corruption problem again, please let
me know. If that happens the logs should help diagnose the cause of the invalid
metadata.

*** This bug has been marked as a duplicate of 208836 ***

*** This bug has been marked as a duplicate of 208836 ***