Bug 126538 - filesystem deadlock when recovery happens
Summary: filesystem deadlock when recovery happens
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: David Teigland
QA Contact: Derek Anderson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-06-22 22:11 UTC by Corey Marthaler
Modified: 2010-01-12 02:52 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-01-05 22:25:04 UTC
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2004-06-22 22:11:27 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux)

Description of problem:
I started I/O to a healthy GFS filesystem on morph-01 - morph-06 and then took down morph-04. This caused the filesystem to hang.

Here are the messages from the  different nodes in the cluster

morph-01:
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Recovery started
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Setting lockspace nodes...
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Set 6 nodes
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Rebuilding resource directory...
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Rebuilt 3 resources
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Purging invalidated requests...
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Purged 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Marking requests for resending...
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Marked 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Processing held requests...
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Processed 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Resending requests to new masters...
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Resent 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Recovery done in 1

Jun 22 16:49:11 morph-01 kernel: dlm: gfs1: Recovery started
Jun 22 16:49:11 morph-01 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:49:11 morph-01 kernel: dlm: gfs1: Set 6 nodes
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Rebuilding resource directory...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Rebuilt 3 resources
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Purging invalidated requests...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Purged 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Marking requests for resending...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Marked 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Processing held requests...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Processed 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Resending requests to new masters...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Resent 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Recovery done in 1

morph-02:
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Marked 0 requests
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Processing held requests...
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Processed 0 requests
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Resending requests to new masters...
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Resent 0 requests
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Recovery done in 1
Jun 22 16:52:26 morph-02 sshd(pam_unix)[3757]: session opened for user root by (uid=0)
Jun 22 16:54:32 morph-02 login(pam_unix)[1929]: session opened for user root by LOGIN(uid=0)
Jun 22 16:54:32 morph-02  -- root[1929]: DIALUP AT ttyS0 BY root
Jun 22 16:54:32 morph-02  -- root[1929]: ROOT LOGIN ON ttyS0
SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
Jun 22 16:55:11 morph-02 kernel: SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
dlm: gfs1: Recovery started
dlm: gfs1: Setting lockspace nodes...
dlm: gfs1: Set 5 nodes
Jun 22 16:55:11 morph-02 kernel: dlm: gfs1: Recovery started
Jun 22 16:55:11 morph-02 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:55:11 morph-02 kernel: dlm: gfs1: Set 5 nodes

morph-03:
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Marked 0 requests
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Processing held requests...
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Processed 0 requests
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Resending requests to new masters...
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Resent 0 requests
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Recovery done in 1
Jun 22 16:51:23 morph-03 sshd(pam_unix)[3761]: session opened for user root by (uid=0)
Jun 22 16:53:30 morph-03 login(pam_unix)[1933]: session opened for user root by LOGIN(uid=0)
Jun 22 16:53:30 morph-03  -- root[1933]: DIALUP AT ttyS0 BY root
Jun 22 16:53:30 morph-03  -- root[1933]: ROOT LOGIN ON ttyS0
dlm: gfs0: Recovery started
dlm: gfs0: Setting lockspace nodes...
dlm: gfs0: Set 5 nodes
Jun 22 16:54:08 morph-03 kernel: dlm: gfs0: Recovery started
Jun 22 16:54:08 morph-03 kernel: dlm: gfs0: Setting lockspace nodes...
Jun 22 16:54:08 morph-03 kernel: dlm: gfs0: Set 5 nodes

morph-05:
dlm: gfs1: Recovery started
dlm: gfs1: Setting lockspace nodes...
dlm: gfs1: Set 5 nodes
Jun 22 16:55:12 morph-05 kernel: dlm: gfs1: Recovery started
Jun 22 16:55:12 morph-05 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:55:12 morph-05 kernel: dlm: gfs1: Set 5 nodes
SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
Jun 22 16:55:14 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
Jun 22 16:55:14 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
Jun 22 16:55:14 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
SM: 00000000 process_reply ignored type=2 nodeid=2 id=1
SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=2 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=1 id=1

morph-06:
dlm: gfs1: Recovery started
dlm: gfs1: Setting lockspace nodes...
dlm: gfs1: Set 5 nodes
Jun 22 16:53:42 morph-06 kernel: dlm: gfs1: Recovery started
Jun 22 16:53:42 morph-06 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:53:42 morph-06 kernel: dlm: gfs1: Set 5 nodes



Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. start io on gfs
2. shoot node
3. attempt io
    

Additional info:

Comment 1 David Teigland 2004-06-25 07:49:31 UTC
I fixed a bug with similar symptoms in Changeset 1.1667, although
that bug took quite some effort to trigger in my setup.  So, there's
a good chance this is resolved.

Comment 2 Dean Jansa 2004-07-13 19:43:23 UTC
As of July 13, the simplest case (start IO on all nodes, shoot one) 
still causes this hang. 
 
I only see: 
Jul 13 14:36:14 tank-02 kernel: dlm: gfs0: recover event 23 
Jul 13 14:36:14 tank-02 kernel: dlm: gfs0: remove node 6 
 
Jul 13 14:36:16 tank-03 kernel: dlm: gfs1: recover event 20 
Jul 13 14:36:16 tank-03 kernel: dlm: gfs1: remove node 6 
 
Jul 13 14:36:09 tank-04 kernel: dlm: gfs0: recover event 18 
Jul 13 14:36:09 tank-04 kernel: dlm: gfs0: remove node 6 
 
Jul 13 14:36:16 tank-05 kernel: dlm: gfs0: recover event 13 
Jul 13 14:36:16 tank-05 kernel: dlm: gfs0: remove node 6 
 
Jul 13 14:36:00 tank-06 kernel: CMAN: node 
tank-01.lab.msp.redhat.com is not res 
ponding - removing from the cluster 
Jul 13 14:36:03 tank-06 kernel: dlm: gfs0: recover event 11 
Jul 13 14:36:03 tank-06 kernel: dlm: gfs0: remove node 6 
 

Comment 3 David Teigland 2004-07-14 09:53:32 UTC
I cannot get this to happen using my four nodes.  Could you get the
nodes into this state, leave them, and then let me log in to inspect?


Comment 4 Kiersten (Kerri) Anderson 2004-11-16 19:10:05 UTC
Updating version to the right level in the defects.  Sorry for the storm.

Comment 5 Corey Marthaler 2005-01-05 22:25:04 UTC
No one has seen this in about 6 months


Note You need to log in before you can comment on or make changes to this bug.