Bug 126538

Summary: filesystem deadlock when recovery happens
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: gfsAssignee: David Teigland <teigland>
Status: CLOSED WORKSFORME QA Contact: Derek Anderson <danderso>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: djansa
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-01-05 22:25:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2004-06-22 22:11:27 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux)

Description of problem:
I started I/O to a healthy GFS filesystem on morph-01 - morph-06 and then took down morph-04. This caused the filesystem to hang.

Here are the messages from the  different nodes in the cluster

morph-01:
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Recovery started
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Setting lockspace nodes...
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Set 6 nodes
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Rebuilding resource directory...
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Rebuilt 3 resources
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Purging invalidated requests...
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Purged 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Marking requests for resending...
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Marked 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Processing held requests...
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Processed 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Resending requests to new masters...
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Resent 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Recovery done in 1

Jun 22 16:49:11 morph-01 kernel: dlm: gfs1: Recovery started
Jun 22 16:49:11 morph-01 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:49:11 morph-01 kernel: dlm: gfs1: Set 6 nodes
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Rebuilding resource directory...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Rebuilt 3 resources
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Purging invalidated requests...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Purged 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Marking requests for resending...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Marked 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Processing held requests...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Processed 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Resending requests to new masters...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Resent 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Recovery done in 1

morph-02:
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Marked 0 requests
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Processing held requests...
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Processed 0 requests
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Resending requests to new masters...
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Resent 0 requests
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Recovery done in 1
Jun 22 16:52:26 morph-02 sshd(pam_unix)[3757]: session opened for user root by (uid=0)
Jun 22 16:54:32 morph-02 login(pam_unix)[1929]: session opened for user root by LOGIN(uid=0)
Jun 22 16:54:32 morph-02  -- root[1929]: DIALUP AT ttyS0 BY root
Jun 22 16:54:32 morph-02  -- root[1929]: ROOT LOGIN ON ttyS0
SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
Jun 22 16:55:11 morph-02 kernel: SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
dlm: gfs1: Recovery started
dlm: gfs1: Setting lockspace nodes...
dlm: gfs1: Set 5 nodes
Jun 22 16:55:11 morph-02 kernel: dlm: gfs1: Recovery started
Jun 22 16:55:11 morph-02 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:55:11 morph-02 kernel: dlm: gfs1: Set 5 nodes

morph-03:
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Marked 0 requests
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Processing held requests...
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Processed 0 requests
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Resending requests to new masters...
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Resent 0 requests
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Recovery done in 1
Jun 22 16:51:23 morph-03 sshd(pam_unix)[3761]: session opened for user root by (uid=0)
Jun 22 16:53:30 morph-03 login(pam_unix)[1933]: session opened for user root by LOGIN(uid=0)
Jun 22 16:53:30 morph-03  -- root[1933]: DIALUP AT ttyS0 BY root
Jun 22 16:53:30 morph-03  -- root[1933]: ROOT LOGIN ON ttyS0
dlm: gfs0: Recovery started
dlm: gfs0: Setting lockspace nodes...
dlm: gfs0: Set 5 nodes
Jun 22 16:54:08 morph-03 kernel: dlm: gfs0: Recovery started
Jun 22 16:54:08 morph-03 kernel: dlm: gfs0: Setting lockspace nodes...
Jun 22 16:54:08 morph-03 kernel: dlm: gfs0: Set 5 nodes

morph-05:
dlm: gfs1: Recovery started
dlm: gfs1: Setting lockspace nodes...
dlm: gfs1: Set 5 nodes
Jun 22 16:55:12 morph-05 kernel: dlm: gfs1: Recovery started
Jun 22 16:55:12 morph-05 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:55:12 morph-05 kernel: dlm: gfs1: Set 5 nodes
SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
Jun 22 16:55:14 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
Jun 22 16:55:14 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
Jun 22 16:55:14 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
SM: 00000000 process_reply ignored type=2 nodeid=2 id=1
SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=2 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=1 id=1

morph-06:
dlm: gfs1: Recovery started
dlm: gfs1: Setting lockspace nodes...
dlm: gfs1: Set 5 nodes
Jun 22 16:53:42 morph-06 kernel: dlm: gfs1: Recovery started
Jun 22 16:53:42 morph-06 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:53:42 morph-06 kernel: dlm: gfs1: Set 5 nodes



Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. start io on gfs
2. shoot node
3. attempt io
    

Additional info:

Comment 1 David Teigland 2004-06-25 07:49:31 UTC
I fixed a bug with similar symptoms in Changeset 1.1667, although
that bug took quite some effort to trigger in my setup.  So, there's
a good chance this is resolved.

Comment 2 Dean Jansa 2004-07-13 19:43:23 UTC
As of July 13, the simplest case (start IO on all nodes, shoot one) 
still causes this hang. 
 
I only see: 
Jul 13 14:36:14 tank-02 kernel: dlm: gfs0: recover event 23 
Jul 13 14:36:14 tank-02 kernel: dlm: gfs0: remove node 6 
 
Jul 13 14:36:16 tank-03 kernel: dlm: gfs1: recover event 20 
Jul 13 14:36:16 tank-03 kernel: dlm: gfs1: remove node 6 
 
Jul 13 14:36:09 tank-04 kernel: dlm: gfs0: recover event 18 
Jul 13 14:36:09 tank-04 kernel: dlm: gfs0: remove node 6 
 
Jul 13 14:36:16 tank-05 kernel: dlm: gfs0: recover event 13 
Jul 13 14:36:16 tank-05 kernel: dlm: gfs0: remove node 6 
 
Jul 13 14:36:00 tank-06 kernel: CMAN: node 
tank-01.lab.msp.redhat.com is not res 
ponding - removing from the cluster 
Jul 13 14:36:03 tank-06 kernel: dlm: gfs0: recover event 11 
Jul 13 14:36:03 tank-06 kernel: dlm: gfs0: remove node 6 
 

Comment 3 David Teigland 2004-07-14 09:53:32 UTC
I cannot get this to happen using my four nodes.  Could you get the
nodes into this state, leave them, and then let me log in to inspect?


Comment 4 Kiersten (Kerri) Anderson 2004-11-16 19:10:05 UTC
Updating version to the right level in the defects.  Sorry for the storm.

Comment 5 Corey Marthaler 2005-01-05 22:25:04 UTC
No one has seen this in about 6 months