Bug 126538 - filesystem deadlock when recovery happens
filesystem deadlock when recovery happens
Status: CLOSED WORKSFORME
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Derek Anderson
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-06-22 18:11 EDT by Corey Marthaler
Modified: 2010-01-11 21:52 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-01-05 17:25:04 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2004-06-22 18:11:27 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux)

Description of problem:
I started I/O to a healthy GFS filesystem on morph-01 - morph-06 and then took down morph-04. This caused the filesystem to hang.

Here are the messages from the  different nodes in the cluster

morph-01:
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Recovery started
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Setting lockspace nodes...
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Set 6 nodes
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Rebuilding resource directory...
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Rebuilt 3 resources
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Purging invalidated requests...
Jun 22 16:49:08 morph-01 kernel: dlm: gfs0: Purged 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Marking requests for resending...
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Marked 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Processing held requests...
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Processed 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Resending requests to new masters...
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Resent 0 requests
Jun 22 16:49:09 morph-01 kernel: dlm: gfs0: Recovery done in 1

Jun 22 16:49:11 morph-01 kernel: dlm: gfs1: Recovery started
Jun 22 16:49:11 morph-01 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:49:11 morph-01 kernel: dlm: gfs1: Set 6 nodes
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Rebuilding resource directory...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Rebuilt 3 resources
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Purging invalidated requests...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Purged 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Marking requests for resending...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Marked 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Processing held requests...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Processed 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Resending requests to new masters...
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Resent 0 requests
Jun 22 16:49:12 morph-01 kernel: dlm: gfs1: Recovery done in 1

morph-02:
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Marked 0 requests
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Processing held requests...
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Processed 0 requests
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Resending requests to new masters...
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Resent 0 requests
Jun 22 16:49:30 morph-02 kernel: dlm: gfs1: Recovery done in 1
Jun 22 16:52:26 morph-02 sshd(pam_unix)[3757]: session opened for user root by (uid=0)
Jun 22 16:54:32 morph-02 login(pam_unix)[1929]: session opened for user root by LOGIN(uid=0)
Jun 22 16:54:32 morph-02  -- root[1929]: DIALUP AT ttyS0 BY root
Jun 22 16:54:32 morph-02  -- root[1929]: ROOT LOGIN ON ttyS0
SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
Jun 22 16:55:11 morph-02 kernel: SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
dlm: gfs1: Recovery started
dlm: gfs1: Setting lockspace nodes...
dlm: gfs1: Set 5 nodes
Jun 22 16:55:11 morph-02 kernel: dlm: gfs1: Recovery started
Jun 22 16:55:11 morph-02 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:55:11 morph-02 kernel: dlm: gfs1: Set 5 nodes

morph-03:
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Marked 0 requests
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Processing held requests...
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Processed 0 requests
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Resending requests to new masters...
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Resent 0 requests
Jun 22 16:48:28 morph-03 kernel: dlm: gfs1: Recovery done in 1
Jun 22 16:51:23 morph-03 sshd(pam_unix)[3761]: session opened for user root by (uid=0)
Jun 22 16:53:30 morph-03 login(pam_unix)[1933]: session opened for user root by LOGIN(uid=0)
Jun 22 16:53:30 morph-03  -- root[1933]: DIALUP AT ttyS0 BY root
Jun 22 16:53:30 morph-03  -- root[1933]: ROOT LOGIN ON ttyS0
dlm: gfs0: Recovery started
dlm: gfs0: Setting lockspace nodes...
dlm: gfs0: Set 5 nodes
Jun 22 16:54:08 morph-03 kernel: dlm: gfs0: Recovery started
Jun 22 16:54:08 morph-03 kernel: dlm: gfs0: Setting lockspace nodes...
Jun 22 16:54:08 morph-03 kernel: dlm: gfs0: Set 5 nodes

morph-05:
dlm: gfs1: Recovery started
dlm: gfs1: Setting lockspace nodes...
dlm: gfs1: Set 5 nodes
Jun 22 16:55:12 morph-05 kernel: dlm: gfs1: Recovery started
Jun 22 16:55:12 morph-05 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:55:12 morph-05 kernel: dlm: gfs1: Set 5 nodes
SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
Jun 22 16:55:14 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
Jun 22 16:55:14 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
Jun 22 16:55:14 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
SM: 00000000 process_reply ignored type=2 nodeid=2 id=1
SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
SM: 00000000 process_reply ignored type=2 nodeid=1 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=2 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=5 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=3 id=1
Jun 22 16:56:32 morph-05 kernel: SM: 00000000 process_reply ignored type=2 nodeid=1 id=1

morph-06:
dlm: gfs1: Recovery started
dlm: gfs1: Setting lockspace nodes...
dlm: gfs1: Set 5 nodes
Jun 22 16:53:42 morph-06 kernel: dlm: gfs1: Recovery started
Jun 22 16:53:42 morph-06 kernel: dlm: gfs1: Setting lockspace nodes...
Jun 22 16:53:42 morph-06 kernel: dlm: gfs1: Set 5 nodes



Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. start io on gfs
2. shoot node
3. attempt io
    

Additional info:
Comment 1 David Teigland 2004-06-25 03:49:31 EDT
I fixed a bug with similar symptoms in Changeset 1.1667, although
that bug took quite some effort to trigger in my setup.  So, there's
a good chance this is resolved.
Comment 2 Dean Jansa 2004-07-13 15:43:23 EDT
As of July 13, the simplest case (start IO on all nodes, shoot one) 
still causes this hang. 
 
I only see: 
Jul 13 14:36:14 tank-02 kernel: dlm: gfs0: recover event 23 
Jul 13 14:36:14 tank-02 kernel: dlm: gfs0: remove node 6 
 
Jul 13 14:36:16 tank-03 kernel: dlm: gfs1: recover event 20 
Jul 13 14:36:16 tank-03 kernel: dlm: gfs1: remove node 6 
 
Jul 13 14:36:09 tank-04 kernel: dlm: gfs0: recover event 18 
Jul 13 14:36:09 tank-04 kernel: dlm: gfs0: remove node 6 
 
Jul 13 14:36:16 tank-05 kernel: dlm: gfs0: recover event 13 
Jul 13 14:36:16 tank-05 kernel: dlm: gfs0: remove node 6 
 
Jul 13 14:36:00 tank-06 kernel: CMAN: node 
tank-01.lab.msp.redhat.com is not res 
ponding - removing from the cluster 
Jul 13 14:36:03 tank-06 kernel: dlm: gfs0: recover event 11 
Jul 13 14:36:03 tank-06 kernel: dlm: gfs0: remove node 6 
 
Comment 3 David Teigland 2004-07-14 05:53:32 EDT
I cannot get this to happen using my four nodes.  Could you get the
nodes into this state, leave them, and then let me log in to inspect?
Comment 4 Kiersten (Kerri) Anderson 2004-11-16 14:10:05 EST
Updating version to the right level in the defects.  Sorry for the storm.
Comment 5 Corey Marthaler 2005-01-05 17:25:04 EST
No one has seen this in about 6 months

Note You need to log in before you can comment on or make changes to this bug.