Bug 440118 - Got message for unknown client 0x8000000
Summary: Got message for unknown client 0x8000000
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: lvm2-cluster
Version: 4
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-04-01 19:41 UTC by Nate Straz
Modified: 2010-07-07 11:44 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-07-07 11:44:29 UTC
Embargoed:


Attachments (Terms of Use)
clvmd -d logs from all nodes (9.27 KB, application/x-gzip)
2008-04-01 19:42 UTC, Nate Straz
no flags Details

Description Nate Straz 2008-04-01 19:41:22 UTC
Description of problem:

While trying to lvextend an LV for growfs testing I'm seeing command timeouts:

extending LV gfs2 to 4G on morph-03
  Error locking on node morph-02: Command timed out
  Failed to suspend gfs2
couldn't extend LV gfs2 on morph-03

I'm running on a four node dlm cluster.  When I turn on clvmd debugging I see
something like this on morph-03:

CLVMD[f6feb8e0]: Apr  1 14:20:51 Reply from node morph-01: 0 bytes
CLVMD[f6feb8e0]: Apr  1 14:20:51 Got 1 replies, expecting: 4
CLVMD[f6feb8e0]: Apr  1 14:20:51 Reply from node morph-04: 0 bytes
CLVMD[f6feb8e0]: Apr  1 14:20:51 Got 2 replies, expecting: 4
CLVMD[f65e9ba0]: Apr  1 14:20:51 Command return is 0
CLVMD[f65e9ba0]: Apr  1 14:20:51 Reply from node morph-03: 0 bytes
CLVMD[f65e9ba0]: Apr  1 14:20:51 Got 3 replies, expecting: 4
CLVMD[f65e9ba0]: Apr  1 14:20:51 LVM thread waiting for work
CLVMD[f6feb8e0]: Apr  1 14:21:51 Request timed-out (send: 1207077650, now:
1207077711)
CLVMD[f6feb8e0]: Apr  1 14:21:51 Request timed-out. padding
... eventually ...
CLVMD[f65e9ba0]: Apr  1 14:21:51 LVM thread waiting for work
CLVMD[f6feb8e0]: Apr  1 14:22:39 Got message for unknown client 0x8000000

Then on morph-02:

CLVMD[f654eba0]: Apr  1 14:20:57 process_remote_command 50 for clientid
0x8000000 XID 6 on node morph-03
CLVMD[f654eba0]: Apr  1 14:20:57 do_lock_lv: resource
'BmNmUgP4WKBW7dPUJ9sYtiHu0cOUSwdF8zF1gFTXiUP32q64gY79pDffUYBcQKrD', cmd = 0x1e,
flags = 84
CLVMD[f654eba0]: Apr  1 14:20:57 Command return is 0
CLVMD[f654eba0]: Apr  1 14:20:57 LVM thread waiting for work
CLVMD[f6f508e0]: Apr  1 14:21:01 add_to_lvmqueue: cmd=0x8d67088.
client=0x80a5dc0, msg=0xfefb8060, len=85, csid=0xfefb8034, xid=0
CLVMD[f654eba0]: Apr  1 14:21:02 process_work_item: remote
CLVMD[f654eba0]: Apr  1 14:21:02 process_remote_command 50 for clientid
0x8000000 XID 9 on node morph-03
CLVMD[f654eba0]: Apr  1 14:21:02 do_lock_lv: resource
'BmNmUgP4WKBW7dPUJ9sYtiHu0cOUSwdFbYCX8GacWZF47cUJ5aeO15GTU2vFtgyX', cmd = 0x1c,
flags = 84
CLVMD[f654eba0]: Apr  1 14:22:50 Command return is 0


Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.27-6.el4 (first tried -2.el4)

How reproducible:
Easily

Steps to Reproduce:
Easy way:
 1. Use growfs test case from sts-rhel4.7
Harder way:
 1. Create a small LV, mkfs, mount, start I/O load
 2. lvextend LV
  
Actual results:
See above

Expected results:
The LV should extend.

Additional info:
I will attach full clvmd -d output from all nodes.

Comment 1 Nate Straz 2008-04-01 19:42:43 UTC
Created attachment 299956 [details]
clvmd -d logs from all nodes

Here are the clvmd -d logs from all nodes.  They start at the beginning of the
test case and end just after the command timeout.

Comment 2 Christine Caulfield 2008-04-02 08:06:35 UTC
It's all down to timeouts.

As you can see in the log from morph-02, the lv suspend (command 0x1c) takes
nearly two minutes. The default command timeout is 1 minute so the local clvmd
gives up and tells the client the command has timed out.

50 seconds later the command arrives back from the remote node but the original
client has got bored and gone away.

Comment 3 Christine Caulfield 2008-04-16 14:18:43 UTC
What do we want to do about this ? It's not a clvmd bug because it's doing all
it can. The most we can do there is extend the timeout period.

On the other hand it might be possible to speed up the lvextend, though if it's
under heavy IO load, perhaps not...

Comment 4 Nate Straz 2008-04-16 14:32:06 UTC
I think we need to figure out why one message is getting delayed so much.  The
rest of the messages come in right away.  I've seen this problem with RHEL5 on
the 64 node cluster too by doing a vgscan on all nodes.  

Comment 5 Christine Caulfield 2008-04-16 14:47:18 UTC
See comment #2. The message being delayed because the lv_suspend is taking a
very long time to complete. If you're running a high IO load then dm will have
to flush a huge amount of outstanding IO to disk before it can suspend the device.

Comment 8 RHEL Program Management 2010-07-07 11:44:29 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.


Note You need to log in before you can comment on or make changes to this bug.