Bug 440118

Summary: Got message for unknown client 0x8000000
Product: [Retired] Red Hat Cluster Suite Reporter: Nate Straz <nstraz>
Component: lvm2-clusterAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 4CC: agk, ccaulfie, dwysocha, iannis, jbrassow, mbroz, prockai
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-07-07 11:44:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
clvmd -d logs from all nodes none

Description Nate Straz 2008-04-01 19:41:22 UTC
Description of problem:

While trying to lvextend an LV for growfs testing I'm seeing command timeouts:

extending LV gfs2 to 4G on morph-03
  Error locking on node morph-02: Command timed out
  Failed to suspend gfs2
couldn't extend LV gfs2 on morph-03

I'm running on a four node dlm cluster.  When I turn on clvmd debugging I see
something like this on morph-03:

CLVMD[f6feb8e0]: Apr  1 14:20:51 Reply from node morph-01: 0 bytes
CLVMD[f6feb8e0]: Apr  1 14:20:51 Got 1 replies, expecting: 4
CLVMD[f6feb8e0]: Apr  1 14:20:51 Reply from node morph-04: 0 bytes
CLVMD[f6feb8e0]: Apr  1 14:20:51 Got 2 replies, expecting: 4
CLVMD[f65e9ba0]: Apr  1 14:20:51 Command return is 0
CLVMD[f65e9ba0]: Apr  1 14:20:51 Reply from node morph-03: 0 bytes
CLVMD[f65e9ba0]: Apr  1 14:20:51 Got 3 replies, expecting: 4
CLVMD[f65e9ba0]: Apr  1 14:20:51 LVM thread waiting for work
CLVMD[f6feb8e0]: Apr  1 14:21:51 Request timed-out (send: 1207077650, now:
1207077711)
CLVMD[f6feb8e0]: Apr  1 14:21:51 Request timed-out. padding
... eventually ...
CLVMD[f65e9ba0]: Apr  1 14:21:51 LVM thread waiting for work
CLVMD[f6feb8e0]: Apr  1 14:22:39 Got message for unknown client 0x8000000

Then on morph-02:

CLVMD[f654eba0]: Apr  1 14:20:57 process_remote_command 50 for clientid
0x8000000 XID 6 on node morph-03
CLVMD[f654eba0]: Apr  1 14:20:57 do_lock_lv: resource
'BmNmUgP4WKBW7dPUJ9sYtiHu0cOUSwdF8zF1gFTXiUP32q64gY79pDffUYBcQKrD', cmd = 0x1e,
flags = 84
CLVMD[f654eba0]: Apr  1 14:20:57 Command return is 0
CLVMD[f654eba0]: Apr  1 14:20:57 LVM thread waiting for work
CLVMD[f6f508e0]: Apr  1 14:21:01 add_to_lvmqueue: cmd=0x8d67088.
client=0x80a5dc0, msg=0xfefb8060, len=85, csid=0xfefb8034, xid=0
CLVMD[f654eba0]: Apr  1 14:21:02 process_work_item: remote
CLVMD[f654eba0]: Apr  1 14:21:02 process_remote_command 50 for clientid
0x8000000 XID 9 on node morph-03
CLVMD[f654eba0]: Apr  1 14:21:02 do_lock_lv: resource
'BmNmUgP4WKBW7dPUJ9sYtiHu0cOUSwdFbYCX8GacWZF47cUJ5aeO15GTU2vFtgyX', cmd = 0x1c,
flags = 84
CLVMD[f654eba0]: Apr  1 14:22:50 Command return is 0


Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.27-6.el4 (first tried -2.el4)

How reproducible:
Easily

Steps to Reproduce:
Easy way:
 1. Use growfs test case from sts-rhel4.7
Harder way:
 1. Create a small LV, mkfs, mount, start I/O load
 2. lvextend LV
  
Actual results:
See above

Expected results:
The LV should extend.

Additional info:
I will attach full clvmd -d output from all nodes.

Comment 1 Nate Straz 2008-04-01 19:42:43 UTC
Created attachment 299956 [details]
clvmd -d logs from all nodes

Here are the clvmd -d logs from all nodes.  They start at the beginning of the
test case and end just after the command timeout.

Comment 2 Christine Caulfield 2008-04-02 08:06:35 UTC
It's all down to timeouts.

As you can see in the log from morph-02, the lv suspend (command 0x1c) takes
nearly two minutes. The default command timeout is 1 minute so the local clvmd
gives up and tells the client the command has timed out.

50 seconds later the command arrives back from the remote node but the original
client has got bored and gone away.

Comment 3 Christine Caulfield 2008-04-16 14:18:43 UTC
What do we want to do about this ? It's not a clvmd bug because it's doing all
it can. The most we can do there is extend the timeout period.

On the other hand it might be possible to speed up the lvextend, though if it's
under heavy IO load, perhaps not...

Comment 4 Nate Straz 2008-04-16 14:32:06 UTC
I think we need to figure out why one message is getting delayed so much.  The
rest of the messages come in right away.  I've seen this problem with RHEL5 on
the 64 node cluster too by doing a vgscan on all nodes.  

Comment 5 Christine Caulfield 2008-04-16 14:47:18 UTC
See comment #2. The message being delayed because the lv_suspend is taking a
very long time to complete. If you're running a high IO load then dm will have
to flush a huge amount of outstanding IO to disk before it can suspend the device.

Comment 8 RHEL Product and Program Management 2010-07-07 11:44:29 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.