Description of problem: While trying to lvextend an LV for growfs testing I'm seeing command timeouts: extending LV gfs2 to 4G on morph-03 Error locking on node morph-02: Command timed out Failed to suspend gfs2 couldn't extend LV gfs2 on morph-03 I'm running on a four node dlm cluster. When I turn on clvmd debugging I see something like this on morph-03: CLVMD[f6feb8e0]: Apr 1 14:20:51 Reply from node morph-01: 0 bytes CLVMD[f6feb8e0]: Apr 1 14:20:51 Got 1 replies, expecting: 4 CLVMD[f6feb8e0]: Apr 1 14:20:51 Reply from node morph-04: 0 bytes CLVMD[f6feb8e0]: Apr 1 14:20:51 Got 2 replies, expecting: 4 CLVMD[f65e9ba0]: Apr 1 14:20:51 Command return is 0 CLVMD[f65e9ba0]: Apr 1 14:20:51 Reply from node morph-03: 0 bytes CLVMD[f65e9ba0]: Apr 1 14:20:51 Got 3 replies, expecting: 4 CLVMD[f65e9ba0]: Apr 1 14:20:51 LVM thread waiting for work CLVMD[f6feb8e0]: Apr 1 14:21:51 Request timed-out (send: 1207077650, now: 1207077711) CLVMD[f6feb8e0]: Apr 1 14:21:51 Request timed-out. padding ... eventually ... CLVMD[f65e9ba0]: Apr 1 14:21:51 LVM thread waiting for work CLVMD[f6feb8e0]: Apr 1 14:22:39 Got message for unknown client 0x8000000 Then on morph-02: CLVMD[f654eba0]: Apr 1 14:20:57 process_remote_command 50 for clientid 0x8000000 XID 6 on node morph-03 CLVMD[f654eba0]: Apr 1 14:20:57 do_lock_lv: resource 'BmNmUgP4WKBW7dPUJ9sYtiHu0cOUSwdF8zF1gFTXiUP32q64gY79pDffUYBcQKrD', cmd = 0x1e, flags = 84 CLVMD[f654eba0]: Apr 1 14:20:57 Command return is 0 CLVMD[f654eba0]: Apr 1 14:20:57 LVM thread waiting for work CLVMD[f6f508e0]: Apr 1 14:21:01 add_to_lvmqueue: cmd=0x8d67088. client=0x80a5dc0, msg=0xfefb8060, len=85, csid=0xfefb8034, xid=0 CLVMD[f654eba0]: Apr 1 14:21:02 process_work_item: remote CLVMD[f654eba0]: Apr 1 14:21:02 process_remote_command 50 for clientid 0x8000000 XID 9 on node morph-03 CLVMD[f654eba0]: Apr 1 14:21:02 do_lock_lv: resource 'BmNmUgP4WKBW7dPUJ9sYtiHu0cOUSwdFbYCX8GacWZF47cUJ5aeO15GTU2vFtgyX', cmd = 0x1c, flags = 84 CLVMD[f654eba0]: Apr 1 14:22:50 Command return is 0 Version-Release number of selected component (if applicable): lvm2-cluster-2.02.27-6.el4 (first tried -2.el4) How reproducible: Easily Steps to Reproduce: Easy way: 1. Use growfs test case from sts-rhel4.7 Harder way: 1. Create a small LV, mkfs, mount, start I/O load 2. lvextend LV Actual results: See above Expected results: The LV should extend. Additional info: I will attach full clvmd -d output from all nodes.
Created attachment 299956 [details] clvmd -d logs from all nodes Here are the clvmd -d logs from all nodes. They start at the beginning of the test case and end just after the command timeout.
It's all down to timeouts. As you can see in the log from morph-02, the lv suspend (command 0x1c) takes nearly two minutes. The default command timeout is 1 minute so the local clvmd gives up and tells the client the command has timed out. 50 seconds later the command arrives back from the remote node but the original client has got bored and gone away.
What do we want to do about this ? It's not a clvmd bug because it's doing all it can. The most we can do there is extend the timeout period. On the other hand it might be possible to speed up the lvextend, though if it's under heavy IO load, perhaps not...
I think we need to figure out why one message is getting delayed so much. The rest of the messages come in right away. I've seen this problem with RHEL5 on the 64 node cluster too by doing a vgscan on all nodes.
See comment #2. The message being delayed because the lv_suspend is taking a very long time to complete. If you're running a high IO load then dm will have to flush a huge amount of outstanding IO to disk before it can suspend the device.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.