Bug 406721 - clvmd can timeout and fail when starting on nodes in larger cluster
clvmd can timeout and fail when starting on nodes in larger cluster
Status: CLOSED NEXTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: lvm2-cluster (Show other bugs)
4
All Linux
low Severity low
: rc
: ---
Assigned To: LVM and device-mapper development team
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-11-30 12:39 EST by Corey Marthaler
Modified: 2010-07-07 07:22 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-07-07 07:21:38 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2007-11-30 12:39:37 EST
Description of problem:
On my 8 node (qa-xen-{13-20} cluster w/ 1 cmirror:

[root@qa-xen-13 ~]# service cmirror start
Loading clustered mirror log:                              [  OK  ]
[root@qa-xen-13 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                            [  OK  ]

[root@qa-xen-14 ~]# service clvmd start
Starting clvmd: clvmd startup timed out
                                                           [FAILED]

[root@qa-xen-15 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                            [  OK  ]

[root@qa-xen-16 ~]# service clvmd start
Starting clvmd: clvmd startup timed out
                                                           [FAILED]
[root@qa-xen-17 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                            [  OK  ]

[root@qa-xen-18 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                            [  OK  ]

[root@qa-xen-19 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                            [  OK  ]

[root@qa-xen-20 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                            [  OK  ]


Version-Release number of selected component (if applicable):
lvm2-2.02.27-2.el4/lvm2-cluster-2.02.27-2.el4
Comment 1 Corey Marthaler 2007-11-30 17:56:18 EST
This is easily reproducable on my xen nodes with the updated 02.27-4 rpms as well.

lvm2-2.02.27-4.el4/lvm2-cluster-2.02.27-4.el4
Comment 2 Corey Marthaler 2007-12-12 12:35:52 EST
I hit this during a revolver run on taft-02, which is only a 4 node cluster with
6  GFS filesystems.
Comment 3 Christine Caulfield 2007-12-13 03:56:47 EST
I suspect this is more DLM-related than LVM-related. clvmd doesn't wait for the
LVs to be activated before returning to the command-line, it does them in the
background.

Things to check are the DLM recovery debug logs and cman_tool services. (now
known as group_tool).
Comment 4 Corey Marthaler 2007-12-13 16:42:35 EST
I'm seeing revolver time out now due to clvmd not starting in a timely fashion
(taking over seven min). Though I'm not positive that this is the same issue
since I'm not seeing clvmd time out.

Thu Dec 13 15:22:09 CST 2007
Starting clvmd: [  OK  ]
Thu Dec 13 15:27:55 CST 2007
Activating VGs: [  OK  ]
Thu Dec 13 15:29:23 CST 2007
Comment 5 Christine Caulfield 2007-12-14 04:28:35 EST
Looking at the init script, I suspect it's the vgscan that's taking all the time
here. It's the only thing between the clvmd start (which has the timeout) and
the VG activation.

That's still probably clvmd or dlm related (unless it's disk IO) as it gathers
the locks at startup in the background before allowing things like vgscan to do
anything.

Can you start clvmd as 'clvmd -d2' please? this should send debugging output to
syslog (level debug, so make sure it's filed somewhere!). This should give us a
chance to see where the bottleneck is.

To be honest though, I doubt it will be easy to speed things like this up,
unless it's a really heinous bug that we haven't spotted before.
Comment 6 Christine Caulfield 2008-02-05 08:57:36 EST
I wonder if this is related to the number of PVs in the system. I created a VG
with a large number of PVs and it took FOREVER to run vgscan (well, I killed it
after 25 minutes to be honest).

It's also worth knowing that vgscan operations do not run in parallel. If you
run 8 vgscans on 8 nodes at the same time then the locking will serialise them.
So if one node takes a minute to vgscan, then 8 nodes booting together could
take 8 minutes.

Note that this only happens when the PVs are in a VG. and it seems to be
independent of the number of LVs.
Comment 7 Christine Caulfield 2008-02-05 09:29:39 EST
I meant to say "... could take 8 minutes on one of the nodes". It would likely
take 7 minutes on another, 6 on another etc. Only the first node to get the lock
would take 1 minute.
Comment 9 Milan Broz 2009-02-02 04:35:58 EST
There were several changes in lvm2 which can help here (since the time this BZ was reported).
Is it still reproducible?
Comment 11 Corey Marthaler 2009-05-11 15:47:36 EDT
I appeared to have hit this bug with the lastest 4.8 code on just a 3 node cluster.

[revolver] ================================================================================                    
[revolver] Scenario iteration 1.2 started at Mon May 11 07:09:21 CDT 2009                                      
[revolver] Sleeping 3 minute(s) to let the I/O get its lock count up...                                        
[revolver] Senario: DLM kill one node                                                                          
[revolver]                                                                                                     
[revolver] Those picked to face the revolver... hayes-01                                                       
[revolver] Feeling lucky hayes-01? Well do ya? Go'head make my day...                                          
[revolver] Didn't receive heartbeat for 5 seconds                                                              
[revolver]                                                                                                     
[revolver] Verify that hayes-01 has been removed from cluster on remaining nodes                               
[revolver] Verifying that the dueler(s) are alive                                                              
[revolver] Still not all alive, sleeping another 10 seconds                                                    
[revolver] Still not all alive, sleeping another 10 seconds                                                    
[revolver] All killed nodes are back up, making sure they're qarshable...                                      
[revolver] Still not all qarshable, sleeping another 10 seconds                                                
[revolver] Still not all qarshable, sleeping another 10 seconds                                                
[revolver] Still not all qarshable, sleeping another 10 seconds                                                
[revolver] Verifying that recovery properly took place on the node(s) which stayed in the cluster              
[revolver] checking Fence recovery...                                                                          
[revolver] checking DLM recovery...                                                                            
[revolver] checking GFS recovery...                                                                            
[revolver] Verifying that clvmd was started properly on the dueler(s)                                          
[revolver] mounting /dev/mapper/HAYES-HAYES0 on /mnt/HAYES0 on hayes-01                                        
[revolver] mount: special device /dev/mapper/HAYES-HAYES0 does not exist  

The lv failed to activate on hayes-01.

May 11 11:35:55 hayes-01 ccsd[3324]: Initial status:: Inquorate
May 11 11:35:58 hayes-01 kernel: CMAN: sending membership request
May 11 11:35:59 hayes-01 kernel: CMAN: got node hayes-02
May 11 11:35:59 hayes-01 kernel: CMAN: got node hayes-03
May 11 11:35:59 hayes-01 kernel: CMAN: Finished transition, generation 1
May 11 11:35:59 hayes-01 ccsd[3324]: Cluster is quorate.  Allowing connections.
May 11 11:35:59 hayes-01 kernel: CMAN: quorum regained, resuming activity
May 11 11:35:59 hayes-01 cman: startup succeeded
May 11 11:35:59 hayes-01 kernel: dm-cmirror: dm-cmirror 0.2.0 (built May  5 2009 15:26:19) installed
May 11 11:35:59 hayes-01 cmirror: startup succeeded
May 11 11:35:59 hayes-01 lock_gulmd: no <gulm> section detected in /etc/cluster/cluster.conf succeeded
May 11 11:37:59 hayes-01 fenced: startup failed
May 11 11:37:59 hayes-01 kernel: aoe: AoE v71 initialised.
May 11 11:37:59 hayes-01 kernel: aoe: e1.1: setting 1024 byte data frames
May 11 11:37:59 hayes-01 kernel: aoe: 0030488d63d3 e1.1 v4467 has 19046937725 sectors
May 11 11:37:59 hayes-01 kernel:  etherd/e1.1: p1
May 11 11:38:04 hayes-01 udevd[1238]: udev done!
May 11 11:38:14 hayes-01 last message repeated 3 times
May 11 11:38:20 hayes-01 clvmd: clvmd startup timed out
May 11 11:38:20 hayes-01 clvmd: clvmd startup failed

Note You need to log in before you can comment on or make changes to this bug.