Description of problem: In a 2 nodes cluster, after 1 node is fence, any clvm command hang on the ramaining node. when the fenced node cluster come back in the cluster, any clvm command also hang, moreover the node do not activate any clustered vg, and so do not access any shared device. Version-Release number of selected component (if applicable): redhat 5.2 update device-mapper-1.02.28-2.el5.x86_64.rpm lvm2-2.02.40-6.el5.x86_64.rpm lvm2-cluster-2.02.40-7.el5.x86_64.rpm How reproducible: always Steps to Reproduce: 1.2 nodes cluster , quorum formed with qdisk 2.cold boot node 2 3.node 2 is evicted and fenced, service are taken over by node 1 4.node é come back in cluster, quorate, but no clustered vg are up and any lvm related command hang 5.At this step every lvm command hang on node 1 Expected results: node 2 should be able to get back the lock on clustered lvm volume and node 1 should be able to issue any lvm relate command Here are my cluster.conf and lvm.conf <?xml version="1.0"?> <cluster alias="rome" config_version="53" name="rome"> <fence_daemon clean_start="0" post_fail_delay="9" post_join_delay="6"/> <clusternodes> <clusternode name="romulus.fr" nodeid="1" votes="1"> <fence> <method name="1"> <device name="ilo172"/> </method> </fence> </clusternode> <clusternode name="remus.fr" nodeid="2" votes="1"> <fence> <method name="1"> <device name="ilo173"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="3"/> <totem consensus="4800" join="60" token="21002" token_retransmits_before_loss_const="20"/> <fencedevices> <fencedevice agent="fence_ilo" hostname="X.X.X.X" login="Administrator" name="ilo172" passwd="X.X.X.X"/> <fencedevice agent="fence_ilo" hostname="XXXX" login="Administrator" name="ilo173" passwd="XXXX"/> </fencedevices> <rm> <failoverdomains/> <resources/> <vm autostart="1" exclusive="0" migrate="live" name="alfrescoP64" path="/etc/xen" recovery="relocate"/> <vm autostart="1" exclusive="0" migrate="live" name="alfrescoI64" path="/etc/xen" recovery="relocate"/> <vm autostart="1" exclusive="0" migrate="live" name="alfrescoS64" path="/etc/xen" recovery="relocate"/> </rm> <quorumd interval="3" label="quorum64" min_score="1" tko="30" votes="1"> <heuristic interval="2" program="ping -c3 -t2 X.X.X.X" score="1"/> </quorumd> </cluster> part of lvm.conf: # Type 3 uses built-in clustered locking. locking_type = 3 # If using external locking (type 2) and initialisation fails, # with this set to 1 an attempt will be made to use the built-in # clustered locking. # If you are using a customised locking_library you should set this to 0. fallback_to_clustered_locking = 0 # If an attempt to initialise type 2 or type 3 locking failed, perhaps # because cluster components such as clvmd are not running, with this set # to 1 an attempt will be made to use local file-based locking (type 1). # If this succeeds, only commands against local volume groups will proceed. # Volume Groups marked as clustered will be ignored. fallback_to_local_locking = 1 # Local non-LV directory that holds file-based locks while commands are # in progress. A directory like /tmp that may get wiped on reboot is OK. locking_dir = "/var/lock/lvm" # Other entries can go here to allow you to load shared libraries # e.g. if support for LVM1 metadata was compiled as a shared library use # format_libraries = "liblvm2format1.so" # Full pathnames can be given. # Search this directory first for shared libraries. # library_dir = "/lib" # The external locking library to load if locking_type is set to 2. # locking_library = "liblvm2clusterlock.so" part of lvm log on second node : vgchange.c:165 Activated logical volumes in volume group "VolGroup00" vgchange.c:172 7 logical volume(s) in volume group "VolGroup00" now active cache/lvmcache.c:1220 Wiping internal VG cache commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:17:29 2009 commands/toolcontext.c:209 Set umask to 0077 locking/cluster_locking.c:83 connect() failed on local socket: Connexion refusée locking/locking.c:259 WARNING: Falling back to local file-based locking. locking/locking.c:261 Volume Groups with the clustered attribute will be inaccessible. toollib.c:578 Finding all volume groups toollib.c:491 Finding volume group "VGhomealfrescoS64" metadata/metadata.c:2379 Skipping clustered volume group VGhomealfrescoS64 toollib.c:491 Finding volume group "VGhomealfS64" metadata/metadata.c:2379 Skipping clustered volume group VGhomealfS64 toollib.c:491 Finding volume group "VGvmalfrescoS64" metadata/metadata.c:2379 Skipping clustered volume group VGvmalfrescoS64 toollib.c:491 Finding volume group "VGvmalfrescoI64" metadata/metadata.c:2379 Skipping clustered volume group VGvmalfrescoI64 toollib.c:491 Finding volume group "VGvmalfrescoP64" metadata/metadata.c:2379 Skipping clustered volume group VGvmalfrescoP64 toollib.c:491 Finding volume group "VolGroup00" libdm-report.c:981 VolGroup00 cache/lvmcache.c:1220 Wiping internal VG cache commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:17:29 2009 commands/toolcontext.c:209 Set umask to 0077 locking/cluster_locking.c:83 connect() failed on local socket: Connexion refusée locking/locking.c:259 WARNING: Falling back to local file-based locking. locking/locking.c:261 Volume Groups with the clustered attribute will be inaccessible. toollib.c:542 Using volume group(s) on command line toollib.c:491 Finding volume group "VolGroup00" vgchange.c:117 7 logical volume(s) in volume group "VolGroup00" monitored cache/lvmcache.c:1220 Wiping internal VG cache commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:20:45 2009 commands/toolcontext.c:209 Set umask to 0077 toollib.c:331 Finding all logical volumes commands/toolcontext.c:188 Logging initialised at Wed Jun 3 15:20:50 2009 commands/toolcontext.c:209 Set umask to 0077 toollib.c:578 Finding all volume groups group_tool on node 1 type level name id state fence 0 default 00010001 none [1 2] dlm 1 clvmd 00010002 none [1 2] dlm 1 rgmanager 00020002 none [1] group_tool on node 2 [root@remus ~]# group_tool type level name id state fence 0 default 00010001 none [1 2] dlm 1 clvmd 00010002 none [1 2] Additional info:
The lvm log file you have included from node 2 shows the message: locking/cluster_locking.c:83 connect() failed on local socket: Connexion refusée locking/locking.c:259 WARNING: Falling back to local file-based locking. locking/locking.c:261 Volume Groups with the clustered attribute will be inaccessible. This says that clvmd is not running. If clvmd is not running on one node then clustered volumes will not be accessible on that node. Also, clustered LVM commands on node1 will hang until clvmd is started on node2(or they will time out after a couple of minutes). Having said that, it looks like clvmd is either running or has been running and has maybe crashed or been stopped. Because there is an active clvmd lockspace on both nodes. It might also be possible that cvmd is running and the lvm commands cannot talk to it (maybe SELinux?) - though that wouldn't explain the hangs on node1. Can you check if clvmd is still running on node2 please ? if not then it would be helpful to find out what happened to it. A core file or a 'clvmd -d' log would be useful here. If clvmd is NOT running then an strace of it to see if it is responding to the connect request from lvm clients (which should probably also be straced) is the thing to get.
We do not use either iptable or selinux at all. on node 1 : service clvmd status clvmd (pid 8549) en cours d'exécution... --> this command hangs, i have to ctrl-c no output for clvmd -d1, get back to command line immediately on node 2 : service clvmd status clvmd (pid 9849 9832 8536) en cours d'exécution... no output for clvmd -d1, stay alive, have to ctrl-c
Sorry, maybe I should have been clearer. clvmd will not print any output (from strace or into system after a -d1) unless an lvm command is issued that tries to connect to it. Can you strace an lvm command while clvmd is also under strace please ?
Created attachment 346505 [details] strace_clvmd_while_strace_lvscan_node1
Created attachment 346506 [details] strace_clvmd_while_strace_lvscan_node2
Created attachment 346507 [details] strace_lvscan_node1
Created attachment 346508 [details] strace_lvscan_node2
ok here are the 2 commands i ran on both nodes: strace -o/tmp/strace_lvscan_nodeX strace -o/tmp/strace_clvmd_while_strace_lvscan_nodeX clvmd -d1 result in attachment. Every command hang but strace -o/tmp/strace_clvmd_while_strace_lvscan_node1 clvmd -d1
- on node1 is clvmd already running, so stracing another instance prints no usefull info - on node2 is apparently some problem elsewhere, clvmd dies on socket connection before it does anything Please if it is still reproducible on recent release (RHEL 5.5), attach full logs and reopen, these here are incomplete. (Or use Red Hat support channel to open new issue, thanks.)