632319 – dlm_controld daemon cpg_dispatch error 2

Bug 632319 - dlm_controld daemon cpg_dispatch error 2

Summary: dlm_controld daemon cpg_dispatch error 2

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	cluster
Sub Component:
Version:	13
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Fabio Massimo Di Nitto
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	631496
TreeView+	depends on / blocked

Reported:	2010-09-09 17:01 UTC by Michael Hagmann
Modified:	2011-02-08 11:06 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-02-08 11:06:45 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Sosreport from scecond node Leo (435.93 KB, application/x-xz) 2010-09-09 17:03 UTC, Michael Hagmann	no flags	Details
Sosreport from first Node Scheat (3.89 MB, application/x-xz) 2010-09-09 17:05 UTC, Michael Hagmann	no flags	Details
sosreport after cluster.conf change to 19 (437.50 KB, application/x-xz) 2010-09-09 19:31 UTC, Michael Hagmann	no flags	Details
sosreport after cluster.conf change to 19 (3.73 MB, application/x-xz) 2010-09-09 19:32 UTC, Michael Hagmann	no flags	Details
View All

Description Michael Hagmann 2010-09-09 17:01:42 UTC

Description of problem:

I setup a extrem simple Cluster with 2 Virtual Fedora 13 Nodes, one ( Scheat ) is a Fileserver and the second ( leo ) is just a Fedora 13 minimal Installation for testing.

After install with lucci, I configure on Servervice with a IP for failover. After a short time the corosync could not update the config ( according to syslog ) and today the cman on scheat died.

Version-Release number of selected component (if applicable):

see sosreport

How reproducible:

?

Steps to Reproduce:
1.
2.
3.
  
Actual results:

cluster not working

Expected results:

no Problem at all, this is the simplest of possible configs!

Additional info:

on the Fileserver scheat at the beginning it works:

Sep 07 21:03:05 corosync [CLM   ] CLM CONFIGURATION CHANGE
Sep 07 21:03:05 corosync [CLM   ] New Configuration:
Sep 07 21:03:05 corosync [CLM   ]       r(0) ip(192.168.1.5)
Sep 07 21:03:05 corosync [CLM   ] Members Left:
Sep 07 21:03:05 corosync [CLM   ] Members Joined:
Sep 07 21:03:05 corosync [CLM   ] CLM CONFIGURATION CHANGE
Sep 07 21:03:05 corosync [CLM   ] New Configuration:
Sep 07 21:03:05 corosync [CLM   ]       r(0) ip(192.168.1.5)
Sep 07 21:03:05 corosync [CLM   ]       r(0) ip(192.168.1.6)
Sep 07 21:03:05 corosync [CLM   ] Members Left:
Sep 07 21:03:05 corosync [CLM   ] Members Joined:
Sep 07 21:03:05 corosync [CLM   ]       r(0) ip(192.168.1.6)
Sep 07 21:03:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 07 21:03:05 corosync [QUORUM] Members[1]: 1
Sep 07 21:03:05 corosync [QUORUM] Members[2]: 1 2
Sep 07 21:03:05 corosync [QUORUM] Members[2]: 1 2
Sep 07 21:03:05 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Sep 07 21:07:55 corosync [QUORUM] Members[2]: 1 2
Sep 07 21:13:37 corosync [QUORUM] Members[2]: 1 2
Sep 07 21:25:48 corosync [QUORUM] Members[2]: 1 2
Sep 07 21:28:06 corosync [QUORUM] Members[2]: 1 2
Sep 07 21:30:15 corosync [QUORUM] Members[2]: 1 2
Sep 07 21:44:39 corosync [QUORUM] Members[2]: 1 2
Sep 07 21:45:10 corosync [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration

Sep 07 20:26:38 dlm_controld dlm_controld 3.0.14 started
Sep 09 07:33:26 dlm_controld dlm_controld 3.0.14 started
Sep 09 07:33:35 dlm_controld daemon cpg_dispatch error 2
Sep 09 07:33:35 dlm_controld cluster is down, exiting


[root@scheat cluster]# service modclusterd status
modclusterd is stopped
[root@scheat cluster]# service ricci status
ricci is stopped
[root@scheat cluster]# service rgmanger status
rgmanger: unrecognized service
[root@scheat cluster]# service cman status
Found stale pid file
[root@scheat cluster]# service --list | grep rgmanager
--list: unrecognized service
[root@scheat cluster]# service --status-all | grep rgmanager
rgmanager (pid  1779) is running...
[root@scheat cluster]# service rgmanager status
rgmanager (pid  1779) is running...
[root@scheat cluster]# clustat
Could not connect to CMAN: No such file or directory
[root@scheat cluster]# 


on th second node Leo, all locks ok:
[root@leo cluster]# clustat
Cluster Status for gecco @ Thu Sep  9 18:58:15 2010
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 scheat                                                              1 Offline
 leo                                                                 2 Online, Local, rgmanager

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 service:Webserver                                                leo                                                              started       
[root@leo cluster]# 




any Idea?

do I have forgotten something to configure ?

thanks Michael

Comment 1 Michael Hagmann 2010-09-09 17:03:35 UTC

Created attachment 446303 [details]
Sosreport from scecond node Leo

Comment 2 Michael Hagmann 2010-09-09 17:05:28 UTC

Created attachment 446304 [details]
Sosreport from first Node Scheat

Comment 3 Michael Hagmann 2010-09-09 17:11:12 UTC

I also not able to start cman again:

[root@scheat cluster]# service cman start
Starting cluster: 
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
   Starting gfs_controld...                                [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain... 
                                                           [FAILED]
[root@scheat cluster]# 

How could I configure manual fencing over luci ? I Think that's the problem.

Michael

Comment 4 Michael Hagmann 2010-09-09 18:37:25 UTC

After add the manual fencing the cluster works fine:

[root@leo tmp]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="18" name="gecco">
	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="scheat" nodeid="1" votes="1">
			<fence>
				<method name="single">
					<device name="human" nodename="scheat"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="leo" nodeid="2" votes="1">
			<fence>
				<method name="single">
					<device name="human" nodename="leo"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice name="human" agent="fence_manual"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="scheat" nofailback="0" ordered="0" restricted="1">
				<failoverdomainnode name="scheat" priority="1"/>
			</failoverdomain>
			<failoverdomain name="all" nofailback="0" ordered="1" restricted="0">
				<failoverdomainnode name="scheat" priority="1"/>
				<failoverdomainnode name="leo" priority="1"/>
			</failoverdomain>
			<failoverdomain name="leo" nofailback="0" ordered="0" restricted="1">
				<failoverdomainnode name="leo" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="192.168.1.111" sleeptime="10"/>
		</resources>
		<service autostart="1" domain="all" exclusive="1" name="Webserver" recovery="relocate">
			<ip ref="192.168.1.111"/>
		</service>
	</rm>
</cluster>
[root@leo tmp]# 

But luci don't show the Cluster anymore!

https://bugzilla.redhat.com/show_bug.cgi?id=631496

thanks for Help

Michael

Comment 5 Lon Hohberger 2010-09-09 18:59:52 UTC

Manual override is built-in; there is no need to configure it.  Also, there is no fence_manual agent.

I do not think fencing configuration was the problem -- I think somehow the two config files got out of sync.  The two sosreports have cluster conf versions 15 and 16.  I am not sure why.

Updating the cluster config by adding manual fencing brought the config files back into sync, causing things to work again.

Comment 6 Michael Hagmann 2010-09-09 19:18:22 UTC

ok, 

so I could deconfigure it again, and it should work ?

I asume the fencing was the problem because if i want to start cman the Joining fence domain...  failed.

so this config version should also work?

<?xml version="1.0"?>
<cluster config_version="19" name="gecco">
 <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
 <clusternodes>
  <clusternode name="scheat" nodeid="1" votes="1">
  <fence/>
  </clusternode>
  <clusternode name="leo" nodeid="2" votes="1">
   <fence/>
  </clusternode>
 </clusternodes>
 <cman expected_votes="1" two_node="1"/>
 <fencedevices/>
 <rm>
  <failoverdomains>
   <failoverdomain name="scheat" nofailback="0" ordered="0" restricted="1">
    <failoverdomainnode name="scheat" priority="1"/>
   </failoverdomain>
   <failoverdomain name="all" nofailback="0" ordered="1" restricted="0">
    <failoverdomainnode name="scheat" priority="1"/>
    <failoverdomainnode name="leo" priority="1"/>
   </failoverdomain>
   <failoverdomain name="leo" nofailback="0" ordered="0" restricted="1">
    <failoverdomainnode name="leo" priority="1"/>
   </failoverdomain>
  </failoverdomains>
  <resources>
   <ip address="192.168.1.111" sleeptime="10"/>
  </resources>
  <service autostart="1" domain="all" exclusive="1" name="Webserver"
recovery="relocate">
   <ip ref="192.168.1.111"/>
  </service>
 </rm>
</cluster>

but why then luci don't allow me to admin the cluster ?

--> https://bugzilla.redhat.com/show_bug.cgi?id=631496

something is complete wrong

Michael

Comment 7 Michael Hagmann 2010-09-09 19:26:55 UTC

Lon

you are right I update to no fencedevice and it works too


Michael

Comment 8 Michael Hagmann 2010-09-09 19:29:34 UTC

and now also luci works again !

is that normal that with a small configerror luci give a 500?

Michael

Comment 9 Michael Hagmann 2010-09-09 19:31:22 UTC

Created attachment 446354 [details]
sosreport after cluster.conf change to 19

Comment 10 Michael Hagmann 2010-09-09 19:32:16 UTC

Created attachment 446355 [details]
sosreport after cluster.conf change to 19

Comment 11 Fabio Massimo Di Nitto 2011-02-08 11:06:45 UTC

(In reply to comment #8)
> and now also luci works again !
> 
> is that normal that with a small configerror luci give a 500?
> 
> Michael

This issue has been addressed in luci and now luci has a much wider understanding of the configuration.

The configuration version problem could have been caused by the other issue you had of luci unable to talk to one of the ricci session (reported in another bug).

Note You need to log in before you can comment on or make changes to this bug.