Description of problem: I setup a extrem simple Cluster with 2 Virtual Fedora 13 Nodes, one ( Scheat ) is a Fileserver and the second ( leo ) is just a Fedora 13 minimal Installation for testing. After install with lucci, I configure on Servervice with a IP for failover. After a short time the corosync could not update the config ( according to syslog ) and today the cman on scheat died. Version-Release number of selected component (if applicable): see sosreport How reproducible: ? Steps to Reproduce: 1. 2. 3. Actual results: cluster not working Expected results: no Problem at all, this is the simplest of possible configs! Additional info: on the Fileserver scheat at the beginning it works: Sep 07 21:03:05 corosync [CLM ] CLM CONFIGURATION CHANGE Sep 07 21:03:05 corosync [CLM ] New Configuration: Sep 07 21:03:05 corosync [CLM ] r(0) ip(192.168.1.5) Sep 07 21:03:05 corosync [CLM ] Members Left: Sep 07 21:03:05 corosync [CLM ] Members Joined: Sep 07 21:03:05 corosync [CLM ] CLM CONFIGURATION CHANGE Sep 07 21:03:05 corosync [CLM ] New Configuration: Sep 07 21:03:05 corosync [CLM ] r(0) ip(192.168.1.5) Sep 07 21:03:05 corosync [CLM ] r(0) ip(192.168.1.6) Sep 07 21:03:05 corosync [CLM ] Members Left: Sep 07 21:03:05 corosync [CLM ] Members Joined: Sep 07 21:03:05 corosync [CLM ] r(0) ip(192.168.1.6) Sep 07 21:03:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 07 21:03:05 corosync [QUORUM] Members[1]: 1 Sep 07 21:03:05 corosync [QUORUM] Members[2]: 1 2 Sep 07 21:03:05 corosync [QUORUM] Members[2]: 1 2 Sep 07 21:03:05 corosync [MAIN ] Completed service synchronization, ready to provide service. Sep 07 21:07:55 corosync [QUORUM] Members[2]: 1 2 Sep 07 21:13:37 corosync [QUORUM] Members[2]: 1 2 Sep 07 21:25:48 corosync [QUORUM] Members[2]: 1 2 Sep 07 21:28:06 corosync [QUORUM] Members[2]: 1 2 Sep 07 21:30:15 corosync [QUORUM] Members[2]: 1 2 Sep 07 21:44:39 corosync [QUORUM] Members[2]: 1 2 Sep 07 21:45:10 corosync [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration Sep 07 20:26:38 dlm_controld dlm_controld 3.0.14 started Sep 09 07:33:26 dlm_controld dlm_controld 3.0.14 started Sep 09 07:33:35 dlm_controld daemon cpg_dispatch error 2 Sep 09 07:33:35 dlm_controld cluster is down, exiting [root@scheat cluster]# service modclusterd status modclusterd is stopped [root@scheat cluster]# service ricci status ricci is stopped [root@scheat cluster]# service rgmanger status rgmanger: unrecognized service [root@scheat cluster]# service cman status Found stale pid file [root@scheat cluster]# service --list | grep rgmanager --list: unrecognized service [root@scheat cluster]# service --status-all | grep rgmanager rgmanager (pid 1779) is running... [root@scheat cluster]# service rgmanager status rgmanager (pid 1779) is running... [root@scheat cluster]# clustat Could not connect to CMAN: No such file or directory [root@scheat cluster]# on th second node Leo, all locks ok: [root@leo cluster]# clustat Cluster Status for gecco @ Thu Sep 9 18:58:15 2010 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ scheat 1 Offline leo 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:Webserver leo started [root@leo cluster]# any Idea? do I have forgotten something to configure ? thanks Michael
Created attachment 446303 [details] Sosreport from scecond node Leo
Created attachment 446304 [details] Sosreport from first Node Scheat
I also not able to start cman again: [root@scheat cluster]# service cman start Starting cluster: Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld... [ OK ] Starting gfs_controld... [ OK ] Unfencing self... [ OK ] Joining fence domain... [FAILED] [root@scheat cluster]# How could I configure manual fencing over luci ? I Think that's the problem. Michael
After add the manual fencing the cluster works fine: [root@leo tmp]# cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster config_version="18" name="gecco"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="scheat" nodeid="1" votes="1"> <fence> <method name="single"> <device name="human" nodename="scheat"/> </method> </fence> </clusternode> <clusternode name="leo" nodeid="2" votes="1"> <fence> <method name="single"> <device name="human" nodename="leo"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice name="human" agent="fence_manual"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="scheat" nofailback="0" ordered="0" restricted="1"> <failoverdomainnode name="scheat" priority="1"/> </failoverdomain> <failoverdomain name="all" nofailback="0" ordered="1" restricted="0"> <failoverdomainnode name="scheat" priority="1"/> <failoverdomainnode name="leo" priority="1"/> </failoverdomain> <failoverdomain name="leo" nofailback="0" ordered="0" restricted="1"> <failoverdomainnode name="leo" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="192.168.1.111" sleeptime="10"/> </resources> <service autostart="1" domain="all" exclusive="1" name="Webserver" recovery="relocate"> <ip ref="192.168.1.111"/> </service> </rm> </cluster> [root@leo tmp]# But luci don't show the Cluster anymore! https://bugzilla.redhat.com/show_bug.cgi?id=631496 thanks for Help Michael
Manual override is built-in; there is no need to configure it. Also, there is no fence_manual agent. I do not think fencing configuration was the problem -- I think somehow the two config files got out of sync. The two sosreports have cluster conf versions 15 and 16. I am not sure why. Updating the cluster config by adding manual fencing brought the config files back into sync, causing things to work again.
ok, so I could deconfigure it again, and it should work ? I asume the fencing was the problem because if i want to start cman the Joining fence domain... failed. so this config version should also work? <?xml version="1.0"?> <cluster config_version="19" name="gecco"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="scheat" nodeid="1" votes="1"> <fence/> </clusternode> <clusternode name="leo" nodeid="2" votes="1"> <fence/> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices/> <rm> <failoverdomains> <failoverdomain name="scheat" nofailback="0" ordered="0" restricted="1"> <failoverdomainnode name="scheat" priority="1"/> </failoverdomain> <failoverdomain name="all" nofailback="0" ordered="1" restricted="0"> <failoverdomainnode name="scheat" priority="1"/> <failoverdomainnode name="leo" priority="1"/> </failoverdomain> <failoverdomain name="leo" nofailback="0" ordered="0" restricted="1"> <failoverdomainnode name="leo" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="192.168.1.111" sleeptime="10"/> </resources> <service autostart="1" domain="all" exclusive="1" name="Webserver" recovery="relocate"> <ip ref="192.168.1.111"/> </service> </rm> </cluster> but why then luci don't allow me to admin the cluster ? --> https://bugzilla.redhat.com/show_bug.cgi?id=631496 something is complete wrong Michael
Lon you are right I update to no fencedevice and it works too Michael
and now also luci works again ! is that normal that with a small configerror luci give a 500? Michael
Created attachment 446354 [details] sosreport after cluster.conf change to 19
Created attachment 446355 [details] sosreport after cluster.conf change to 19
(In reply to comment #8) > and now also luci works again ! > > is that normal that with a small configerror luci give a 500? > > Michael This issue has been addressed in luci and now luci has a much wider understanding of the configuration. The configuration version problem could have been caused by the other issue you had of luci unable to talk to one of the ricci session (reported in another bug).