Bug 224462

Summary:

clurgmgrd claim "service started" but it is not

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Roger Pena-Escobio <orkcu>

Component:

rgmanager

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED NOTABUG

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

cluster-maint, tmarshal

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-01-26 17:58:54 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Original configuration	none
rg_test output of original configuration	none
altered configuration	none
rg_test output of new configuration, clusterfs.sh not modified yet	none
rg_test output of new configuration, clusterfs.sh modified to set mountpoint unique="0"	none
clusterfs.sh with unique for mountpoint set to 0	none

Description Roger Pena-Escobio 2007-01-25 20:04:01 UTC

Description of problem:
If I have two identical services but with different failover domains, one of 
the services start but the other one claim that it start but it doesn't do 
anything

Version-Release number of selected component (if applicable):
rgmanager-1.9.54-1

Steps to Reproduce:
just try this cluster.conf

        <rm>
                <failoverdomains>
                        <failoverdomain name="mysql" ordered="0" restricted="1">
                                <failoverdomainnode name="blade21" 
priority="1"/>
                                <failoverdomainnode name="blade22" 
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="apache25" ordered="0" 
restricted="1">
                                <failoverdomainnode name="blade25" 
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="apache26" ordered="0" 
restricted="1">
                                <failoverdomainnode name="blade26" 
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="ftp" ordered="0" restricted="1">
                                <failoverdomainnode name="blade25" 
priority="1"/>
                                <failoverdomainnode name="blade26" 
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <script file="/etc/init.d/httpd" name="apache start-
stop"/>
                        <script file="/etc/init.d/vsftpd" name="vsftpd"/>
                </resources>
                <service autostart="1" domain="mysql" name="mysqld" 
recovery="restart">
                        <script file="/etc/init.d/mysqld" name="mysql start-
stop">
                                <ip address="172.17.0.123" monitor_link="1"/>
                        </script>
                        <fs device="/dev/mapper/MysqlData-VarLibMysql" 
force_fsck="0" force_unmount="1" fsid="30618" fstype="ext3" 
mountpoint="/var/lib/mysql" name="MysqlData" options="" self_fence="1"/>
                </service>
                <service autostart="1" domain="apache25" name="apache25">
                        <clusterfs device="/dev/emcpowerd1" force_unmount="0" 
fsid="41106" fstype="gfs" mountpoint="/opt/www" name="WWWData" options="">
                                <script ref="vsftpd"/>
                        </clusterfs>
                        <clusterfs device="/dev/emcpowera1" force_unmount="0" 
fsid="30342" fstype="gfs" mountpoint="/opt/soft" name="WWWSoft" options="">
                                <script ref="apache start-stop"/>
                        </clusterfs>
                </service>
                <service autostart="1" domain="apache26" name="apache26">
                        <clusterfs device="/dev/emcpowerd1" force_unmount="0" 
fsid="41107" fstype="gfs" mountpoint="/opt/www" name="WWWData" options="">
                                <script ref="vsftpd"/>
                        </clusterfs>
                        <clusterfs device="/dev/emcpowerb1" force_unmount="0" 
fsid="30343" fstype="gfs" mountpoint="/opt/soft" name="WWWSoft" options="">
                                <script ref="apache start-stop"/>
                        </clusterfs>
                </service>
        </rm>
  
Actual results:
Jan 25 14:23:57 blade26 clurgmgrd[3494]: <notice> Starting disabled service 
apache26
Jan 25 14:23:57 blade26 clurgmgrd[3494]: <notice> Service apache26 started

Expected results:
Jan 25 15:32:31 blade25 clurgmgrd[3990]: <notice> Starting disabled service 
apache25
Jan 25 15:32:31 blade25 clurgmgrd: [3990]: <info> Executing /etc/init.d/vsftpd 
start
Jan 25 15:32:31 blade25 vsftpd: vsftpd vsftpd succeeded
Jan 25 15:32:31 blade25 clurgmgrd: [3990]: <info> Executing /etc/init.d/httpd 
start
Jan 25 15:32:31 blade25 httpd: httpd startup succeeded
Jan 25 15:32:31 blade25 clurgmgrd[3990]: <notice> Service apache25 started
Jan 25 15:32:40 blade25 clurgmgrd: [3990]: <info> Executing /etc/init.d/vsftpd 
status
Jan 25 15:32:40 blade25 clurgmgrd: [3990]: <info> Executing /etc/init.d/httpd 
status

Additional info:
I try to configure another service with no similar name but with same 
resources, the same happen.
as I said, the cluster claim everything is ok but it isn't
[root@blade26 ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  blade25                                  Online, rgmanager
  blade26                                  Online, Local, rgmanager
  blade21                                  Online, rgmanager
  blade22                                  Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  mysqld               blade21                        started
  apache25             blade25                        started
  apache26             blade26                        started
[root@blade26 ~]# ps ax | grep http
16087 pts/0    S+     0:00 grep http
[root@blade26 ~]#

Comment 1 Lon Hohberger 2007-01-26 17:30:10 UTC

Quite the interesting configuration there :)  Ok, to start, have a look at:

# rg_test test /etc/cluster/cluster.conf
Unique/primary not unique type clusterfs, name=WWWData
Error storing clusterfs resource
Unique/primary not unique type clusterfs, name=WWWSoft
Error storing clusterfs resource
...

When rgmanager detects collisions between attributes of a resource type which
are required to be unique across the resource type, it stops parsing that branch
of the tree (so, your references to scripts in the apache26 service are not even
present in service trees that rgmanager constructs internally - see the bottom
of the output of rg_test).  The apache26 service has two resource collisions
with the apache25 service:

(1) WWWData is defined twice, with basically identical components (except fsid,
which does not affect your configuration).

You should put this one in your <resources> block and pass it by reference (like
you did with scripts).

(2) WWWSoft is defined twice with a different device, but the same mount point,
causing a naming & mount point collision.

You need to rename one to something else to resolve the naming collision.

The mount point is also the same, and that must be unique.  However, you can
make it not required to be unique tweaking the metadata in
/usr/share/cluster/clusterfs.sh:
* set "unique" to "0" for the "mountpoint" parameter.
* restart rgmanager on both nodes

Most users should *not* do this, but in your case, it looks safe to do (since
the two services will never coexist on the same node due to restricted failover
domains).

Warning: do not change the primary attribute ("name", in most cases), or you
will probably break stuff.

Anyway, if you change the 'unique' flag to the 'mountpoint' parameter to 0 in
/usr/share/cluster/clusterfs.sh, and restart rgmanager, the following
configuration should work:
        <rm>
                <failoverdomains>
                        <failoverdomain name="mysql" ordered="0" restricted="1">
                                <failoverdomainnode name="blade21" 
priority="1"/>
                                <failoverdomainnode name="blade22" 
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="apache25" ordered="0" 
restricted="1">
                                <failoverdomainnode name="blade25" 
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="apache26" ordered="0" 
restricted="1">
                                <failoverdomainnode name="blade26" 
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="ftp" ordered="0" restricted="1">
                                <failoverdomainnode name="blade25" 
priority="1"/>
                                <failoverdomainnode name="blade26" 
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <script file="/etc/init.d/httpd" name="apache start-stop"/>
                        <script file="/etc/init.d/vsftpd" name="vsftpd"/>
                        <clusterfs device="/dev/emcpowerd1" force_unmount="0" 
fsid="41107" fstype="gfs" mountpoint="/opt/www" name="WWWData" options=""/>
                        </clusterfs>
                </resources>
                <service autostart="1" domain="mysql" name="mysqld" 
recovery="restart">
                        <fs device="/dev/mapper/MysqlData-VarLibMysql" 
force_fsck="0" force_unmount="1" fsid="30618" fstype="ext3" 
mountpoint="/var/lib/mysql" name="MysqlData" options="" self_fence="1"/>
                        <ip address="172.17.0.123" monitor_link="1"/>
                        <script file="/etc/init.d/mysqld" name="mysql start-stop"/>
                </service>
                <service autostart="1" domain="apache25" name="apache25">
                        <clusterfs ref="WWWData"/>
                        <clusterfs device="/dev/emcpowera1" force_unmount="0" 
fsid="30342" fstype="gfs" mountpoint="/opt/soft" name="WWWSoft1" options=""/>
                        <script ref="vsftpd"/>
                        <script ref="apache start-stop"/>
                </service>
                <service autostart="1" domain="apache26" name="apache26">
                        <clusterfs ref="WWWData"/>
                        <clusterfs device="/dev/emcpowerb1" force_unmount="0" 
fsid="30343" fstype="gfs" mountpoint="/opt/soft" name="WWWSoft2" options=""/>
                        <script ref="vsftpd"/>
                        <script ref="apache start-stop"/>
                </service>
        </rm>

Now, if you don't change /usr/share/cluster/clusterfs.sh, you'll have to change
the mount point and make the scripts for apache context-sensitive.  You can do
this by checking "OCF_RESKEY_service_name" and starting apache with a different
config based on that from the script if you use the above configuration; i.e.
(untested example, the idea is that it starts httpd based on the service it's
part of, and uses /etc/httpd/conf/httpd-<service_name>.conf).

--- /etc/init.d/httpd.old       2007-01-26 12:08:59.000000000 -0500
+++ /etc/init.d/httpd   2007-01-26 12:10:33.000000000 -0500
@@ -57,6 +57,9 @@
 # when not running is also a failure.  So we just do it the way init scripts
 # are expected to behave here.
 start() {
+       if [ "$OCF_RESKEY_service_name" ]; then 
+               OPTIONS="$OPTIONS -f
/etc/httpd/conf/httpd-${OCF_RESKEY_service_name}.conf"
+       fi
         echo -n $"Starting $prog: "
         check13 || exit 1
         LANG=$HTTPD_LANG daemon $httpd $OPTIONS

If you choose to do it this way, WWWSoft1 and WWWSoft2 in the above example
configuration will need different mount points (/opt/soft1 and /opt/soft2, for
example), and /etc/httpd/conf/httpd-apache25.conf and httpd-apache26.conf will
need whatever is pointing at /opt/soft set accordingly.

While you get things up and running, I will investigate the possibility of
allowing non-primary (but unique) namespace collisions across disjoint
restricted failover domains.  This will not be solved overnight, mind you (and
may fall into the realm of the dependency code we're working on).

Comment 2 Lon Hohberger 2007-01-26 17:33:59 UTC

Generally, you should always design your services as though they can coexist -
unless there is a device disconnect between the nodes (e.g. for example,
/dev/emcpowera1 is not connected to blade25 and /dev/emcpowerb1 is not connected
to blade26).

Oh, the above configuration has an extraneous "</clusterfs>" thing in the
<resources> section.  Remove it before use ;)

Comment 3 Lon Hohberger 2007-01-26 17:36:29 UTC

Created attachment 146689 [details]
Original configuration

Comment 4 Lon Hohberger 2007-01-26 17:37:00 UTC

Created attachment 146690 [details]
rg_test output of original configuration

Comment 5 Lon Hohberger 2007-01-26 17:37:57 UTC

Created attachment 146691 [details]
altered configuration

Comment 6 Lon Hohberger 2007-01-26 17:38:45 UTC

Created attachment 146692 [details]
rg_test output of new configuration, clusterfs.sh not modified yet

Comment 7 Lon Hohberger 2007-01-26 17:39:30 UTC

Created attachment 146693 [details]
rg_test output of new configuration, clusterfs.sh modified to set mountpoint unique="0"

Comment 8 Lon Hohberger 2007-01-26 17:40:38 UTC

Created attachment 146694 [details]
clusterfs.sh with unique for mountpoint set to 0

[Note: from RHEL5 branch, but should work on RHEL4]

Comment 9 Lon Hohberger 2007-01-26 17:58:54 UTC

I've filed a separate bugzilla feature request to allow reuse of "unique"
attributes if the services will never collide, as well as add syslog-logging
(rather than just "printf") for when resource collisions occur:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=224608

The current behavior concerning resource collisions is not a bug, but may be
possible to expand the behavior as described previously (and in the above noted
bugzilla).  Additionally, the collisions might be something we can check for in
the GUIs (system-config-cluster and Conga) - so that this does not quietly hit
other users.