Bug 1087286 - The cman service could not start after updating the RHEL version from 6.0 to RHEL6.3
Summary: The cman service could not start after updating the RHEL version from 6.0 to ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster
Version: 6.3
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Christine Caulfield
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1075802
TreeView+ depends on / blocked
 
Reported: 2014-04-14 08:16 UTC by henry
Modified: 2018-12-03 21:06 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, errors generated while updating the resource-agents scheme were sometimes not reported. As a consequence, if an error occurred when updating the resource-agents schema, the update failed silently and later attempts to start the cman service could fail as well. With this update, schema errors are reported, and remedial action can be taken at upgrade time in case of problems.
Clone Of:
Environment:
Last Closed: 2015-07-22 07:04:28 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:1363 normal SHIPPED_LIVE cluster bug fix and enhancement update 2015-07-20 17:59:16 UTC

Description henry 2014-04-14 08:16:07 UTC
- Description of problem:

1. The customer deplyed the RHCS on RHEL6.0, after using "yum update" to update the RHEL version from 6.0 to 6.3, the cman service could not start normally. 

2. Let the customer try to copy clean cluster.rng or doing yum reinstall cman manually doesn't solve the issue. 

- Steps to Reproduce:

1. Deploy the RHCS in RHEL6.0
2. Update the RHEL 6.0 to RHEL 6.3 via executing the command " yum update"
3. Start the cman service .

- Actual results:

Start the cman service and show the error messages below:
----------------------------
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... 
/usr/share/cluster/cluster.rng:998: element ref: Relax-NG parser error : Reference VM has no matching definition
/usr/share/cluster/cluster.rng:995: element ref: Relax-NG parser error : Reference SERVICE has no matching definition
/usr/share/cluster/cluster.rng:995: element ref: Relax-NG parser error : Internal found no define for ref SERVICE
/usr/share/cluster/cluster.rng:998: element ref: Relax-NG parser error : Internal found no define for ref VM
Relax-NG schema /usr/share/cluster/cluster.rng failed to compile
[  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Starting gfs_controld... [  OK  ]
   Unfencing self... [  OK  ]
   Joining fence domain... [  OK  ]
-------------------------------

- Expected results: Start cman service normally.

- Current status:
 
1.  We solve the issue via adding "bash -x" to the script ccs_update_schema or deleting ccs_update_schema directly.
2. Actually, once we changed the md5sum and meeting the criteria of triggering refresh of resources.rng.cache, it will regenerate the file cluster.rng and resolve the problem.

-  Analysis process with SEG. 

1. ccs_update_schema will calculate if resources.rng.hash has been changed, if so, it will trigger refreshing of resources.rng.cache. You can check generate_hash() and generate_ras() in ccs_update_schema to know how it decides if needing to update resources.rng.cache and how to generate resources.rng.cache.

2. The reason that setting -x of ccs_update_schema resolves the issue is because by adding -x in ccs_update_schema we've changed the md5sum and thus meeting the criteria of triggering refresh of resources.rng.cache.

3. The reason that manually copying clean cluster.rng or doing yum reinstall cman doesn't solve the issue is because ccs_update_schema will always uses resources.rng.cache to generate cluster.rng. As customer's resources.rng.cache is bad (missing SERVICE and VM definition), after running ccs_update_schema, cluster.rng is always wrong.



- Additional info:

1. There is indeed some difference between 6.0's dir structure and 6.3's in terms of cluster. For example, /var/lib/cluster is empty and cluster.rng is located in /usr/share/cluster in 6.0. The service.sh is also slightly different between 6.0 and 6.3. 

2. We guess after upgrading to 6.3, somehow resources.rng.cache is wrong and becomes the "bad-guy" causing problem.

3. We need help to double confirm if it is a bug and any other idea about this. 


Henry Bai
GSS China

Comment 2 Christine Caulfield 2014-04-22 09:11:54 UTC
Is this reproducable at all or has it only happened to this one customer?

Comment 3 henry 2014-04-23 02:33:14 UTC
(In reply to Christine Caulfield from comment #2)
> Is this reproducable at all or has it only happened to this one customer?

Hi Christine, 

Thanks for you reply. The issue occurs in several ha-cluster (4 or 6  two-nodes clusters) in this  CU's productioin environment. Actually, I only got this specified customer did the cluster upgrade operation and hit the problem. 


Henry Bai
GSS China

Comment 4 henry 2014-04-29 01:44:50 UTC
Hi Engineering team,

Is there any feedback about the issue ?  We took  a sbr-cluster meeting last week and talked about this issue, Ryan Mitchell in SEG team  said the other customer also hit the same issue.  So, I think it is not a special case.
Would you pls help me double confirm if it is a bug ? 

Thanks,

Henry Bai
GSS China

Comment 6 Jan Pokorný [poki] 2014-04-29 21:33:36 UTC
A little proposal to make such alleged situation better identifiable
next time (there can be variants):

https://www.redhat.com/archives/cluster-devel/2014-April/msg00089.html

Comment 7 Jan Pokorný [poki] 2014-04-29 21:34:12 UTC
(there can be *better* variants)

Comment 8 Jan Pokorný [poki] 2014-05-06 18:06:03 UTC
Patch landed upstream (STABLE32 branch).

Comment 9 Christine Caulfield 2014-07-04 11:58:08 UTC
commit 8fd4192e154384a4e5a7f4b16dc5365118ac98d1
Author: Jan Pokorný <jpokorny@redhat.com>
Date:   Tue Apr 29 23:24:30 2014 +0200

    xml: ccs_update_schema: be verbose about extraction fail
    
    Previously, the distillation of resource-agents' metadata could fail
    from unexpected reasons without any evidence ever being made, unlike
    in case of fence-agents.  Also "no metadata" and "issue with their
    extraction" will allegedly yield the same outcome, so it is reflected
    in the comments being emitted to the schema for both sorts of agents.
    
    Signed-off-by: Jan Pokorný <jpokorny@redhat.com>

Comment 14 Justin Payne 2015-03-27 20:46:42 UTC
Verified in cman-3.0.12.1-73.el6:

[root@host-134 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.0 (Santiago)
[root@host-134 ~]# rpm -q cman
cman-3.0.12-23.el6_0.7.x86_64
[root@host-134 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Starting gfs_controld... [  OK  ]
   Unfencing self... [  OK  ]
   Joining fence domain... [  OK  ]

[root@host-135 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.0 (Santiago)
[root@host-135 ~]# rpm -q cman
cman-3.0.12-23.el6_0.7.x86_64
[root@host-135 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Starting gfs_controld... [  OK  ]
   Unfencing self... [  OK  ]
   Joining fence domain... [  OK  ]

[root@host-136 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.0 (Santiago)
[root@host-136 ~]# rpm -q cman
cman-3.0.12-23.el6_0.7.x86_64
[root@host-136 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Starting gfs_controld... [  OK  ]
   Unfencing self... [  OK  ]
   Joining fence domain... [  OK  ]

[root@host-133 ~]# for i in `seq 4 6`; do qarsh root@host-13$i rpm -q cman; done
cman-3.0.12.1-73.el6.x86_64
cman-3.0.12.1-73.el6.x86_64
cman-3.0.12.1-73.el6.x86_64

[root@host-134 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Tuning DLM kernel config... [  OK  ]
   Starting gfs_controld... [  OK  ]
   Unfencing self... [  OK  ]
   Joining fence domain... [  OK  ]

[root@host-135 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Tuning DLM kernel config... [  OK  ]
   Starting gfs_controld... [  OK  ]
   Unfencing self... [  OK  ]
   Joining fence domain... [  OK  ]

[root@host-136 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Tuning DLM kernel config... [  OK  ]
   Starting gfs_controld... [  OK  ]
   Unfencing self... [  OK  ]
   Joining fence domain... [  OK  ]

Comment 16 errata-xmlrpc 2015-07-22 07:04:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1363.html


Note You need to log in before you can comment on or make changes to this bug.