889098 – rgmanager did not recognized config updates from ccs

Bug 889098 - rgmanager did not recognized config updates from ccs

Summary: rgmanager did not recognized config updates from ccs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.8
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Ryan McCabe
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	822104 (view as bug list)
Depends On:
Blocks:	928849
TreeView+	depends on / blocked

Reported:	2012-12-20 08:37 UTC by Josef Zimek
Modified:	2018-12-03 18:04 UTC (History)
CC List:	14 users (show)
Fixed In Version:	rgmanager-2.0.52-44.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-09-30 22:37:16 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	430813	None	None	None	Never
Red Hat Knowledge Base (Solution)	446573	None	None	None	Never
Red Hat Product Errata	RHBA-2013:1316	normal	SHIPPED_LIVE	rgmanager bug fix and enhancement update	2013-09-30 21:13:21 UTC

Description Josef Zimek 2012-12-20 08:37:28 UTC

Description of problem:

After doing a modification in cluster.conf and distribute to the other nodes with ccs_tool update, the cluster.conf change in all the nodes but don't applies in the cluster.

Version-Release number of selected component (if applicable):
cman-2.0.115-96.el5_8.3.x86_64 

How reproducible:
Always

Steps to Reproduce: 
1. modify cluster.conf (add new service)
2. propagate the changes with ccs_tool update
3. check cluster.conf on all nodes and if the change took effect (new service present in clustat output)
  
Actual results:
Service is not added

Expected results:
New service is visible

Additional info:

From the dupms of rgmanager (attached in the case 00724487) it does show something interesting:

$ grep "SAP-BOP" sapproclt0*
sapproclt01_rgmanager-dump.mN27ID:    rg="service:SAP-BOP", View: 8, Size: 96, Address: 0x2aaab05c3060
sapproclt02_rgmanager-dump.HoDb9M:    rg="service:SAP-BOP", View: 8, Size: 96, Address: 0x2aaab0005740
sapproclt03_rgmanager-dump.OpsGLz:    rg="service:SAP-BOP", View: 8, Size: 96, Address: 0x4a2eb60
sapproclt04_rgmanager-dump.1P93BU:    rg="service:SAP-BOP", View: 8, Size: 96, Address: 0x2aaaac000bb0
sapproclt04_rgmanager-dump.1P93BU:  name = SAP-BOP [ primary unique required ]
sapproclt04_rgmanager-dump.1P93BU:  name = "SAP-BOP";

sapproclt04 does see the resource and the other ones do not. The reason is that they are on a different 
configuraton version:

$ head -n 1 sapproclt0*
==> sapproclt01_rgmanager-dump.mN27ID <==
Cluster configuration version 184

==> sapproclt02_rgmanager-dump.HoDb9M <==
Cluster configuration version 184

==> sapproclt03_rgmanager-dump.OpsGLz <==
Cluster configuration version 184

==> sapproclt04_rgmanager-dump.1P93BU <==
Cluster configuration version 187

All the cluster nodes appeared to have gotten the updated configuration file:
$ grep -h "Update of cluster.conf complete" */var/log/messages | sort
Oct 22 22:00:06 sapproclt01 ccsd[15551]: Update of cluster.conf complete (version 185 -> 186). 
Oct 22 22:00:06 sapproclt02 ccsd[17157]: Update of cluster.conf complete (version 185 -> 186). 
Oct 22 22:00:06 sapproclt03 ccsd[15529]: Update of cluster.conf complete (version 185 -> 186). 
Oct 22 22:00:06 sapproclt04 ccsd[15664]: Update of cluster.conf complete (version 185 -> 186). 
Oct 22 22:31:06 sapproclt01 ccsd[15551]: Update of cluster.conf complete (version 186 -> 187). 
Oct 22 22:31:06 sapproclt02 ccsd[17157]: Update of cluster.conf complete (version 186 -> 187). 
Oct 22 22:31:06 sapproclt03 ccsd[15529]: Update of cluster.conf complete (version 186 -> 187). 
Oct 22 22:31:06 sapproclt04 ccsd[15664]: Update of cluster.conf complete (version 186 -> 187). 

-----

There is only 2 ways this occurs:
1) The file was not propagated to all the nodes in the cluster and this does not appear to be the case.
2) There is a bug that prevented it from propagating.

Comment 1 Fabio Massimo Di Nitto 2012-12-20 08:42:41 UTC

can you please check that the ondisk version of cluster.conf on all nodes is at 187 and please collect all of /var/log/messages from all the nodes.

The ccsd update appears to have succeeded, but cman is not using the configuration. We will need the full logs to try to understand why.

Also a copy of cluster.conf at 186 and 187 might be useful.

Comment 2 Christine Caulfield 2012-12-20 08:59:24 UTC

Also it's worth checking that cman has seen the update, it might just be rgmanager that is using the older version. Comparing the output of "cman_tool status" with the rgmanager dump will clear this up.

Comment 4 Fabio Massimo Di Nitto 2012-12-20 09:17:03 UTC

I don´t have access to the ticket, can you please give me the information I asked for in comment #1 and also for Chrissie in comment #2 ?

Comment 6 Fabio Massimo Di Nitto 2012-12-20 12:22:14 UTC

This looks a lot like https://bugzilla.redhat.com/show_bug.cgi?id=822104
but fencing is not in progress.

rgmanager is simply stuck at config version 184 vs on disk (and cman/ccsd) 187.

Interesting enough, rgmanager daemon did not produce one single line of log (despite configured to log_level="7") since:

Aug 23 22:26:45 sapproclt01 clurgmgrd: [18949]: <notice> Getting status 

Aug 19 03:56:14 sapproclt02 clurgmgrd: [20392]: <notice> Getting status 

Aug 23 22:21:57 sapproclt03 clurgmgrd: [18512]: <notice> Getting status 


Aug 23 23:01:19 sapproclt04 clurgmgrd[18930]: <info> Starting changed resources. 
[huge no log gap]
Oct 22 22:00:16 sapproclt04 clurgmgrd[18930]: <notice> Reconfiguring 
...
Oct 22 22:31:21 sapproclt04 clurgmgrd: [18930]: <notice> Getting status 
[no more logs]

It appears that log has stopped working at the same time of:

sapproclt01 messages.9:Aug 23 23:01:03 sapproclt01 ccsd[15551]: Update of cluster.conf complete (version 184 -> 185). 

sapproclt02 ccsd has not logged 184 -> 185 update but based on rgmanager dump
            the config is live.

sapproclt03 messages.9:Aug 23 23:01:03 sapproclt03 ccsd[15529]: Update of cluster.conf complete (version 184 -> 185). 

sapproclt04 messages.9:Aug 23 23:01:03 sapproclt04 ccsd[15664]: Update of cluster.conf complete (version 184 -> 185). 

Assuming that the cluster.conf stored in the sosreports are the same that have been pushed to production, the differences between 184 and 185 are only confined to few <fs/> services.

diff -u sapproclt03-92388/etc/cluster/cluster.conf.230812 sapproclt04-380775/etc/cluster/cluster.conf.221012

Comment 7 Julio Entrena Perez 2012-12-20 12:34:04 UTC

(In reply to comment #4)
> I don´t have access to the ticket, can you please give me the information I
> asked for in comment #1 and also for Chrissie in comment #2 ?

(In reply to comment #4)
> I don´t have access to the ticket, can you please give me the information I
> asked for in comment #1 and also for Chrissie in comment #2 ?

On disk version of cluster.conf is updated in all nodes:

$ cat sapproclt0*-*/etc/cluster/cluster.conf|grep config_version
<cluster alias="cl_PepeJeans" config_version="187" name="cl_PepeJeans">
<cluster alias="cl_PepeJeans" config_version="187" name="cl_PepeJeans">
<cluster alias="cl_PepeJeans" config_version="187" name="cl_PepeJeans">
<cluster alias="cl_PepeJeans" config_version="187" name="cl_PepeJeans">


cman has latest version in all nodes:

$ cat sapproclt0*-*/sos_commands/cluster/cman_tool_status | egrep "Node ID|Config Version"
Config Version: 187
Node ID: 1
Config Version: 187
Node ID: 2
Config Version: 187
Node ID: 3
Config Version: 187
Node ID: 4


rgmanager doesn't:

$ cat sapproclt0*_rgmanager-dump*|grep version
Cluster configuration version 184
Cluster configuration version 184
Cluster configuration version 184
Cluster configuration version 187

Comment 25 Ryan McCabe 2013-06-13 20:13:31 UTC

*** Bug 822104 has been marked as a duplicate of this bug. ***

Comment 29 errata-xmlrpc 2013-09-30 22:37:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1316.html

Note You need to log in before you can comment on or make changes to this bug.