120934 – clusvcmgrd :Service number mismatch...

Bug 120934 - clusvcmgrd :Service number mismatch...

Summary: clusvcmgrd :Service number mismatch...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-04-15 13:25 UTC by Need Real Name
Modified:	2009-04-16 20:14 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-02-28 21:04:05 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:047	0	high	SHIPPED_LIVE	clumanager and redhat-config-cluster bug fix update	2005-05-25 04:00:00 UTC

Description Need Real Name 2004-04-15 13:25:46 UTC

Description of problem:

Getting the following error...

node1 clusvcmgrd[XXXX]: <emerg> readServiceBlock: Service number 
mismatch 6, 8.


Version-Release number of selected component (if applicable):

clumanager-1.2.9-1

How reproducible:

intermitant


Steps to Reproduce:
1. Cause unknown, can't reproduce at will.
2. Note: logging is set to DEBUG level
3.
  
Actual results:

Things seem to keep working fine.

I'd just like an explaination as to what can cause this.

Comment 1 Lon Hohberger 2004-04-15 13:48:34 UTC

Background:

The service manager reads from fixed offsets on the shared disk.  The
offsets are disk block-sized so operations on them are atomic; so
basically, there's an offset for the first service block (block 0);
the rest of the service blocks are calculated based on their ID and
this initial offset.

The service blocks store information about a service.  Some of the
information is static, some is dynamic, some is semi-static.

The structure looks like this:

SharedServiceBlock {
        sb_magic = 0x19fad022
        ServiceBlock sb_svcblk {
                sb_id = 0
                sb_owner = 0
                sb_last_owner = 0
                sb_state = 1
                sb_restarts = 0
                sb_transition = 0x00000000407e1655 (00:57:57 Apr 15 2004)
        }
}


Why it happens:

Simple.  If the sb_id field does not match the block offset index,
then we complain loudly:

        if (svcblk->sb_svcblk.sb_id != svcNum) {
            clulog(LOG_EMERG, "readServiceBlock: Service number
mismatch %d, %d.\n",
                svcNum, svcblk->sb_svcblk.sb_id);
            return(-1);
        }
        return(0);


Why so loud?

This should simply never happen: the offsets are statically calculated
(they're *always* the same), and the devices are read from/written to
in synchronous fashion, in block-sized chunks.  I saw this behavior
once, indeed it also was off by two service blocks (1024 bytes).


Why is it non-fatal?

The service manager simply aborts the operation that time around. 
Interestingly, it never happens more than once, making me wonder if
it's not below the cluster software (e.g. hardware or kernel level bug).


Out of curiosity, what shared array are you using?

Comment 2 Lon Hohberger 2004-04-15 13:52:39 UTC

Oh, and if you would like to view the shared data, try these:

shutil -p /service/X/state
shutil -p /lock/X/state

(where 'X' is a service ID)

shutil -p /cluster/header
shutil -d /cluster/config.xml

Comment 3 Lon Hohberger 2004-04-15 13:53:35 UTC

I meant:

shutil -p /service/X/status
shutil -p /lock/X/status

Comment 4 Need Real Name 2004-04-15 16:17:16 UTC

Wow, that was quick!

I'm using an MSA1000 connected to two HPDL380's via qlogic 234x 
series HBA's. The HPA driver is the one from qlogic, 6.06.10 if I 
remember correctly.

My quorum disk(s) are on /dev/sda1 (primary) and /dev/sda2 (shadow)

So far it's not caused a problem, however it's a bit disconcerting.

Comment 5 Lon Hohberger 2004-04-15 17:05:37 UTC

If the cluster isn't in production, you can always stop the cluster
software and run 'clurawtest' for a day or so on both members (one
with the -r option).

There should be no errors.

Comment 8 Lon Hohberger 2004-08-27 17:12:28 UTC

Out of curiosity, could you see if the following problem manifests on
your configuration?

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128179

Comment 9 Stefan Kaltenbrunner 2004-10-05 07:49:56 UTC

not sure if this is the apropriate place to comment on this(since I'm
actually running a RHEL3 recompile here) - but I just want to mention
that we are seeing the exact same issue here:

Oct  4 18:06:25 cludata001 clusvcmgrd[11097]: <emerg>
readServiceBlock: Service number mismatch 1, 3.

The Storagearray is an IBM Fastt600 with two QLA2340 in each of the
two nodes.
Other than the strange error - the cluster seems to work fine ...

Comment 10 Lon Hohberger 2004-10-05 13:07:15 UTC

It is indeed the proper place.  Unfortunately, this has been
impossible to reproduce.

I suppose the absolute worst thing that can happen is that your
enable/disable request will get thrown out the window, but the chances
of that happening are rare.

It may be possible to simply retry the read request once or twice to
see if we can obtain proper data prior to complaining.

Comment 11 Dr. Stephan Wonczak 2004-11-17 15:37:45 UTC

We also have seen this several times (at least thrice during uptime of
the cluster). Last instance was:
Message from syslogd@lvr13 at Wed Nov 17 15:47:28 2004 ...
lvr13 clusvcmgrd[28167]: <emerg> readServiceBlock: Service number
mismatch 6, 7.
We use a rather complicated setup: IBM SVC (SAN Volume Controller) on
Cisco MDS9500 with IBM FAStT 600 as storage backends.
Apart from this the cluster runs fine.
It is also not possible for us to reproduce at will; apart from this
we are talking about a rather important university mailserver in full
production!

Comment 12 Lon Hohberger 2004-11-17 16:38:13 UTC

This only seems to occur once-in-awhile, and only in the status-check
case (though this is not specifically related to the status-check case
per se).  It's not harmful.  The service manager handles the fault and
everything returns to normal.

I'm trying to reproduce it by dedicating a cluster to it, but it's not
been successful so far.

Comment 13 Lon Hohberger 2004-11-23 19:20:39 UTC

Ok, here's some data:

I changed the shared_services.c code to do the following:

        if (svcblk->sb_svcblk.sb_id != svcNum && retries < 3) {
            clulog(LOG_WARNING, "BUG: Service number mismatch %d, %d.\n",
                svcNum, svcblk->sb_svcblk.sb_id);
            ++retries;
            goto top; /* retry */
        }
        if (svcblk->sb_svcblk.sb_id != svcNum) {
            clulog(LOG_EMERG, "readServiceBlock: Service number
mismatch %d, %d.\n",
                svcNum, svcblk->sb_svcblk.sb_id);
            return(-1);
        }

Basically, retry up to 3 times to re-read the service block we desire.
 If we fail all attempts, then log at EMERG level.  In the past week
(with lots of services defined), I have gotten the following:

Nov 18 21:28:08 magenta clusvcmgrd[26497]: <warning> BUG: Service
number mismatch 7, 10.
Nov 20 16:58:02 magenta clusvcmgrd[24362]: <warning> BUG: Service
number mismatch 7, 10.
Nov 21 00:16:07 magenta clusvcmgrd[25085]: <warning> BUG: Service
number mismatch 4, 8.

No messages were logged at 'emerg'; and in fact, no double occurrences
of any messages appeared.  This means that the cluster read the
correct data on an immediate retry - which makes me think this problem
is below the cluster software itself somewhere.

Comment 16 Lon Hohberger 2005-01-12 16:27:56 UTC

1.2.23-0.4 has a bug preventing IP addresses from starting properly in some cases

Try 1.2.24-0.1

http://people.redhat.com/lhh/clumanager-1.2.24-0.1.i386.rpm
http://people.redhat.com/lhh/clumanager-1.2.24-0.1.src.rpm

Comment 22 Jay Turner 2005-05-25 16:40:58 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-047.html

Comment 28 Lon Hohberger 2007-12-21 15:09:46 UTC

Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3

Note You need to log in before you can comment on or make changes to this bug.