Description of problem: Getting the following error... node1 clusvcmgrd[XXXX]: <emerg> readServiceBlock: Service number mismatch 6, 8. Version-Release number of selected component (if applicable): clumanager-1.2.9-1 How reproducible: intermitant Steps to Reproduce: 1. Cause unknown, can't reproduce at will. 2. Note: logging is set to DEBUG level 3. Actual results: Things seem to keep working fine. I'd just like an explaination as to what can cause this.
Background: The service manager reads from fixed offsets on the shared disk. The offsets are disk block-sized so operations on them are atomic; so basically, there's an offset for the first service block (block 0); the rest of the service blocks are calculated based on their ID and this initial offset. The service blocks store information about a service. Some of the information is static, some is dynamic, some is semi-static. The structure looks like this: SharedServiceBlock { sb_magic = 0x19fad022 ServiceBlock sb_svcblk { sb_id = 0 sb_owner = 0 sb_last_owner = 0 sb_state = 1 sb_restarts = 0 sb_transition = 0x00000000407e1655 (00:57:57 Apr 15 2004) } } Why it happens: Simple. If the sb_id field does not match the block offset index, then we complain loudly: if (svcblk->sb_svcblk.sb_id != svcNum) { clulog(LOG_EMERG, "readServiceBlock: Service number mismatch %d, %d.\n", svcNum, svcblk->sb_svcblk.sb_id); return(-1); } return(0); Why so loud? This should simply never happen: the offsets are statically calculated (they're *always* the same), and the devices are read from/written to in synchronous fashion, in block-sized chunks. I saw this behavior once, indeed it also was off by two service blocks (1024 bytes). Why is it non-fatal? The service manager simply aborts the operation that time around. Interestingly, it never happens more than once, making me wonder if it's not below the cluster software (e.g. hardware or kernel level bug). Out of curiosity, what shared array are you using?
Oh, and if you would like to view the shared data, try these: shutil -p /service/X/state shutil -p /lock/X/state (where 'X' is a service ID) shutil -p /cluster/header shutil -d /cluster/config.xml
I meant: shutil -p /service/X/status shutil -p /lock/X/status
Wow, that was quick! I'm using an MSA1000 connected to two HPDL380's via qlogic 234x series HBA's. The HPA driver is the one from qlogic, 6.06.10 if I remember correctly. My quorum disk(s) are on /dev/sda1 (primary) and /dev/sda2 (shadow) So far it's not caused a problem, however it's a bit disconcerting.
If the cluster isn't in production, you can always stop the cluster software and run 'clurawtest' for a day or so on both members (one with the -r option). There should be no errors.
Out of curiosity, could you see if the following problem manifests on your configuration? https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128179
not sure if this is the apropriate place to comment on this(since I'm actually running a RHEL3 recompile here) - but I just want to mention that we are seeing the exact same issue here: Oct 4 18:06:25 cludata001 clusvcmgrd[11097]: <emerg> readServiceBlock: Service number mismatch 1, 3. The Storagearray is an IBM Fastt600 with two QLA2340 in each of the two nodes. Other than the strange error - the cluster seems to work fine ...
It is indeed the proper place. Unfortunately, this has been impossible to reproduce. I suppose the absolute worst thing that can happen is that your enable/disable request will get thrown out the window, but the chances of that happening are rare. It may be possible to simply retry the read request once or twice to see if we can obtain proper data prior to complaining.
We also have seen this several times (at least thrice during uptime of the cluster). Last instance was: Message from syslogd@lvr13 at Wed Nov 17 15:47:28 2004 ... lvr13 clusvcmgrd[28167]: <emerg> readServiceBlock: Service number mismatch 6, 7. We use a rather complicated setup: IBM SVC (SAN Volume Controller) on Cisco MDS9500 with IBM FAStT 600 as storage backends. Apart from this the cluster runs fine. It is also not possible for us to reproduce at will; apart from this we are talking about a rather important university mailserver in full production!
This only seems to occur once-in-awhile, and only in the status-check case (though this is not specifically related to the status-check case per se). It's not harmful. The service manager handles the fault and everything returns to normal. I'm trying to reproduce it by dedicating a cluster to it, but it's not been successful so far.
Ok, here's some data: I changed the shared_services.c code to do the following: if (svcblk->sb_svcblk.sb_id != svcNum && retries < 3) { clulog(LOG_WARNING, "BUG: Service number mismatch %d, %d.\n", svcNum, svcblk->sb_svcblk.sb_id); ++retries; goto top; /* retry */ } if (svcblk->sb_svcblk.sb_id != svcNum) { clulog(LOG_EMERG, "readServiceBlock: Service number mismatch %d, %d.\n", svcNum, svcblk->sb_svcblk.sb_id); return(-1); } Basically, retry up to 3 times to re-read the service block we desire. If we fail all attempts, then log at EMERG level. In the past week (with lots of services defined), I have gotten the following: Nov 18 21:28:08 magenta clusvcmgrd[26497]: <warning> BUG: Service number mismatch 7, 10. Nov 20 16:58:02 magenta clusvcmgrd[24362]: <warning> BUG: Service number mismatch 7, 10. Nov 21 00:16:07 magenta clusvcmgrd[25085]: <warning> BUG: Service number mismatch 4, 8. No messages were logged at 'emerg'; and in fact, no double occurrences of any messages appeared. This means that the cluster read the correct data on an immediate retry - which makes me think this problem is below the cluster software itself somewhere.
1.2.23-0.4 has a bug preventing IP addresses from starting properly in some cases Try 1.2.24-0.1 http://people.redhat.com/lhh/clumanager-1.2.24-0.1.i386.rpm http://people.redhat.com/lhh/clumanager-1.2.24-0.1.src.rpm
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-047.html
Fixing product name. Clumanager on RHEL3 was part of RHCS3, not RHEL3