Description of problem: openais is continuously reporting the following error to /var/log/messages: openais[14278]: [CKPT ] checkpoint_find returned 0 calling error_exit. then that is followed by errors such as: last message repeated 152 times last message repeated 66 times this goes on. Version-Release number of selected component (if applicable): openais-0.80.3-7.el5 How reproducible: This seems to clear after a reboot of the cluster, then randomly shows up. I don't know how to go about reproducing it, what causes it, or how to even begin to debug this. The most recent event that occurred prior to this error starting was when I rebooted 1 node of my cluster. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Because this is on a "disconnected network", I need to go through some steps to get logs, configurations, etc burned to a CD and upload to bugzilla. So I don't have to do that process multiple times, please notify me of all the logs, configs, etc that you need to see for this.
I found, in /etc/xen/ some "leftover" domU configuration files lying around. One user had been testing and copied an existing domU config to test with, failing to change the "name =" line or "uuid" line. Could this be what was causing this error? I've removed those errant files and rebooted... just waiting to see if the problem resurfaces.
I am seeing the error again, after a reboot of 1 of my nodes. I have verified that the /etc/xen/ directory is "clean", no duplicates, etc. This time the error showed up just after adding a new vm using Luci. We started getting the following error: clurgmgrd[21756]: <err> #37: Error receiving header from 1 sz=0 CTX 0xbe20a90 clurgmgrd[21756]: <err> #37: Error receiving header from 1 sz=0 CTX 0xbe24b70 clurgmgrd[21756]: <err> #37: Error receiving header from 1 sz=0 CTX 0xbe2d4f0 and that continued until I rebooted node 1. Once node 1 was rebooted, those errors went away and I started getting the checkpoint_find error repeatedly.
this is an error in synchronization that is not yet understood. A clear definition of how to reproduce the issue should help since in 2 years of dev I have never seen this in our labs. Until we have a solid QE reproducer or method to reproduce Im marking needinfo. Regards -steve
I can give you some debug options to add to the cluster info that may help get more information to aid in debugging. Try adding <cluster config_version="3" name="brassow-xen"> <logging debug="on" fileline="on" timestamp="on"> <logger ident="CKPT" debug="on" tags="enter|leave"> </logger> </logging> do not put in the "cluster" tag but instead put the logging and logger tags after <cluster .....> then reload the config with ccs_tool "filename" where filename is the filename of the hand modified cluster.conf file with the above logger output.
'Error receiving header' from clurgmgrd might be a fixed problem in the current release, and may or may not be related to the openais errors. Rgmanager doesn't use checkpointing (though I wish it did :) ), but it does use cman (openais) messaging to communicate.
*** Bug 436507 has been marked as a duplicate of this bug. ***
*** Bug 430296 has been marked as a duplicate of this bug. ***
fixed in openais-0.80-3.17
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0074.html