Red Hat Bugzilla – Bug 400941
openais reporting error continuously
Last modified: 2016-04-26 09:47:06 EDT
Description of problem:
openais is continuously reporting the following error to /var/log/messages:
openais: [CKPT ] checkpoint_find returned 0 calling error_exit.
then that is followed by errors such as:
last message repeated 152 times
last message repeated 66 times
this goes on.
Version-Release number of selected component (if applicable):
This seems to clear after a reboot of the cluster, then randomly shows up. I
don't know how to go about reproducing it, what causes it, or how to even begin
to debug this. The most recent event that occurred prior to this error starting
was when I rebooted 1 node of my cluster.
Steps to Reproduce:
Because this is on a "disconnected network", I need to go through some steps to
get logs, configurations, etc burned to a CD and upload to bugzilla. So I don't
have to do that process multiple times, please notify me of all the logs,
configs, etc that you need to see for this.
I found, in /etc/xen/ some "leftover" domU configuration files lying around. One
user had been testing and copied an existing domU config to test with, failing
to change the "name =" line or "uuid" line. Could this be what was causing this
error? I've removed those errant files and rebooted... just waiting to see if
the problem resurfaces.
I am seeing the error again, after a reboot of 1 of my nodes. I have verified
that the /etc/xen/ directory is "clean", no duplicates, etc. This time the error
showed up just after adding a new vm using Luci. We started getting the
clurgmgrd: <err> #37: Error receiving header from 1 sz=0 CTX 0xbe20a90
clurgmgrd: <err> #37: Error receiving header from 1 sz=0 CTX 0xbe24b70
clurgmgrd: <err> #37: Error receiving header from 1 sz=0 CTX 0xbe2d4f0
and that continued until I rebooted node 1. Once node 1 was rebooted, those
errors went away and I started getting the checkpoint_find error repeatedly.
this is an error in synchronization that is not yet understood. A clear
definition of how to reproduce the issue should help since in 2 years of dev I
have never seen this in our labs. Until we have a solid QE reproducer or method
to reproduce Im marking needinfo.
I can give you some debug options to add to the cluster info that may
help get more information to aid in debugging.
<cluster config_version="3" name="brassow-xen">
<logging debug="on" fileline="on" timestamp="on">
<logger ident="CKPT" debug="on" tags="enter|leave">
do not put in the "cluster" tag but instead put the logging and logger
tags after <cluster .....>
then reload the config with ccs_tool "filename" where filename is the
filename of the hand modified cluster.conf file with the above logger
'Error receiving header' from clurgmgrd might be a fixed problem in the current
release, and may or may not be related to the openais errors. Rgmanager doesn't
use checkpointing (though I wish it did :) ), but it does use cman (openais)
messaging to communicate.
*** Bug 436507 has been marked as a duplicate of this bug. ***
*** Bug 430296 has been marked as a duplicate of this bug. ***
fixed in openais-0.80-3.17
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.