Bug 400941

Summary:	openais reporting error continuously
Product:	Red Hat Enterprise Linux 5	Reporter:	Mark Nielsen <mnielsen>
Component:	openais	Assignee:	Steven Dake <sdake>
Status:	CLOSED ERRATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	urgent
Version:	5.0	CC:	cluster-maint, cmarthal, gavinf, ghelleks, lhh, sghosh
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-01-20 20:46:37 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	509885

Description Mark Nielsen 2007-11-27 13:32:25 UTC

Description of problem:
openais is continuously reporting the following error to /var/log/messages:
openais[14278]: [CKPT ] checkpoint_find returned 0 calling error_exit.
then that is followed by errors such as:
last message repeated 152 times
last message repeated 66 times
this goes on.

Version-Release number of selected component (if applicable):
openais-0.80.3-7.el5

How reproducible:
This seems to clear after a reboot of the cluster, then randomly shows up. I
don't know how to go about reproducing it, what causes it, or how to even begin
to debug this. The most recent event that occurred prior to this error starting
was when I rebooted 1 node of my cluster.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Because this is on a "disconnected network", I need to go through some steps to
get logs, configurations, etc burned to a CD and upload to bugzilla. So I don't
have to do that process multiple times, please notify me of all the logs,
configs, etc that you need to see for this.

Comment 1 Mark Nielsen 2007-11-28 14:41:31 UTC

I found, in /etc/xen/ some "leftover" domU configuration files lying around. One
user had been testing and copied an existing domU config to test with, failing
to change the "name =" line or "uuid" line. Could this be what was causing this
error? I've removed those errant files and rebooted... just waiting to see if
the problem resurfaces.

Comment 2 Mark Nielsen 2007-12-10 16:07:05 UTC

I am seeing the error again, after a reboot of 1 of my nodes. I have verified
that the /etc/xen/ directory is "clean", no duplicates, etc. This time the error
showed up just after adding a new vm using Luci. We started getting the
following error:
clurgmgrd[21756]: <err> #37: Error receiving header from 1 sz=0 CTX 0xbe20a90
clurgmgrd[21756]: <err> #37: Error receiving header from 1 sz=0 CTX 0xbe24b70
clurgmgrd[21756]: <err> #37: Error receiving header from 1 sz=0 CTX 0xbe2d4f0

and that continued until I rebooted node 1. Once node 1 was rebooted, those
errors went away and I started getting the checkpoint_find error repeatedly.

Comment 3 Steven Dake 2008-02-07 16:25:35 UTC

this is an error in synchronization that is not yet understood.  A clear
definition of how to reproduce the issue should help since in 2 years of dev I
have never seen this in our labs.  Until we have a solid QE reproducer or method
to reproduce Im marking needinfo.

Regards
-steve

Comment 4 Steven Dake 2008-02-07 17:31:42 UTC

I can give you some debug options to add to the cluster info that may
help get more information to aid in debugging.

Try adding 
<cluster config_version="3" name="brassow-xen">
         <logging debug="on" fileline="on" timestamp="on">
                 <logger ident="CKPT" debug="on" tags="enter|leave">
                 </logger>
         </logging>


do not put in the "cluster" tag but instead put the logging and logger
tags after <cluster .....>

then reload the config with ccs_tool "filename" where filename is the
filename of the hand modified cluster.conf file with the above logger
output.

Comment 5 Lon Hohberger 2008-02-07 19:47:29 UTC

'Error receiving header' from clurgmgrd might be a fixed problem in the current
release, and may or may not be related to the openais errors.  Rgmanager doesn't
use checkpointing (though I wish it did :) ), but it does use cman (openais)
messaging to communicate.

Comment 7 Steven Dake 2008-05-18 21:05:54 UTC

*** Bug 436507 has been marked as a duplicate of this bug. ***

Comment 8 Corey Marthaler 2008-05-19 15:34:29 UTC

*** Bug 430296 has been marked as a duplicate of this bug. ***

Comment 9 Steven Dake 2008-06-11 21:52:11 UTC

fixed in openais-0.80-3.17

Comment 13 errata-xmlrpc 2009-01-20 20:46:37 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0074.html