Red Hat Bugzilla – Bug 499918
Last modified: 2009-06-01 04:47:37 EDT
Description of problem:
I think we may have lost something in transit between irc/email/svn,
Mar 26 16:10:20 <dct> confchg, node1 create ckpt, node2 open ckpt, node2
read ckpt -> fail
Mar 26 16:10:46 <dct> nodeid 1 creates the ckpt
Mar 26 16:13:42 <dct> saCkptCheckpointOpen() works,
then saCkptSectionIterationNext() fails
Mar 26 16:30:34 <sdake> wow iteration fails straight up single node
Mar 26 16:30:39 <sdake> that was working like 1 week ago or less
Mar 26 16:52:30 <sdake> dct found problem
Mar 26 16:52:32 <sdake> patch coming to list now
This looks like the patch, but I don't see it in svn
And I'm still getting error 9 (BAD_HANDLE) from saCkptSectionIterationNext().
Version-Release number of selected component (if applicable):
node1: mount gfs
node1: take plock on gfs
node2: mount gfs
when node2 mounts, node1 creates a checkpoint containing the info about the plock, node2 then opens and tries to read that checkpoint. node2 will then show this error in /var/log/messages
bull-02 dlm_controld: retrieve_plocks: ckpt iternext error 9 x
Steps to Reproduce:
Created attachment 345615 [details]
Patch for Makefile.am, so ipc_hdb is no included multiple times
included is patch for Makefile.am of corosync, so coroipcc.o is no
longer included in lib... directly, but rather *.so is a dependency, so
ipc_hdb is no longer in multiple *.so and multiple times in binary what
Better (I hope) description of problem:
Functions from ckpt library (like aCkptCheckpointOpen, saCkptSectionIterationInitialize, ...) internally uses corosync functions reply_receive, reply_receive_in_buf, ... This functions are included in coroipcc.c source file and uses global static variable ipc_hdb.
Without patch, coroipcc is linked to shared library (libcoroipcc.so) AND linked with every corosync libraries (like cpg, ....), so global variable ipc_hdb is included not only in libcoroipcc.so, but also in libcpg.so, ...
dlm_controld has function retrieve_plocks, and whole binary is linked with libcoroipcc and libcpg. So ipc_hdb is included TWICE (so has TWO addresses).
Main problem causing the bug was, that reply_receive uses address from one library, and reply_receive_in_buf uses other. This confuses check of hdb_get function. And this is, what I don't understand 100%. Why linker allowed two existence of ipc_hdb or better, why it choose different addresses in different functions (but defined in same module and called from same module)? Or better. It looks like linker chooses addresses just randomly.
After removing linking of coroipcc.o to cpg, and rather use of dynamic version, (this means, there is only one instance of ipc_hdb) problem disappeared for me.
Committed to upstream, so I'm closing this bug.