Bug 499918

Summary: saCkptSectionIterationNext() error
Product: [Fedora] Fedora Reporter: David Teigland <teigland>
Component: openaisAssignee: Jan Friesse <jfriesse>
Status: CLOSED UPSTREAM QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: rawhideCC: agk, fdinitto, sdake
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-01 08:47:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch for Makefile.am, so ipc_hdb is no included multiple times none

Description David Teigland 2009-05-08 22:10:48 UTC
Description of problem:

I think we may have lost something in transit between irc/email/svn,

Mar 26 16:10:20 <dct>   confchg, node1 create ckpt, node2 open ckpt, node2
                        read ckpt -> fail

Mar 26 16:10:46 <dct>   nodeid 1 creates the ckpt

Mar 26 16:13:42 <dct>   saCkptCheckpointOpen() works,
                        saCkptSectionIterationInitialize() works,
                        then saCkptSectionIterationNext() fails

Mar 26 16:30:34 <sdake> wow iteration fails straight up single node
Mar 26 16:30:39 <sdake> that was working like 1 week ago or less
Mar 26 16:52:30 <sdake> dct found problem
Mar 26 16:52:32 <sdake> patch coming to list now

This looks like the patch, but I don't see it in svn
https://lists.linux-foundation.org/pipermail/openais/2009-March/011048.html

And I'm still getting error 9 (BAD_HANDLE) from saCkptSectionIterationNext().   


Version-Release number of selected component (if applicable):


How reproducible:

node1: mount gfs
node1: take plock on gfs
node2: mount gfs

when node2 mounts, node1 creates a checkpoint containing the info about the plock, node2 then opens and tries to read that checkpoint.  node2 will then show this error in /var/log/messages

bull-02 dlm_controld[8233]: retrieve_plocks: ckpt iternext error 9 x

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jan Friesse 2009-05-27 14:24:39 UTC
Created attachment 345615 [details]
Patch for Makefile.am, so ipc_hdb is no included multiple times

included is patch for Makefile.am of corosync, so coroipcc.o is no
longer included in lib... directly, but rather *.so is a dependency, so
ipc_hdb is no longer in multiple *.so and multiple times in binary what
causes problem.

Comment 2 Jan Friesse 2009-05-28 12:09:13 UTC
Better (I hope) description of problem:

Functions from ckpt library (like aCkptCheckpointOpen, saCkptSectionIterationInitialize, ...) internally uses corosync functions reply_receive, reply_receive_in_buf, ... This functions are included in coroipcc.c source file and uses global static variable ipc_hdb.

Without patch, coroipcc is linked to shared library (libcoroipcc.so) AND linked with every corosync libraries (like cpg, ....), so global variable ipc_hdb is included not only in libcoroipcc.so, but also in libcpg.so, ...

dlm_controld has function retrieve_plocks, and whole binary is linked with libcoroipcc and libcpg. So ipc_hdb is included TWICE (so has TWO addresses).

Main problem causing the bug was, that reply_receive uses address from one library, and reply_receive_in_buf uses other. This confuses check of hdb_get function. And this is, what I don't understand 100%. Why linker allowed two existence of ipc_hdb or better, why it choose different addresses in different functions (but defined in same module and called from same module)? Or better. It looks like linker chooses addresses just randomly.

After removing linking of coroipcc.o to cpg, and rather use of dynamic version,  (this means, there is only one instance of ipc_hdb) problem disappeared for me.

Comment 3 Jan Friesse 2009-06-01 08:47:37 UTC
Committed to upstream, so I'm closing this bug.