499918 – saCkptSectionIterationNext() error

Bug 499918 - saCkptSectionIterationNext() error

Summary: saCkptSectionIterationNext() error

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	openais
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Jan Friesse
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-05-08 22:10 UTC by David Teigland
Modified:	2009-06-01 08:47 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-06-01 08:47:37 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch for Makefile.am, so ipc_hdb is no included multiple times (3.14 KB, patch) 2009-05-27 14:24 UTC, Jan Friesse	no flags	Details \| Diff
View All

Description David Teigland 2009-05-08 22:10:48 UTC

Description of problem:

I think we may have lost something in transit between irc/email/svn,

Mar 26 16:10:20 <dct>   confchg, node1 create ckpt, node2 open ckpt, node2
                        read ckpt -> fail

Mar 26 16:10:46 <dct>   nodeid 1 creates the ckpt

Mar 26 16:13:42 <dct>   saCkptCheckpointOpen() works,
                        saCkptSectionIterationInitialize() works,
                        then saCkptSectionIterationNext() fails

Mar 26 16:30:34 <sdake> wow iteration fails straight up single node
Mar 26 16:30:39 <sdake> that was working like 1 week ago or less
Mar 26 16:52:30 <sdake> dct found problem
Mar 26 16:52:32 <sdake> patch coming to list now

This looks like the patch, but I don't see it in svn
https://lists.linux-foundation.org/pipermail/openais/2009-March/011048.html

And I'm still getting error 9 (BAD_HANDLE) from saCkptSectionIterationNext().   


Version-Release number of selected component (if applicable):


How reproducible:

node1: mount gfs
node1: take plock on gfs
node2: mount gfs

when node2 mounts, node1 creates a checkpoint containing the info about the plock, node2 then opens and tries to read that checkpoint.  node2 will then show this error in /var/log/messages

bull-02 dlm_controld[8233]: retrieve_plocks: ckpt iternext error 9 x

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jan Friesse 2009-05-27 14:24:39 UTC

Created attachment 345615 [details]
Patch for Makefile.am, so ipc_hdb is no included multiple times

included is patch for Makefile.am of corosync, so coroipcc.o is no
longer included in lib... directly, but rather *.so is a dependency, so
ipc_hdb is no longer in multiple *.so and multiple times in binary what
causes problem.

Comment 2 Jan Friesse 2009-05-28 12:09:13 UTC

Better (I hope) description of problem:

Functions from ckpt library (like aCkptCheckpointOpen, saCkptSectionIterationInitialize, ...) internally uses corosync functions reply_receive, reply_receive_in_buf, ... This functions are included in coroipcc.c source file and uses global static variable ipc_hdb.

Without patch, coroipcc is linked to shared library (libcoroipcc.so) AND linked with every corosync libraries (like cpg, ....), so global variable ipc_hdb is included not only in libcoroipcc.so, but also in libcpg.so, ...

dlm_controld has function retrieve_plocks, and whole binary is linked with libcoroipcc and libcpg. So ipc_hdb is included TWICE (so has TWO addresses).

Main problem causing the bug was, that reply_receive uses address from one library, and reply_receive_in_buf uses other. This confuses check of hdb_get function. And this is, what I don't understand 100%. Why linker allowed two existence of ipc_hdb or better, why it choose different addresses in different functions (but defined in same module and called from same module)? Or better. It looks like linker chooses addresses just randomly.

After removing linking of coroipcc.o to cpg, and rather use of dynamic version,  (this means, there is only one instance of ipc_hdb) problem disappeared for me.

Comment 3 Jan Friesse 2009-06-01 08:47:37 UTC

Committed to upstream, so I'm closing this bug.

Note You need to log in before you can comment on or make changes to this bug.