706289 – corosync daemon core dump running openais/test/testlck

Bug 706289 - corosync daemon core dump running openais/test/testlck

Summary: corosync daemon core dump running openais/test/testlck

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	openais
Sub Component:
Version:	14
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	---
Assignee:	Ryan O'Hara
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-05-19 22:39 UTC by dan clark
Modified:	2011-10-03 15:37 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-10-03 15:37:56 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description dan clark 2011-05-19 22:39:15 UTC

Description of problem:
run openais/test/testlck while pacemaker is up on a system supported by corosync resulting in a back trace:



Version-Release number of selected component (if applicable):
corosync-1.3.1/openais 1.1.3/pacemaker 1.0.0 (clusterlabs)
corosync-1.3.1/openais 1.1.4/pacemaker 1.1.5 (fedora-14)

How reproducible:
run testlck (or testlck2) 
generally first run causes corosync to crash when testlck exits (after hitting enter)

Steps to Reproduce:
1. service openais restart
2. openais/test/testlck <hit enter> on exit corosync core dumps
  
Actual results:
(gdb) where
#0  0x00007ffff400e221 in fprintf () from /usr/libexec/lcrso/service_lck.lcrso
#1  0x0000000000407afe in ?? ()
#2  0x00007ffff7bcaeee in ?? () from /usr/lib64/libtotem_pg.so.4
#3  0x00007ffff7bcb40e in ?? () from /usr/lib64/libtotem_pg.so.4
#4  0x00007ffff7bc4163 in ?? () from /usr/lib64/libtotem_pg.so.4
#5  0x00007ffff7bc8c98 in ?? () from /usr/lib64/libtotem_pg.so.4
#6  0x00007ffff7bbf6f3 in rrp_deliver_fn () from /usr/lib64/libtotem_pg.so.4
#7  0x00007ffff7bbb480 in ?? () from /usr/lib64/libtotem_pg.so.4
#8  0x00007ffff7bb7be8 in poll_run () from /usr/lib64/libtotem_pg.so.4
#9  0x000000000040735b in main ()

Expected results:
locking works as specified, application exits normally and corosync daemon continues to operate.

Additional info:
suspect fprintf in amfcomp.c clc_command_run() handling exiting process but purely speculative guess.

After corosync exits all pacemaker daemons consume 100% of cpu (attrd, crmd, stonithd, cib).   additional bug, generally shutdown (service openais stop) does not guarantee all services stopped cleanly.  Openais started in an idle state consumes approximately 5KB/s network bandwidth.

Comment 2 Ryan O'Hara 2011-05-19 23:01:14 UTC

(In reply to comment #1)
> Since this issue was entered in bugzilla, the release flag has been
> set to ? to ensure that it is properly evaluated for this release.

(In reply to comment #0)
> Description of problem:
> run openais/test/testlck while pacemaker is up on a system supported by
> corosync resulting in a back trace:
> 
> 
> 
> Version-Release number of selected component (if applicable):
> corosync-1.3.1/openais 1.1.3/pacemaker 1.0.0 (clusterlabs)
> corosync-1.3.1/openais 1.1.4/pacemaker 1.1.5 (fedora-14)

You filed this bug against RHEL6, but it appears that you are not using RHEL6. Is that correct?

> How reproducible:
> run testlck (or testlck2) 
> generally first run causes corosync to crash when testlck exits (after hitting
> enter)

Can you define "generally"? Does that mean this always happens or only occasionally?

> Steps to Reproduce:
> 1. service openais restart
> 2. openais/test/testlck <hit enter> on exit corosync core dumps
> 
> Actual results:
> (gdb) where
> #0  0x00007ffff400e221 in fprintf () from /usr/libexec/lcrso/service_lck.lcrso
> #1  0x0000000000407afe in ?? ()
> #2  0x00007ffff7bcaeee in ?? () from /usr/lib64/libtotem_pg.so.4
> #3  0x00007ffff7bcb40e in ?? () from /usr/lib64/libtotem_pg.so.4
> #4  0x00007ffff7bc4163 in ?? () from /usr/lib64/libtotem_pg.so.4
> #5  0x00007ffff7bc8c98 in ?? () from /usr/lib64/libtotem_pg.so.4
> #6  0x00007ffff7bbf6f3 in rrp_deliver_fn () from /usr/lib64/libtotem_pg.so.4
> #7  0x00007ffff7bbb480 in ?? () from /usr/lib64/libtotem_pg.so.4
> #8  0x00007ffff7bb7be8 in poll_run () from /usr/lib64/libtotem_pg.so.4
> #9  0x000000000040735b in main ()
> 
> Expected results:
> locking works as specified, application exits normally and corosync daemon
> continues to operate.
> 
> Additional info:
> suspect fprintf in amfcomp.c clc_command_run() handling exiting process but
> purely speculative guess.

Why do you think AMF code comes into play?

> After corosync exits all pacemaker daemons consume 100% of cpu (attrd, crmd,
> stonithd, cib).   additional bug, generally shutdown (service openais stop)
> does not guarantee all services stopped cleanly.  Openais started in an idle
> state consumes approximately 5KB/s network bandwidth.

Side effect and/or different issue.

Comment 3 dan clark 2011-05-23 18:18:54 UTC

Delayed response due to not being on cc list...now fixed.  The use of the available test applications was used to provide a simple way to recreate a problem.
(In reply to comment #2)
> (In reply to comment #1)
> > Since this issue was entered in bugzilla, the release flag has been
> > set to ? to ensure that it is properly evaluated for this release.
> 
> (In reply to comment #0)
> > Description of problem:
> > run openais/test/testlck while pacemaker is up on a system supported by
> > corosync resulting in a back trace:
> > 
> > 
> > 
> > Version-Release number of selected component (if applicable):
> > corosync-1.3.1/openais 1.1.3/pacemaker 1.0.0 (clusterlabs)
> > corosync-1.3.1/openais 1.1.4/pacemaker 1.1.5 (fedora-14)
> 
> You filed this bug against RHEL6, but it appears that you are not using RHEL6.
> Is that correct?
My mistake.  In this case the release that were tested were Fedora-14, but my account will not let me change the category of the bug to a different product line. The problem MAY exist in RHEL6 but I did not specifically test.
> 
> > How reproducible:
> > run testlck (or testlck2) 
> > generally first run causes corosync to crash when testlck exits (after hitting
> > enter)
> 
> Can you define "generally"? Does that mean this always happens or only
> occasionally?
I have only run 10's of tests and I believe the 'corosync' daemon crashes every time on the first run of the 'testlck' application.  Expected results are that testlck and testlck2 should run over and over without crashing the corosync daemon.  Would a crash of a core daemon (on what appears to be a debug print statement) due to a routine application exiting be considered high priority since it disables the node in pacemaker?
> 
> > Steps to Reproduce:
> > 1. service openais restart
> > 2. openais/test/testlck <hit enter> on exit corosync core dumps
> > 
> > Actual results:
> > (gdb) where
> > #0  0x00007ffff400e221 in fprintf () from /usr/libexec/lcrso/service_lck.lcrso
> > #1  0x0000000000407afe in ?? ()
> > #2  0x00007ffff7bcaeee in ?? () from /usr/lib64/libtotem_pg.so.4
> > #3  0x00007ffff7bcb40e in ?? () from /usr/lib64/libtotem_pg.so.4
> > #4  0x00007ffff7bc4163 in ?? () from /usr/lib64/libtotem_pg.so.4
> > #5  0x00007ffff7bc8c98 in ?? () from /usr/lib64/libtotem_pg.so.4
> > #6  0x00007ffff7bbf6f3 in rrp_deliver_fn () from /usr/lib64/libtotem_pg.so.4
> > #7  0x00007ffff7bbb480 in ?? () from /usr/lib64/libtotem_pg.so.4
> > #8  0x00007ffff7bb7be8 in poll_run () from /usr/lib64/libtotem_pg.so.4
> > #9  0x000000000040735b in main ()
> > 
> > Expected results:
> > locking works as specified, application exits normally and corosync daemon
> > continues to operate.
> > 
> > Additional info:
> > suspect fprintf in amfcomp.c clc_command_run() handling exiting process but
> > purely speculative guess.
> 
> Why do you think AMF code comes into play?
The speculation is based on a search of the code associated with the bottom of the stack trace in service_lck.
fprintf () from /usr/libexec/lcrso/service_lck.lcrso
> 
> > After corosync exits all pacemaker daemons consume 100% of cpu (attrd, crmd,
> > stonithd, cib).   additional bug, generally shutdown (service openais stop)
> > does not guarantee all services stopped cleanly.  Openais started in an idle
> > state consumes approximately 5KB/s network bandwidth.
> 
> Side effect and/or different issue.
Once the corosync daemon exits the pacemaker demons loop and dominate the cpu.  This is an unacceptable error condition response IMHO. Since this is the root cause of the following higher priority bugs shouldn't the priority of this crash of a service be higher?

706291 - pacemaker (stonithd, attrd, cib, crmd) spins and consume 100%
cpu after corosync crashes
706297 - service openais not stopping all services

Comment 4 Ryan O'Hara 2011-05-23 19:30:51 UTC

(In reply to comment #3)
> Delayed response due to not being on cc list...now fixed.  The use of the
> available test applications was used to provide a simple way to recreate a
> problem.
> (In reply to comment #2)
> > (In reply to comment #1)
> > > Since this issue was entered in bugzilla, the release flag has been
> > > set to ? to ensure that it is properly evaluated for this release.
> > 
> > (In reply to comment #0)
> > > Description of problem:
> > > run openais/test/testlck while pacemaker is up on a system supported by
> > > corosync resulting in a back trace:
> > > 
> > > 
> > > 
> > > Version-Release number of selected component (if applicable):
> > > corosync-1.3.1/openais 1.1.3/pacemaker 1.0.0 (clusterlabs)
> > > corosync-1.3.1/openais 1.1.4/pacemaker 1.1.5 (fedora-14)
> > 
> > You filed this bug against RHEL6, but it appears that you are not using RHEL6.
> > Is that correct?
> My mistake.  In this case the release that were tested were Fedora-14, but my
> account will not let me change the category of the bug to a different product
> line. The problem MAY exist in RHEL6 but I did not specifically test.

Moved to F14.

> > > How reproducible:
> > > run testlck (or testlck2) 
> > > generally first run causes corosync to crash when testlck exits (after hitting
> > > enter)
> > 
> > Can you define "generally"? Does that mean this always happens or only
> > occasionally?
> I have only run 10's of tests and I believe the 'corosync' daemon crashes every
> time on the first run of the 'testlck' application.  Expected results are that
> testlck and testlck2 should run over and over without crashing the corosync
> daemon.  Would a crash of a core daemon (on what appears to be a debug print
> statement) due to a routine application exiting be considered high priority
> since it disables the node in pacemaker?

No, because the LCK service is experimental.

> > > Steps to Reproduce:
> > > 1. service openais restart
> > > 2. openais/test/testlck <hit enter> on exit corosync core dumps

Are you also starting corosync?

> > > Actual results:
> > > (gdb) where
> > > #0  0x00007ffff400e221 in fprintf () from /usr/libexec/lcrso/service_lck.lcrso
> > > #1  0x0000000000407afe in ?? ()
> > > #2  0x00007ffff7bcaeee in ?? () from /usr/lib64/libtotem_pg.so.4
> > > #3  0x00007ffff7bcb40e in ?? () from /usr/lib64/libtotem_pg.so.4
> > > #4  0x00007ffff7bc4163 in ?? () from /usr/lib64/libtotem_pg.so.4
> > > #5  0x00007ffff7bc8c98 in ?? () from /usr/lib64/libtotem_pg.so.4
> > > #6  0x00007ffff7bbf6f3 in rrp_deliver_fn () from /usr/lib64/libtotem_pg.so.4
> > > #7  0x00007ffff7bbb480 in ?? () from /usr/lib64/libtotem_pg.so.4
> > > #8  0x00007ffff7bb7be8 in poll_run () from /usr/lib64/libtotem_pg.so.4
> > > #9  0x000000000040735b in main ()
> > > 
> > > Expected results:
> > > locking works as specified, application exits normally and corosync daemon
> > > continues to operate.
> > > 
> > > Additional info:
> > > suspect fprintf in amfcomp.c clc_command_run() handling exiting process but
> > > purely speculative guess.
> > 
> > Why do you think AMF code comes into play?
> The speculation is based on a search of the code associated with the bottom of
> the stack trace in service_lck.
> fprintf () from /usr/libexec/lcrso/service_lck.lcrso
> > 
> > > After corosync exits all pacemaker daemons consume 100% of cpu (attrd, crmd,
> > > stonithd, cib).   additional bug, generally shutdown (service openais stop)
> > > does not guarantee all services stopped cleanly.  Openais started in an idle
> > > state consumes approximately 5KB/s network bandwidth.
> > 
> > Side effect and/or different issue.
> Once the corosync daemon exits the pacemaker demons loop and dominate the cpu. 
> This is an unacceptable error condition response IMHO. Since this is the root
> cause of the following higher priority bugs shouldn't the priority of this
> crash of a service be higher?
> 
> 706291 - pacemaker (stonithd, attrd, cib, crmd) spins and consume 100%
> cpu after corosync crashes
> 706297 - service openais not stopping all services

If you are using corosync, you really should not need to use the openais init script. In fact, it should be removed.

Comment 5 dan clark 2011-05-23 21:19:59 UTC

The startup for openais simply runs the corosync daemon with an extra set of flags, there is no reason to start the service corosync.

Starting openais sets up the services for use by openais (corosync is in the openais mail group and the services that are started include the pacemaker components).  Are you sure you want to 'remove' the openais init script?  

from the FAQ:
We share a mailing list with the openais project. The openais mailing list should be used for all communication relating to The Corosync Cluster Engine.

Comment 6 dan clark 2011-05-23 22:56:57 UTC

The core dump happens usually after one run of testlck, definitely after two runs in a row.

A back trace after a rebuild of the lcrso provides better information about the location of the problem (not in printf, and as noted above not in amf). 

(gdb) where
#0  0x00007feee8136e48 in lck_resource_cleanup_find (conn=0x6d5700, 
    resource_name=0x7fff3c439960)
    at /vobs/opensource/expanded/openais-1.1.4/services/lck.c:1597
#1  0x00007feee81383c8 in message_handler_req_exec_lck_resourceclose (
    message=0x7fff3c439940, nodeid=990357696)
    at /vobs/opensource/expanded/openais-1.1.4/services/lck.c:2309
#2  0x0000000000407afe in ?? ()
#3  0x00007feeec138eee in ?? () from /usr/lib64/libtotem_pg.so.4
#4  0x00007feeec13940e in ?? () from /usr/lib64/libtotem_pg.so.4
#5  0x00007feeec132163 in ?? () from /usr/lib64/libtotem_pg.so.4
#6  0x00007feeec136c98 in ?? () from /usr/lib64/libtotem_pg.so.4
#7  0x00007feeec12d6f3 in rrp_deliver_fn () from /usr/lib64/libtotem_pg.so.4
#8  0x00007feeec129480 in ?? () from /usr/lib64/libtotem_pg.so.4
#9  0x00007feeec125be8 in poll_run () from /usr/lib64/libtotem_pg.so.4
#10 0x000000000040735b in main ()

Note that in lck.c 
static struct resource_cleanup *lck_resource_cleanup_find (
    void *conn,
    const mar_name_t *resource_name)
{
    struct lck_pd *lck_pd = (struct lck_pd *)api->ipc_private_data_get (conn);

    struct resource_cleanup *cleanup;
    struct list_head *cleanup_list;

1597--->    for (cleanup_list = lck_pd->resource_cleanup_list.next;
         cleanup_list != &lck_pd->resource_cleanup_list;
         cleanup_list = cleanup_list->next)
    {

The problem is that the values dereferenced from the api data structure are not correct returning some garbage pointer into 'lck_pd' which should have a value of the function 'coroipcs_private_data_get' if I read the code correctly.

(gdb) p lck_pd
$3 = (struct lck_pd *) 0x636e7973
(gdb) p *lck_pd
Cannot access memory at address 0x636e7973
(gdb) p coroipcs_private_data_get
$4 = {
    <text variable, no debug info>} 0x7feeebd13890 <coroipcs_private_data_get>


from corosync apidef.c
static struct corosync_api_v1 apidef_corosync_api_v1 = {
...
    .ipc_private_data_get = coroipcs_private_data_get,


Any suggestions on places to look?

Comment 7 Ryan O'Hara 2011-05-24 13:55:15 UTC

(In reply to comment #5)
> The startup for openais simply runs the corosync daemon with an extra set of
> flags, there is no reason to start the service corosync.

I suggest you use corosycnc service and configure corosync to load the services you are interested in using.

> Starting openais sets up the services for use by openais (corosync is in the
> openais mail group and the services that are started include the pacemaker
> components).  Are you sure you want to 'remove' the openais init script?  

Yes. In fact, it should be removed. See BZ 630110. We don't need an openais init script now that corosync exists.

> from the FAQ:
> We share a mailing list with the openais project. The openais mailing list
> should be used for all communication relating to The Corosync Cluster Engine.

I am well aware of this. I don't understand why you are mentioning this.

Comment 8 dan clark 2011-05-24 15:59:34 UTC

(In reply to comment #7)
> (In reply to comment #5)
> > The startup for openais simply runs the corosync daemon with an extra set of
> > flags, there is no reason to start the service corosync.
> 
> I suggest you use corosycnc service and configure corosync to load the services
> you are interested in using.
Yes, this does indeed circumvent the bug by not allowing the test application to load at all.  That, however, seems to be an odd way to address an underlying bug that is causing the daemon to crash.  Can we focus this discussion on how to change the software to fix the underlying crash.  
> 
> > Starting openais sets up the services for use by openais (corosync is in the
> > openais mail group and the services that are started include the pacemaker
> > components).  Are you sure you want to 'remove' the openais init script?  
> 
> Yes. In fact, it should be removed. See BZ 630110. We don't need an openais
> init script now that corosync exists.
Perhaps the proper setup of pacemaker/corosync should include some documentation to this effect instead of burying the proper setup in a bug.
> 
> > from the FAQ:
> > We share a mailing list with the openais project. The openais mailing list
> > should be used for all communication relating to The Corosync Cluster Engine.
> 
> I am well aware of this. I don't understand why you are mentioning this.
I mention the mailing list as one of the sources of confusion on properly setting up the system.  The openais mailing list and project appears to be the umbrella under which corosync is working.  As a result the assumption was made that openais supports the AIS suite of interfaces and that this was provided for development and used by pacemaker to interface with corosync.  
Again, the basic problem is that there is a path that is causing the daemon to crash with a very simple way to recreate the problem.  Would it be tOK to focus the discussion on helping to identify the logic problems in the daemon that could be repaired to handle shutdown of an application without the core dump of the daemon?

Comment 9 Ryan O'Hara 2011-10-03 15:37:56 UTC

Closing WONTFIX since openais will be going away in F17.

Note You need to log in before you can comment on or make changes to this bug.