Bug 441413

Summary:	aisexec dies during recovery when plock_ownership is enabled
Product:	Red Hat Enterprise Linux 5	Reporter:	Steven Dake <sdake>
Component:	openais	Assignee:	Steven Dake <sdake>
Status:	CLOSED ERRATA	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	5.3	CC:	bstevens, ccaulfie, cluster-maint, edamato, jbrassow, nstraz, sdake, sghosh, syeghiay, tao, teigland
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-01-20 20:46:45 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	432531
Bug Blocks:	509887

Comment 1 Nate Straz 2008-04-17 22:35:45 UTC

I was able to hit this using the PLOCK load without plock_ownership enabled.

Comment 5 Nate Straz 2008-08-06 14:14:43 UTC

I hit this on a long revolver run (on the 26th iteration) with 
openais-0.80.3-15.el5.

Comment 8 Steven Dake 2008-10-12 21:46:59 UTC

really good progress to report.

The problem with debugging this problem in the past has been that the existing logging or printf tracing is too slow to cause the problem to reproduce.  Essentially because of changes in timing characteristcs of the system with typical printf processing overhead, the problem was impossible to reproduce.

I spent this last week writing a super high performance "flight recorder" tracing system and integrated it into whitetank (openais-0.80.3+patches) for understanding this issue.  When an assert or segfault occurs, a significant amount of the last events that occured in the system are recorded to disk.

The following events were recorded:

@ at record position 22465 in the record log

Section transmit checkpoint name=[gfsplock.bench-1230] id=[48] section=[r74.1]
Section transmit checkpoint name=[gfsplock.bench-1230] id=[48] section=[r72.1]
Section transmit checkpoint name=[gfsplock.bench-1230] id=[48] section=[r73.1]
Section transmit checkpoint name=[gfsplock.bench-1230] id=[48] section=[r80590950.1]
Section transmit checkpoint name=[gfsplock.bench-1230] id=[48] section=[r80590953.0]
Section transmit checkpoint name=[gfsplock.bench-1230] id=[48] section=[r389.1]
Checkpoint transmit name=[gfsplock.bench-1232] id=[15]
Section transmit checkpoint name=[gfsplock.bench-1232] id=[15] section=[r443.1]
Section transmit checkpoint name=[gfsplock.bench-1232] id=[15] section=[r460.1]
Section transmit checkpoint name=[gfsplock.bench-1232] id=[15] section=[r66372.1]

@26717 right before the segfault:

Section receive checkpoint name=[gfsplock.bench-1230] id=[48] section=[r74.1]
Section receive checkpoint name=[gfsplock.bench-1230] id=[48] section=[r72.1]
Section receive checkpoint name=[gfsplock.bench-1230] id=[48] section=[r73.1]
Section receive checkpoint name=[gfsplock.bench-1230] id=[48] section=[r80590950.1]
Section receive checkpoint name=[gfsplock.bench-1230] id=[48] section=[r80590953.0]
Section receive checkpoint name=[gfsplock.bench-1230] id=[48] section=[r389.1]
Section receive checkpoint name=[gfsplock.bench-1232] id=[15] section=[r443.1]
Finishing replay: records found [26716]

From the above trace it can be seen that checkpoint gfsplock.bench-1232 id 15 was transmitted, but not received!  The checkpoint section receive handler depends on the ordering of the totem protocol so that section receives happen after checkpoint receives.  If the checkpoint receive doesn't happen, then aisexec asserts (because it is operating out of defined behavior of the design).

While this is not a full root cause, now there is a tracng system that idicates why the crash occured and can be used to understand the actual root cause.

The question to be answered now is why is the transmit that occured missing from the incoming message stream.  While not a trivial question to answer, with more instrumentation the answer can be found.

Comment 9 Steven Dake 2008-10-24 22:32:22 UTC

Still working full time on this issue.  Still segfaults.

I have fixed various other issues hoping they would result in correction of this problem.  They were found via trace debugging using the flight recorder to understand where that missing message went.

First, once a checkpoint synchronization is started, if it is aborted, the synchronized checkpoint database could become corrupt in the current implementation.  An abort happens when a member joins or leaves while synchronization is taking place.  This is fixed in my test tree.

Second, totem messages are not delivered to the application in the GATHER or COMMIT states.  This can result in messages delivered in the wrong configuration confusing the checkpoint synchronization.  Tihs has been fixed in my test tree.

Third, it is possible for the received_flg of the totem token to not be set properly in many circumstances.  This results in messages not being recovered at the end of a configuration that can be recovered.  I am not certain this has any effect on checkpoint synchronization.  This has been fixed in my test tree.

I have tried all kinds of various things and inspection of some of the code paths which I am not pleased with.  They all seem to function according to their design minus the significant flaws listed above.  I have more to examine but as has been stated in this bugzilla before, figuring out what to trace after a failure takes 2-3 hours and running the test takes 3-4 hours so if I'm lucky i can overlap two runs in a day when I'm sleeping sometimes three if the test fails earlier then usual.

Bottom line still making progress on the onion but on a side note many other defects have been resolved relating to checkpoint as inspection of this problem continues.

Comment 10 Steven Dake 2008-10-31 12:31:04 UTC

I have found through additional lengthy analysis that a proper error code is being received for the message transmitted, but the message is not in the delivered messages.  I also was suspicious of the assembly of fragmented packets.

The following trace:
rec=[617102] totemsrp_delv 53311
rec=[617103] totempg_deliver_fn assembly=[369] iov_delv=[774]
rec=[617104] xmit_id=[87003] Section receive checkpoint name=[gfsplock.bench-1232] id=[59] section=[r81965510.1]
rec=[617105] totempg_deliver_fn assembly=[774] iov_delv=[369]
rec=[617106] xmit_id=[87005] Section receive checkpoint name=[gfsplock.bench-1231] id=[58] section=[r81834705.1]
rec=[617107] LEAVING function [message_handler_req_exec_ckpt_sync_checkpoint_section] line [3992]
rec=[617108] totempg_deliver_fn assembly=[1143] iov_delv=[369]
rec=[617109] xmit_id=[87006] Section receive checkpoint name=[gfsplock.bench-1231] id=[58] section=[r81965552.1]
rec=[617110] LEAVING function [message_handler_req_exec_ckpt_sync_checkpoint_section] line [3992]
rec=[617111] LEAVING function [totempg_deliver_fn] line [612]

shows that at rec 617106 xmit_id 87004 (the assertion causing missing message) is missing from the stream of messages.  However the totemsrp identifier is ordered properly.  It can be seen from the value of iov_delv in rec 617103 that iov_delv=774.  This value is incorrect.  774 = sizeof section (369) + checkpoint (405).  It is this invalid iov_delv that is resulting in the message being skipped.  In essence two messages are delivered to one message handler and the "second msg" is essentially thrown away.  See below for where magic 405 message size comes from on a proper checkpoint receive.

rec=[232459] totempg_deliver_fn assembly=[369] iov_delv=[405]
rec=[232460] xmit_id=[27197] Checkpoint receive checkpoint name=[gfsplock.bench-1231] id=[40]

Further analysis shows:

rec=[617099] totempg_deliver_fn assembly=[726] iov_delv=[363]
rec=[617100] xmit_id=[87002] Section receive checkpoint name=[gfsplock.bench-1232] id=[59] section=[r81.1]
rec=[617101] LEAVING function [totempg_deliver_fn] line [612]
rec=[617102] totemsrp_delv 53311
rec=[617103] totempg_deliver_fn assembly=[369] iov_delv=[774]

rec 617099 assembly=726 + delivery length of 363 = 1089 for the next assembly index (but instead its 369).  It should either be zero or 1089 not 369.

This is why the message is lost.  There appears to be an error in totempg in some scenario.

Comment 11 Steven Dake 2008-11-06 04:16:26 UTC

More tracing reveals the following:

missing message at:
rec=[823369] totemsrp_delv 46662
rec=[823370] totempg_deliver_fn assembly=[364] iov_delv=[769] fragmented=[1]
rec=[823371] xmit_id=[41183] Section receive checkpoint name=[gfsplock.bench-1230] id=[45] section=[r426.1]
rec=[823372] totempg_deliver_fn assembly=[769] iov_delv=[367] fragmented=[1]
rec=[823373] xmit_id=[41185] Section receive checkpoint name=[gfsplock.bench-1232] id=[41] section=[r131901.1]
rec=[823374] LEAVING function [message_handler_req_exec_ckpt_sync_checkpoint_section] line [3992]

XMIT ID 41184 is missing.

If we look back at the transmission of this message we see:
rec=[802221] totempg_mcast_cond_two idx=[0] iov_len=[5] copy_base=[0] copy_len=[4] iov_len_two=[4]
rec=[802222] totempg_mcast_cond_two idx=[1] iov_len=[5] copy_base=[0] copy_len=[1] iov_len_two=[1]
rec=[802223] totempg_mcast_cond_two idx=[2] iov_len=[5] copy_base=[0] copy_len=[352] iov_len_two=[352]
rec=[802224] totempg_mcast_cond idx=[3] iov_len=[5] copy_base=[0] copy_len=[7] iov_len_two=[7]
rec=[802225] totemsrp_mcast 46661
rec=[802226] totempg_mcast_cond_two idx=[4] iov_len=[5] copy_base=[0] copy_len=[0] iov_len_two=[0]
rec=[802227] LEAVING function [mcast_msg] line [915]


We can see that the size of the requested IO vector is o in rec 802226.  

Essentially totempg is designed to handle iovector elements with 0 lengths.  But in a certain cirsumstance the logic incorrectly formats the fragments of the message.

The above message should _NOT_ be marked as a "fragment" which will cause the refragmentation code to set the proper assembly->index as defined in the previous comments.  Because there is an extra iovec (index 3 above) which "just fits in the message window" the message is marked as a fragmented even though it is not fragmented since the iovector for index 4 has a 0 length.  It does this because it thinks there is more data in the message contents (but there is not since the iovector element length is zero).

This condition happens very rarely when the message fragmented contents fill up exactly an entire totemsrp frame with a zero iovector length element for the last element.  The ckpt synchronization code does the operation often and is expected and correct input to the totempg layer.

Going to test a workaround which simply removes any iovector elements which have zero length at the start of the fragmentation (multicasting) operation.

Test results tomorrow 9am gmt-7 from this workaround.

Regards
-steve

Comment 12 Steven Dake 2008-11-10 08:26:36 UTC

bug is fixed.  Bug requires action -> exception flag to be set in order to be checked in.

Thanks
-steve

Comment 16 errata-xmlrpc 2009-01-20 20:46:45 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0074.html