Bug 101280 - reentrancy into versioning event processor causes stack overflow
Summary: reentrancy into versioning event processor causes stack overflow
Alias: None
Product: Red Hat Web Application Framework
Classification: Retired
Component: other
Version: nightly
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: Vadim Nasardinov
QA Contact: Jon Orris
Depends On:
Blocks: 98546
TreeView+ depends on / blocked
Reported: 2003-07-30 16:20 UTC by Vadim Nasardinov
Modified: 2007-04-18 16:56 UTC (History)
2 users (show)

Clone Of:
Last Closed: 2003-08-04 20:38:23 UTC

Attachments (Terms of Use)

Description Vadim Nasardinov 2003-07-30 16:20:04 UTC
Date: 30 Jul 2003 10:38:41 -0400
From: Mike Bonnet <mikeb@redhat.com>
To: Vadim Nasardinov <vadimn@redhat.com>
Message-Id: <1059575921.6551.16.camel@kahoma.home>
Subject: stack overflow in versioning

Hey Vadim.  I'm seeing some weird behavior in versioning.  When an item
is being promoted from pending to live by the lifecycle thread, I'm
getting stack traces like the attached.  Now, it is a complex object
with a lot of associations, but it looks like versioning might be
getting into some kind of loop.

It appears that this causes the IBM JVM to abort with a SIGSEGV.  I
think this is actually poor handling of a stack overflow, but I'm not
100% sure.

You may want to look into the loop problem.  In the shorter term, does
it make sense to add a call to Versions.suspendVersioning() in

I'm on AIM and IRC if you have questions.


Comment 1 Vadim Nasardinov 2003-07-30 16:25:12 UTC
The stack trace:

2003-07-30 10:17:45,834 [ead-5] DEBUG db.Statement - Statement
constructor: created com.arsdigita.db.PreparedStatement@76d26703,
statement count is now 1

java.lang.Throwable: Stack trace
at_com.arsdigita.db.Statement.<init>(Statement.java(Compiled Code))
Compiled Code))
at_com.arsdigita.db.PreparedStatement.wrap(PreparedStatement.java(Compiled Code))
at_com.arsdigita.db.Connection.prepareStatement(Connection.java(Compiled Code))
at_com.redhat.persistence.SetEvent.dispatch(SetEvent.java(Compiled Code))
at_com.redhat.persistence.Session.flushInternal(Session.java(Compiled Code))
at_com.redhat.persistence.Session.flush(Session.java(Compiled Code))
at_com.redhat.persistence.Cursor.next(Cursor.java(Compiled Code))

[13000+ similar lines removed for the sake of brevity]

at_com.redhat.persistence.Session.flushInternal(Session.java(Compiled Code))
at_com.redhat.persistence.Session.flush(Session.java(Compiled Code))
at_com.redhat.persistence.Cursor.next(Cursor.java(Compiled Code))
at_com.redhat.persistence.Session.flushInternal(Session.java(Compiled Code))
at_com.redhat.persistence.Session.flush(Session.java(Compiled Code))

Comment 2 Vadim Nasardinov 2003-07-30 16:27:58 UTC
(11:35:02) ad Vadim: the short answer is, it's a feature

(11:35:25) ad Vadim: the longer answer is, I'd like to fix this. Here's how it

(11:36:23) ad Vadim: When the versioning event processor starts receiving
  events that need to be recorded in the versioning log, it has to retrieve
  the five kinds of event types from the vcx_event_types table first.

(11:36:47) ad Vadim: the act of retrieving a data object causes the
  persistence engine to flush all pending events.

(11:36:59) ad Vadim: so, more events are received by the versioning log.

(11:37:38) ad Vadim: in order to record this, it first must retrieve the
  appropriate event types. A retrieve is issued. This causes the persistence
  engine to flush all currently pending flushable events.

(11:37:45) ad Vadim: and on, and on, and on

(11:38:12) aD mikeb: couldn't (shouldn't?) those event types be cached at
  server startup?

(11:38:16) ad Vadim: this works out ok in the end, because eventually we get
  to the point where all flushable pending events have been flushed

(11:38:23) ad Vadim: yes, they could

(11:38:43) ad Vadim: the only problem is, the versioning initializer runs
  before the persistence initializer

(11:38:49) ad Vadim: (it must for a variety of reasons)

(11:39:10) ad Vadim: so, I am leaning towards adding a
  PostPersistenceInitializer to retrieve the events and cache them

(11:39:26) aD mikeb: could you just lazy-initialize them the first time
  they're needed?

(11:39:52) ad Vadim: This is exactly what's happenning now. They are being
  lazily initialized, which causes reentrance that you are seeing.

(11:40:36) ad Vadim: The problem is that in your particular case, the number
  of reentrant calls is so great, it exceeds the allocated stack capacity

(11:40:53) aD mikeb: yes

(11:41:05) ad Vadim: in my test cases, the stack trace is no more than 200
  piddly lines or so

(11:41:41) ad Vadim: so, I think, to be on the safe side, I do need to add a
  post-persistence initializer.

(11:41:47) aD mikeb: increasing the stack size to 2M seems to keep the JVM
  from aborting with the SIGSEGV, but it doesn't look like
  promotePendingVersion() is ever actually completing successfully.

(11:42:31) ad Vadim: well, that probably means that for whatever reason, there
  is always a fresh supply of pending flushable events.

(11:42:55) ad Vadim: I did not prove mathematically that the reentrancy always
  sorts itself out eventually

(11:43:03) ad Vadim: maybe it does not under certain circumstances

(11:43:30) ad Vadim: it would seem to me, that given a deep enough stack, it
  should sort itself out, but maybe not

(11:43:35) aD mikeb: like if there's a cycle in the object graph?

(11:43:53) ad Vadim: has nothing to do that, really

(11:43:55) aD mikeb: which is very possible in this case

(11:44:06) ad Vadim: it goes like this

(11:44:15) ad Vadim: you make some changes to your data objects...

(11:44:26) ad Vadim: this generates a bunch of Set, Add, Remove, Create, and
  Delete events

(11:44:43) ad Vadim: the versioning processor examines these events and
  decides if they need to be recorded

(11:45:11) ad Vadim: say, a Set event needs to be recorded. So, we create an
  Operation, whose Event type is "set".

(11:45:39) ad Vadim: the way the versioning log data model is laid out, the
  operations table has a foreigh key referencing vcx_event_types.

(11:45:57) ad Vadim: so, I need to retrieve the "set" event data object to set
  the operation's "eventType" attribute correctly.

(11:46:10) ad Vadim: I try to retrieve the "set" event data object.

(11:46:20) ad Vadim: this causes persistence to flush all pending flushable

(11:46:29) ad Vadim: more events end up coming the versioning processor's way

(11:46:53) ad Vadim: note that my "retrieve" call hasn't returned yet at this

(11:46:53) aD mikeb: Is an event flushable if it doesn't have it's "eventType"

(11:47:57) ad Vadim: no, an event is flushable if performing the corresponding
  database operation is possible, i.e. it doesn't violate referential
  integrity and such

(11:48:32) ad Vadim: so, for example, if you set a foreign key to a value, you
  can't write the foreign key, until the thing it references has been saved

(11:48:38) ad Vadim: persistence sorts these things out

(11:48:52) ad Vadim: all I see on the versioning side, is a steady stream of
  persistence events

(11:48:59) ad Vadim: there are five kinds.

(11:49:16) ad Vadim: to persist these events in the versioning log, I have to
  go through the process I described.

(11:49:47) ad Vadim: so, this is what happens

(11:50:17) ad Vadim: Persistence has 100 flushable pending events in its
  queue. It flushes them. The versioning processor receives event 1.

(11:50:37) ad Vadim: It's versionable, so it tries to save the corresponding
  versioning Operation.

(11:51:03) ad Vadim: The very first time this happens, versioning's "Event"
  data object have not been loaded yet.

(11:51:16) ad Vadim: so, we try to retrieve them.

(11:51:38) ad Vadim: This causes persistence to flush more stuff

(11:51:52) ad Vadim: The versioning processor receives event 2.

(11:52:00) ad Vadim: say, it's also versionable

(11:52:26) ad Vadim: we must initialize our "Event" data objects.

(Remember that our "retreive" has not returned yet at this point.)

(11:52:45) ad Vadim: so, we try to retrieve the five "Event" data objects

(11:52:51) ad Vadim: this flushes more events

(11:53:15) ad Vadim: so, I think basically, this goes until the queue of
  flushable events is exhausted.

(11:53:26) ad Vadim: if there is a lot of events in the queue, this may take a
  long time.

(11:54:09) ad Vadim: if you have a complex object structure and you do a lot
  of sets and adds without calling save(), then you end up building a long
  queue of events.

(11:54:19) ad Vadim: at some point, you call save(), and all these are flushed

(11:54:32) aD mikeb: which is exactly what happens in promotePendingVersion()

(11:54:46) ad Vadim: so, the versioning processor chokes on reentrancy

(11:55:19) ad Vadim: a quick fix is to make a couple of simple sets and adds
  and call save, if possible

(11:55:30) ad Vadim: since the queue is small, the reentrancy will sort itself

(11:56:05) ad Vadim: once the "Event" object have been successfully
  initialized, they are cached as disconnected data objects (since they are
  truly read-only and never change).

(11:56:22) ad Vadim: the right fix is to add the post-persistence initializer,
  I think.

(11:56:50) ad Vadim: I tried solving this before, and that's the only viable
  option I could come up with.

(11:57:09) ad Vadim: so, I am going to file this chat log as a BZ ticket and
  assign it to myself.

(11:57:32) ad Vadim: in the meantime, you can try the workaround I suggested

(11:57:40) ad Vadim: how does that sound?

(11:57:49) aD mikeb: shorter term, what do you think about adding
  suspendVersioning() to promotePendingVersion() ?

(12:01:19) ad Vadim: it's not just a matter of what I think. There is also
  Dr. Lutterkort's opinion that should be taken into account. Here's the deal.

(12:01:40) ad Vadim: Currently, we suspend versioning on entry into publish()
  and unpublish(), right?

(12:01:58) ad Vadim: This is based on the assumption that no important changes
  happen in these methods.

(12:02:07) ad Vadim: David was uneasy about this.

(12:02:27) ad Vadim: If someone overrides publish(), they may do stuff in the
  publish method that should be versioned.

(12:02:43) ad Vadim: They will be surprised to learn that it wasn't, next time
  they try to rollback.

(12:03:08) ad Vadim: The alternative was to try to do what the old versioning
  did, which is, decide on a per instance basis.

(12:03:28) ad Vadim: This doesn't really work, because you don't really know
  what constitutes an item.

(12:03:39) ad Vadim: It's not just the data object itself, it's the
  components, too.

(12:04:19) aD mikeb: hmm, okay, I'll try to do some sets and a save() before
  entering promotePendingVersion(), and see if that fixes the problem

(12:04:21) ad Vadim: So, if we were to provide callbacks to let you decide
  whether each individual event should be versioned or not, you'd have to
  basically have the same knowledge of the versioning dependency graph that
  versioning has

(12:04:34) ad Vadim: so, my point:

(12:04:54) ad Vadim: we decided that versioning should not be suspended on
  entry into publish()

(12:05:21) aD mikeb: ahhh, okay, I did not realize that had been decided

(12:05:24) ad Vadim: this is cheap, because publish() mostly creates new data
  objects, and these are cheap, since we hardly record anything for newly
  created objects

(12:05:43) ad Vadim: we do want to suspend versioning in unpublish(), since
  doing otherwise is too expensive.

(12:06:12) ad Vadim: so, it would seem to me that promotePendingVersion()
  falls into the publish() category, as far as Lutterkort would be concerned

(12:06:16) ad Vadim: what do you think?

(12:06:33) aD mikeb: well, all promotePendingVersion() does is update the
  "version" field

(12:06:49) aD mikeb: but it does it recursively, potentially affecting a lot
  of items

(12:08:08) ad Vadim: ah, but as David would point out, are you sure that no
  one will override it in the future? If it's a final method and has no
  attachable callbacks, then sure, you can probably suspend versioning

(12:09:09) aD mikeb: I'm not really sure how the semantics of versioning apply
  to live items. If you roll back an item with no live version to some point
  in the past when it did have a live version, do you resurrect the old live

(12:11:05) ad Vadim: If there is enough data to resurrect it, then yes.

(In theory. In practice, versioning does not currently work correctly if an
  object needs to be resurrected, but we don't have the data, due to
  suspendVersioning() having been called when the object was deleted.)

(12:13:34) ad Vadim: so, the bottom line is, there are some issues that need
  to be sorted out, and Lutterkort seems to be the only one keeping an eye on
  my p4 submit messages and asking questions. If it hadn't been for him, these
  things would have been swept under the rug. (Because I personally happen to
  think that suspending versioning on entry into publish() is ok.)

(12:14:17) aD mikeb: alright, I'll try your workaround for now. Will you be on
  AIM if I have more questions?

(12:14:29) ad Vadim: yes, I'll stay logged in.

Comment 3 Vadim Nasardinov 2003-07-30 17:04:37 UTC
(12:42:34) aD mikeb: Hey Vadim. I added the sets and saves your suggested, to
  no avail. I'm still getting the huge stack trace, followed by a JVM
  abort. The stack trace and subsequent flurry of versioning activity happens
  when trying to delete an object (the old live version of a pending item).
(12:54:27) ad Vadim: ok
(12:54:37) ad Vadim: I'll add the post-persistence initializer today
(12:54:54) ad Vadim: I'll try to add a test case that fails the same way.
(12:55:12) aD mikeb: I just added suspendVersioning() to
  PublishLifecycleListener.begin() and that fixed the problem.
(12:55:33) aD mikeb: The pending versions got successfully promoted to live as
(12:55:55) aD mikeb: I'm not convinced this is a versioning-reentrancy
(12:56:07) ad Vadim: what do you mean, it's not?
(12:56:27) ad Vadim: your stack trace was 14,000 lines
(12:56:36) ad Vadim: that looks like reentrancy to me all right
(12:57:17) aD mikeb: But I'm fairly sure that those event types must have been
  initialized already.
(12:57:51) ad Vadim: BTW, the Bugzilla ticket for this is
(12:58:36) ad Vadim:
  com.arsdigita.versioning.EventType.getDataObject(EventType.java:67) should
  return the cached disconnected data object, if event types had been
(12:58:56) aD mikeb: I added this code before promotePendingVersion() is
  final String originalName = master.getName();
  That should initialize everything, right?
(12:59:41) ad Vadim: probably, but not necessarily
(13:00:05) ad Vadim: if there are no versionable events, the flushed events
  will be ignored
(13:00:24) aD mikeb: Ahh, I see. So Session.retrieve() should never get called
  after EventType.getDataObject?
(13:00:47) ad Vadim: say what?
(13:01:38) aD mikeb: In the stack trace...if it was returning the cached data
  object, Session.retrieve wouldn't be getting called after
  EventType.getDataObject() ?
(13:01:47) ad Vadim: correct
(13:02:20) ad Vadim:

  synchronized DataObject getDataObject() {
      if ( m_dobj == null ) {
          m_dobj = SessionManager.getSession().retrieve(new OID(DATA_TYPE, m_id));
      return m_dobj;
(13:02:33) aD mikeb: okay. Let me know when that initializer has been
(13:02:34) ad Vadim: retrieve() is called only if m_dobj is null

Comment 4 Vadim Nasardinov 2003-07-30 19:12:07 UTC
In p4 34119, added a unit test that seems to reproduce this problem.
It creates 1000 data objects without calling save.  Once all 1000 data
objects have been created, the session is flushed.  This causes 1000
"create" events to be sent to the versioning event processor. (To be
precise, it's more than 1000, but all events other than "create" are
not versionable in the case of a newly created data object.)  This
causes massive reentrancy into the versioning event processor, as it
tries to initialize event types for the first time.  I got tired of
waiting for this process to die and had to kill it with "kill -3".
The resulting stack trace was similar to what Mike reported.

Lowering the number of data objects to 100, and adding some additional
logging to the EventType.getDataObject method, I see 100 cache misses.
From this I conclude that in the case 1000 objects, the depth of
reentrancy would be 1000, resulting in a stack over 11,000 lines long.
This is because the repeating chunk is 11 lines long:

com.redhat.persistence.Session.flushInternal(Session.java(Compiled Code))
com.redhat.persistence.Session.flush(Session.java(Compiled Code))
com.redhat.persistence.Cursor.next(Cursor.java(Compiled Code))

Comment 5 Vadim Nasardinov 2003-07-30 21:04:56 UTC
Fixed in 34126.

Note You need to log in before you can comment on or make changes to this bug.