Bug 116010 - Lucene search does not eliminate deleted content from its index
Lucene search does not eliminate deleted content from its index
Status: CLOSED WONTFIX
Product: Red Hat Enterprise CMS
Classification: Retired
Component: other (Show other bugs)
nightly
All Linux
medium Severity medium
: ---
: ---
Assigned To: Vadim Nasardinov
Jon Orris
:
Depends On:
Blocks: 108447
  Show dependency treegraph
 
Reported: 2004-02-17 11:18 EST by Jon Orris
Modified: 2007-04-18 13:03 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-11-10 15:27:36 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jon Orris 2004-02-17 11:18:42 EST
Description of problem:
@40422/Postgres

Search under lucene does not appear to eliminate deleted content from
the lucene index. The E-Test Search test creates a number of items &
after completion, deletes all of them. Even after this, searching for
the keywords in these items brings up a list. None of the links to the
items work, of course, as they've been deleted.

I've waited 10-15 minutes, and the indexer has run multiple times.
Parameters:

waf.search.indexer=lucene
waf.search.lazy_updates=false
waf.lucene.location=/var/ccm-devel/web/jorris/rickshaw/work/lucene
waf.lucene.interval=120

Log:
Indexer running after deletes:
2004-02-17 11:06:35,471 [ead-5] INFO  lucene.Indexer - Lucene indexer
running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:06:35,471 [ead-5] DEBUG lucene.Indexer - syncing
2004-02-17 11:06:35,479 [ead-5] INFO  lucene.Indexer - Lucene indexer
finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:08:35,481 [ead-5] INFO  lucene.Indexer - Lucene indexer
running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:08:35,482 [ead-5] DEBUG lucene.Indexer - syncing
2004-02-17 11:08:35,490 [ead-5] INFO  lucene.Indexer - Lucene indexer
finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:10:35,491 [ead-5] INFO  lucene.Indexer - Lucene indexer
running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:10:35,493 [ead-5] DEBUG lucene.Indexer - syncing
2004-02-17 11:10:35,504 [ead-5] INFO  lucene.Indexer - Lucene indexer
finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:12:35,501 [ead-5] INFO  lucene.Indexer - Lucene indexer
running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:12:35,501 [ead-5] DEBUG lucene.Indexer - syncing
2004-02-17 11:12:35,509 [ead-5] INFO  lucene.Indexer - Lucene indexer
finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.

Search being run
2004-02-17 11:13:33,333 [004-3] DEBUG search.Search - Processing
com.arsdigita.search.QuerySpecification@b45e718d
Specification: {
  Terms: foobar
  Filters:
    com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04
    com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59
    com.arsdigita.cms.search.ContentTypeFilterSpecification@7111582
    com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04
    com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59
    com.arsdigita.cms.search.VersionFilterSpecification@11436579
} using cache com.arsdigita.search.TimedResultCache@105bb5a on lucene
2004-02-17 11:13:33,334 [004-3] DEBUG search.TimedResultCache -
Retrieved result for com.arsdigita.search.QuerySpecification@b45e718d
Specification: {
  Terms: foobar
  Filters:
    com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04
    com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59
    com.arsdigita.cms.search.ContentTypeFilterSpecification@7111582
    com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04
    com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59
    com.arsdigita.cms.search.VersionFilterSpecification@11436579
} none cached
2004-02-17 11:13:33,335 [004-3] DEBUG search.Search - Processing query
specification com.arsdigita.search.QuerySpecification@b45e718d
Specification: {
  Terms: foobar
  Filters:
    com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04
    com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59
    com.arsdigita.cms.search.ContentTypeFilterSpecification@7111582
    com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04
    com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59
    com.arsdigita.cms.search.VersionFilterSpecification@11436579
}
2004-02-17 11:13:33,335 [004-3] DEBUG search.QueryEngineRegistry -
Lookup engine for lucene
2004-02-17 11:13:33,336 [004-3] DEBUG search.QueryEngineRegistry -  
Filter: creationUser (class
com.arsdigita.cms.search.CreationUserFilterType)
2004-02-17 11:13:33,336 [004-3] DEBUG search.QueryEngineRegistry -  
Filter: creationDate (class
com.arsdigita.cms.search.CreationDateFilterType)
2004-02-17 11:13:33,336 [004-3] DEBUG search.QueryEngineRegistry -  
Filter: contentType (class com.arsdigita.cms.search.ContentTypeFilterType)
2004-02-17 11:13:33,336 [004-3] DEBUG search.QueryEngineRegistry -  
Filter: lastModifiedUser (class
com.arsdigita.cms.search.LastModifiedUserFilterType)
2004-02-17 11:13:33,336 [004-3] DEBUG search.QueryEngineRegistry -  
Filter: lastModifiedDate (class
com.arsdigita.cms.search.LastModifiedDateFilterType)
2004-02-17 11:13:33,337 [004-3] DEBUG search.QueryEngineRegistry -  
Filter: version (class com.arsdigita.cms.search.VersionFilterType)
2004-02-17 11:13:33,435 [004-3] DEBUG lucene.IntersectionFilter -
Adding filter class com.arsdigita.search.lucene.TypeSpecificFilter
2004-02-17 11:13:33,447 [004-3] DEBUG lucene.LuceneResultSet - Result
set count is 10, query duration 0
2004-02-17 11:13:33,451 [004-3] INFO  search.CachedResultSet - Cache
size: 50, is greater than result count: 10
2004-02-17 11:13:33,453 [004-3] DEBUG lucene.LuceneResultSet -
Paginating at offset 0 for 10 rows
2004-02-17 11:13:33,456 [004-3] DEBUG lucene.LuceneResultSet - Peek
object 0
2004-02-17 11:13:33,456 [004-3] DEBUG lucene.LuceneResultSet - Got
next object
2004-02-17 11:13:33,461 [004-3] DEBUG lucene.LuceneResultSet - Peeking
at result count is 10
2004-02-17 11:13:33,462 [004-3] DEBUG lucene.LuceneResultSet - Got
next object
2004-02-17 11:13:33,463 [004-3] DEBUG lucene.LuceneResultSet - Peeking
at result count is 9
2004-02-17 11:13:33,463 [004-3] DEBUG lucene.LuceneResultSet - Got
next object
2004-02-17 11:13:33,464 [004-3] DEBUG lucene.LuceneResultSet - Peeking
at result count is 8
2004-02-17 11:13:33,464 [004-3] DEBUG lucene.LuceneResultSet - Got
next object
2004-02-17 11:13:33,465 [004-3] DEBUG lucene.LuceneResultSet - Peeking
at result count is 7
2004-02-17 11:13:33,465 [004-3] DEBUG lucene.LuceneResultSet - Got
next object
2004-02-17 11:13:33,466 [004-3] DEBUG lucene.LuceneResultSet - Peeking
at result count is 6
2004-02-17 11:13:33,467 [004-3] DEBUG lucene.LuceneResultSet - Got
next object
2004-02-17 11:13:33,467 [004-3] DEBUG lucene.LuceneResultSet - Peeking
at result count is 5
2004-02-17 11:13:33,468 [004-3] DEBUG lucene.LuceneResultSet - Got
next object
2004-02-17 11:13:33,469 [004-3] DEBUG lucene.LuceneResultSet - Peeking
at result count is 4
2004-02-17 11:13:33,469 [004-3] DEBUG lucene.LuceneResultSet - Got
next object
2004-02-17 11:13:33,470 [004-3] DEBUG lucene.LuceneResultSet - Peeking
at result count is 3
2004-02-17 11:13:33,470 [004-3] DEBUG lucene.LuceneResultSet - Got
next object
2004-02-17 11:13:33,471 [004-3] DEBUG lucene.LuceneResultSet - Peeking
at result count is 2
2004-02-17 11:13:33,472 [004-3] DEBUG lucene.LuceneResultSet - Got
next object
2004-02-17 11:13:33,472 [004-3] DEBUG lucene.LuceneResultSet - Peeking
at result count is 1
2004-02-17 11:13:33,473 [004-3] DEBUG lucene.LuceneResultSet - Closing
search result set
2004-02-17 11:13:33,473 [004-3] DEBUG search.TimedResultCache -
Storing result for com.arsdigita.search.QuerySpecification@b45e718d
Specification: {
  Terms: foobar
  Filters:
    com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04
    com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59
    com.arsdigita.cms.search.ContentTypeFilterSpecification@7111582
    com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04
    com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59
    com.arsdigita.cms.search.VersionFilterSpecification@11436579
} count 10 time 0 engine lucene
2004-02-17 11:13:33,476 [004-3] DEBUG search.TimedResultCache -
Retrieved result for com.arsdigita.search.QuerySpecification@b45e718d
Specification: {
  Terms: foobar
  Filters:
    com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04
    com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59
    com.arsdigita.cms.search.ContentTypeFilterSpecification@7111582
    com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04
    com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59
    com.arsdigita.cms.search.VersionFilterSpecification@11436579
} count 10 time 0 engine lucene
2004-02-17 11:13:33,476 [004-3] DEBUG ui.ResultsPane - Got result set
class com.arsdigita.search.CachedResultSet count: 10
2004-02-17 11:13:33,480 [004-3] DEBUG ui.ResultsPane - Paginator stats
  page number:1
  page count: 1
  page size: 10
 start 0
  end: 10
 count: 10
2004-02-17 11:13:33,484 [004-3] DEBUG ui.ResultsPane - Outputting
documents
2004-02-17 11:13:33,484 [004-3] DEBUG ui.ResultsPane - One doc
[com.arsdigita.cms.contenttypes.Minutes:{id=27019}] search minutes
2004-02-17 11:13:33,486 [004-3] DEBUG ui.ResultsPane - One doc
[com.arsdigita.cms.contenttypes.Service:{id=26055}] search service
2004-02-17 11:13:33,487 [004-3] DEBUG ui.ResultsPane - One doc
[com.arsdigita.cms.contenttypes.Event:{id=26020}] search event
2004-02-17 11:13:33,487 [004-3] DEBUG ui.ResultsPane - One doc
[com.arsdigita.cms.contenttypes.PressRelease:{id=26046}] press search
2004-02-17 11:13:33,488 [004-3] DEBUG ui.ResultsPane - One doc
[com.arsdigita.cms.contenttypes.Agenda:{id=26003}] search agenda
2004-02-17 11:13:33,489 [004-3] DEBUG ui.ResultsPane - One doc
[com.arsdigita.cms.contenttypes.Article:{id=26011}] search article
2004-02-17 11:13:33,490 [004-3] DEBUG ui.ResultsPane - One doc
[com.arsdigita.cms.contenttypes.LegalNotice:{id=27010}] search legal
2004-02-17 11:13:33,491 [004-3] DEBUG ui.ResultsPane - One doc
[com.arsdigita.cms.contenttypes.NewsItem:{id=26037}] search news
2004-02-17 11:13:33,502 [004-3] DEBUG ui.ResultsPane - One doc
[com.arsdigita.cms.contenttypes.MultiPartArticle:{id=26029}] search mpa
2004-02-17 11:13:33,503 [004-3] DEBUG ui.ResultsPane - One doc
[com.arsdigita.cms.contenttypes.Job:{id=27002}] search job
2004-02-17 11:14:35,511 [ead-5] INFO  lucene.Indexer - Lucene indexer
running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:14:35,511 [ead-5] DEBUG lucene.Indexer - syncing
2004-02-17 11:14:35,518 [ead-5] INFO  lucene.Indexer - Lucene indexer
finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:16:35,521 [ead-5] INFO  lucene.Indexer - Lucene indexer
running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:16:35,521 [ead-5] DEBUG lucene.Indexer - syncing
2004-02-17 11:16:35,529 [ead-5] INFO  lucene.Indexer - Lucene indexer
finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:18:35,531 [ead-5] INFO  lucene.Indexer - Lucene indexer
running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
2004-02-17 11:18:35,531 [ead-5] DEBUG lucene.Indexer - syncing
2004-02-17 11:18:35,538 [ead-5] INFO  lucene.Indexer - Lucene indexer
finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.
Comment 1 Carsten Clasohm 2004-03-09 05:38:48 EST
The 6.0 branch has the same problem.
Comment 2 Vadim Nasardinov 2004-03-10 12:56:16 EST
I can't reproduce this with simple testing.  I used the following
three indicators to verify that deleting content actually works as
advertised:

 1. Using the "Search" tab of the content section to see whether or
    not lucene search finds the content before and after it is
    deleted.

 2. I set the following two loggers to the DEBUG level:
      com.arsdigita.search.lucene.Indexer
      com.arsdigita.search.lucene.DocumentObserver

    and instrumented DocumentObserver like so:    

    |==== //core-platform/dev/src/com/arsdigita/search/lucene/DocumentObserver.java#2
    |@@ -118,12 +118,17 @@
    |      */
    |     public void onDelete(DomainObject dobj) {
    |         BigDecimal id = (BigDecimal)dobj.getOID().get("id");
    |+        s_log.debug("onDelete: id=" + id, new Throwable());
    |         Assert.exists(id, BigDecimal.class);
    |         Document doc = Document.retrieve(id);
    |         if (doc != null) {
    |+            s_log.debug("deleting document");
    |             doc.setDeleted(true);
    |             doc.setDirty(true);
    |             doc.save();
    |+        } else {
    |+            s_log.debug("document not found");
    |         }
    |+
    |     }
    | }

 3. I used Luke[1], the Lucene Index Browser by Andrzej Bialecki to
    independently view the lucene index.

My testing included creating two Articles and publishing one of them.
The first article was called "Draft Piece" and the second "Published
Article".  I published the second article, unpublished it, deleted it,
deleted the first article.  At each stage, I would search for either
"Draft" or "Published" (or both) to see whether the relevant content
was indexed and whether it was appropriately deleted.

All of the three indicators described above show that content is
properly added to and removed from the lucene index.  Removal is a
two-stage process.  Once a content item is deleted, the corresponding
lucene Document is marked for deletion like so:

  com.arsdigita.search.lucene.DocumentObserver.onDelete(DocumentObserver.java:121)
  com.arsdigita.search.SearchTransactionListener.beforeCommit(SearchTransactionListener.java:82)
  com.arsdigita.persistence.TransactionContext.fireBeforeCommitEvent(TransactionContext.java:180)
  com.arsdigita.persistence.TransactionContext.commitTxn(TransactionContext.java:87)
  com.arsdigita.kernel.DatabaseTransaction.commit(DatabaseTransaction.java:75)
  com.arsdigita.kernel.DatabaseTransaction.end(DatabaseTransaction.java:63)

The next time the lucene Indexer runs, the document that has been
scheduled for deletion is properly removed from the index.  The
"Search" tab returns no hits.  Luke shows that the number of entries
in the Lucene filesystem index is decreased by one.

References
 1. http://www.getopt.org/luke/
Comment 3 Vadim Nasardinov 2004-03-10 12:57:07 EST
I have two possible explanations for why you see the reported behavior
happening with E-Tester.  The simplest explanation is as follows.

  The problem is probably due to the fact that the Lucene index is
  maintained asynchronously.  When you add or delete a content item,
  we create (or update) an entry in lucene_docs that is _later_
  processed by a background thread that adds/removes/updates entries
  to/from the lucene filesystem index.

  Now, suppose E-Tester creates a bunch of content items.  These are
  properly indexed by Lucene in due time.  E-Tester deletes the
  content items and the server is shut down for whatever reason before
  the lucene Indexer had a chance to remove the corresponding entries
  from the filesystem index.  Suppose further that you reload the data
  model.  Next time you start the server, your lucene_docs table does
  not have any entries in it.  The stale lucene documents are still
  there in your Lucene index directory (specified by
  "waf.lucene.location"), but they will never be cleaned out by the
  background Indexer thread, because the corresponding entries in the
  lucene_docs table are gone.

So, to test this out, could you please check if you can reproduce this
problem on a freshly reloaded data model with your
/var/ccm-devel/web/foo/dev/data/lucene/ directory cleaned before the
server start?
Comment 4 Vadim Nasardinov 2004-03-10 13:10:39 EST
Dan suggested another possibility why you guys might be seeing this.
Quoting from today's transcript of the #ccm channel:

<danpb_not_so_broken>
  vadim: one situation in which it could well be broken
<danpb_not_so_broken>
  is if they have mutliple JVMs but didn't give them unique lucene
  index ids
<danpb_not_so_broken>
  in that situation only one of the JVMs would remove each piece of
  content
Comment 5 Jon Orris 2004-03-10 15:41:52 EST
I can't reproduce this with a single JVM either. Looking at the debug
log, it is definitely deleting the objects.

So in my case, one of vadim's explanations was likely the case.

Carsten, can you comment on whether multiple jvms are being used in
the 6.0 problem? If the aren't, possibly you need faster update
frequency on lucene indexing.
Comment 6 Carsten Clasohm 2004-03-11 07:12:57 EST
It is reproducable with a single JVM setup and 6.0 as follows:

If a content item is deleted by unpublishing a live version, the
corresponding entry in the Lucene index is correctly deleted. But when
you browse a content section and use the "Delete" link of a content
item, that item is not removed from the Lucene index. I ran a debugger
and found that the beforeDelete() and afterDelete() methods of
lucene.Observer are never called in this case.

According to Tomislav, this is a problem in
com.arsdigita.cms.ui.folder.FolderBrowser.ItemDeleter.cellSelected(),
which instantiates a ContentBundle and deletes that object. I guess
that the observer is not triggered in this case because the bundle's
components are automatically deleted by persistence.
Comment 7 Vadim Nasardinov 2004-03-11 08:37:56 EST
Carsten,  I am about to head out to Westford and will be there in
a little over an hour.  In the meantime, can you post more detailed
instructions on how to reproduce this?  If I understand you correctly,
you are talking about a scenario which I did test on the trunk.  As
you point out, item deletion occurs in two cases: (a) when an live
item is unpublished and (b) when a draft item (that has no associated
live item) is explicitly deleted from its parent folder.

I test both (a) and (b) on the trunk and they both work.  I'll
see if I can reproduce this on the 6.0 branch.  Does Tomislav know
if the 6.0 are trunk versions of FolderBrowser.ItemDeleter are 
different?  (I will check for myself, but I gotta run now.)
Comment 8 Carsten Clasohm 2004-03-11 10:32:02 EST
Steps to reproduce this with 6.0:

1. Configure your system to use Lucene for indexing.
2. Create a new Article.
3. Stop the workflow and edit the body text, putting in a term which
will be easy to find.
4. Search the content section for the term.
5. Browse the content section and delete the Article.
6. Wait a bit and again search for the term.
7. Notice that the deleted Article is still listed in the search results.
8. Also take a look at table lucene_docs and see that column
is_deleted still is "0" for the Article you deleted.
Comment 9 Vadim Nasardinov 2004-03-11 12:45:32 EST
Search was completely rewritten for Rickshaw.  There is little
common code between 6.0 and the current trunk:

    | $ find 6.0/src/com/arsdigita/search/ -type d
    | 6.0/src/com/arsdigita/search/
    | $ find dev/src/com/arsdigita/search/ -type d
    | dev/src/com/arsdigita/search/
    | dev/src/com/arsdigita/search/filters
    | dev/src/com/arsdigita/search/intermedia
    | dev/src/com/arsdigita/search/lucene
    | dev/src/com/arsdigita/search/ui
    | dev/src/com/arsdigita/search/ui/filters

Since everyone agrees this ticket is not reproducible on the trunk,
I'm moving it off the "richshaw rc1" tracker (bug 113496) and putting
it on the Troika errata tracker (bug 108447).
Comment 10 Vadim Nasardinov 2004-03-22 12:48:13 EST
Dan points out that this might be the root cause of this bug:
https://www.redhat.com/archives/redhat-ccm-list/2004-March/msg00057.html

(When deleting a content bundle, persistence automagically deletes
all of its components without instantiating them as domain objects.
Therefore, some of the content items don't have their observers
run on delete.)
Comment 11 Vadim Nasardinov 2004-03-22 15:17:16 EST
Carsten,

If you get a chance, could you check if change 41632 fixes the problem?
(See bug 118919.)
Comment 12 Carsten Clasohm 2004-03-24 03:44:32 EST
Yes, it fixes the problem. A change for Troika only containing the
necessary modifications can be found in 41689.
Comment 13 Vadim Nasardinov 2005-11-10 15:27:36 EST
This is ancient.  Closing.

Note You need to log in before you can comment on or make changes to this bug.