Bug 116010
| Summary: | Lucene search does not eliminate deleted content from its index | ||
|---|---|---|---|
| Product: | [Retired] Red Hat Enterprise CMS | Reporter: | Jon Orris <jorris> |
| Component: | other | Assignee: | Vadim Nasardinov <vnasardinov> |
| Status: | CLOSED WONTFIX | QA Contact: | Jon Orris <jorris> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | nightly | CC: | clasohm |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2005-11-10 20:27:36 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 108447 | ||
The 6.0 branch has the same problem. I can't reproduce this with simple testing. I used the following
three indicators to verify that deleting content actually works as
advertised:
1. Using the "Search" tab of the content section to see whether or
not lucene search finds the content before and after it is
deleted.
2. I set the following two loggers to the DEBUG level:
com.arsdigita.search.lucene.Indexer
com.arsdigita.search.lucene.DocumentObserver
and instrumented DocumentObserver like so:
|==== //core-platform/dev/src/com/arsdigita/search/lucene/DocumentObserver.java#2
|@@ -118,12 +118,17 @@
| */
| public void onDelete(DomainObject dobj) {
| BigDecimal id = (BigDecimal)dobj.getOID().get("id");
|+ s_log.debug("onDelete: id=" + id, new Throwable());
| Assert.exists(id, BigDecimal.class);
| Document doc = Document.retrieve(id);
| if (doc != null) {
|+ s_log.debug("deleting document");
| doc.setDeleted(true);
| doc.setDirty(true);
| doc.save();
|+ } else {
|+ s_log.debug("document not found");
| }
|+
| }
| }
3. I used Luke[1], the Lucene Index Browser by Andrzej Bialecki to
independently view the lucene index.
My testing included creating two Articles and publishing one of them.
The first article was called "Draft Piece" and the second "Published
Article". I published the second article, unpublished it, deleted it,
deleted the first article. At each stage, I would search for either
"Draft" or "Published" (or both) to see whether the relevant content
was indexed and whether it was appropriately deleted.
All of the three indicators described above show that content is
properly added to and removed from the lucene index. Removal is a
two-stage process. Once a content item is deleted, the corresponding
lucene Document is marked for deletion like so:
com.arsdigita.search.lucene.DocumentObserver.onDelete(DocumentObserver.java:121)
com.arsdigita.search.SearchTransactionListener.beforeCommit(SearchTransactionListener.java:82)
com.arsdigita.persistence.TransactionContext.fireBeforeCommitEvent(TransactionContext.java:180)
com.arsdigita.persistence.TransactionContext.commitTxn(TransactionContext.java:87)
com.arsdigita.kernel.DatabaseTransaction.commit(DatabaseTransaction.java:75)
com.arsdigita.kernel.DatabaseTransaction.end(DatabaseTransaction.java:63)
The next time the lucene Indexer runs, the document that has been
scheduled for deletion is properly removed from the index. The
"Search" tab returns no hits. Luke shows that the number of entries
in the Lucene filesystem index is decreased by one.
References
1. http://www.getopt.org/luke/
I have two possible explanations for why you see the reported behavior happening with E-Tester. The simplest explanation is as follows. The problem is probably due to the fact that the Lucene index is maintained asynchronously. When you add or delete a content item, we create (or update) an entry in lucene_docs that is _later_ processed by a background thread that adds/removes/updates entries to/from the lucene filesystem index. Now, suppose E-Tester creates a bunch of content items. These are properly indexed by Lucene in due time. E-Tester deletes the content items and the server is shut down for whatever reason before the lucene Indexer had a chance to remove the corresponding entries from the filesystem index. Suppose further that you reload the data model. Next time you start the server, your lucene_docs table does not have any entries in it. The stale lucene documents are still there in your Lucene index directory (specified by "waf.lucene.location"), but they will never be cleaned out by the background Indexer thread, because the corresponding entries in the lucene_docs table are gone. So, to test this out, could you please check if you can reproduce this problem on a freshly reloaded data model with your /var/ccm-devel/web/foo/dev/data/lucene/ directory cleaned before the server start? Dan suggested another possibility why you guys might be seeing this. Quoting from today's transcript of the #ccm channel: <danpb_not_so_broken> vadim: one situation in which it could well be broken <danpb_not_so_broken> is if they have mutliple JVMs but didn't give them unique lucene index ids <danpb_not_so_broken> in that situation only one of the JVMs would remove each piece of content I can't reproduce this with a single JVM either. Looking at the debug log, it is definitely deleting the objects. So in my case, one of vadim's explanations was likely the case. Carsten, can you comment on whether multiple jvms are being used in the 6.0 problem? If the aren't, possibly you need faster update frequency on lucene indexing. It is reproducable with a single JVM setup and 6.0 as follows: If a content item is deleted by unpublishing a live version, the corresponding entry in the Lucene index is correctly deleted. But when you browse a content section and use the "Delete" link of a content item, that item is not removed from the Lucene index. I ran a debugger and found that the beforeDelete() and afterDelete() methods of lucene.Observer are never called in this case. According to Tomislav, this is a problem in com.arsdigita.cms.ui.folder.FolderBrowser.ItemDeleter.cellSelected(), which instantiates a ContentBundle and deletes that object. I guess that the observer is not triggered in this case because the bundle's components are automatically deleted by persistence. Carsten, I am about to head out to Westford and will be there in a little over an hour. In the meantime, can you post more detailed instructions on how to reproduce this? If I understand you correctly, you are talking about a scenario which I did test on the trunk. As you point out, item deletion occurs in two cases: (a) when an live item is unpublished and (b) when a draft item (that has no associated live item) is explicitly deleted from its parent folder. I test both (a) and (b) on the trunk and they both work. I'll see if I can reproduce this on the 6.0 branch. Does Tomislav know if the 6.0 are trunk versions of FolderBrowser.ItemDeleter are different? (I will check for myself, but I gotta run now.) Steps to reproduce this with 6.0: 1. Configure your system to use Lucene for indexing. 2. Create a new Article. 3. Stop the workflow and edit the body text, putting in a term which will be easy to find. 4. Search the content section for the term. 5. Browse the content section and delete the Article. 6. Wait a bit and again search for the term. 7. Notice that the deleted Article is still listed in the search results. 8. Also take a look at table lucene_docs and see that column is_deleted still is "0" for the Article you deleted. Search was completely rewritten for Rickshaw. There is little
common code between 6.0 and the current trunk:
| $ find 6.0/src/com/arsdigita/search/ -type d
| 6.0/src/com/arsdigita/search/
| $ find dev/src/com/arsdigita/search/ -type d
| dev/src/com/arsdigita/search/
| dev/src/com/arsdigita/search/filters
| dev/src/com/arsdigita/search/intermedia
| dev/src/com/arsdigita/search/lucene
| dev/src/com/arsdigita/search/ui
| dev/src/com/arsdigita/search/ui/filters
Since everyone agrees this ticket is not reproducible on the trunk,
I'm moving it off the "richshaw rc1" tracker (bug 113496) and putting
it on the Troika errata tracker (bug 108447).
Dan points out that this might be the root cause of this bug: https://www.redhat.com/archives/redhat-ccm-list/2004-March/msg00057.html (When deleting a content bundle, persistence automagically deletes all of its components without instantiating them as domain objects. Therefore, some of the content items don't have their observers run on delete.) Carsten, If you get a chance, could you check if change 41632 fixes the problem? (See bug 118919.) Yes, it fixes the problem. A change for Troika only containing the necessary modifications can be found in 41689. This is ancient. Closing. |
Description of problem: @40422/Postgres Search under lucene does not appear to eliminate deleted content from the lucene index. The E-Test Search test creates a number of items & after completion, deletes all of them. Even after this, searching for the keywords in these items brings up a list. None of the links to the items work, of course, as they've been deleted. I've waited 10-15 minutes, and the indexer has run multiple times. Parameters: waf.search.indexer=lucene waf.search.lazy_updates=false waf.lucene.location=/var/ccm-devel/web/jorris/rickshaw/work/lucene waf.lucene.interval=120 Log: Indexer running after deletes: 2004-02-17 11:06:35,471 [ead-5] INFO lucene.Indexer - Lucene indexer running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:06:35,471 [ead-5] DEBUG lucene.Indexer - syncing 2004-02-17 11:06:35,479 [ead-5] INFO lucene.Indexer - Lucene indexer finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:08:35,481 [ead-5] INFO lucene.Indexer - Lucene indexer running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:08:35,482 [ead-5] DEBUG lucene.Indexer - syncing 2004-02-17 11:08:35,490 [ead-5] INFO lucene.Indexer - Lucene indexer finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:10:35,491 [ead-5] INFO lucene.Indexer - Lucene indexer running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:10:35,493 [ead-5] DEBUG lucene.Indexer - syncing 2004-02-17 11:10:35,504 [ead-5] INFO lucene.Indexer - Lucene indexer finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:12:35,501 [ead-5] INFO lucene.Indexer - Lucene indexer running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:12:35,501 [ead-5] DEBUG lucene.Indexer - syncing 2004-02-17 11:12:35,509 [ead-5] INFO lucene.Indexer - Lucene indexer finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. Search being run 2004-02-17 11:13:33,333 [004-3] DEBUG search.Search - Processing com.arsdigita.search.QuerySpecification@b45e718d Specification: { Terms: foobar Filters: com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04 com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59 com.arsdigita.cms.search.ContentTypeFilterSpecification@7111582 com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04 com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59 com.arsdigita.cms.search.VersionFilterSpecification@11436579 } using cache com.arsdigita.search.TimedResultCache@105bb5a on lucene 2004-02-17 11:13:33,334 [004-3] DEBUG search.TimedResultCache - Retrieved result for com.arsdigita.search.QuerySpecification@b45e718d Specification: { Terms: foobar Filters: com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04 com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59 com.arsdigita.cms.search.ContentTypeFilterSpecification@7111582 com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04 com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59 com.arsdigita.cms.search.VersionFilterSpecification@11436579 } none cached 2004-02-17 11:13:33,335 [004-3] DEBUG search.Search - Processing query specification com.arsdigita.search.QuerySpecification@b45e718d Specification: { Terms: foobar Filters: com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04 com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59 com.arsdigita.cms.search.ContentTypeFilterSpecification@7111582 com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04 com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59 com.arsdigita.cms.search.VersionFilterSpecification@11436579 } 2004-02-17 11:13:33,335 [004-3] DEBUG search.QueryEngineRegistry - Lookup engine for lucene 2004-02-17 11:13:33,336 [004-3] DEBUG search.QueryEngineRegistry - Filter: creationUser (class com.arsdigita.cms.search.CreationUserFilterType) 2004-02-17 11:13:33,336 [004-3] DEBUG search.QueryEngineRegistry - Filter: creationDate (class com.arsdigita.cms.search.CreationDateFilterType) 2004-02-17 11:13:33,336 [004-3] DEBUG search.QueryEngineRegistry - Filter: contentType (class com.arsdigita.cms.search.ContentTypeFilterType) 2004-02-17 11:13:33,336 [004-3] DEBUG search.QueryEngineRegistry - Filter: lastModifiedUser (class com.arsdigita.cms.search.LastModifiedUserFilterType) 2004-02-17 11:13:33,336 [004-3] DEBUG search.QueryEngineRegistry - Filter: lastModifiedDate (class com.arsdigita.cms.search.LastModifiedDateFilterType) 2004-02-17 11:13:33,337 [004-3] DEBUG search.QueryEngineRegistry - Filter: version (class com.arsdigita.cms.search.VersionFilterType) 2004-02-17 11:13:33,435 [004-3] DEBUG lucene.IntersectionFilter - Adding filter class com.arsdigita.search.lucene.TypeSpecificFilter 2004-02-17 11:13:33,447 [004-3] DEBUG lucene.LuceneResultSet - Result set count is 10, query duration 0 2004-02-17 11:13:33,451 [004-3] INFO search.CachedResultSet - Cache size: 50, is greater than result count: 10 2004-02-17 11:13:33,453 [004-3] DEBUG lucene.LuceneResultSet - Paginating at offset 0 for 10 rows 2004-02-17 11:13:33,456 [004-3] DEBUG lucene.LuceneResultSet - Peek object 0 2004-02-17 11:13:33,456 [004-3] DEBUG lucene.LuceneResultSet - Got next object 2004-02-17 11:13:33,461 [004-3] DEBUG lucene.LuceneResultSet - Peeking at result count is 10 2004-02-17 11:13:33,462 [004-3] DEBUG lucene.LuceneResultSet - Got next object 2004-02-17 11:13:33,463 [004-3] DEBUG lucene.LuceneResultSet - Peeking at result count is 9 2004-02-17 11:13:33,463 [004-3] DEBUG lucene.LuceneResultSet - Got next object 2004-02-17 11:13:33,464 [004-3] DEBUG lucene.LuceneResultSet - Peeking at result count is 8 2004-02-17 11:13:33,464 [004-3] DEBUG lucene.LuceneResultSet - Got next object 2004-02-17 11:13:33,465 [004-3] DEBUG lucene.LuceneResultSet - Peeking at result count is 7 2004-02-17 11:13:33,465 [004-3] DEBUG lucene.LuceneResultSet - Got next object 2004-02-17 11:13:33,466 [004-3] DEBUG lucene.LuceneResultSet - Peeking at result count is 6 2004-02-17 11:13:33,467 [004-3] DEBUG lucene.LuceneResultSet - Got next object 2004-02-17 11:13:33,467 [004-3] DEBUG lucene.LuceneResultSet - Peeking at result count is 5 2004-02-17 11:13:33,468 [004-3] DEBUG lucene.LuceneResultSet - Got next object 2004-02-17 11:13:33,469 [004-3] DEBUG lucene.LuceneResultSet - Peeking at result count is 4 2004-02-17 11:13:33,469 [004-3] DEBUG lucene.LuceneResultSet - Got next object 2004-02-17 11:13:33,470 [004-3] DEBUG lucene.LuceneResultSet - Peeking at result count is 3 2004-02-17 11:13:33,470 [004-3] DEBUG lucene.LuceneResultSet - Got next object 2004-02-17 11:13:33,471 [004-3] DEBUG lucene.LuceneResultSet - Peeking at result count is 2 2004-02-17 11:13:33,472 [004-3] DEBUG lucene.LuceneResultSet - Got next object 2004-02-17 11:13:33,472 [004-3] DEBUG lucene.LuceneResultSet - Peeking at result count is 1 2004-02-17 11:13:33,473 [004-3] DEBUG lucene.LuceneResultSet - Closing search result set 2004-02-17 11:13:33,473 [004-3] DEBUG search.TimedResultCache - Storing result for com.arsdigita.search.QuerySpecification@b45e718d Specification: { Terms: foobar Filters: com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04 com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59 com.arsdigita.cms.search.ContentTypeFilterSpecification@7111582 com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04 com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59 com.arsdigita.cms.search.VersionFilterSpecification@11436579 } count 10 time 0 engine lucene 2004-02-17 11:13:33,476 [004-3] DEBUG search.TimedResultCache - Retrieved result for com.arsdigita.search.QuerySpecification@b45e718d Specification: { Terms: foobar Filters: com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04 com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59 com.arsdigita.cms.search.ContentTypeFilterSpecification@7111582 com.arsdigita.search.filters.PartyFilterSpecification@d0bcca04 com.arsdigita.search.filters.DateRangeFilterSpecification@213a2e59 com.arsdigita.cms.search.VersionFilterSpecification@11436579 } count 10 time 0 engine lucene 2004-02-17 11:13:33,476 [004-3] DEBUG ui.ResultsPane - Got result set class com.arsdigita.search.CachedResultSet count: 10 2004-02-17 11:13:33,480 [004-3] DEBUG ui.ResultsPane - Paginator stats page number:1 page count: 1 page size: 10 start 0 end: 10 count: 10 2004-02-17 11:13:33,484 [004-3] DEBUG ui.ResultsPane - Outputting documents 2004-02-17 11:13:33,484 [004-3] DEBUG ui.ResultsPane - One doc [com.arsdigita.cms.contenttypes.Minutes:{id=27019}] search minutes 2004-02-17 11:13:33,486 [004-3] DEBUG ui.ResultsPane - One doc [com.arsdigita.cms.contenttypes.Service:{id=26055}] search service 2004-02-17 11:13:33,487 [004-3] DEBUG ui.ResultsPane - One doc [com.arsdigita.cms.contenttypes.Event:{id=26020}] search event 2004-02-17 11:13:33,487 [004-3] DEBUG ui.ResultsPane - One doc [com.arsdigita.cms.contenttypes.PressRelease:{id=26046}] press search 2004-02-17 11:13:33,488 [004-3] DEBUG ui.ResultsPane - One doc [com.arsdigita.cms.contenttypes.Agenda:{id=26003}] search agenda 2004-02-17 11:13:33,489 [004-3] DEBUG ui.ResultsPane - One doc [com.arsdigita.cms.contenttypes.Article:{id=26011}] search article 2004-02-17 11:13:33,490 [004-3] DEBUG ui.ResultsPane - One doc [com.arsdigita.cms.contenttypes.LegalNotice:{id=27010}] search legal 2004-02-17 11:13:33,491 [004-3] DEBUG ui.ResultsPane - One doc [com.arsdigita.cms.contenttypes.NewsItem:{id=26037}] search news 2004-02-17 11:13:33,502 [004-3] DEBUG ui.ResultsPane - One doc [com.arsdigita.cms.contenttypes.MultiPartArticle:{id=26029}] search mpa 2004-02-17 11:13:33,503 [004-3] DEBUG ui.ResultsPane - One doc [com.arsdigita.cms.contenttypes.Job:{id=27002}] search job 2004-02-17 11:14:35,511 [ead-5] INFO lucene.Indexer - Lucene indexer running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:14:35,511 [ead-5] DEBUG lucene.Indexer - syncing 2004-02-17 11:14:35,518 [ead-5] INFO lucene.Indexer - Lucene indexer finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:16:35,521 [ead-5] INFO lucene.Indexer - Lucene indexer running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:16:35,521 [ead-5] DEBUG lucene.Indexer - syncing 2004-02-17 11:16:35,529 [ead-5] INFO lucene.Indexer - Lucene indexer finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:18:35,531 [ead-5] INFO lucene.Indexer - Lucene indexer running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'. 2004-02-17 11:18:35,531 [ead-5] DEBUG lucene.Indexer - syncing 2004-02-17 11:18:35,538 [ead-5] INFO lucene.Indexer - Lucene indexer finished running for '/var/ccm-devel/web/jorris/rickshaw/work/lucene'.