Bug 1381720

Summary:	Uninventory of resource with content leaves orphaned LOB data in the PostgreSQL pg_largeobject table
Product:	[JBoss] JBoss Operations Network	Reporter:	Larry O'Leary <loleary>
Component:	Database, Content	Assignee:	Michael Burman <miburman>
Status:	CLOSED ERRATA	QA Contact:	Hayk Hovsepyan <hhovsepy>
Severity:	high	Docs Contact:
Priority:	high
Version:	JON 3.3.7	CC:	fbrychta, hhovsepy, miburman, spinder
Target Milestone:	ER01	Keywords:	Triaged
Target Release:	JON 3.3.8
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-02-16 18:45:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1306602
Bug Blocks:

Description Larry O'Leary 2016-10-04 19:57:26 UTC

+++ This bug was initially created as a clone of Bug #1306602 +++

Description of problem:
Removal of a resource that has content associated with it (rhq_package_bits data) leaves the actual blob data in pg_largeobject table orphaned once the rhq_package_bits row is removed.

This results in unbound file system growth for the PostgreSQL database and even after removing old content, the space is not eligible for garbage collection/compaction because the actual blob data never gets unlinked.

Version-Release number of selected component (if applicable):
JBoss ON 3.3.7

How reproducible:
Always

Steps to Reproduce:

1.  Install, configure, and start JBoss ON 3.3.7 system.
2.  Install, configure, and start EAP 6.4 standalone server.
3.  Import newly discovered EAP server into inventory.
4.  Configure newly imported EAP server's connection settings.
5.  Create new _Deployment_ child resource for EAP server using jboss-helloworld.war.
6.  After newly deployed WAR is reflected in JBoss ON inventory, verify its content page reflects that the resource is backed by the jboss-helloworld.war content.
7.  Uninventory the jboss-helloworld.war resource.

Actual results:
After a few minutes that jboss-helloworld.war content entry stored in rhq_package_bits is removed yet the pg_largeobject table continues to contain the binary data that represents the jboss-helloworld.war. There are no references in the JBoss ON database to this lob making it an orphan.

Expected results:
The pg_largeobject table should be empty.

Additional info:
This issue is specific to how PostgreSQL stores blob data. Instead of storing the binary data in the rhq_package_bits table, Postgres writes the data to the pg_largeobject table and stores the data reference in the rhq_package_bits table. However, the PostgreSQL JDBC driver is not able to determine if the only reference to this data is held by the rhq_package_bits table when a DELETE is performed. The expectation by the JDBC implementation is that the application calling DELETE will know whether there are other references to the pg_largeobject entry and if not, use the unlink operation to remove the old data.

I believe the fix here is that when the DELETE query is executed against the rhq_package_bits table, an unlink query should also be executed.

Do not confuse this issue with the issue identified in Bug 1306602. Although the issues appear to be the same, the difference is that 1306602 raised an issue in where we simply failed to remove the rhq_package_bits row. With the fix for 1306602, we are now removing the row from rhq_package_bits but the failure here it to unlink the lob id when PostgreSQL is the back-end database.

Comment 1 Michael Burman 2016-11-22 12:41:51 UTC

There can be data from three different tables in the pg_largeobject:

rhq_raw_config
rhq_package_bits
rhq_drift_file

The knowledge base solution should do the trick, but we shouldn't trust that tool since it might not be on the execution path (and all the permission issues  considering), so the functionality should be replicated in a light way.

Find all the oids from those tables, then check pg_largeobject / pg_largeobject_metadata and remove those lines, which are not in the three tables. Most likely easiest with a single SQL clause and pure JDBC.

The automatic unlinking would require a new trigger to the database and a database change, so there's a small problem when it comes to z-stream release.

Comment 2 Larry O'Leary 2016-11-22 15:32:06 UTC

(In reply to Michael Burman from comment #1)
> There can be data from three different tables in the pg_largeobject:
> 
> rhq_raw_config
> rhq_package_bits
> rhq_drift_file
> 
> The knowledge base solution should do the trick, but we shouldn't trust that
> tool since it might not be on the execution path (and all the permission
> issues  considering), so the functionality should be replicated in a light
> way.
> 
> Find all the oids from those tables, then check pg_largeobject /
> pg_largeobject_metadata and remove those lines, which are not in the three
> tables. Most likely easiest with a single SQL clause and pure JDBC.

I suppose a separate workaround for newer versions of PostgreSQL could be created. The issue is that pg_largeobject cannot be accessed by non-superuser and pg_largeobject_metadata was not added until the 9.x stream.

This is why the vacuumlo option was chosen. However, the re-index will also require super-user so perhaps my argument is moot.

> The automatic unlinking would require a new trigger to the database and a
> database change, so there's a small problem when it comes to z-stream
> release.

Agreed. However, we could do the unlinking ourselves when a delete or update is called on these data types. It would seem the Hibernate has to have some callback method or other option for handling this?

Comment 3 Michael Burman 2016-11-29 09:24:00 UTC

I'm going to back up a little bit and suggest that we don't do the "fix" part. I've now implemented the fixing strategy in many places (making Postgres to have different delete code than Oracle), but I don't think I can cover all the bases.

Problem is that we have multiple "CASCADE" deletes also, and stuff like "Resource-delete" that runs more than dozen different delete query in certain given order. Tracking and finding all the possible cascades if painful - I'm not sure if I can reliably find all of them. And rewriting all the query blocks that might touch the given tables is also very error prone.

For example, the whole logic of following query: https://github.com/rhq-project/rhq/blob/master/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java#L726

Must be rewritten to have separate stages so I can in the middle put some Postgres-specific code. Same sort of rewrite must be put to 6-7 different places also, with some of them needing new Exception paths that would change interfaces also (to catch errors).

After doing few of these, I'm no longer recommending to go forward with that strategy. This will create a huge amount of QE testing, potential problems and I'm not sure if it'll even solve the issue 100%.

Instead, as we need to create the purge job in any case, could we just run it on say daily basis? That would clean up any orphan largeobjects from the database. It's more simple to test and it will catch all the places. Obviously, the first clean up run will take a lot longer for customers with large databases already. Should we trigger it from apply-updates also?

Comment 4 Michael Burman 2016-11-30 10:10:48 UTC

This PR adds a new operation to the postgres-plugin that cleans orphaned items from the database: https://github.com/rhq-project/rhq/pull/276

Comment 5 Larry O'Leary 2016-12-02 15:56:28 UTC

The new operation will only work as a workaround and does not fix this issue.

Can we not explicitly remove content on resource uninventory instead of relying on the CASCADE.

The workaround is still valid in case we miss the pgsql delete/unlink for some lobs. But if we can at least handle explicit unlink for uninventory method it will at least address the issue identified in this BZ. It seems package bits and drift would be the biggest offenders of this.

Comment 7 Michael Burman 2017-01-13 12:43:45 UTC

Fixed in the master (did we have a ticket for the rhq_drift_file?):

commit 012b4f48f0072a4df3995cc3279cdd0cabde6361
Author: Michael Burman <miburman>
Date:   Wed Jan 11 16:03:13 2017 +0200

    Unlink each object before purging the rows from rhq_drift_file

commit b594e547b05eaca22ac7b2f7cd9e65f07dc5eccb
Author: Michael Burman <miburman>
Date:   Wed Jan 11 13:03:23 2017 +0200

    RHQ_PACKAGE_BITS is unlinked before removing the orphaned bits

commit 48e839c1b1756da39dc0a00a2b9c5ae42ad23e4c
Author: Michael Burman <miburman>
Date:   Wed Nov 30 12:07:22 2016 +0200

    [BZ 1381720] Add ability to clean orphaned large objects from the database
    
    Query pg_largeobject_metadata when used with Postgres >= 9.0

Comment 8 Michael Burman 2017-01-13 12:45:17 UTC

Note to this and the BZ 1381725, there are multiple paths to cause additional issues with largeobject (such as using the Java API). These fixes do not catch all the potential places.

Comment 10 Simeon Pinder 2017-01-21 02:40:32 UTC

Moving to ON_QA as available for test with build:
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=534002

http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/110/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip maps to ER01 build of JON 3.3.8.

Comment 14 Hayk Hovsepyan 2017-02-01 15:58:09 UTC

Verified on ER02 build of JON 3.3.8.
No orphan content is left in pg_largeobject table after uninventory of resource.

Comment 15 errata-xmlrpc 2017-02-16 18:45:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0285.html