Bug 705935

Summary: Idle transactions kept open forever make VACUUM useless
Product: [Community] Spacewalk Reporter: Andreas Rogge <a.rogge>
Component: ServerAssignee: Jan Pazdziora (Red Hat) <jpazdziora>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Satellite QA List <satqe-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 1.4CC: jpazdziora, mathieu-acct
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: osad-5.10.16-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-16 10:08:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 723481    
Attachments:
Description Flags
Patch that adds rhnSQL.commit() into the main loop of osa-dispatcher
none
Add a rhnSQL.commit() just after self._req_processor.process()
none
add rhnSQL.commit() after process() to flush dangling transaction
none
introduce rhnSQL.commit() after the both SELECT statements that seem to be the main loop none

Description Andreas Rogge 2011-05-18 22:34:03 UTC
Description of problem:
Seems like Spacewalk is keeping database transactions open which leads to unremovable dead rows in tables.

Version-Release number of selected component (if applicable):
spacewalk-postgresql-1.3.3-1.el5
postgresql84-server-8.4.5-1.el5_5.1

How reproducible:
always

Steps to Reproduce:
1. Restart spacewalk (spacewalk-service restart)

2. Run VACUUM VERBOSE rhnchannelnewestpackage
[...]
INFO:  "rhnchannelnewestpackage": found 661960 removable, 11794 nonremovable row versions in 5616 out of 5616 pages
[...]

3. Run explain analyze select * from rhn_channel.refresh_newest_package(101, E'Test'); several times. You'll notice somewhat increasing query runtimes (hence the "explain analyze")

4. Run VACUUM VERBOSE rhnchannelnewestpackage again
will look like the following:
[...]
INFO:  "rhnchannelnewestpackage": found 0 removable, 96818 nonremovable row versions in 838 out of 5616 pages
[...]

5. Restart spacewalk again

6. Run VACUUM VERBOSE rhnchannelnewestpackage once again
[...]
INFO:  "rhnchannelnewestpackage": found 88504 removable, 8314 nonremovable row versions in 838 out of 5616 pages
[...]
  
Actual results:
Because of PostgreSQL's MVCC rows that might be in use by a running transaction will not be removed (i.e. garbage collectd) by VACCUM.
As spacewalk keeps transactions idle for a long time (see SELECT * FROM pg_stat_activity ODER BY xact_start) only rows that are older than the oldest running transaction will be collected.

Expected results:
Spacewalk should not keep idle transactions. A transaction should only ever be started when there's work to do and should be comitted or rolled back immediately after the work is done.

Additional info:
Especially the example given here (rhnchannelnewestpackage) is a problem. Before I noticed this bug my table grew to more than 1 GB. After a VACUUM FULL with no concurrent transactions it is now around 40 MB.

As rhn_channel.refresh_newest_packages() rewrites large parts of the table each time it is called the number of dead rows grows really quick.

Possible Workaround:
Schedule a spacewalk restart every hour or so

Comment 1 Andreas Rogge 2011-06-01 08:53:26 UTC
Setting to high priority as this is somewhat like a memory leak. 
The database leaks diskspace until you restart the software. That's at least as bad as a memory leak.

Comment 2 Andreas Rogge 2011-06-02 22:57:00 UTC
I've identified three sources of these dangling transactions.

1. osa-dispatcher
2. task-o-matic
3. the /XMLRPC handler

"fixing" osa-dispatcher is simple. Adding rhnSQL.commit() calls after the two queries appearing in the main-loop makes the transactions close.

As osa-dispatcher is the "bad guy" (i.e. the component that keeps transactions open for days and weeks) this already helps a lot.

Comment 3 Andreas Rogge 2011-06-02 23:00:03 UTC
Created attachment 502679 [details]
Patch that adds rhnSQL.commit() into the main loop of osa-dispatcher

Please have a look at the attached patch and tell me wether that is a feasible workaround.

I can also provide a patch for git master if that helps.

Comment 4 Andreas Rogge 2011-06-02 23:36:18 UTC
Created attachment 502691 [details]
Add a rhnSQL.commit() just after self._req_processor.process()

This adds a rhnSQL.commit() just after the request has been processed. 
It does not look like it is a great place for that, but it is where rhnSQL.rollback() is done in case the request threw an exception, so I guess it should be ok.

Comment 5 Andreas Rogge 2011-06-03 04:00:35 UTC
Created attachment 502711 [details]
add rhnSQL.commit() after process() to flush dangling transaction

New patch against git master

Comment 6 Andreas Rogge 2011-06-03 04:01:32 UTC
Created attachment 502712 [details]
introduce rhnSQL.commit() after the both SELECT statements that seem to be the main loop

New patch against git master

Comment 7 Jan Pazdziora (Red Hat) 2011-06-07 11:11:30 UTC
(In reply to comment #6)
> Created attachment 502712 [details]
> introduce rhnSQL.commit() after the both SELECT statements that seem to be the
> main loop
> 
> New patch against git master

Committed to Spacewalk master as 5ac15a79d42a4a334bf162957a3de5f8c0536f50.

Comment 8 Jan Pazdziora (Red Hat) 2011-07-20 11:48:35 UTC
Aligning under space16.

Comment 9 Jan Pazdziora (Red Hat) 2011-09-16 10:03:41 UTC
Taking.

Comment 10 Jan Pazdziora (Red Hat) 2011-09-16 10:08:23 UTC
I'm going to move this bugzilla to CLOSED CURRENTRELEASE as the osa-dispatcher patch was implemented and released.

As for the apacheHandler.py patch, I am not too fond of the commit. There was more discussion about it in thread

http://www.redhat.com/archives/spacewalk-devel/2011-June/msg00004.html

and we will need to find out what is it that leaves the transactions behind. If we have more information or patch which would help us to find that out, please either open new bugzilla or followup in that thread.

We did not have any patch for the taskomatic side -- again, if anyone comes with an investigation or patch, please file new bugzilla or post to spacewalk-devel mailing list.