Bug 873379 - Out of memory with Spacewalk 1.8 and Postgres 8.4.13
Summary: Out of memory with Spacewalk 1.8 and Postgres 8.4.13
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Spacewalk
Classification: Community
Component: Server
Version: 1.8
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Jan Pazdziora
QA Contact: Red Hat Satellite QA List
URL:
Whiteboard:
Depends On:
Blocks: space19
TreeView+ depends on / blocked
 
Reported: 2012-11-05 17:07 UTC by Valeriy Vasiliev
Modified: 2013-03-08 19:25 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-03-08 19:25:19 UTC
Embargoed:


Attachments (Terms of Use)
Postgres queries describing locks, process listing, file listings. (111.80 KB, text/plain)
2012-11-05 17:07 UTC, Valeriy Vasiliev
no flags Details
postgres statistics showing increase in connections and locks starting at 11pm (179.91 KB, application/pdf)
2012-11-14 09:52 UTC, James Hogarth
no flags Details
postgres statistics showing increase in connections and locks starting at 11pm (176.82 KB, application/pdf)
2012-11-15 10:26 UTC, James Hogarth
no flags Details
postgres statistics showing increase in connections and locks starting at 11pm (160.56 KB, application/pdf)
2012-11-16 10:18 UTC, James Hogarth
no flags Details
postgres statistics showing increase in connections and locks starting at 9am (162.14 KB, application/pdf)
2012-11-16 15:04 UTC, James Hogarth
no flags Details
Locks linked to pid that is IDLE in transaction (7.35 KB, text/plain)
2012-11-21 10:49 UTC, James Hogarth
no flags Details
details of the spawned httpd process holding the db locks/transaction (24.99 KB, text/plain)
2012-11-22 10:20 UTC, James Hogarth
no flags Details
postgres log for the transaction idle since 00:49 this morning and hanging the httpd process (35.94 KB, text/plain)
2012-11-23 11:22 UTC, James Hogarth
no flags Details

Description Valeriy Vasiliev 2012-11-05 17:07:30 UTC
Created attachment 638747 [details]
Postgres queries describing locks, process listing, file listings.

I'm running Spacewalk 1.8 on CentOS 6.3 with Postgres 8.4.13.

I've made an upgrade according to the trac article "HowToUpgrade"

The machine has been set up with Spacewalk 1.7 and Postgres 8.4.12 a few months ago.

Around 50 hosts are registered with this Spacewalk server.
This Spacewalk installation has kickstarted all those hosts and is serving them with various channels. The channels are synced with reposync/quartz and the sync has been configured thru Spacewalk's UI. E.g. one of repo/channels is being synced  Monday morning.

This seems to be the time when the machine starts leaking memory.
According to `free` and `htop` there is almost memory available anymore.
Postgres reports that it can account for 26686 locks by inspecting the pg_locks table;

The Spacewalk Web-UI and the API become completely unresponsive.
But I'm able to access the machine thru ssh.
`cobbler list` hangs for example as well.

It seems as if there were at least 540 `postmaster` pgsql processes running.

I'd hoped that the issue would go away with 1.8 since I've observed the same behaviour with 1.7, but I can't account for the details during the 1.7 period, since I've made a compilation of the facts which I attached to this issue only today, not back then.

A reboot seems to fix this issue for a few days.

I think it is somehow related to taskomatic/quartz, but I'm not certain.

I've made the attachment as detailed as possible, I won't reboot the machine for the next few days, so it's possible to inspect the behaviour live. If you may need to see certain log files - tell me so.

Comment 1 James Hogarth 2012-11-14 09:51:45 UTC
Interesting - I'm trying to track down a similar issue that appears to have occurred between 1.7 and 1.8 with a postgres backend...

Right now I"m trying to narrow it down but around 11pm (so I'm thinking config compare in the scheduled task list) the number of connections and locks increases significantly until the postgres server starts refusing new connections...

I've tried a new postgres jdbc jar (as suggested in mailing list to solve long idle) but it hasn't help really...

I'm wondering if OSAD is having an impact - I've only just recently enabled it... 

My next steps were to get some stats tonight on the old postgres jdbc jar and then tomorrow night to try disabling OSAD to check the behaviour then...

Attached is a PDF from my Icinga instance monitoring the postgres statistics generated from PNP4Nagios where you can see the steady increase in locks and connections until it just falls over...

Comment 2 James Hogarth 2012-11-14 09:52:34 UTC
Created attachment 644715 [details]
postgres statistics showing increase in connections and locks starting at 11pm

Comment 3 Valeriy Vasiliev 2012-11-14 11:07:33 UTC
James, I'd be thankful for any feedback regarding the jdbc/OSAD involvement.
I'm using the standard jdbc adapter. But I do use OSAD since day 1 of the Spacewalk setup.

Comment 4 James Hogarth 2012-11-15 10:26:53 UTC
Created attachment 645559 [details]
postgres statistics showing increase in connections and locks starting at 11pm

Yesterday the small update for 1.8 was applied and the postgres jdbc jar was reverted to the older one form the standard repos and as you can see similar behaviour was observed...

Starting at around 11pm there was a roughly linear increase in locks (with a peak of accessshare of around 20,000 locks) which corresponds with the increase in connections, waiting queries and and long idle connections...

Tonight I'll disable osad on all the hosts before heading home to determine whether that is having an impact or not...

Comment 5 Valeriy Vasiliev 2012-11-15 14:00:38 UTC
Looking forward to seeing your results.

Comment 6 James Hogarth 2012-11-16 10:18:31 UTC
Created attachment 646275 [details]
postgres statistics showing increase in connections and locks starting at 11pm

So this was fairly interesting... 

As you can see in the attached PDF with OSAD disabled the number of concurrent connections stayed roughly stable with just rhnsd on its default 4 hour schedule checking in (and all the configuration differences did indeed run overnight as per usual)...

My next steps in trying to track down what triggers the edge case that causes things to fall over are OSAD enabled and then run a 'diff all' outside of the regular bundle schedule and seeing how things perform then - see if I can get it to lock up on demand or if this is a bit of a red herring...

Comment 7 Valeriy Vasiliev 2012-11-16 14:16:23 UTC
For roughly ten days now I haven't experienced this behaviour.
This is getting strange.

Comment 8 James Hogarth 2012-11-16 15:04:18 UTC
Created attachment 646369 [details]
postgres statistics showing increase in connections and locks starting at 9am

That's especially weird - take a look at my statistics today...

I re-enabled OSAD starting around 9am or so and bam connections and locks jumped up...

I'm going to disable OSAD for a few days - I don't need the immediate response yet - and see how it goes...

Unfortunately with the amount of time between 1.7 and 1.8 there's a lot of commits (if it is indeed a regression of some nature) so I think this might be tricky to track down...

Comment 9 Valeriy Vasiliev 2012-11-19 18:19:59 UTC
I noticed that Spacewalk+Postgresql gave up on Monday morning once again. Yesterday I received the daily Spacewalk mail, today the server OOMd. I know that quartz starts reposyncing on Monday morning. I may reschedule the sync in a few days to friday to see if that is really the main concern.

How things are looking at your end James?

Comment 10 James Hogarth 2012-11-20 09:40:14 UTC
Well this was interesting...

So from the 16th through to now OSAD has been disabled but it looks like it resulted in a slow leak as opposed to something that happened over a few hours..

After a few days running I was showing locks like:

764 Locks -- AccessShareLock(719), ExclusiveLock(7), RowExclusiveLock(37), RowShareLock(1), 

It hadn't quite fallen over but was obviously on its way there... interestingly the number of connections was relatively low (66 with 33 idle) ...

In terms of waiting statements on postgres there were seven of the following looking like this pattern:

update rhnServerAction
set status = 1,
pickup_time = current_timestamp,
remaining_tries = 4 - 1
where action_id = 5944
and server_id = 1000010030

I'm not sure what is active and locking the DB right now...

In next couple of days I'm going to add some more postgres logging to try and catch what's doing locks etc - I'm pretty sure there's some kind of leak at this point but it's going to be tricky getting to the bottom of it I think.

Comment 11 James Hogarth 2012-11-21 10:49:45 UTC
Created attachment 649139 [details]
Locks linked to pid that is IDLE in transaction

This morning (no OSAD still) my icinga instance alerted to one waiting query - which seemed a good time to get a more detailed look...

Checking running queries it seemed suspicious that there was an IDLE connection still in transaction:

 spaceschema | spaceuser |   18577 |             | f       | 2012-11-21 02:37:04.24375+00  | <IDLE> in transaction

Checking that pid there were 32 AccessShareLocks, 9 RowExclusiveLocks, 1 RowShareLock and 2 ExclusiveLocks - the latter linked to a transaction ID and a virtualxid (see attachement).

Interestingly there was nothing scheduled to run at 02:37 so this is reasonably likely to just be a rhn_check from rhnsd doing this...

Unfortunately so far as I can see postgres 8.4 does not appear to have a way to view a 'stalled' transaction as it were - all I can say is that it's not a prepared transaction...

I've got the pg_xlog segment that's been unlinked but held open by the relevant pid - but naturally it's a 16M (binary) file of all transactions that occurred at the time this pid was generated so appears to be of limited use (needle in a haystack comes to mind)...

I'm tempted to do some verbose logging on postgres - but wary how much disk space that could take until the leaked transaction makes a reappearance... 

When I get some time tomorrow (or the weekend) it seems like a decent next step though and it's pretty clear here there is a leaked transaction somewhere which doesn't complete and results in stuff locked and prevents other actions (this is on the shipped postgres jar with the OS and not the updated one mentioned on the mailing lists).

Comment 12 James Hogarth 2012-11-21 11:12:44 UTC
As a small addition as a quick look at what might be holding this open I stopped the spacewalk processes one at a time running:

select * from pg_stat_activity where current_query like '%in transaction%' and datname='spaceschema';

between each service stop ... oddly it was when httpd was stopped that the query returned null - and not tomcat6, monitoring, monitoringscout or rhn-search like I was expecting (note that I did not check to see if any tomcat processes were being held open by httpd which in retrospect I should have)...


I'm going to set up a check to look for the IDLE in transaction symptom and when it next happens take a more detailed look at httpd, tomcat6, postgres and the relationships between them at that point...

Comment 13 Valeriy Vasiliev 2012-11-21 13:34:51 UTC
James thank you so much for taking the time thoroughly investigating this issue!

Comment 14 James Hogarth 2012-11-22 10:20:31 UTC
Created attachment 649654 [details]
details of the spawned httpd process holding the db locks/transaction

A little more information again today....

Rather than killing everything outright I decided to take a more scalpel rather than sledgehammer approach.

This morning I cam in to my monitoring spamming my email with an IDLE in transaction and locks related to it.

Restarting tomcat6 made no difference - whilst whilst tomcat6 had been stopped by the service command there were no apparent incidences of httpd holding open a process preventing it stop etc.

Looking at httpd a little closer after yesterday there was one spawned process that stood out as it had a start date of yesterday when all the others had a HUP with logrotate this morning.

That process did not respond to TERM but when a KILL was sent the idle in transaction (and locks) in the postgres database were removed.

The details for that PID (lsof -p and the ps that showed it as 'odd') are attached.

Looking at the time of the transaction start and comparing to apache logs and rhn_server_xmlrpc.log it would appear to coincide with one of my systems checking in and carrying out a configuration difference comparison...

At this time I don't know if this was the same system yesterday, if this was the first system to check and so on... I do know running a manual check of this system (initiated from spacewalk and run via rhn_check) does not result in the same behaviour.

Tonight I'll add some disk space and enable verbose database logging to see if/when it happens what exactly is going on at the database level...

Why would httpd hold open a transaction that appears to be associated with a configuration file diff? ...

Comment 15 Milan Zázrivec 2012-11-22 10:30:11 UTC
(In reply to comment #14)
> Why would httpd hold open a transaction that appears to be associated with a
> configuration file diff? ...

Because spacewalk-backend -- a mod_wsgi application -- runs in apache's
context and it is the part that the client communicate with when doing
config management things.

Comment 16 James Hogarth 2012-11-22 10:57:45 UTC
Thanks Milan - that's very useful to know and I believe we're getting closer to seeing what's going on ...

Interestingly the system that correlated with a compare event and the time this all happened in the logs was the only system in the logs (others did check their own checks both before and afterwards) that did not have a xmlrpc/registration.welcome_message('lang: None',) entry in rhn_sever_xmlrpc.log - when it ran successfully later on (having kill -9'd that httpd process) this did then appear for the successful run.

Hopefully when(?) it happens tonight there will be more correlating factors that might narrow this down further and isolate the cause.

Comment 17 James Hogarth 2012-11-22 11:04:10 UTC
Looking at the changes that could have a potential impact on get server capabilities that landed in 1.8 I'm wondering if Commit d219b472d26564d961f9f1746a7ff0eae1119e07 could be related to this...

What I find even more intriguing is that this impacts both rhncfg and osad behaviour so far as I can see which again correlates with the symptoms.

Comment 18 Milan Zázrivec 2012-11-22 11:14:03 UTC
(In reply to comment #17)
> Looking at the changes that could have a potential impact on get server
> capabilities that landed in 1.8 I'm wondering if Commit
> d219b472d26564d961f9f1746a7ff0eae1119e07 could be related to this...

Unlikely I think.

This change just took the code from rhnlib and moved it to both
osad and rhncfg.

Comment 19 James Hogarth 2012-11-22 11:34:57 UTC
Sorry that's a red herring - this is a server side not client side issue and that commit only changed client side behaviour (and just moving code as you pointed out)... 

Possibly then backend/server/rhnCapability.py could be a key file... and the only change there not in 1.7 was the new abrt stuff in Commit c978e55e676c2be9c1fdb01ac0313c754d70edf1 ... and I'm not sure how that could end up with a uncommitted transaction left idling holding open the locks and httpd process...

Given how hard it seems to be to replicate the behaviour I think this will now have to wait for similar behaviour in the next 24 hours or so with the more detailed postgres logging.

Comment 20 James Hogarth 2012-11-23 11:22:46 UTC
Created attachment 650335 [details]
postgres log for the transaction idle since 00:49 this morning and hanging the httpd process

So my monitoring consisted of a cronjob that ran every 5 minutes to check for idle in transaction and print locks (then mail it all to me)...

What this has shown (not attached for now since it's a lot of data to little gain) is that idle in transaction is a normal thing (happens fairly often over night) but the vast majority of the time it goes away - and may only get stuck on occasion... once stuck it stays there though - interestingly most seem to be over 127.0.0.1:5432 but the stuck one is over unix socket (there was also another during the night that did not stick but appeared in my monitoring over unix socket).

I again see the symptom of the stuck process in the rhn_web logs as being unique in not having a registration.welcome_message associated with it:

49:11 +01:00 29053 10.139.8.5: xmlrpc/queue.get(1000010030, 2, 'checkins enabled')
2012/11/23 00:49:11 +01:00 32703 10.139.8.5: xmlrpc/up2date.login(1000010030,)
2012/11/23 00:49:12 +01:00 32508 10.139.8.5: xmlrpc/queue.submit(1000010030, 6081, 0)

The system in question has:

2012/11/23 08:49:11 +01:00 25102 10.139.8.5: xmlrpc/queue.get(1000010030, 2, 'checkins enabled')

Every 4 hours roughly since the 'stick' but nothing else.

This is not the same system as yesterday.

There is a similar 'stuck' httpd process as yesterday.

strace shows the httpd process doing nothing at all - and /proc/28402/status shows the state as sleeping.

The action ID involved in the log is 'show differences between files' according to the spacewalk schedule screen and this remains in progress.

Again a regular TERM did nothing for the sleeping httpd process. 

A KILL resulted in this in the postgres logs (I would expect normal):

spaceuser%spaceschema%1468%10767046%22/64LOG:  unexpected EOF on client connection

No transaction rollback or anything like that in the logs though but as expected the IDLE in transaction and associated locks were gone.

Oddly when trying to view the history for a system I'm getting this - which I have not seen before so I don't necessarily think it linked but including it for completeness:

[Fri Nov 23 11:15:06 2012] [error] [client 10.243.20.36] APR::Request::Error=HASH(0x7fa873924428), referer: https://ucbdspacewalk.bigdata.bskyb.com/rhn/schedule/CompletedSystems.do?aid=6099

That's after a full satellite and postgres restart on any system at all even just trying to go into the event page for that system and not a particular event...

Comment 21 James Hogarth 2012-11-23 11:42:14 UTC
Sometimes I wish you could edit comments - ignore the APR::Request::Error thing ... cookies in my Chrome got weird for the spacewalk servers I'm responsible for apparently ...

I don't think this needs a large number of servers registered to show symptoms however... a smaller instance of only 14 systems showed 8 IDLE in transaction connections (that had been idle 1 day so definitely hanging) ...

I'm going to add monitoring for this behaviour here as well although I've had to turn down the postgres logging due to disk space concerns.

Comment 22 Valeriy Vasiliev 2012-11-27 15:13:23 UTC
James, were you able to identify the issue?
I'm still seeing the behaviour as described.

Comment 23 James Hogarth 2012-11-29 10:45:59 UTC
I'm seeing the behaviour as described - I"m pretty sure now it's related to the config management part of things...

For now I'm doing daily restarts as I don't have sufficient free time to look at this with other work demands at the moment...

I'll be able to dig in more again next week...

Comment 24 James Hogarth 2012-12-04 10:02:57 UTC
Valeriy if you want to  do a similar thing I'm running this script as root every 10 minutes...

#!/bin/bash

chk_cmd="psql -t -c \"select count(*) from pg_locks where mode='AccessShareLock';\""

lock_count="$(su - postgres -c "$chk_cmd")"

if (( $lock_count > 2000 ))
then
rhn-satellite stop
service postgresql stop
sleep 5
service postgresql start
rhn-satellite start
fi

It's very hacky but keeps things working for now with minimal intervention....

I'm in the process of building out puppet to replace spacewalk configuration management and when that's done I'll be removing all the config management in spacewalk - will see then if the symptoms persist (so not definitely config management related) or if they then go...

Comment 25 Valeriy Vasiliev 2012-12-04 10:26:47 UTC
I've thought about implementing a similar hacky solution... yet I've encountered a different challenge which involves osa-dispatcher and a few clients that for whatever reason stopped responding to the jabber server, it's only a small set of hosts - no plausible explanation. 
Freakishly annoying, can't deploy config files properly anymore.
I dug deep into the jabberd logs, osa-dispatcher and osad logs, removed the jabberd DB from /var but couldn't find out what the exact problem was.

So I'd be glad to hear about your transition to Puppet, the removal of config channels from Spacewalk and how that affects the locking issue.

In my opinion Spacewalk offers is a great overview over the config files deployed on the servers as well as the various revisions of similar configs across different types of systems. Also it intergrates tightly with provisioning. And  yet somehow you are switching to Puppet, is Puppet a better solution in this case, does it offer more control and more structure?

Thank you James for your investigations and findings.

Comment 26 James Hogarth 2012-12-07 13:03:55 UTC
FYI someone on the mailing lists mentioned this patch yesterday:

https://bugzilla.redhat.com/show_bug.cgi?id=843723

I've just built it an initial checks looks good... 

I've removed my 'hack' and am letting it run like this over the weekend (osad enabled too).

The patch applies cleanly to the python-psycopg2 in CentOS 6.3 SRPM ...

Comment 27 James Hogarth 2012-12-10 08:59:51 UTC
Well although the system didn't outright fall over with max connections being reached and so on with this patch in place there are still IDLE connections stuck in transaction holding open some locks...

So better but it still appears to be some sort of underlying problem leaking transactions/connections.

Comment 28 Valeriy Vasiliev 2012-12-10 22:45:19 UTC
This problem definitely related to config diffs running at night.
I've implemented a warning based on your script, the diffs ran, a few of them hung and shortly after that I've got a warning that the locks increased and had to reboot the machine.
Also the same machines that cause the lock rising are unresponsive over osa-dispatcher as well for me.
It's strange and there is now fix, it's sad.

Comment 29 Stefan Kutzke 2013-01-20 14:07:35 UTC
> This problem definitely related to config diffs running at night.
Yes, I'm seeing the same issue and can't find a solution. There are one ore more systems (not the same every night) with blocking PIDs in the database when config diff is running:
 update rhnServerAction
 set status = 1,
 pickup_time = current_timestamp,
 remaining_tries = 4 - 1
 where action_id = 5944
 and server_id = 1000010067
I'm not seeing the registration.welcome_message for these systems in /var/log/rhn/rhn_server_xmlrpc.log.
I will remove compare-configs-default schedule to keep my Spacewalk running.

Comment 30 Stefan Kutzke 2013-01-20 16:30:07 UTC
I have just tested a config diff via SSM (all systems and all configs selected) and I have got the same issue: There are 2 systems on which rhn_check doesn't finish. I'm seeing the following message in /var/log/osad (debug_level = 1 in /etc/sysconfig/osad.conf):
 2013-01-20 17:01:37 osad_client._check_signature: Dropping message, drift is too 
big 122.814021111
It's not a problem with time, NTP is running fine. I think this message occurs because rhn_check could not finish within 120 seconds (max_time_drift = 120 in osad.conf). Running rhn_check manually results in an error about a communication problem with jabberd. To solve this problem I do the following steps:
1. Stop osad
2. Kill rhn_check process
3. Remove /etc/sysconfig/rhn/osa-auth.conf
4. Terminate blocking Postgres processes
5. Start osad

Comment 31 Valeriy Vasiliev 2013-01-21 02:51:40 UTC
Stefan, take a look at this thread https://www.redhat.com/archives/spacewalk-list/2012-December/msg00055.html 
rebuilding python-psycopg2 with the patch from https://bugzilla.redhat.com/show_bug.cgi?id=843723 seems to solve the problem.

Comment 32 Stefan Kutzke 2013-01-21 11:02:34 UTC
Valeriy, thank you for the hint. I have built a new python-psycopg2 package with the patch for my Spacewalk machine (running lastest CentOS 6.3). After updating python-psycopg2 the problem seems to be solved. Single schedules of compare-configs-bunch and config diffs via SSM were successful. No hanging Postgres proccesses or OSAD errors up to now. I will keep an eye on...

Comment 33 Jan Pazdziora 2013-03-08 16:08:53 UTC
(In reply to comment #31)
> Stefan, take a look at this thread
> https://www.redhat.com/archives/spacewalk-list/2012-December/msg00055.html 
> rebuilding python-psycopg2 with the patch from
> https://bugzilla.redhat.com/show_bug.cgi?id=843723 seems to solve the
> problem.

Valeriy, can we mark this bugzilla as resolved?

Comment 34 Valeriy Vasiliev 2013-03-08 16:23:01 UTC
I'd say it's solved, been running on the patched psycopg2 for weeks now without any issue.

Comment 35 Jan Pazdziora 2013-03-08 19:25:19 UTC
Thank you for the confirmation.

And kudos to you guys for the detailed investigation on the issue.


Note You need to log in before you can comment on or make changes to this bug.