Bug 1168328 - Satellite crashes on large system sets
Summary: Satellite crashes on large system sets
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Spacewalk
Classification: Community
Component: WebUI
Version: 2.2
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Stephen Herr
QA Contact: Red Hat Satellite QA List
URL:
Whiteboard:
Depends On: 1161274 1168322
Blocks: space23
TreeView+ depends on / blocked
 
Reported: 2014-11-26 16:41 UTC by Stephen Herr
Modified: 2015-04-14 19:02 UTC (History)
4 users (show)

Fixed In Version: spacewalk-java-2.3.108-1
Doc Type: Bug Fix
Doc Text:
Clone Of: 1168322
Environment:
Last Closed: 2015-04-14 19:02:42 UTC
Embargoed:


Attachments (Terms of Use)

Description Stephen Herr 2014-11-26 16:41:56 UTC
+++ This bug was initially created as a clone of Bug #1168322 +++

+++ This bug was initially created as a clone of Bug #1161274 +++

Description of problem:
Satellite crashes on large system sets. When trying actions on systems (100 or so or more) in ssm, the satellite crashes and services need restarted.
basically moving 100 systems to a different base channel, the system runs a timeout issue. it seems the database is not performing as expected for several operations


Version-Release number of selected component (if applicable):
satellite-schema-5.6.0.19-3.el5sat


How reproducible:
100% kinda sorta - It never completes and sometimes will cause the services to stop working.

https://<hostname>/rhn/channel/ssm/ChildSubscriptionsConfirm.do

Steps to Reproduce:
1. On the satellite, use ssm or select the systems you want to change the channel for.
2. try to change the channel
3. https://<hostname>/rhn/ssm/ViewAllLog.do shows a few in progress. 

Actual results:
The satellite currently has a few showing in progress, so maybe that's why it's not really showing the exact error as before, but either way, the channel change never completes and it usually will cause the satellite to stop working.

Expected results:
The channels to change correctly.

Additional info:

--- Additional comment from Marcelo Moreira de Mello on 2014-11-11 13:16:02 EST 
catalina.out
===============
2014-11-12 19:37:12,626 [TP-Processor7] DEBUG net.rkbloom.logdriver.LogCallableStatement - executing CallableStatement: '{call rhn_channel.unsubscribe_server(?, ?)}' with bind parameters: {1=1000016731, 2=365} out parameters: {}

a lot of hibernate activity when changing the channels


postgresql
---------
2014-11-12 19:37:30.930 EST LOG:  duration: 18303.816 ms  execute <unnamed>: select * from rhn_channel.unsubscribe_server($1, $2) as result
2014-11-12 19:37:30.930 EST DETAIL:  parameters: $1 = '1000016731', $2 = '365'

2014-11-12 19:37:41.622 EST LOG:  duration: 6179.462 ms  execute <unnamed>: select * from rhn_channel.unsubscribe_server($1, $2) as result
2014-11-12 19:37:41.622 EST DETAIL:  parameters: $1 = '1000041275', $2 = '3550'



top - 19:39:12 up 4 days,  2:27,  3 users,  load average: 1.27, 0.46, 0.16
Tasks: 168 total,   2 running, 166 sleeping,   0 stopped,   0 zombie
Cpu(s): 45.3%us,  0.5%sy,  0.0%ni, 42.3%id, 11.6%wa,  0.2%hi,  0.2%si,  0.0%st
Mem:   4050448k total,  3760620k used,   289828k free,   156812k buffers
Swap:  1020116k total,      104k used,  1020012k free,  2747500k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                        
28386 postgres  18   0  550m 247m 233m R 88.5  6.3   1:06.99 postmaster                     
27533 tomcat    20   0  533m 341m 7760 S  2.3  8.6   0:17.52 java                           
    1 root      15   0 10372  692  580 S  0.0  0.0   0:00.37 init

--- Additional comment from Stephen Herr on 2014-11-17 10:39:56 EST ---

Hmm, rhn_channel.unsubscribe_server is a fairly complicated stored procedure with lots of room for calling other stored procs or looping, so it's difficult to know exactly why it's so slow without doing a lot more investigation into this particular dataset.

One possibility though is that by default it appears to recalculate the updates available to the server in-line (as opposed to asynchronously), which that by itself can take a long time if you're doing it on many systems. I'm not sure if that behavior is required or there because it's the default and we've never changed it.

It's worth experimenting with recalculating available updates asynchronously instead of in-line to see if that makes a difference or if there are negative side-effects (this can be done by calling unsubscribe_server(server_id, channel_id, 0) to set the optional arg immediate_in to 0)

--- Additional comment from Stephen Herr on 2014-11-19 10:54:14 EST ---

(In reply to Xixi from comment #5)
> 2) "used the the api to change the base channel for the group of systems and
> that worked just fine using system.setBaseChannel" <- if the problem only
> happens in Web UI, are the code paths / queries different between Web UI and
> API?

The API operates on a single system at a time. It's "system".setBaseChannel. So the client calling the API loops over all the systems it cares about and calls that method for each of them. So the difference is that the API is setting the base channel for these systems one at a time, and each one is relatively fast and completes long before a timeout is reached. The problem you're seeing with System Set Manager is that setting the base channels for many systems all at once is slow and a timeout is reached.

--- Additional comment from Xixi on 2014-11-19 11:53:35 EST ---

(In reply to Peter Gervase from comment #0)

Logged on to Satellite above -> SSM -> Change Base Channels -> selected "Set to Default Channel" for 10 systems selected -> Submit

After about 5-10 minutes, https://<hostname>/rhn/channel/ssm/BaseChannelSubscribe.do returns 500 ISE Timeout.

--- Additional comment from Stephen Herr on 2014-11-20 12:55:36 EST ---

I have an update ready for testing that makes two material changes:

1) The Base Channel change SSM action now happens asynchronously. This will solve the timeout / ISE problem and allow the user to go do other things while this happens in the background.

2) The action now commits after each server is migrated. This allows other processes that are waiting on locks for relevant tables (like system registration) to happen after only waiting a second or two, instead of multiple minutes (causing them to timeout or fail).

What this will not do is reduce the total amount of work that the database has to do. It's possible that there will be less strain on the database now due to much smaller rollback buffer requirements, but in general I would not expect to see much difference in database load.

Comment 1 Stephen Herr 2014-11-26 16:49:52 UTC
Committing to Spacewalk master:
90b419978977534d7e4418613d85484ba543650e

Comment 2 Stephen Herr 2014-12-12 19:59:02 UTC
An additional commit that 1) makes scheduling of the base channel change action faster and 2) prevents some base channel changes from failing due to uninitialized log id:
8cfebb37d1770f766a335cdf89899b90cf875e34

Comment 3 Grant Gainey 2015-03-23 16:58:45 UTC
Moving bugs to ON_QA as we move to release Spacewalk 2.3

Comment 4 Grant Gainey 2015-04-14 19:02:42 UTC
Spacewalk 2.3 has been released. See

https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23


Note You need to log in before you can comment on or make changes to this bug.