+++ This bug was initially created as a clone of Bug #1168322 +++ +++ This bug was initially created as a clone of Bug #1161274 +++ Description of problem: Satellite crashes on large system sets. When trying actions on systems (100 or so or more) in ssm, the satellite crashes and services need restarted. basically moving 100 systems to a different base channel, the system runs a timeout issue. it seems the database is not performing as expected for several operations Version-Release number of selected component (if applicable): satellite-schema-5.6.0.19-3.el5sat How reproducible: 100% kinda sorta - It never completes and sometimes will cause the services to stop working. https://<hostname>/rhn/channel/ssm/ChildSubscriptionsConfirm.do Steps to Reproduce: 1. On the satellite, use ssm or select the systems you want to change the channel for. 2. try to change the channel 3. https://<hostname>/rhn/ssm/ViewAllLog.do shows a few in progress. Actual results: The satellite currently has a few showing in progress, so maybe that's why it's not really showing the exact error as before, but either way, the channel change never completes and it usually will cause the satellite to stop working. Expected results: The channels to change correctly. Additional info: --- Additional comment from Marcelo Moreira de Mello on 2014-11-11 13:16:02 EST catalina.out =============== 2014-11-12 19:37:12,626 [TP-Processor7] DEBUG net.rkbloom.logdriver.LogCallableStatement - executing CallableStatement: '{call rhn_channel.unsubscribe_server(?, ?)}' with bind parameters: {1=1000016731, 2=365} out parameters: {} a lot of hibernate activity when changing the channels postgresql --------- 2014-11-12 19:37:30.930 EST LOG: duration: 18303.816 ms execute <unnamed>: select * from rhn_channel.unsubscribe_server($1, $2) as result 2014-11-12 19:37:30.930 EST DETAIL: parameters: $1 = '1000016731', $2 = '365' 2014-11-12 19:37:41.622 EST LOG: duration: 6179.462 ms execute <unnamed>: select * from rhn_channel.unsubscribe_server($1, $2) as result 2014-11-12 19:37:41.622 EST DETAIL: parameters: $1 = '1000041275', $2 = '3550' top - 19:39:12 up 4 days, 2:27, 3 users, load average: 1.27, 0.46, 0.16 Tasks: 168 total, 2 running, 166 sleeping, 0 stopped, 0 zombie Cpu(s): 45.3%us, 0.5%sy, 0.0%ni, 42.3%id, 11.6%wa, 0.2%hi, 0.2%si, 0.0%st Mem: 4050448k total, 3760620k used, 289828k free, 156812k buffers Swap: 1020116k total, 104k used, 1020012k free, 2747500k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28386 postgres 18 0 550m 247m 233m R 88.5 6.3 1:06.99 postmaster 27533 tomcat 20 0 533m 341m 7760 S 2.3 8.6 0:17.52 java 1 root 15 0 10372 692 580 S 0.0 0.0 0:00.37 init --- Additional comment from Stephen Herr on 2014-11-17 10:39:56 EST --- Hmm, rhn_channel.unsubscribe_server is a fairly complicated stored procedure with lots of room for calling other stored procs or looping, so it's difficult to know exactly why it's so slow without doing a lot more investigation into this particular dataset. One possibility though is that by default it appears to recalculate the updates available to the server in-line (as opposed to asynchronously), which that by itself can take a long time if you're doing it on many systems. I'm not sure if that behavior is required or there because it's the default and we've never changed it. It's worth experimenting with recalculating available updates asynchronously instead of in-line to see if that makes a difference or if there are negative side-effects (this can be done by calling unsubscribe_server(server_id, channel_id, 0) to set the optional arg immediate_in to 0) --- Additional comment from Stephen Herr on 2014-11-19 10:54:14 EST --- (In reply to Xixi from comment #5) > 2) "used the the api to change the base channel for the group of systems and > that worked just fine using system.setBaseChannel" <- if the problem only > happens in Web UI, are the code paths / queries different between Web UI and > API? The API operates on a single system at a time. It's "system".setBaseChannel. So the client calling the API loops over all the systems it cares about and calls that method for each of them. So the difference is that the API is setting the base channel for these systems one at a time, and each one is relatively fast and completes long before a timeout is reached. The problem you're seeing with System Set Manager is that setting the base channels for many systems all at once is slow and a timeout is reached. --- Additional comment from Xixi on 2014-11-19 11:53:35 EST --- (In reply to Peter Gervase from comment #0) Logged on to Satellite above -> SSM -> Change Base Channels -> selected "Set to Default Channel" for 10 systems selected -> Submit After about 5-10 minutes, https://<hostname>/rhn/channel/ssm/BaseChannelSubscribe.do returns 500 ISE Timeout. --- Additional comment from Stephen Herr on 2014-11-20 12:55:36 EST --- I have an update ready for testing that makes two material changes: 1) The Base Channel change SSM action now happens asynchronously. This will solve the timeout / ISE problem and allow the user to go do other things while this happens in the background. 2) The action now commits after each server is migrated. This allows other processes that are waiting on locks for relevant tables (like system registration) to happen after only waiting a second or two, instead of multiple minutes (causing them to timeout or fail). What this will not do is reduce the total amount of work that the database has to do. It's possible that there will be less strain on the database now due to much smaller rollback buffer requirements, but in general I would not expect to see much difference in database load.
Committing to Spacewalk master: 90b419978977534d7e4418613d85484ba543650e
An additional commit that 1) makes scheduling of the base channel change action faster and 2) prevents some base channel changes from failing due to uninitialized log id: 8cfebb37d1770f766a335cdf89899b90cf875e34
Moving bugs to ON_QA as we move to release Spacewalk 2.3
Spacewalk 2.3 has been released. See https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23