Bug 1103822 - API method systemgroup.listSystems() is not scalling when a group has +6k systems
Summary: API method systemgroup.listSystems() is not scalling when a group has +6k sys...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Spacewalk
Classification: Community
Component: API
Version: 2.2
Hardware: All
OS: All
high
high
Target Milestone: ---
Assignee: Stephen Herr
QA Contact: Red Hat Satellite QA List
URL:
Whiteboard:
Depends On: 1099948
Blocks: space22
TreeView+ depends on / blocked
 
Reported: 2014-06-02 16:05 UTC by Stephen Herr
Modified: 2014-07-17 08:41 UTC (History)
3 users (show)

Fixed In Version: spacewalk-java-2.2.86-1
Doc Type: Bug Fix
Doc Text:
Clone Of: 1099948
Environment:
Last Closed: 2014-07-17 08:41:42 UTC
Embargoed:


Attachments (Terms of Use)

Description Stephen Herr 2014-06-02 16:05:59 UTC
+++ This bug was initially created as a clone of Bug #1099948 +++

Description of problem:
API method systemsgroup.listSystems() is not scalling when a group has +6k systems. 

During our troubleshooting process, we were able to see that the issue is being caused by the framework used by the satellite called Hibernate. When calling the API method systemsgroup.listSystems() with a large number of systems (currently you have +6k), the Hibernate framework is not scaling as expected. 
    
    We restored the Oracle database on the same box that is currently the Red Hat Satellite server and we could observe an increase in Java overhead. The Java overhead looks directly related to the Hibernate overhead of setting up objects for each of the 6k systems in the loop.

    During our tests, it took 3 hours to return the API 

# rpm -q spacewalk-java
spacewalk-java-2.0.2-46.el5sat  (note: the issue also happens running the latest Satellite 5.6)

# rpm -qa | grep schema
spacewalk-schema-2.0.2-13.el5sat
satellite-schema-5.6.0.10-1.el5sat


[root@sterling-01022737 ~]# python -m trace --count -C . listSystemsInGroupV2.py 
2014-05-20 19:38:17.647049

web_api.log
[2014-05-20 19:38:17,565] INFO  - REQUESTED FROM: 10.13.145.110 CALL: auth.login(satserv, ********) CALLER: (none) TIME: 0.064 seconds
[2014-05-20 19:39:03,823] INFO  - REQUESTED FROM: 10.13.145.110 CALL: systemgroup.listSystems(695080x8e72c851ac348de33450e7ea52b7a7d1, Production) CALLER: (satserv) TIME: 46.101 seconds

java again consuming the cpu
top - 20:14:32 up 2 days, 23:31,  7 users,  load average: 1.58, 1.60, 1.55
Tasks: 166 total,   2 running, 162 sleeping,   2 stopped,   0 zombie
Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4051248k total,  3716020k used,   335228k free,   163076k buffers
Swap:  1020116k total,   170480k used,   849636k free,  2090752k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                   
 4289 tomcat    25   0  725m 414m  14m S 99.4 10.5  35:46.27 java   
 
 since the java is consuming CPU and not oracle, looking at the code we have:
 
 
API method
-------
131     /**
132      * List the systems that are associated to the given system group.
133      * @param sessionKey The sessionKey containing the logged in user
134      * @param systemGroupName the name of the system group
135      * @return a list of systems associated to a given system group.
136      *
137      * @xmlrpc.doc Return a list of systems associated with this system group.
138      * User must have access to this system group.
139 
140      * @xmlrpc.param #session_key()
141      * @xmlrpc.param #param("string", "systemGroupName")
142      * @xmlrpc.returntype
143      *      #array()
144      *          $ServerSerializer
145      *      #array_end()
146      */
147     public List listSystems(String sessionKey, String systemGroupName) {
148         User loggedInUser = getLoggedInUser(sessionKey);
149         ServerGroupManager manager = ServerGroupManager.getInstance();
150         ManagedServerGroup group = manager.lookup(systemGroupName, loggedInUser);
151         return group.getServers();
152     }

system/ServerGroupManager.java
================
 61     /**
 62      * Lookup a ServerGroup by ID and organization.
 63      * @param id Server group id
 64      * @param user logged in user needed for authentication
 65      * @return Server group requested
 66      */
 67     public ManagedServerGroup lookup(Long id, User user) {
 68         ManagedServerGroup sg = ServerGroupFactory.
 69                                 lookupByIdAndOrg(id, user.getOrg());
 70         if (sg == null) {
 71             validateAccessCredentials(user, sg, id.toString());
 72         }
 73         else {
 74             validateAccessCredentials(user, sg, sg.getName());
 75         }
 76 
 77         return sg;
 78     }

domain/server/ServerGroup.hbm.xml
=======================
 55     <query name="ServerGroup.lookupByIdAndOrg">
 56         <![CDATA[from com.redhat.rhn.domain.server.ServerGroup as s where s.id = :id
 57                  and s.org = :org and (s.groupType is null)]]>
 58     </query>


domain/server/ServerGroupFactory.java
==========
    /**
     * Returns the servers of a given serverGroup.
     * @param sg the serverGroup to find the servers of
     * @return list of Server objects that are part of
     *                      the server group
     */
    public static List listServers(ServerGroup sg) {
        Map<String, Object> params = new HashMap<String, Object>();
        params.put("sgid", sg.getId());
        params.put("org_id", sg.getOrg().getId());
        List ids = SINGLETON.listObjectsByNamedQuery(
                "ServerGroup.lookupServerIds", params);
        List servers = new ArrayList(ids.size());
        for (Iterator itr = ids.iterator(); itr.hasNext();) {
            Number id = (Number) itr.next();
            servers.add(ServerFactory.lookupById(new Long(id.longValue())));
        }
        return servers;
    }
    

domain/server/ServerGroup.hbm.xml
=======================
120     <sql-query name="ServerGroup.lookupServerIds">
121         <![CDATA[ select s.id
122 >--->--->--->--->---from rhnServer s
123 >--->--->--->--->---inner join rhnServerGroupMembers sgm on s.id = sgm.server_id
124                   where
125 >---               >sgm.server_group_id = :sgid
126 >---               >and s.org_id = :org_id
127 >---           ]]>
128     </sql-query>

  Looks like the hibernate query is being used for each system (total 6400) systems and that explains why the java process is starving the cpu.


Version-Release number of selected component (if applicable):
spacewalk-java-2.0.2-64.el5sat



How reproducible:
100%


Steps to Reproduce:
1. Install a Satellite 5.6 
2. Create a group with +6k 
3. Call the API script

Actual results:
API is not performing


Expected results:
API should return the information fast. 


Additional info:

/var/lib/tomcat5/webapps/rhn/WEB-INF/classes/log4j.properties
## Hibernate logging
log4j.appender.HibernateAppender=org.apache.log4j.FileAppender
log4j.appender.HibernateAppender.file=/var/log/rhn/rhn_web_hibernate.log
log4j.appender.HibernateAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.HibernateAppender.layout.ConversionPattern=[%d] %-5p - %m%n
log4j.additivity.org.hibernate=false
log4j.additivity.net.rkbloom.logdriver=false


;/etc/rhn/rhn.conf
----------
#debug
hibernate.show_sql = true
debug = 6


++++++++++++++++++++++ part 1 +++++++++++++++++++++
---using satserv on Development group (56 systems)
[root@sterling-01022737 ~]# date ; time ./listSystemsInGroupV2.py  ; date
Tue May 20 20:40:39 EDT 2014
2014-05-20 20:40:39.233119
2014-05-20 20:40:40.661285

		**** debug -> systemgrouplist query time: 428166 microseconds ****


real	0m1.615s
user	0m0.084s
sys	0m0.013s

[2014-05-20 20:40:39,154] INFO  - REQUESTED FROM: 10.13.145.110 CALL: auth.login(satserv, ********) CALLER: (none) TIME: 0.012 seconds
[2014-05-20 20:40:39,949] INFO  - REQUESTED FROM: 10.13.145.110 CALL: systemgroup.listSystems(695086x6fb842a25de3eb801b3218b87cb37b3a, Development) CALLER: (satserv) TIME: 0.651 seconds

++++++++++++++++++++++ part 2 +++++++++++++++++++++
---using satserv on Production group (6028 systems)



--- after running the API script, we can see the catalina.out beeing populated 

==> /var/log/rhn/rhn_web_api.log <==
[2014-05-20 20:41:59,882] INFO  - REQUESTED FROM: 10.13.145.110 CALL: auth.login(satserv, ********) CALLER: (none) TIME: 0.0090 seconds
[2014-05-20 20:42:46,232] INFO  - REQUESTED FROM: 10.13.145.110 CALL: systemgroup.listSystems(695087x5b13dccc927b79a696dde506390e6ba7, Production) CALLER: (satserv) TIME: 46.263 seconds


  Here we can see that Hibernate is checking each system (+6k). The query lasts for +- 3 hours. 


==> /var/log/tomcat5/catalina.out <==
Hibernate: select servergrou2_.id as id212_, servergrou2_.name as name212_, servergrou2_.description as descript4_212_, servergrou2_.created as created212_, servergrou2_.modified as modified212_, servergrou2_.group_type as group2_212_, servergrou2_.org_id as org7_212_, servergrou2_.max_members as max8_212_ from rhnServer server0_ inner join rhnServerGroupMembers groups1_ on server0_.id=groups1_.server_id inner join rhnServerGroup servergrou2_ on groups1_.server_group_id=servergrou2_.id where server0_.id=? and (servergrou2_.group_type is not null)
Hibernate: select networks0_.server_id as server7_229_1_, networks0_.id as id1_, networks0_.id as id203_0_, networks0_.hostname as hostname203_0_, networks0_.ipaddr as ipaddr203_0_, networks0_.ip6addr as ip4_203_0_, networks0_.created as created203_0_, networks0_.modified as modified203_0_, networks0_.server_id as server7_203_0_ from rhnServerNetwork networks0_ where networks0_.server_id=?
Hibernate: select servergrou2_.id as id212_, servergrou2_.name as name212_, servergrou2_.description as descript4_212_, servergrou2_.created as created212_, servergrou2_.modified as modified212_, servergrou2_.group_type as group2_212_, servergrou2_.org_id as org7_212_, servergrou2_.max_members as max8_212_ from rhnServer server0_ inner join rhnServerGroupMembers groups1_ on server0_.id=groups1_.server_id inner join rhnServerGroup servergrou2_ on groups1_.server_group_id=servergrou2_.id where server0_.id=? and (servergrou2_.group_type is not null)

--- Additional comment from Stephen Herr on 2014-05-29 14:52:00 EDT ---

Given the data that this method returns it would be very difficult to convert this hibernate lookup directly into sql; we would have to create a query that would be the equivalent of several object's worth of hibernate mappings and the process would be error prone. Nowhere else do we get this data through direct sql queries.

What we can do to speed up performance though is:
1) I can make hibernate lookup the servers 1000 at a time, instead of one at a time. This should reduce at least some overhead.

2) I can create a new method that returns a smaller amount of data that can be direct sql (and very fast). If the user needs the extra data they can look it up through system.getDetails. Perhaps this would only return system_id, system_name, and last_checkin, similar to the system.list*Systems methods.

--- Additional comment from Stephen Herr on 2014-06-02 11:26:10 EDT ---

I will create a new API method, systemgroup.listSystemsMinimal, that returns only the id, name, and last_checkin of every system in the system group. This should be:

1) fast

2) return consistent information with other listSystem methods, like the ones in the system handler. This means that the user can easily wrap it inside their scripts that may call either system.listSystems or systemgroup.listSystemsMinimal depending on if a group name was supplied or not, and not have to worry about different sets of data being returned.

The old systemgroup.listSystems method will continue to exist and continue to return the same data it currently does for compatibility reasons, and also continue to be slow (unfortunately).

Comment 1 Stephen Herr 2014-06-02 16:57:08 UTC
Committing to Spacewalk master:
dc30f0dbf4b7cd3d5ebae37455800e6c6c877f1b

Comment 2 Stephen Herr 2014-06-03 12:24:03 UTC
And:
20f24980dd1067039cd6ef547f078f4e61a3b5bb

Comment 3 Milan Zázrivec 2014-07-17 08:41:42 UTC
Spacewalk 2.2 has been released:

    https://fedorahosted.org/spacewalk/wiki/ReleaseNotes22


Note You need to log in before you can comment on or make changes to this bug.