Bug 751818 - Investigate why aviary_query_server memory footprint is double that of condor_job_server
Summary: Investigate why aviary_query_server memory footprint is double that of condor...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-aviary
Version: 2.1
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: 2.1.1
: ---
Assignee: Pete MacKinnon
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-11-07 16:57 UTC by Pete MacKinnon
Modified: 2011-12-08 19:55 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-11-30 20:39:24 UTC
Target Upstream Version:


Attachments (Terms of Use)
massif output file for getJobDetails from 500 jobs (44.60 KB, application/x-gzip)
2011-11-30 20:25 UTC, Pete MacKinnon
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 758864 0 high CLOSED Explanation for aviary_query_server memory usage 2021-02-22 00:41:40 UTC

Internal Links: 758864

Description Pete MacKinnon 2011-11-07 16:57:37 UTC
This was observed from a load test on grid0:

[root@grid0 spool]# ps auxw | grep condor_job_server | grep -v grep
condor   30990  2.5  5.5 2971544 2760632 ?     Ssl  10:04   2:46 condor_job_server -f -f
[root@grid0 spool]# ps auxw | grep aviary_query_server | grep -v grep
condor   15505  1.8 11.3 5674816 5595980 ?     Ss   Nov06  29:09 aviary_query_server -f

The internal data structures between these components are nearly identical save for a Codec which is instantiated per submission in the aviary version. This is the first area to profile.

Comment 1 Pete MacKinnon 2011-11-28 23:42:21 UTC
I see this after a restart:

16880 condor    15   0  315m 108m 6188 S  0.0  0.2   0:21.59 condor_job_serv
16669 condor    18   0  167m  91m 3372 S  0.0  0.2   0:05.10 aviary_query_se

We cache summaries in the aviary query server and currently don't GC them regularly. However, in my testing the biggest RES growth is associated with calling getJobDetails on live jobs from the queue.

500 "sleep 20" jobs in the queue.

After invoking getJobSummary on all jobs:

22054 pmackinn  20   0 67460  15m 7936 S  0.0  0.8   0:01.52 condor_job_serv
22052 pmackinn  20   0 23196  14m 5304 S  0.0  0.7   0:01.45 aviary_query_se

After invoking getJobDetails on all jobs:

21183 pmackinn  20   0 67320  15m 7940 S  0.0  0.8   0:01.50 condor_job_serv
21182 pmackinn  20   0 31708  23m 5372 S  0.0  1.2   0:06.76 aviary_query_se

These peak RES levels appear to stay constant with repeated get* invocations on the same job set.

Comment 2 Pete MacKinnon 2011-11-29 14:48:37 UTC
valgrind from getJobDetails on 500 jobs (500 submissions):

==32243== HEAP SUMMARY:
==32243==     in use at exit: 742,648 bytes in 16,861 blocks
==32243==   total heap usage: 3,869,478 allocs, 3,852,617 frees, 5,806,216,557 bytes allocated
==32243== 
==32243== LEAK SUMMARY:
==32243==    definitely lost: 236 bytes in 7 blocks
==32243==    indirectly lost: 72 bytes in 3 blocks
==32243==      possibly lost: 123,201 bytes in 3,611 blocks
==32243==    still reachable: 619,139 bytes in 13,240 blocks
==32243==         suppressed: 0 bytes in 0 blocks
==32243== Rerun with --leak-check=full to see details of leaked memory
==32243== 
==32243== For counts of detected and suppressed errors, rerun with: -v
==32243== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 82 from 10)

Comment 3 Pete MacKinnon 2011-11-29 14:51:52 UTC
valgrind from getJobSummary on 500 jobs (500 submissions):

==32320== HEAP SUMMARY:
==32320==     in use at exit: 742,379 bytes in 16,860 blocks
==32320==   total heap usage: 1,340,780 allocs, 1,323,920 frees, 92,293,704 bytes allocated
==32320== 
==32320== LEAK SUMMARY:
==32320==    definitely lost: 236 bytes in 7 blocks
==32320==    indirectly lost: 72 bytes in 3 blocks
==32320==      possibly lost: 122,870 bytes in 3,609 blocks
==32320==    still reachable: 619,201 bytes in 13,241 blocks
==32320==         suppressed: 0 bytes in 0 blocks
==32320== Rerun with --leak-check=full to see details of leaked memory
==32320== 
==32320== For counts of detected and suppressed errors, rerun with: -v
==32320== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 81 from 10)

Comment 4 Pete MacKinnon 2011-11-29 14:57:41 UTC
valgrind from getJobStatus on 500 jobs (500 submissions):

==32373== HEAP SUMMARY:
==32373==     in use at exit: 742,379 bytes in 16,860 blocks
==32373==   total heap usage: 1,295,591 allocs, 1,278,731 frees, 81,635,751 bytes allocated
==32373== 
==32373== LEAK SUMMARY:
==32373==    definitely lost: 236 bytes in 7 blocks
==32373==    indirectly lost: 72 bytes in 3 blocks
==32373==      possibly lost: 122,870 bytes in 3,609 blocks
==32373==    still reachable: 619,201 bytes in 13,241 blocks
==32373==         suppressed: 0 bytes in 0 blocks
==32373== Rerun with --leak-check=full to see details of leaked memory
==32373== 
==32373== For counts of detected and suppressed errors, rerun with: -v
==32373== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 81 from 10)

Comment 5 Pete MacKinnon 2011-11-29 17:00:18 UTC
valgrind from getJobDetails on 500 jobs (1 submission):

==1724== HEAP SUMMARY:
==1724==     in use at exit: 645,732 bytes in 17,690 blocks
==1724==   total heap usage: 4,625,485 allocs, 4,607,795 frees, 6,159,873,460 bytes allocated
==1724== 
==1724== LEAK SUMMARY:
==1724==    definitely lost: 16,718 bytes in 2,243 blocks
==1724==    indirectly lost: 24,122 bytes in 1,591 blocks
==1724==      possibly lost: 87,806 bytes in 2,614 blocks
==1724==    still reachable: 517,086 bytes in 11,242 blocks
==1724==         suppressed: 0 bytes in 0 blocks
==1724== Rerun with --leak-check=full to see details of leaked memory
==1724== 
==1724== For counts of detected and suppressed errors, rerun with: -v
==1724== ERROR SUMMARY: 8455 errors from 14 contexts (suppressed: 81 from 10)

Comment 6 Pete MacKinnon 2011-11-30 20:25:59 UTC
Created attachment 538791 [details]
massif output file for getJobDetails from 500 jobs

Comment 7 Pete MacKinnon 2011-11-30 20:39:24 UTC
The Axis2/C engine retains a stream buffer for responses. It doesn't reallocate based on the size of each new response but instead just checks instead to see if the old one will fit the new data to be written. If not, then a new buffer is allocated and the old one freed. Thus, over time, the buffer will remain at a high water mark of the last largest response size. In the case of the aviary_query_server and depending on the operation invoked, the RES of the process will grow because of the heap still in use. For example, a request for 1 job status after a request for 500 job details will not grow the RES of the process (other non-RPC memory allocations notwithstanding). 

This is not a memory leak.

Closing and adding Release Note for 2.2.


Note You need to log in before you can comment on or make changes to this bug.