Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 751818

Summary:

Investigate why aviary_query_server memory footprint is double that of condor_job_server

Product:

Red Hat Enterprise MRG

Reporter:

Pete MacKinnon <pmackinn>

Component:

condor-aviary

Assignee:

Pete MacKinnon <pmackinn>

Status:

CLOSED NOTABUG

QA Contact:

MRG Quality Engineering <mrgqe-bugs>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

2.1

CC:

matt

Target Milestone:

2.1.1

Target Release:

---

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-11-30 20:39:24 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
massif output file for getJobDetails from 500 jobs	none

Description Pete MacKinnon 2011-11-07 16:57:37 UTC

This was observed from a load test on grid0:

[root@grid0 spool]# ps auxw | grep condor_job_server | grep -v grep
condor   30990  2.5  5.5 2971544 2760632 ?     Ssl  10:04   2:46 condor_job_server -f -f
[root@grid0 spool]# ps auxw | grep aviary_query_server | grep -v grep
condor   15505  1.8 11.3 5674816 5595980 ?     Ss   Nov06  29:09 aviary_query_server -f

The internal data structures between these components are nearly identical save for a Codec which is instantiated per submission in the aviary version. This is the first area to profile.

Comment 1 Pete MacKinnon 2011-11-28 23:42:21 UTC

I see this after a restart:

16880 condor    15   0  315m 108m 6188 S  0.0  0.2   0:21.59 condor_job_serv
16669 condor    18   0  167m  91m 3372 S  0.0  0.2   0:05.10 aviary_query_se

We cache summaries in the aviary query server and currently don't GC them regularly. However, in my testing the biggest RES growth is associated with calling getJobDetails on live jobs from the queue.

500 "sleep 20" jobs in the queue.

After invoking getJobSummary on all jobs:

22054 pmackinn  20   0 67460  15m 7936 S  0.0  0.8   0:01.52 condor_job_serv
22052 pmackinn  20   0 23196  14m 5304 S  0.0  0.7   0:01.45 aviary_query_se

After invoking getJobDetails on all jobs:

21183 pmackinn  20   0 67320  15m 7940 S  0.0  0.8   0:01.50 condor_job_serv
21182 pmackinn  20   0 31708  23m 5372 S  0.0  1.2   0:06.76 aviary_query_se

These peak RES levels appear to stay constant with repeated get* invocations on the same job set.

Comment 2 Pete MacKinnon 2011-11-29 14:48:37 UTC

valgrind from getJobDetails on 500 jobs (500 submissions):

==32243== HEAP SUMMARY:
==32243==     in use at exit: 742,648 bytes in 16,861 blocks
==32243==   total heap usage: 3,869,478 allocs, 3,852,617 frees, 5,806,216,557 bytes allocated
==32243== 
==32243== LEAK SUMMARY:
==32243==    definitely lost: 236 bytes in 7 blocks
==32243==    indirectly lost: 72 bytes in 3 blocks
==32243==      possibly lost: 123,201 bytes in 3,611 blocks
==32243==    still reachable: 619,139 bytes in 13,240 blocks
==32243==         suppressed: 0 bytes in 0 blocks
==32243== Rerun with --leak-check=full to see details of leaked memory
==32243== 
==32243== For counts of detected and suppressed errors, rerun with: -v
==32243== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 82 from 10)

Comment 3 Pete MacKinnon 2011-11-29 14:51:52 UTC

valgrind from getJobSummary on 500 jobs (500 submissions):

==32320== HEAP SUMMARY:
==32320==     in use at exit: 742,379 bytes in 16,860 blocks
==32320==   total heap usage: 1,340,780 allocs, 1,323,920 frees, 92,293,704 bytes allocated
==32320== 
==32320== LEAK SUMMARY:
==32320==    definitely lost: 236 bytes in 7 blocks
==32320==    indirectly lost: 72 bytes in 3 blocks
==32320==      possibly lost: 122,870 bytes in 3,609 blocks
==32320==    still reachable: 619,201 bytes in 13,241 blocks
==32320==         suppressed: 0 bytes in 0 blocks
==32320== Rerun with --leak-check=full to see details of leaked memory
==32320== 
==32320== For counts of detected and suppressed errors, rerun with: -v
==32320== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 81 from 10)

Comment 4 Pete MacKinnon 2011-11-29 14:57:41 UTC

valgrind from getJobStatus on 500 jobs (500 submissions):

==32373== HEAP SUMMARY:
==32373==     in use at exit: 742,379 bytes in 16,860 blocks
==32373==   total heap usage: 1,295,591 allocs, 1,278,731 frees, 81,635,751 bytes allocated
==32373== 
==32373== LEAK SUMMARY:
==32373==    definitely lost: 236 bytes in 7 blocks
==32373==    indirectly lost: 72 bytes in 3 blocks
==32373==      possibly lost: 122,870 bytes in 3,609 blocks
==32373==    still reachable: 619,201 bytes in 13,241 blocks
==32373==         suppressed: 0 bytes in 0 blocks
==32373== Rerun with --leak-check=full to see details of leaked memory
==32373== 
==32373== For counts of detected and suppressed errors, rerun with: -v
==32373== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 81 from 10)

Comment 5 Pete MacKinnon 2011-11-29 17:00:18 UTC

valgrind from getJobDetails on 500 jobs (1 submission):

==1724== HEAP SUMMARY:
==1724==     in use at exit: 645,732 bytes in 17,690 blocks
==1724==   total heap usage: 4,625,485 allocs, 4,607,795 frees, 6,159,873,460 bytes allocated
==1724== 
==1724== LEAK SUMMARY:
==1724==    definitely lost: 16,718 bytes in 2,243 blocks
==1724==    indirectly lost: 24,122 bytes in 1,591 blocks
==1724==      possibly lost: 87,806 bytes in 2,614 blocks
==1724==    still reachable: 517,086 bytes in 11,242 blocks
==1724==         suppressed: 0 bytes in 0 blocks
==1724== Rerun with --leak-check=full to see details of leaked memory
==1724== 
==1724== For counts of detected and suppressed errors, rerun with: -v
==1724== ERROR SUMMARY: 8455 errors from 14 contexts (suppressed: 81 from 10)

Comment 6 Pete MacKinnon 2011-11-30 20:25:59 UTC

Created attachment 538791 [details]
massif output file for getJobDetails from 500 jobs

Comment 7 Pete MacKinnon 2011-11-30 20:39:24 UTC

The Axis2/C engine retains a stream buffer for responses. It doesn't reallocate based on the size of each new response but instead just checks instead to see if the old one will fit the new data to be written. If not, then a new buffer is allocated and the old one freed. Thus, over time, the buffer will remain at a high water mark of the last largest response size. In the case of the aviary_query_server and depending on the operation invoked, the RES of the process will grow because of the heap still in use. For example, a request for 1 job status after a request for 500 job details will not grow the RES of the process (other non-RPC memory allocations notwithstanding). 

This is not a memory leak.

Closing and adding Release Note for 2.2.