Bug 131769

Summary: Memory leak in krb5_cc_get_principal (was leak in libpq)
Product: [Fedora] Fedora Reporter: Bojan Smojver <bojan>
Component: krb5Assignee: Nalin Dahyabhai <nalin>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 2CC: tgl
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-01-12 19:28:30 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
Table creation script
none
strace output of around 10,000 iterations
none
first pmap output
none
second pmap output (see diff in comment)
none
All RPM packages installed on this system none

Description Bojan Smojver 2004-09-04 02:44:03 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2)
Gecko/20040803 Galeon/1.3.17

Description of problem:
When I compile this program:

-------------------------------
#include <stdio.h>
#include <libpq-fe.h>

int main(int argc,char **argv){
  int i;
  PGconn *conn;
  PGresult *res;

  for(i=0;i<1000000;i++){
    conn=PQconnectdb("dbname=spintest");

    res=PQexec(conn,"select * from spintest");
    printf("%d\r",i);
    PQclear(res);

    PQfinish(conn);
  }
  printf("\n");

  return 0;
}
------------------------------- 

And run it, I'm seeing constant increase in the memory usage of the
program, around 4 kB every second on my 2 GHz P4-based Celeron
machine. I am either doing something incredibly stupid in the above
code (I would not be surprised ;-), or there is a memory leak in the
library.

Version-Release number of selected component (if applicable):
postgresql-7.4.2-1

How reproducible:
Always

Steps to Reproduce:
1. Compile the above program on FC2.
2. Run it.
3. Observe memory usage in top. 

Actual Results:  Memory usage increases.

Expected Results:  Memory usage should be steady.

Additional info:
Comment 1 Bojan Smojver 2004-09-04 03:55:05 EDT
Just to put things in perspective, by about 600,000 iterations, the
memory leak is almost 9 MB!
Comment 2 Tom Lane 2004-09-04 12:26:18 EDT
I couldn't reproduce this in a first attempt.  Do you have any connection-related settings 
that libpq is making use of beyond what is shown in your test program?  For instance, 
PGfoo environment variables, or a .pgpass file?
Comment 3 Bojan Smojver 2004-09-04 19:32:44 EDT
I have no .pg* files at all. There are no PG* environment variables.
As far as I'm aware, there is nothing special about this user account
in terms of PostgreSQL.

The only thing in pg_hba.conf that is not commented out is this:

---------------------------------
local  all    all             ident   sameuser
---------------------------------

The contents of postgresql.conf is this:

---------------------------------
max_connections = 100
shared_buffers = 1000           # min 16, at least max_connections*2,
8KB each
lc_messages = 'en_AU.UTF-8'             # locale for system error
message strings
lc_monetary = 'en_AU.UTF-8'             # locale for monetary formatting
lc_numeric = 'en_AU.UTF-8'              # ou locale for number formatting
lc_time = 'en_AU.UTF-8'                 # locale for time formatting
---------------------------------

Pretty much the defaults. Is there anything else you need?

BTW, I'm going to try this on another FC2 machine, just to see if I
can replicate it there.
Comment 4 Bojan Smojver 2004-09-04 19:45:04 EDT
Created attachment 103471 [details]
Table creation script

This is the script used to create the table used in the test. Before that, the
database also needs to be created, of course, and user (which has to exist) has
to be granted select privilege on the table.
Comment 5 Bojan Smojver 2004-09-04 19:50:26 EDT
I'm seeing the same thing on my other FC2 system. Same version of
PostgreSQL RPM, as shipped with FC2.

On this machine, there is no .pg* files. Here, the environment
variables are set like this:

PGLIB=/usr/lib/pgsql
PGDATA=/var/lib/pgsql

I'm guessing this is from /etc/profile.d/postgresql.sh, which is
something that's been hanging around from previous version of
PostgreSQL (this particular box has been upgraded all the way from Red
Hat Linux 5.0 :-).
Comment 6 Bojan Smojver 2004-09-04 19:51:46 EDT
On both systems, the program has been compiled like this:

gcc -o test test.c -lpq

Sorry, I forgot to mention that before.
Comment 7 Tom Lane 2004-09-05 00:38:37 EDT
I've just spent a couple hours trying to duplicate this, with no success at all.

I think possibly there is some version difference between your machines and the one I was 
testing on.  What version of glibc are you using, for instance?

Another line of attack is to try to find out what makes the leak go away.  For instance, if 
you use something besides IDENT authentication, does the leak go away?
Comment 8 Bojan Smojver 2004-09-05 06:23:06 EDT
Both machines are simply FC2 installations, fully updated. I don't
normally upgrade RPMS on those machines from development branches of
Fedora. The only difference may be RPMS that don't exist in Fedora by
default.

I'll install a brand new system, pure vanilla FC2, and then try again.

BTW, on the first machine I mentioned, I also have problems with
krb5-libs functions, called from pqlib code. They intermittently
segfault, but I was unable to pinpoint the exact cause (and therefore
never reported that as a bug). I have seen some recent security
advisories with double-free in krb5, which does look very similar to
what I have been experiencing (all my segfaults were in krb5 calling
free()). However, to replicate those tests, a complicated set of
software needs to be set up (httpd, libapreq2, mod_spin etc.), making
in unpractical for you guys to test.

That's how I came up with this small example - trying to replicate
those segfaults with a straightforward test. Turns out, it ended up in
memory leaks. Go figure... BTW, I have experienced memory leaks of
similar nature in my mod_spin applications, which do not appear unless
I keep opening and closing PostgreSQL connections.

I'll keep you posted.
Comment 9 Bojan Smojver 2004-09-06 01:05:31 EDT
I found this sentence in the release notes for 7.4.3:

Fix temporary memory leak when using non-hashed aggregates (Tom)

That's on this page:

http://www.postgresql.org/docs/current/interactive/release-7-4-3.html

Could that have anything to do with what I'm seeing?
Comment 10 Tom Lane 2004-09-06 01:28:09 EDT
No.
Comment 11 Bojan Smojver 2004-09-09 03:44:31 EDT
Just testing on the third machine (which doesn't even run a PostgreSQL
server, only has libs and devel, so all request are failing for sure)
and I'm seeing the same thing. Still have to do the vanilla install to
verify...
Comment 12 Tom Lane 2004-09-09 04:04:00 EDT
Hm, just to confirm I read that right: you have a loop in which PQconnectdb is failing to 
begin with (never mind the PQexec), and it's still leaking memory?

I'm still clueless why I can't reproduce this :-(.  I suspected from the start that the leak is 
associated with the connection and not with the query (please confirm whether it still leaks 
if you take out the PQexec and PQclear lines) but I sure do not know why ...
Comment 13 Bojan Smojver 2004-09-09 21:02:40 EDT
Yep, exactly. I have removed PQexec() and PQclear() calls altogether
and recompiled. Still leaks. I'll attach the output of two pmap runs,
just so you can see where exactly the memory goes. Not sure if that
helps though.

I did run strace and the only thing I can see in relation to memory
allocation is repeated calls to mmap2() by krb5-libs. There is a
matching munmap() and it does appear to be successful, so I'm not sure
if it's related at all. I'll attach a snippet of that too.
Comment 14 Bojan Smojver 2004-09-09 21:08:47 EDT
Created attachment 103662 [details]
strace output of around 10,000 iterations
Comment 15 Bojan Smojver 2004-09-09 21:10:40 EDT
Created attachment 103663 [details]
first pmap output
Comment 16 Bojan Smojver 2004-09-09 21:12:11 EDT
Created attachment 103664 [details]
second pmap output (see diff in comment)

Here is a diff of the first and second pmap output, which shows where the
memory is going:

--------------------------------
--- test-pmap-1.txt	2004-09-10 11:12:12.603360095 +1000
+++ test-pmap-2.txt	2004-09-10 11:12:30.256620406 +1000
@@ -46,9 +46,9 @@
 06300000     12K rw---  /libssl.so.0.9.7a
 08048000      4K r-x--  /test
 08049000      4K rw---  /test
-08d55000    644K rw---    [ anon ]
+08d55000   1028K rw---    [ anon ]
 f6feb000     16K rw---    [ anon ]
-f6fff000      4K rw---    [ anon ]
+f6ffe000      8K rw---    [ anon ]
 fee95000   1452K rw---    [ stack ]
 ffffd000      4K -----    [ anon ]
- total     5820K
+ total     6208K
--------------------------------
Comment 17 Bojan Smojver 2004-09-09 21:22:11 EDT
Created attachment 103665 [details]
All RPM packages installed on this system

This particular system is an HP Vectra VL 420 DT, Pentium 4, 1.7 GHz with 384
MB or RAM and 20 GB disk. In terms of glibc, I guess it would be running i686
version of it.
Comment 18 Bojan Smojver 2004-09-09 21:23:53 EDT
Output of "ldd ./test" looks like this:

--------------------------------
libpq.so.3 => /usr/lib/libpq.so.3 (0x00618000)
libc.so.6 => /lib/tls/libc.so.6 (0x0011a000)
libssl.so.4 => /lib/libssl.so.4 (0x062cf000)
libcrypto.so.4 => /lib/libcrypto.so.4 (0x061a1000)
libkrb5.so.3 => /usr/lib/libkrb5.so.3 (0x00c70000)
libcom_err.so.2 => /lib/libcom_err.so.2 (0x02e8e000)
libcrypt.so.1 => /lib/libcrypt.so.1 (0x00b5c000)
libresolv.so.2 => /lib/libresolv.so.2 (0x02dde000)
libnsl.so.1 => /lib/libnsl.so.1 (0x00b8b000)
libpthread.so.0 => /lib/tls/libpthread.so.0 (0x02dca000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x00101000)
libgssapi_krb5.so.2 => /usr/lib/libgssapi_krb5.so.2 (0x062b8000)
libk5crypto.so.3 => /usr/lib/libk5crypto.so.3 (0x06294000)
libdl.so.2 => /lib/libdl.so.2 (0x02e71000)
libz.so.1 => /usr/lib/libz.so.1 (0x02df2000)
--------------------------------
Comment 19 Bojan Smojver 2004-09-09 21:28:22 EDT
I didn't mention this and it may be important, pmap and strace outputs
are from *different* runs. If you want me to do it on the same one,
let me know. And, sorry for so many posts today... :-(
Comment 20 Tom Lane 2004-09-10 19:54:38 EDT
Ah-hah.  I've been able to reproduce this problem on one FC2 machine
in the Toronto office, but the one I first tried does not have it!

They are running different kernels:
no bug:  2.6.8-1.521
bug: 2.6.8-1.521smp
but I suppose the problem is actually in userspace.  The only package
that libpq depends on that is different between the two is krb5:
no bug seen with 1.3.3-7, bug seen with 1.3.4-6.  So your gut
suspicion of krb5 seems confirmed.

Some further digging shows that each time through the loop, libpq
tries to get the username from Kerberos.  On both of my test systems,
krb5_cc_get_principal() fails with "No credentials cache found".  It
appears that something gets leaked in that code path in 1.3.4-6, but
not in 1.3.3-7.

I'm going to reassign this to the krb5 package owner and see what
he/she has to say.  It's possible that the true bug is still in libpq,
but if so it now takes the form of misuse of the kerberos library, so
I need a hand anyway ...
Comment 21 Bojan Smojver 2004-09-10 21:13:15 EDT
Awesome! Thanks for all your help.
Comment 22 Bojan Smojver 2004-09-12 22:43:10 EDT
Just for the record, this is how I get pinched by segfaults when
running  a lot of requests to Apache that runs my software inside it
(mod_spin + spin_app):

---------------------------------
(gdb) set args -X
(gdb) r
Starting program: /usr/local/apache2/bin/httpd -X
[Thread debugging using libthread_db enabled]
[New Thread -151078176 (LWP 4395)]
Detaching after fork from child process 4404.
[New Thread -151295056 (LWP 4405)]
[New Thread -161784912 (LWP 4406)]
[New Thread -172274768 (LWP 4407)]
[New Thread -182764624 (LWP 4408)]
[New Thread -193254480 (LWP 4409)]
[Thread -151295056 (zombie) exited]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -161784912 (LWP 4406)]
0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0x043a40a9 in krb5_cc_close (context=0x935f3c8, cache=0x935f3c8)
    at ccfns.c:61
#2  0x0084c871 in pg_krb5_init (
    PQerrormsg=0xf65b5380 "pg_krb5_init: krb5_cc_get_principal: No
credentials c
ache found\n") at fe-auth.c:329
#3  0x0084c92f in pg_krb5_authname (PQerrormsg=0x935f3c8 "\001")
    at fe-auth.c:348
#4  0x0084d20b in fe_getauthname (
    PQerrormsg=0xf65b5380 "pg_krb5_init: krb5_cc_get_principal: No
credentials c
ache found\n") at fe-auth.c:731
#5  0x0084f8dd in conninfo_parse (conninfo=0x935f3c8 "\001", 
    errorMessage=0xf3c00780) at fe-connect.c:2720
#6  0x0084d2fc in connectOptions1 (conn=0xf3c00480, conninfo=0x935f3c8
"\001")
    at fe-connect.c:328
#7  0x0084d295 in PQconnectStart (conninfo=0x935f3c8 "\001")
    at fe-connect.c:287
#8  0x0084d242 in PQconnectdb (conninfo=0x935f3c8 "\001") at
fe-connect.c:245
#9  0x00115257 in rxv_spin_db_connect (pool=0x93a7eb0, cpool=0x0, 
    conninfo=0xeb5a15 "dbname=spintest", type=1 '\001') at db.c:255
#10 0x00eb4f96 in rxv_spin_service (ctx=0x93a94a8) at app.c:61
#11 0x00ccf9a1 in handler (r=0x93a7ee8) at mod_spin.c:390
#12 0x0807f40a in ap_run_handler (r=0x93a7ee8) at config.c:151
#13 0x0807f91e in ap_invoke_handler (r=0x93a7ee8) at config.c:358
#14 0x0806d1a7 in ap_process_request (r=0x93a7ee8) at http_request.c:246
#15 0x080692b5 in ap_process_http_connection (c=0x931add0) at
http_core.c:250
#16 0x08088472 in ap_run_process_connection (c=0x931add0) at
connection.c:42
#17 0x0807c5f1 in process_socket (p=0x931aca8, sock=0x931ace0, 
    my_child_num=71132672, my_thread_num=154530760,
bucket_alloc=0x9322cc8)
    at worker.c:520
#18 0x0807cc3a in worker_thread (thd=0x92d5fa0, dummy=0x935f3c8)
    at worker.c:834
#19 0x00d22068 in dummy_worker (opaque=0x935f3c8) at thread.c:88
#20 0x02f6d98c in start_thread () from /lib/tls/libpthread.so.0
#21 0x0237416a in clone () from /lib/tls/libc.so.6
(gdb)
---------------------------------

This is Apache 2.0.50 with worker MPM, but it is just a single thread
doing everything (due to -X). When I go up the call stack a bit, to
verify what is being passed into PQconnectdb() function by my software
(to eliminate any stupidites on my behalf), this is what I see:

---------------------------------
(gdb) up
#1  0x043a40a9 in krb5_cc_close (context=0x935f3c8, cache=0x935f3c8)
    at ccfns.c:61
61          return cache->ops->close(context, cache);
(gdb) up
#2  0x0084c871 in pg_krb5_init (
    PQerrormsg=0xf65b5380 "pg_krb5_init: krb5_cc_get_principal: No
credentials
cache found\n") at fe-auth.c:329
329                     krb5_cc_close(pg_krb5_context, pg_krb5_ccache);
(gdb) up
#3  0x0084c92f in pg_krb5_authname (PQerrormsg=0x935f3c8 "\001")
    at fe-auth.c:348
348             if (pg_krb5_init(PQerrormsg) != STATUS_OK)
(gdb) up
#4  0x0084d20b in fe_getauthname (
    PQerrormsg=0xf65b5380 "pg_krb5_init: krb5_cc_get_principal: No
credentials
cache found\n") at fe-auth.c:731
731                     name = pg_krb5_authname(PQerrormsg);
(gdb) up
#5  0x0084f8dd in conninfo_parse (conninfo=0x935f3c8 "\001", 
    errorMessage=0xf3c00780) at fe-connect.c:2720
2720                            option->val = fe_getauthname(errortmp);
(gdb) up
#6  0x0084d2fc in connectOptions1 (conn=0xf3c00480, conninfo=0x935f3c8
"\001")
    at fe-connect.c:328
328             connOptions = conninfo_parse(conninfo,
&conn->errorMessage);
(gdb) up
#7  0x0084d295 in PQconnectStart (conninfo=0x935f3c8 "\001")
    at fe-connect.c:287
287             if (!connectOptions1(conn, conninfo))
(gdb) up
#8  0x0084d242 in PQconnectdb (conninfo=0x935f3c8 "\001") at
fe-connect.c:245
245             PGconn     *conn = PQconnectStart(conninfo);
(gdb) up
#9  0x00115257 in rxv_spin_db_connect (pool=0x93a7eb0, cpool=0x0, 
    conninfo=0xeb5a15 "dbname=spintest", type=1 '\001') at db.c:255
255           if(!(conn->pgconn=PQconnectdb(conninfo)))
(gdb) p conninfo
$1 = 0xeb5a15 "dbname=spintest"
---------------------------------

I have no idea how the conninfo (which in my code is a constant
string) gets screwed after that... BTW, this is just *one* way those
segfaults happen. Sometimes, it is when free() is called from various
krb5 library functions on an invalid pointer.

Not sure if this is helpful, but here it is nevertheless.
Comment 23 Bojan Smojver 2004-10-17 20:12:47 EDT
I'm hoping krb5-1.3.5 will eventually become part of FC3, at which
point I can retest with it...
Comment 24 Bojan Smojver 2004-12-22 18:55:04 EST
Still present in recently updates krb5-libs-1.3.6-2.
Comment 25 Bojan Smojver 2005-01-05 22:15:49 EST
According to Kerberos developer Tom Yu, these memory leaks have been
fixed in 1.4, which is currently at beta 3.
Comment 26 Bojan Smojver 2005-01-12 06:01:18 EST
I did a little bit more snooping around and it turns out:

- worker MPM: segfaults
- prefork MPM: no segfaults

This is related to thread safety issues in relation to the way
PostgreSQL uses krb5 libraries. This is what Tom said:

-----------------------------
I took a closer look at
postgresql-7.4.5/src/interfaces/libpq/fe-auth.c and discovered that a
lot of krb5 state is kept as file-scope static variables.  The problem
is therefore mostly postgresql's fault, since it looks like libpq
itself isn't thread-safe.
-----------------------------

In relation to memory leaks, this is allegedly fixed in 1.4 beta.
Comment 27 Bojan Smojver 2005-01-12 19:28:30 EST
BTW, do I close this bug with with UPSTREAM? I'm not sure what the
policy of including krb5-libs is and if 1.4 is going to go into the
FC3 updates... Maybe I should close it with NEXTRELEASE?