From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2) Gecko/20040803 Galeon/1.3.17 Description of problem: When I compile this program: ------------------------------- #include <stdio.h> #include <libpq-fe.h> int main(int argc,char **argv){ int i; PGconn *conn; PGresult *res; for(i=0;i<1000000;i++){ conn=PQconnectdb("dbname=spintest"); res=PQexec(conn,"select * from spintest"); printf("%d\r",i); PQclear(res); PQfinish(conn); } printf("\n"); return 0; } ------------------------------- And run it, I'm seeing constant increase in the memory usage of the program, around 4 kB every second on my 2 GHz P4-based Celeron machine. I am either doing something incredibly stupid in the above code (I would not be surprised ;-), or there is a memory leak in the library. Version-Release number of selected component (if applicable): postgresql-7.4.2-1 How reproducible: Always Steps to Reproduce: 1. Compile the above program on FC2. 2. Run it. 3. Observe memory usage in top. Actual Results: Memory usage increases. Expected Results: Memory usage should be steady. Additional info:
Just to put things in perspective, by about 600,000 iterations, the memory leak is almost 9 MB!
I couldn't reproduce this in a first attempt. Do you have any connection-related settings that libpq is making use of beyond what is shown in your test program? For instance, PGfoo environment variables, or a .pgpass file?
I have no .pg* files at all. There are no PG* environment variables. As far as I'm aware, there is nothing special about this user account in terms of PostgreSQL. The only thing in pg_hba.conf that is not commented out is this: --------------------------------- local all all ident sameuser --------------------------------- The contents of postgresql.conf is this: --------------------------------- max_connections = 100 shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each lc_messages = 'en_AU.UTF-8' # locale for system error message strings lc_monetary = 'en_AU.UTF-8' # locale for monetary formatting lc_numeric = 'en_AU.UTF-8' # ou locale for number formatting lc_time = 'en_AU.UTF-8' # locale for time formatting --------------------------------- Pretty much the defaults. Is there anything else you need? BTW, I'm going to try this on another FC2 machine, just to see if I can replicate it there.
Created attachment 103471 [details] Table creation script This is the script used to create the table used in the test. Before that, the database also needs to be created, of course, and user (which has to exist) has to be granted select privilege on the table.
I'm seeing the same thing on my other FC2 system. Same version of PostgreSQL RPM, as shipped with FC2. On this machine, there is no .pg* files. Here, the environment variables are set like this: PGLIB=/usr/lib/pgsql PGDATA=/var/lib/pgsql I'm guessing this is from /etc/profile.d/postgresql.sh, which is something that's been hanging around from previous version of PostgreSQL (this particular box has been upgraded all the way from Red Hat Linux 5.0 :-).
On both systems, the program has been compiled like this: gcc -o test test.c -lpq Sorry, I forgot to mention that before.
I've just spent a couple hours trying to duplicate this, with no success at all. I think possibly there is some version difference between your machines and the one I was testing on. What version of glibc are you using, for instance? Another line of attack is to try to find out what makes the leak go away. For instance, if you use something besides IDENT authentication, does the leak go away?
Both machines are simply FC2 installations, fully updated. I don't normally upgrade RPMS on those machines from development branches of Fedora. The only difference may be RPMS that don't exist in Fedora by default. I'll install a brand new system, pure vanilla FC2, and then try again. BTW, on the first machine I mentioned, I also have problems with krb5-libs functions, called from pqlib code. They intermittently segfault, but I was unable to pinpoint the exact cause (and therefore never reported that as a bug). I have seen some recent security advisories with double-free in krb5, which does look very similar to what I have been experiencing (all my segfaults were in krb5 calling free()). However, to replicate those tests, a complicated set of software needs to be set up (httpd, libapreq2, mod_spin etc.), making in unpractical for you guys to test. That's how I came up with this small example - trying to replicate those segfaults with a straightforward test. Turns out, it ended up in memory leaks. Go figure... BTW, I have experienced memory leaks of similar nature in my mod_spin applications, which do not appear unless I keep opening and closing PostgreSQL connections. I'll keep you posted.
I found this sentence in the release notes for 7.4.3: Fix temporary memory leak when using non-hashed aggregates (Tom) That's on this page: http://www.postgresql.org/docs/current/interactive/release-7-4-3.html Could that have anything to do with what I'm seeing?
No.
Just testing on the third machine (which doesn't even run a PostgreSQL server, only has libs and devel, so all request are failing for sure) and I'm seeing the same thing. Still have to do the vanilla install to verify...
Hm, just to confirm I read that right: you have a loop in which PQconnectdb is failing to begin with (never mind the PQexec), and it's still leaking memory? I'm still clueless why I can't reproduce this :-(. I suspected from the start that the leak is associated with the connection and not with the query (please confirm whether it still leaks if you take out the PQexec and PQclear lines) but I sure do not know why ...
Yep, exactly. I have removed PQexec() and PQclear() calls altogether and recompiled. Still leaks. I'll attach the output of two pmap runs, just so you can see where exactly the memory goes. Not sure if that helps though. I did run strace and the only thing I can see in relation to memory allocation is repeated calls to mmap2() by krb5-libs. There is a matching munmap() and it does appear to be successful, so I'm not sure if it's related at all. I'll attach a snippet of that too.
Created attachment 103662 [details] strace output of around 10,000 iterations
Created attachment 103663 [details] first pmap output
Created attachment 103664 [details] second pmap output (see diff in comment) Here is a diff of the first and second pmap output, which shows where the memory is going: -------------------------------- --- test-pmap-1.txt 2004-09-10 11:12:12.603360095 +1000 +++ test-pmap-2.txt 2004-09-10 11:12:30.256620406 +1000 @@ -46,9 +46,9 @@ 06300000 12K rw--- /libssl.so.0.9.7a 08048000 4K r-x-- /test 08049000 4K rw--- /test -08d55000 644K rw--- [ anon ] +08d55000 1028K rw--- [ anon ] f6feb000 16K rw--- [ anon ] -f6fff000 4K rw--- [ anon ] +f6ffe000 8K rw--- [ anon ] fee95000 1452K rw--- [ stack ] ffffd000 4K ----- [ anon ] - total 5820K + total 6208K --------------------------------
Created attachment 103665 [details] All RPM packages installed on this system This particular system is an HP Vectra VL 420 DT, Pentium 4, 1.7 GHz with 384 MB or RAM and 20 GB disk. In terms of glibc, I guess it would be running i686 version of it.
Output of "ldd ./test" looks like this: -------------------------------- libpq.so.3 => /usr/lib/libpq.so.3 (0x00618000) libc.so.6 => /lib/tls/libc.so.6 (0x0011a000) libssl.so.4 => /lib/libssl.so.4 (0x062cf000) libcrypto.so.4 => /lib/libcrypto.so.4 (0x061a1000) libkrb5.so.3 => /usr/lib/libkrb5.so.3 (0x00c70000) libcom_err.so.2 => /lib/libcom_err.so.2 (0x02e8e000) libcrypt.so.1 => /lib/libcrypt.so.1 (0x00b5c000) libresolv.so.2 => /lib/libresolv.so.2 (0x02dde000) libnsl.so.1 => /lib/libnsl.so.1 (0x00b8b000) libpthread.so.0 => /lib/tls/libpthread.so.0 (0x02dca000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x00101000) libgssapi_krb5.so.2 => /usr/lib/libgssapi_krb5.so.2 (0x062b8000) libk5crypto.so.3 => /usr/lib/libk5crypto.so.3 (0x06294000) libdl.so.2 => /lib/libdl.so.2 (0x02e71000) libz.so.1 => /usr/lib/libz.so.1 (0x02df2000) --------------------------------
I didn't mention this and it may be important, pmap and strace outputs are from *different* runs. If you want me to do it on the same one, let me know. And, sorry for so many posts today... :-(
Ah-hah. I've been able to reproduce this problem on one FC2 machine in the Toronto office, but the one I first tried does not have it! They are running different kernels: no bug: 2.6.8-1.521 bug: 2.6.8-1.521smp but I suppose the problem is actually in userspace. The only package that libpq depends on that is different between the two is krb5: no bug seen with 1.3.3-7, bug seen with 1.3.4-6. So your gut suspicion of krb5 seems confirmed. Some further digging shows that each time through the loop, libpq tries to get the username from Kerberos. On both of my test systems, krb5_cc_get_principal() fails with "No credentials cache found". It appears that something gets leaked in that code path in 1.3.4-6, but not in 1.3.3-7. I'm going to reassign this to the krb5 package owner and see what he/she has to say. It's possible that the true bug is still in libpq, but if so it now takes the form of misuse of the kerberos library, so I need a hand anyway ...
Awesome! Thanks for all your help.
Just for the record, this is how I get pinched by segfaults when running a lot of requests to Apache that runs my software inside it (mod_spin + spin_app): --------------------------------- (gdb) set args -X (gdb) r Starting program: /usr/local/apache2/bin/httpd -X [Thread debugging using libthread_db enabled] [New Thread -151078176 (LWP 4395)] Detaching after fork from child process 4404. [New Thread -151295056 (LWP 4405)] [New Thread -161784912 (LWP 4406)] [New Thread -172274768 (LWP 4407)] [New Thread -182764624 (LWP 4408)] [New Thread -193254480 (LWP 4409)] [Thread -151295056 (zombie) exited] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -161784912 (LWP 4406)] 0x00000000 in ?? () (gdb) bt #0 0x00000000 in ?? () #1 0x043a40a9 in krb5_cc_close (context=0x935f3c8, cache=0x935f3c8) at ccfns.c:61 #2 0x0084c871 in pg_krb5_init ( PQerrormsg=0xf65b5380 "pg_krb5_init: krb5_cc_get_principal: No credentials c ache found\n") at fe-auth.c:329 #3 0x0084c92f in pg_krb5_authname (PQerrormsg=0x935f3c8 "\001") at fe-auth.c:348 #4 0x0084d20b in fe_getauthname ( PQerrormsg=0xf65b5380 "pg_krb5_init: krb5_cc_get_principal: No credentials c ache found\n") at fe-auth.c:731 #5 0x0084f8dd in conninfo_parse (conninfo=0x935f3c8 "\001", errorMessage=0xf3c00780) at fe-connect.c:2720 #6 0x0084d2fc in connectOptions1 (conn=0xf3c00480, conninfo=0x935f3c8 "\001") at fe-connect.c:328 #7 0x0084d295 in PQconnectStart (conninfo=0x935f3c8 "\001") at fe-connect.c:287 #8 0x0084d242 in PQconnectdb (conninfo=0x935f3c8 "\001") at fe-connect.c:245 #9 0x00115257 in rxv_spin_db_connect (pool=0x93a7eb0, cpool=0x0, conninfo=0xeb5a15 "dbname=spintest", type=1 '\001') at db.c:255 #10 0x00eb4f96 in rxv_spin_service (ctx=0x93a94a8) at app.c:61 #11 0x00ccf9a1 in handler (r=0x93a7ee8) at mod_spin.c:390 #12 0x0807f40a in ap_run_handler (r=0x93a7ee8) at config.c:151 #13 0x0807f91e in ap_invoke_handler (r=0x93a7ee8) at config.c:358 #14 0x0806d1a7 in ap_process_request (r=0x93a7ee8) at http_request.c:246 #15 0x080692b5 in ap_process_http_connection (c=0x931add0) at http_core.c:250 #16 0x08088472 in ap_run_process_connection (c=0x931add0) at connection.c:42 #17 0x0807c5f1 in process_socket (p=0x931aca8, sock=0x931ace0, my_child_num=71132672, my_thread_num=154530760, bucket_alloc=0x9322cc8) at worker.c:520 #18 0x0807cc3a in worker_thread (thd=0x92d5fa0, dummy=0x935f3c8) at worker.c:834 #19 0x00d22068 in dummy_worker (opaque=0x935f3c8) at thread.c:88 #20 0x02f6d98c in start_thread () from /lib/tls/libpthread.so.0 #21 0x0237416a in clone () from /lib/tls/libc.so.6 (gdb) --------------------------------- This is Apache 2.0.50 with worker MPM, but it is just a single thread doing everything (due to -X). When I go up the call stack a bit, to verify what is being passed into PQconnectdb() function by my software (to eliminate any stupidites on my behalf), this is what I see: --------------------------------- (gdb) up #1 0x043a40a9 in krb5_cc_close (context=0x935f3c8, cache=0x935f3c8) at ccfns.c:61 61 return cache->ops->close(context, cache); (gdb) up #2 0x0084c871 in pg_krb5_init ( PQerrormsg=0xf65b5380 "pg_krb5_init: krb5_cc_get_principal: No credentials cache found\n") at fe-auth.c:329 329 krb5_cc_close(pg_krb5_context, pg_krb5_ccache); (gdb) up #3 0x0084c92f in pg_krb5_authname (PQerrormsg=0x935f3c8 "\001") at fe-auth.c:348 348 if (pg_krb5_init(PQerrormsg) != STATUS_OK) (gdb) up #4 0x0084d20b in fe_getauthname ( PQerrormsg=0xf65b5380 "pg_krb5_init: krb5_cc_get_principal: No credentials cache found\n") at fe-auth.c:731 731 name = pg_krb5_authname(PQerrormsg); (gdb) up #5 0x0084f8dd in conninfo_parse (conninfo=0x935f3c8 "\001", errorMessage=0xf3c00780) at fe-connect.c:2720 2720 option->val = fe_getauthname(errortmp); (gdb) up #6 0x0084d2fc in connectOptions1 (conn=0xf3c00480, conninfo=0x935f3c8 "\001") at fe-connect.c:328 328 connOptions = conninfo_parse(conninfo, &conn->errorMessage); (gdb) up #7 0x0084d295 in PQconnectStart (conninfo=0x935f3c8 "\001") at fe-connect.c:287 287 if (!connectOptions1(conn, conninfo)) (gdb) up #8 0x0084d242 in PQconnectdb (conninfo=0x935f3c8 "\001") at fe-connect.c:245 245 PGconn *conn = PQconnectStart(conninfo); (gdb) up #9 0x00115257 in rxv_spin_db_connect (pool=0x93a7eb0, cpool=0x0, conninfo=0xeb5a15 "dbname=spintest", type=1 '\001') at db.c:255 255 if(!(conn->pgconn=PQconnectdb(conninfo))) (gdb) p conninfo $1 = 0xeb5a15 "dbname=spintest" --------------------------------- I have no idea how the conninfo (which in my code is a constant string) gets screwed after that... BTW, this is just *one* way those segfaults happen. Sometimes, it is when free() is called from various krb5 library functions on an invalid pointer. Not sure if this is helpful, but here it is nevertheless.
I'm hoping krb5-1.3.5 will eventually become part of FC3, at which point I can retest with it...
Still present in recently updates krb5-libs-1.3.6-2.
According to Kerberos developer Tom Yu, these memory leaks have been fixed in 1.4, which is currently at beta 3.
I did a little bit more snooping around and it turns out: - worker MPM: segfaults - prefork MPM: no segfaults This is related to thread safety issues in relation to the way PostgreSQL uses krb5 libraries. This is what Tom said: ----------------------------- I took a closer look at postgresql-7.4.5/src/interfaces/libpq/fe-auth.c and discovered that a lot of krb5 state is kept as file-scope static variables. The problem is therefore mostly postgresql's fault, since it looks like libpq itself isn't thread-safe. ----------------------------- In relation to memory leaks, this is allegedly fixed in 1.4 beta.
BTW, do I close this bug with with UPSTREAM? I'm not sure what the policy of including krb5-libs is and if 1.4 is going to go into the FC3 updates... Maybe I should close it with NEXTRELEASE?