Description of problem: Not sure that this is really gdb's problem, it may be in the kernel or somewhere else. I can no longer debug the core files produced by abort() calls in F14. This used to work in F13 and before. It complains that the core file is too short: Reading symbols from /home/tgl/testversion/bin/postgres...done. BFD: Warning: /home/tgl/testversion/data/core.23152 is truncated: expected core file size >= 43823104, found: 6680576. warning: core file may not match specified executable file. [New Thread 23152] Cannot access memory at address 0x7f3a4d3e26e0 (gdb) bt #0 0x00000038cd6330c5 in ?? () Cannot access memory at address 0x7fff9277e658 I am suspicious that the difference between the "expected" and actual core sizes has something to do with Postgres' use of a SysV shared memory segment, but I'm only guessing. Version-Release number of selected component (if applicable): gdb-7.2-51.fc14.x86_64 kernel-2.6.35.13-92.fc14.x86_64 How reproducible: Seems consistent when the core file is produced by an abort() call, but it doesn't happen with other methods such as kill -ABRT. Steps to Reproduce: 1. Booby-trap Postgres with an abort() call for debugging purposes. 2. Run it under ulimit -c unlimited. 3. Provoke the abort. 4. Try to use gdb on resulting core file. Actual results: Fails, complaining core file is too short. Expected results: Should work. Additional info: I'm sure this isn't enough info for a useful report; please let me know what other info might be relevant.
It is almost definitely a kernel Bug but could you attach / provide the core file before I try to reproduce it?
(In reply to comment #1) > It is almost definitely a kernel Bug but could you attach / provide the core > file before I try to reproduce it? Sure, but the examples that I have are several MB in size. Is that too large to attach to a BZ?
Even after xz? BZ accepts something up to about 10MB IIRC.
Created attachment 518074 [details] core file, along with executable it was made from OK, here it is.
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align [...] LOAD 0x0000000002993000 0xffffffffff600000 0x0000000000000000 ^^^^^^^^^^^^^^^^^^ 0x0000000000000000 0x0000000000001000 R E 1000 The core file size should be 0x2993000 == 43593728 bytes != 8019968 bytes it is. This is unrelated to GDB, this is a kernel core file producer bug.
Created attachment 518088 [details] Attempted reproducer with shmat(). The shmat idea does not work, it will be more complicated.
Comment on attachment 518088 [details] Attempted reproducer with shmat(). The reproducer is invalid.
Further testing has shown that sometimes I don't get a core file at all. I found some seemingly related messages in /var/log/messages: Aug 15 11:44:31 rh3 abrt[32559]: bad old pattern '636f72', ignoring and using 'core' Aug 15 11:44:31 rh3 abrt[32559]: can't cd to (null) Aug 15 11:44:31 rh3 abrt[32559]: can't read /proc/32551/exe link and another time Aug 15 15:10:37 rh3 abrt[31546]: bad old pattern '636f72', ignoring and using 'core' Aug 15 15:10:37 rh3 abrt[31546]: saved core dump of pid 31542 (/home/tgl/testversion/bin/postgres) to /var/spool/abrt/ccpp-1313435437-31542.new/coredump (65536 bytes) Aug 15 15:10:37 rh3 abrtd: Directory 'ccpp-1313435437-31542' creation detected Aug 15 15:10:37 rh3 abrtd: Executable '/home/tgl/testversion/bin/postgres' doesn't belong to any package Aug 15 15:10:37 rh3 abrtd: Corrupted or bad crash /var/spool/abrt/ccpp-1313435437-31542 (res:4), deleting Is it possible that abrt is messing things up? abrt components currently installed: abrt-plugin-logger-1.1.18-1.fc14.x86_64 abrt-libs-1.1.18-1.fc14.x86_64 abrt-addon-python-1.1.18-1.fc14.x86_64 abrt-addon-ccpp-1.1.18-1.fc14.x86_64 abrt-desktop-1.1.18-1.fc14.x86_64 abrt-plugin-runapp-1.1.18-1.fc14.x86_64 abrt-addon-kerneloops-1.1.18-1.fc14.x86_64 abrt-plugin-bugzilla-1.1.18-1.fc14.x86_64 abrt-1.1.18-1.fc14.x86_64 abrt-gui-1.1.18-1.fc14.x86_64
Yes, it is probably due to ABRT, there were bugs of this kind. Try to disable it temporarily and then reassign it to ABRT. # /etc/init.d/abrtd start Starting abrt daemon: [ OK ] # cat /proc/sys/kernel/core_pattern |/usr/libexec/abrt-hook-ccpp /var/spool/abrt %s %c %p %u %g %t %h %e 636f726500 # /etc/init.d/abrtd stop Stopping abrt daemon: [ OK ] # cat /proc/sys/kernel/core_pattern core
Yes, it seems so: I did my test case about ten times with abrtd off, and didn't get a failure, but it seems to fail about one time in two with abrtd running. Also, that testing convinces me that the failure does not manifest unless there are multiple postgres processes running --- I can usually get it to fail if I provoke a crash during postgres's "parallel" regression tests, but not when the tests are run serially. This is consistent with the original report, because I was testing high-concurrency scenarios at the time. Note that only one of the processes is aborting, the others are just other executions of the same executable. Reassigning to abrt component.
Is this still an issue for you? Fedora 14 and abrt-1.x are a bit old now and this might be fixed in newer version of kernel and/or abrt as there are few fixed bugs that might cause this (e.g. #618602). If so, please try to reproduce it with ProcessUnpackaged set to yes in /etc/abrt/abrt.conf and MakeCompatCore set to yes in /etc/abrt/plugins/CCpp.conf
Yes, it still happens with a fully up-to-date F14 machine: kernel 2.6.35.14-96.fc14.x86_64, abrt versions as stated in comment #8. > Fedora 14 and abrt-1.x are a bit old now Should I interpret that as "we're not fixing bugs in F14's abrtd anymore"? > If so, please try to reproduce it with ProcessUnpackaged set to yes in Please note that the problem being complained of is exactly that abrtd is interfering with core dumps of an unpackaged program. Why would changing that setting make it better? > and MakeCompatCore set to yes in /etc/abrt/plugins/CCpp.conf That already is set to yes in the stock configuration (which I've not touched).
This is the simplest reproducer I've found so far: hack the Postgres source code like so: diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c index a367421..e4d33bd 100644 *** a/src/backend/utils/adt/int.c --- b/src/backend/utils/adt/int.c *************** int4div(PG_FUNCTION_ARGS) *** 725,731 **** if (arg2 == 0) { ! ereport(ERROR, (errcode(ERRCODE_DIVISION_BY_ZERO), errmsg("division by zero"))); /* ensure compiler realizes we mustn't reach the division (gcc bug) */ --- 725,731 ---- if (arg2 == 0) { ! ereport(PANIC, (errcode(ERRCODE_DIVISION_BY_ZERO), errmsg("division by zero"))); /* ensure compiler realizes we mustn't reach the division (gcc bug) */ build, install, then run the parallel regression tests ("make installcheck-parallel"). Doesn't fail every time, but seems to be at least one time in two. One possibly interesting point is that if only a single core file is produced, it's (almost?) always OK. If more than one core file is produced, usually all but one are corrupted. Maybe the bug is related to dumping of concurrently produced cores? The PG regression tests contain several distinct scripts that will exercise this crash concurrently, so depending on timing they could be expected to tickle such a bug pretty often.
After a bit of further testing, I was able to reproduce the problem on F-15 up-to-date as of today, kernel 2.6.40.4-5.fc15.x86_64, abrt versions 2.0.3-1.fc15.x86_64. So "F-14 is too old" isn't getting you off the hook ;-) [tgl@rhlap regress]$ ls -l $PGDATA/core* -rw-------. 1 tgl tgl 34676736 Sep 7 19:17 /home/tgl/testversion/data/core.26717 -rw-------. 1 tgl tgl 26222592 Sep 7 19:17 /home/tgl/testversion/data/core.26913 [tgl@rhlap regress]$ gdb ~/testversion/bin/postgres /home/tgl/testversion/data/core.26913 GNU gdb (GDB) Fedora (7.3-43.fc15) Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/tgl/testversion/bin/postgres...done. BFD: Warning: /home/tgl/testversion/data/core.26913 is truncated: expected core file size >= 35442688, found: 26222592. warning: core file may not match specified executable file. [New LWP 26913] Cannot access memory at address 0x7fd4b17026e0 (gdb) It seems to be harder to reproduce on F-15, but that may be because the laptop I have F-15 on is scrawnier than my F-14 workstation and so it's harder to get multiple core dumps going concurrently. BTW, why does gdb *invariably* claim "core file may not match specified executable file"? The warning seems entirely useless, if it's unable to resolve the issue with more than 0% reliability.
There is a file /proc/sys/kernel/core_pipe_limit where one can specify how many coredump hooks can be run at the same time, by default it's set to 4, so if there is more than 4 crashes at the same time it's not guaranteed that kernel will wait for the hook to finish - this might or might not be the case here, but definitely worth investigating.
(In reply to comment #15) > There is a file /proc/sys/kernel/core_pipe_limit where one can specify how many > coredump hooks can be run at the same time, by default it's set to 4 Interesting, I didn't know about that. I do not think it's relevant to the immediate issue --- in the reproducer shown above, I've never seen more than three core files produced at once, and examination of the test scripts suggests that that's the most number of tests that could hit that PANIC concurrently. > so if > there is more than 4 crashes at the same time it's not guaranteed that kernel > will wait for the hook to finish - this might or might not be the case here, > but definitely worth investigating. Definitely seems like a potential cause of failures, even if it's not the issue right here.
I've modified int4div as you suggested but the test run (installcheck-parallel) produces only one core dump at a time (according to the logfile, postgres terminates all other running processes and starts recovery proces preventing further connections). My steps: /usr/local/pgsql/bin/initdb -D $PGDATA /usr/local/pgsql/bin/postgres -D $PGDATA > logfile 2>&1 make installcheck-parallel Is there something more required to get multiple dumps?
(In reply to comment #17) > I've modified int4div as you suggested but the test run (installcheck-parallel) > produces only one core dump at a time Sometimes that's what you get. Try multiple times. Try it on beefier hardware --- as I mentioned, my workstation shows the problem much more often than my laptop. [ reflects for a bit... ] You could perhaps increase the probability of concurrent core dumps if you modified the parallel_schedule file to make sure all of the potentially vulnerable tests are run in the same test group. A look at the expected outputs says that errors, case, guc, transactions, and plpgsql are the potential crashers. However, there are ordering dependencies between some of the tests which means that just altering parallel_schedule arbitrarily won't work. So I'd suggest just trying a few times before you resort to fooling with that.
(In reply to comment #18) > [ reflects for a bit... ] You could perhaps increase the probability of > concurrent core dumps if you modified the parallel_schedule file to make sure > all of the potentially vulnerable tests are run in the same test group. Or even easier, just insert "select 1/0;" at the head of several sql/*.sql files that are already run concurrently by the parallel_schedule file.
(In reply to comment #19) > (In reply to comment #18) > > [ reflects for a bit... ] You could perhaps increase the probability of > > concurrent core dumps if you modified the parallel_schedule file to make sure > > all of the potentially vulnerable tests are run in the same test group. > > Or even easier, just insert "select 1/0;" at the head of several sql/*.sql > files that are already run concurrently by the parallel_schedule file. This helped and resulted in truncated core dumps created. As Jiri suggested core_pipe_limit is causing this. dmesg contains: [104315.551500] Pid 15665(postgres) over core_pipe_limit [104315.551502] Skipping core dump [104320.986235] Pid 16436(postgres) over core_pipe_limit [104320.986238] Skipping core dump Try increasing /proc/sys/kernel/core_pipe_limit to 20 or more - echo 20 > /proc/sys/kernel/core_pipe_limit this solved the issue for me.
(In reply to comment #20) > dmesg contains: > [104315.551500] Pid 15665(postgres) over core_pipe_limit > [104315.551502] Skipping core dump > [104320.986235] Pid 16436(postgres) over core_pipe_limit > [104320.986238] Skipping core dump Hmm, I do not see that here ... > Try increasing /proc/sys/kernel/core_pipe_limit to 20 or more - > echo 20 > /proc/sys/kernel/core_pipe_limit > this solved the issue for me. That does *not* fix it for me. In any case, if there are only three processes crashing, why wouldn't a setting of 4 be enough?
(In reply to comment #21) > (In reply to comment #20) > > dmesg contains: > > [104315.551500] Pid 15665(postgres) over core_pipe_limit > > [104315.551502] Skipping core dump > > [104320.986235] Pid 16436(postgres) over core_pipe_limit > > [104320.986238] Skipping core dump > > Hmm, I do not see that here ... Then it might be another issue but I'm unable to reproduce that one (maybe on f14|stronger machine). > > > Try increasing /proc/sys/kernel/core_pipe_limit to 20 or more - > > echo 20 > /proc/sys/kernel/core_pipe_limit > > this solved the issue for me. > > That does *not* fix it for me. In any case, if there are only three processes > crashing, why wouldn't a setting of 4 be enough? In my case, all of the sql files contained select "select 1/0;" so I had to use bigger value. Exact number is not relevant as this is probably caused by something else.
BTW, I did not have any luck with increasing the probability of the failure by adding "select 1/0;" as I suggested above. This suggests strongly to me that the problem is correlated with use of SysV shared memory: if the first command fed to a hacked backend is just "select 1/0", it will probably crash after very little access to the shared memory block. I did find that adjusting the parallel_schedule file as in the attached patch increased the probability of a failure quite a bit. This ensures that several of the crashable tests are launched in parallel. In the unmodified schedule, you'll only see concurrent core dumps if the test driver manages to launch a followon parallel set before the postmaster has gotten notified of the crash of the "errors" test, which isn't that easy to do.
Created attachment 522962 [details] patch postgres to cause problem at issue
-> moving to rawhide, so this won't get killed by the bugzapper
*** Bug 633359 has been marked as a duplicate of this bug. ***
Filed Bug 759213
This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component.
Note: bug 696166 is probably the same.
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19