Bug 730169 - postgresql generated core file is incomplete
postgresql generated core file is incomplete
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: abrt (Show other bugs)
19
x86_64 Unspecified
high Severity unspecified
: ---
: ---
Assigned To: Richard Marko
Fedora Extras Quality Assurance
: Regression
: 633359 (view as bug list)
Depends On: 759213 773215
Blocks: ABRTF18
  Show dependency treegraph
 
Reported: 2011-08-11 21:20 EDT by Tom Lane
Modified: 2016-01-31 21:21 EST (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-08-09 06:50:07 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
core file, along with executable it was made from (4.89 MB, application/octet-stream)
2011-08-12 13:52 EDT, Tom Lane
no flags Details
Attempted reproducer with shmat(). (497 bytes, text/plain)
2011-08-12 14:39 EDT, Jan Kratochvil
no flags Details
patch postgres to cause problem at issue (1.82 KB, patch)
2011-09-13 12:21 EDT, Tom Lane
no flags Details | Diff

  None (edit)
Description Tom Lane 2011-08-11 21:20:02 EDT
Description of problem:
Not sure that this is really gdb's problem, it may be in the kernel or somewhere else.

I can no longer debug the core files produced by abort() calls in F14.  This used to work in F13 and before.  It complains that the core file is too short:

Reading symbols from /home/tgl/testversion/bin/postgres...done.
BFD: Warning: /home/tgl/testversion/data/core.23152 is truncated: expected core file size >= 43823104, found: 6680576.

warning: core file may not match specified executable file.
[New Thread 23152]
Cannot access memory at address 0x7f3a4d3e26e0
(gdb) bt
#0  0x00000038cd6330c5 in ?? ()
Cannot access memory at address 0x7fff9277e658

I am suspicious that the difference between the "expected" and actual core sizes has something to do with Postgres' use of a SysV shared memory segment, but I'm only guessing.

Version-Release number of selected component (if applicable):
gdb-7.2-51.fc14.x86_64
kernel-2.6.35.13-92.fc14.x86_64

How reproducible:
Seems consistent when the core file is produced by an abort() call, but it doesn't happen with other methods such as kill -ABRT.

Steps to Reproduce:
1. Booby-trap Postgres with an abort() call for debugging purposes.
2. Run it under ulimit -c unlimited.
3. Provoke the abort.
4. Try to use gdb on resulting core file.
  
Actual results:
Fails, complaining core file is too short.

Expected results:
Should work.

Additional info:
I'm sure this isn't enough info for a useful report; please let me know what other info might be relevant.
Comment 1 Jan Kratochvil 2011-08-12 02:24:41 EDT
It is almost definitely a kernel Bug but could you attach / provide the core file before I try to reproduce it?
Comment 2 Tom Lane 2011-08-12 08:37:55 EDT
(In reply to comment #1)
> It is almost definitely a kernel Bug but could you attach / provide the core
> file before I try to reproduce it?

Sure, but the examples that I have are several MB in size.  Is that too large to attach to a BZ?
Comment 3 Jan Kratochvil 2011-08-12 08:53:01 EDT
Even after xz? BZ accepts something up to about 10MB IIRC.
Comment 4 Tom Lane 2011-08-12 13:52:26 EDT
Created attachment 518074 [details]
core file, along with executable it was made from

OK, here it is.
Comment 5 Jan Kratochvil 2011-08-12 14:38:19 EDT
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
[...]
  LOAD           0x0000000002993000 0xffffffffff600000 0x0000000000000000
                 ^^^^^^^^^^^^^^^^^^
                 0x0000000000000000 0x0000000000001000  R E    1000

The core file size should be 0x2993000 == 43593728 bytes != 8019968 bytes it is.

This is unrelated to GDB, this is a kernel core file producer bug.
Comment 6 Jan Kratochvil 2011-08-12 14:39:10 EDT
Created attachment 518088 [details]
Attempted reproducer with shmat().

The shmat idea does not work, it will be more complicated.
Comment 7 Jan Kratochvil 2011-08-12 14:39:35 EDT
Comment on attachment 518088 [details]
Attempted reproducer with shmat().

The reproducer is invalid.
Comment 8 Tom Lane 2011-08-16 12:20:45 EDT
Further testing has shown that sometimes I don't get a core file at all.  I found some seemingly related messages in /var/log/messages:

Aug 15 11:44:31 rh3 abrt[32559]: bad old pattern '636f72', ignoring and using 'core'
Aug 15 11:44:31 rh3 abrt[32559]: can't cd to (null)
Aug 15 11:44:31 rh3 abrt[32559]: can't read /proc/32551/exe link

and another time

Aug 15 15:10:37 rh3 abrt[31546]: bad old pattern '636f72', ignoring and using 'core'
Aug 15 15:10:37 rh3 abrt[31546]: saved core dump of pid 31542 (/home/tgl/testversion/bin/postgres) to /var/spool/abrt/ccpp-1313435437-31542.new/coredump (65536 bytes)
Aug 15 15:10:37 rh3 abrtd: Directory 'ccpp-1313435437-31542' creation detected
Aug 15 15:10:37 rh3 abrtd: Executable '/home/tgl/testversion/bin/postgres' doesn't belong to any package
Aug 15 15:10:37 rh3 abrtd: Corrupted or bad crash /var/spool/abrt/ccpp-1313435437-31542 (res:4), deleting

Is it possible that abrt is messing things up?  abrt components currently installed:

abrt-plugin-logger-1.1.18-1.fc14.x86_64
abrt-libs-1.1.18-1.fc14.x86_64
abrt-addon-python-1.1.18-1.fc14.x86_64
abrt-addon-ccpp-1.1.18-1.fc14.x86_64
abrt-desktop-1.1.18-1.fc14.x86_64
abrt-plugin-runapp-1.1.18-1.fc14.x86_64
abrt-addon-kerneloops-1.1.18-1.fc14.x86_64
abrt-plugin-bugzilla-1.1.18-1.fc14.x86_64
abrt-1.1.18-1.fc14.x86_64
abrt-gui-1.1.18-1.fc14.x86_64
Comment 9 Jan Kratochvil 2011-08-16 12:28:37 EDT
Yes, it is probably due to ABRT, there were bugs of this kind.
Try to disable it temporarily and then reassign it to ABRT.

# /etc/init.d/abrtd start
Starting abrt daemon:                                      [  OK  ]
# cat /proc/sys/kernel/core_pattern
|/usr/libexec/abrt-hook-ccpp /var/spool/abrt %s %c %p %u %g %t %h %e 636f726500
# /etc/init.d/abrtd stop
Stopping abrt daemon:                                      [  OK  ]
# cat /proc/sys/kernel/core_pattern
core
Comment 10 Tom Lane 2011-08-23 18:58:25 EDT
Yes, it seems so: I did my test case about ten times with abrtd off, and didn't get a failure, but it seems to fail about one time in two with abrtd running.

Also, that testing convinces me that the failure does not manifest unless there are multiple postgres processes running --- I can usually get it to fail if I provoke a crash during postgres's "parallel" regression tests, but not when the tests are run serially.  This is consistent with the original report, because I was testing high-concurrency scenarios at the time.  Note that only one of the processes is aborting, the others are just other executions of the same executable.

Reassigning to abrt component.
Comment 11 Richard Marko 2011-09-07 11:22:43 EDT
Is this still an issue for you? Fedora 14 and abrt-1.x are a bit old now and this  might be fixed in newer version of kernel and/or abrt as there are few fixed bugs that might cause this (e.g. #618602).

If so, please try to reproduce it with ProcessUnpackaged set to yes in /etc/abrt/abrt.conf and MakeCompatCore set to yes in /etc/abrt/plugins/CCpp.conf
Comment 12 Tom Lane 2011-09-07 18:18:05 EDT
Yes, it still happens with a fully up-to-date F14 machine: kernel 2.6.35.14-96.fc14.x86_64, abrt versions as stated in comment #8.

> Fedora 14 and abrt-1.x are a bit old now

Should I interpret that as "we're not fixing bugs in F14's abrtd anymore"?

> If so, please try to reproduce it with ProcessUnpackaged set to yes in

Please note that the problem being complained of is exactly that abrtd is interfering with core dumps of an unpackaged program.  Why would changing that setting make it better?

> and MakeCompatCore set to yes in /etc/abrt/plugins/CCpp.conf

That already is set to yes in the stock configuration (which I've not touched).
Comment 13 Tom Lane 2011-09-07 18:53:31 EDT
This is the simplest reproducer I've found so far: hack the Postgres source code like so:

diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c
index a367421..e4d33bd 100644
*** a/src/backend/utils/adt/int.c
--- b/src/backend/utils/adt/int.c
*************** int4div(PG_FUNCTION_ARGS)
*** 725,731 ****
  
    if (arg2 == 0)
    {
!       ereport(ERROR,
                (errcode(ERRCODE_DIVISION_BY_ZERO),
                 errmsg("division by zero")));
        /* ensure compiler realizes we mustn't reach the division (gcc bug) */
--- 725,731 ----
  
    if (arg2 == 0)
    {
!       ereport(PANIC,
                (errcode(ERRCODE_DIVISION_BY_ZERO),
                 errmsg("division by zero")));
        /* ensure compiler realizes we mustn't reach the division (gcc bug) */

build, install, then run the parallel regression tests ("make installcheck-parallel").  Doesn't fail every time, but seems to be at least one time in two.


One possibly interesting point is that if only a single core file is produced, it's (almost?) always OK.  If more than one core file is produced, usually all but one are corrupted.  Maybe the bug is related to dumping of concurrently produced cores?  The PG regression tests contain several distinct scripts that will exercise this crash concurrently, so depending on timing they could be expected to tickle such a bug pretty often.
Comment 14 Tom Lane 2011-09-07 19:26:53 EDT
After a bit of further testing, I was able to reproduce the problem on F-15 up-to-date as of today, kernel 2.6.40.4-5.fc15.x86_64, abrt versions 2.0.3-1.fc15.x86_64.  So "F-14 is too old" isn't getting you off the hook ;-)

[tgl@rhlap regress]$ ls -l $PGDATA/core*
-rw-------. 1 tgl tgl 34676736 Sep  7 19:17 /home/tgl/testversion/data/core.26717
-rw-------. 1 tgl tgl 26222592 Sep  7 19:17 /home/tgl/testversion/data/core.26913
[tgl@rhlap regress]$ gdb ~/testversion/bin/postgres /home/tgl/testversion/data/core.26913
GNU gdb (GDB) Fedora (7.3-43.fc15)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/tgl/testversion/bin/postgres...done.
BFD: Warning: /home/tgl/testversion/data/core.26913 is truncated: expected core file size >= 35442688, found: 26222592.

warning: core file may not match specified executable file.
[New LWP 26913]
Cannot access memory at address 0x7fd4b17026e0
(gdb) 


It seems to be harder to reproduce on F-15, but that may be because the laptop I have F-15 on is scrawnier than my F-14 workstation and so it's harder to get multiple core dumps going concurrently.

BTW, why does gdb *invariably* claim "core file may not match specified executable file"?  The warning seems entirely useless, if it's unable to resolve the issue with more than 0% reliability.
Comment 15 Jiri Moskovcak 2011-09-08 06:38:13 EDT
There is a file /proc/sys/kernel/core_pipe_limit where one can specify how many coredump hooks can be run at the same time, by default it's set to 4, so if there is more than 4 crashes at the same time it's not guaranteed that kernel will wait for the hook to finish - this might or might not be the case here, but definitely worth investigating.
Comment 16 Tom Lane 2011-09-08 10:56:54 EDT
(In reply to comment #15)
> There is a file /proc/sys/kernel/core_pipe_limit where one can specify how many
> coredump hooks can be run at the same time, by default it's set to 4

Interesting, I didn't know about that.  I do not think it's relevant to the immediate issue --- in the reproducer shown above, I've never seen more than three core files produced at once, and examination of the test scripts suggests that that's the most number of tests that could hit that PANIC concurrently.

> so if
> there is more than 4 crashes at the same time it's not guaranteed that kernel
> will wait for the hook to finish - this might or might not be the case here,
> but definitely worth investigating.

Definitely seems like a potential cause of failures, even if it's not the issue right here.
Comment 17 Richard Marko 2011-09-12 07:41:06 EDT
I've modified int4div as you suggested but the test run (installcheck-parallel) produces only one core dump at a time (according to the logfile, postgres terminates all other running processes and starts recovery proces preventing further connections).

My steps:
/usr/local/pgsql/bin/initdb -D $PGDATA
/usr/local/pgsql/bin/postgres -D $PGDATA > logfile 2>&1
make installcheck-parallel

Is there something more required to get multiple dumps?
Comment 18 Tom Lane 2011-09-12 09:56:13 EDT
(In reply to comment #17)
> I've modified int4div as you suggested but the test run (installcheck-parallel)
> produces only one core dump at a time

Sometimes that's what you get.  Try multiple times.  Try it on beefier hardware --- as I mentioned, my workstation shows the problem much more often than my laptop.

[ reflects for a bit... ]  You could perhaps increase the probability of concurrent core dumps if you modified the parallel_schedule file to make sure all of the potentially vulnerable tests are run in the same test group.  A look at the expected outputs says that errors, case, guc, transactions, and plpgsql are the potential crashers.  However, there are ordering dependencies between some of the tests which means that just altering parallel_schedule arbitrarily won't work.  So I'd suggest just trying a few times before you resort to fooling with that.
Comment 19 Tom Lane 2011-09-12 10:23:30 EDT
(In reply to comment #18)
> [ reflects for a bit... ]  You could perhaps increase the probability of
> concurrent core dumps if you modified the parallel_schedule file to make sure
> all of the potentially vulnerable tests are run in the same test group.

Or even easier, just insert "select 1/0;" at the head of several sql/*.sql files that are already run concurrently by the parallel_schedule file.
Comment 20 Richard Marko 2011-09-12 10:47:46 EDT
(In reply to comment #19)
> (In reply to comment #18)
> > [ reflects for a bit... ]  You could perhaps increase the probability of
> > concurrent core dumps if you modified the parallel_schedule file to make sure
> > all of the potentially vulnerable tests are run in the same test group.
> 
> Or even easier, just insert "select 1/0;" at the head of several sql/*.sql
> files that are already run concurrently by the parallel_schedule file.

This helped and resulted in truncated core dumps created. As Jiri suggested core_pipe_limit is causing this.

dmesg contains:
[104315.551500] Pid 15665(postgres) over core_pipe_limit
[104315.551502] Skipping core dump
[104320.986235] Pid 16436(postgres) over core_pipe_limit
[104320.986238] Skipping core dump

Try increasing /proc/sys/kernel/core_pipe_limit to 20 or more - 

echo 20 > /proc/sys/kernel/core_pipe_limit 

this solved the issue for me.
Comment 21 Tom Lane 2011-09-13 12:07:00 EDT
(In reply to comment #20)
> dmesg contains:
> [104315.551500] Pid 15665(postgres) over core_pipe_limit
> [104315.551502] Skipping core dump
> [104320.986235] Pid 16436(postgres) over core_pipe_limit
> [104320.986238] Skipping core dump

Hmm, I do not see that here ...

> Try increasing /proc/sys/kernel/core_pipe_limit to 20 or more - 
> echo 20 > /proc/sys/kernel/core_pipe_limit 
> this solved the issue for me.

That does *not* fix it for me.  In any case, if there are only three processes crashing, why wouldn't a setting of 4 be enough?
Comment 22 Richard Marko 2011-09-13 12:16:37 EDT
(In reply to comment #21)
> (In reply to comment #20)
> > dmesg contains:
> > [104315.551500] Pid 15665(postgres) over core_pipe_limit
> > [104315.551502] Skipping core dump
> > [104320.986235] Pid 16436(postgres) over core_pipe_limit
> > [104320.986238] Skipping core dump
> 
> Hmm, I do not see that here ...

Then it might be another issue but I'm unable to reproduce that one (maybe on f14|stronger machine).

> 
> > Try increasing /proc/sys/kernel/core_pipe_limit to 20 or more - 
> > echo 20 > /proc/sys/kernel/core_pipe_limit 
> > this solved the issue for me.
> 
> That does *not* fix it for me.  In any case, if there are only three processes
> crashing, why wouldn't a setting of 4 be enough?

In my case, all of the sql files contained select "select 1/0;" so I had to use bigger value. Exact number is not relevant as this is probably caused by something else.
Comment 23 Tom Lane 2011-09-13 12:20:37 EDT
BTW, I did not have any luck with increasing the probability of the failure by adding "select 1/0;" as I suggested above.  This suggests strongly to me that the problem is correlated with use of SysV shared memory: if the first command fed to a hacked backend is just "select 1/0", it will probably crash after very little access to the shared memory block.

I did find that adjusting the parallel_schedule file as in the attached patch increased the probability of a failure quite a bit.  This ensures that several of the crashable tests are launched in parallel.  In the unmodified schedule, you'll only see concurrent core dumps if the test driver manages to launch a followon parallel set before the postmaster has gotten notified of the crash of the "errors" test, which isn't that easy to do.
Comment 24 Tom Lane 2011-09-13 12:21:33 EDT
Created attachment 522962 [details]
patch postgres to cause problem at issue
Comment 25 Jiri Moskovcak 2011-11-25 03:37:03 EST
-> moving to rawhide, so this won't get killed by the bugzapper
Comment 26 Jiri Moskovcak 2011-11-29 09:01:04 EST
*** Bug 633359 has been marked as a duplicate of this bug. ***
Comment 27 Richard Marko 2011-12-01 12:26:00 EST
Filed Bug 759213
Comment 28 Fedora Admin XMLRPC Client 2011-12-19 12:41:59 EST
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.
Comment 29 Fedora Admin XMLRPC Client 2011-12-19 12:44:49 EST
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.
Comment 30 Denys Vlasenko 2012-01-11 11:27:30 EST
Note: bug 696166 is probably the same.
Comment 31 Fedora End Of Life 2013-04-03 14:07:38 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Note You need to log in before you can comment on or make changes to this bug.