Bug 537700 - yum got stuck after "double linked list corrupted" when under tight memory pressure
Summary: yum got stuck after "double linked list corrupted" when under tight memory pr...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: python
Version: 12
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Dave Malcolm
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 608710
TreeView+ depends on / blocked
 
Reported: 2009-11-15 21:13 UTC by Michal Jaegermann
Modified: 2014-01-21 23:12 UTC (History)
9 users (show)

Fixed In Version:
Clone Of:
: 608710 (view as bug list)
Environment:
Last Closed: 2010-12-04 03:18:14 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Michal Jaegermann 2009-11-15 21:13:40 UTC
Description of problem:

An attempt to update with the latest rawhide fell in the following manner:
....

  Updating       : glibc-2.11.90-1.x86_64                               26/1254 
error: Couldn't fork %post(glibc-2.11.90-1.x86_64): Cannot allocate memory
  Updating       : bash-4.0.35-1.fc13.x86_ [######                  ]   27/1254*** glibc detected *** /usr/bin/python: malloc(): smallbin double linked list corrupted: 0x0000000009241810 ***

At this point yum got stuck in "S+" and no further progress was happening whatsoever even after a long wait (around 20 minutes).  Morever a straight 'kill' on a stuck process was ineffective.  'kill -9' did work but rpm database was not so happy after this.

By hook and by crook it was still possible to upgrade all python, yum and glibc packages but this had no real effects on the problem.  A subsequent attempt of 'yum-complete-transaction' got stuck in a similar manner, i.e. on every try, debugging or not, things were falling apart after "smallbin double linked list corrupted".

With a system in a really inconsisten state, some packages partially updated and a raft of duplicates,  in order to recover I had to 'rpm -Fvh --nodeps ...' on all packages in a cache one by one and that followed up by a required number of 'yum-complete-transaction' until none was left.  That worked.  Also smaller yum transactions, like for example 'yum update kernel\*', did not run into this list corruption issues.

Version-Release number of selected component (if applicable):
yum-3.2.24-9.fc12.noarch
python-2.6.2-2.fc12.x86_64
glibc-2.11-2.x86_64

and also, after updates:

yum-3.2.25-1.fc13.noarch
python-2.6.4-3.fc13.x86_64
glibc-2.11.90-1.x86_64

How reproducible:
Not sure.  I could not get past the problem until I used workarounds described above

Steps to Reproduce:
Hm?  Get into "big enough" yum transaction and wait for a corruption to happen?

Additional info:
The machine in question has 512 Megs of on-board memory.  When this happened 'free -m' was indicating that there is still quite a bit of memory available and that even discounting what is buffers/cache.  Swap was used in a minimal way.  Runs of 'memtest' on the hardware so far did not turn any problems.

An update was surely sizeable.  637 packages adding to 809 Megs to install.  Still in the past the same machine "survived" on occasions yum transaction of roughly the same orders.

Comment 1 James Antill 2009-11-16 14:21:15 UTC
 I'm going to assign this to python, but my guess is that it's a corner case due to:

error: Couldn't fork %post(glibc-2.11.90-1.x86_64): Cannot allocate memory

Comment 2 Bug Zapper 2009-11-16 15:34:08 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 3 Dave Malcolm 2009-11-16 16:34:09 UTC
Thanks for filing this bug.

The message:
  error: Couldn't fork %post(glibc-2.11.90-1.x86_64): Cannot allocate memory
came from inside librpm, in rpm/psm.c:runScript:
    if (psm->sq.child == (pid_t)-1) {
	rpmlog(RPMLOG_ERR, _("Couldn't fork %s: %s\n"), sname, strerror(errno));
	goto exit;
    }

This error path happens if a fork() syscall returns a negative number in rpmio/rpmsq.c:rpmsqFork which, according to the fork(2) manpage, leads to errno being set, so you're seeing this error (again from fork(2) manpage):

  ENOMEM fork() failed to allocate the necessary  kernel  structures  because memory is tight.

You mentioned that you've done upgrades of similar sizes in the past; have you ever seen these "fork" error messages before?

The error message
  *** glibc detected *** /usr/bin/python: malloc(): smallbin double linked
list corrupted: 0x0000000009241810 ***
came from inside:
  Void_t*  int_malloc(mstate av, size_t bytes)

This is an integrity check performed inside the malloc implementation (added on 2009-06-19 with this commit:
http://sourceware.org/git/?p=glibc.git;a=commitdiff;h=f6887a0d9a55f5c80c567d9cb153c1c6582410f9 , although similar checks have been present in the malloc code for a long time).

Something had become corrupted within the internal representation of the heap.  I suspect that what happened is that a write occurred outside of the range of an allocated block of memory _somewhere_ within the python process.  Unfortunately, it's hard to track this kind of thing down; they can lurk in rarely-used code paths in one of the modules (or a library they link to), perhaps linked to being extremely low on available memory.

Are you able to reproduce this at all?

Comment 4 Michal Jaegermann 2009-11-16 18:03:09 UTC
> Are you able to reproduce this at all?

When I had this sizeable transaction still pending all attempts to 'yum-complete-transaction', and that was three or four of these (interspersed with updates to all python, yum and glibc packages which need/could be updated in attempts to get around the problem), were ending consistently with "smallbin double linked list corrupted" and a stuck process.  I also run 'rpm --rebuilddb' in between and also to no avail.

After fixing up "manually" a rather messed up system smaller yum transactions were not an issue.  Also some of mentioned above partial intermediate updates were done with a help of yum (glibc ended up after an aborted transaction as a tangled mess where yum could not go either forward or back).

If yum would just abort under a memory pressure, even with plenty of virtual memory still available, that would not make me very happy but an event would be somehow possible to explain.  Just stopping with structure corruptions smells to me like a serious bug (although likely hard to track and probably in python).

As I mentioned - in the past and on the same box yum transactions of a similar order did happen, if not that frequently, and before they were going through.

Comment 5 Bug Zapper 2010-11-04 06:22:52 UTC
This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 6 Bug Zapper 2010-12-04 03:18:14 UTC
Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.