+++ This bug was initially created as a clone of Bug #1246713 +++
Description of problem:
At Facebook we had an app that started hanging and crashing weirdly when going from glibc-2.12-1.149.el6.x86_64 to glibc-2.12-1.163.el6.x86_64. Turns out this patch
Introduced the problem.
You added the following bit to _int_malloc()
+ /* There are no usable arenas. Fall back to sysmalloc to get a chunk from
+ mmap. */
+ if (__glibc_unlikely (av == NULL))
+ void *p = sYSMALLOc (nb, av);
+ if (p != NULL)
+ alloc_perturb (p, bytes);
+ return p;
But this isn't ok, alloc_perturb unconditionally memset's the front byte to 0xf, unlike upstream where it checks to see if perturb_byte is set. This needs to be changed to
if (p != NULL && && __builtin_expect(perturb_byte, 0))
alloc_perturb (p, bytes);
The patch I've attached fixes the problem for me.
This problem is exacerbated by the fact that any sort of lock contention on the arena's results in us falling back on mmap()'ing a new chunk. This is because we check to see if the uncontended arena we check is corrupt, and if it is we loop through, and if we loop to the beginning we know we didn't find anything. Except if our initial arena isn't actually corrupt we'll still return NULL, so we fall back on this mmap() thing more often, which really makes things unstable.
Please get this fixed as soon as possible, I'd even go so far as to call it a possible security issue.
--- Additional comment from Carlos O'Donell on 2015-07-24 23:11:01 EDT ---
(In reply to Josef Bacik from comment #0)
> Created attachment 1055966 [details]
> patch to fix the problem.
> Description of problem:
> At Facebook we had an app that started hanging and crashing weirdly when
> going from glibc-2.12-1.149.el6.x86_64 to glibc-2.12-1.163.el6.x86_64.
Please note that there is already a RHEL 6.7.z errata that fixes this, and it was released two days ago:
Please update to glibc-2.12-1.166.el6_7.1.
One question, when you write "glibc-2.12-1.163.el6.x86_64" do you actually mean "glibc-2.12-1.166.el6.x86_64?" (note .166 not .163)?
Lastly, the robust malloc support has been backed out for the release, but we plan to put it back in as soon as we are certain we've corrected the remaining issues. Would you be interested in testing an unsupported non-production build with the new feature?
--- Additional comment from Josef Bacik on 2015-07-25 11:44:01 EDT ---
We're on Centos, not RHEL, we just happened to end up with the .163 release (I'm not sure how) before 6.7 was released. Give me whatever package you want me to test, we don't care about unsupported, obviously we are capable of supporting ourselves ;). I do need to have an src.rpm tho so I can build and test it on our systems and verify the issue I was seeing is actually fixed.
--- Additional comment from Carlos O'Donell on 2015-08-04 14:59:49 EDT ---
(In reply to Josef Bacik from comment #5)
> We're on Centos, not RHEL, we just happened to end up with the .163 release
> (I'm not sure how) before 6.7 was released. Give me whatever package you
> want me to test, we don't care about unsupported, obviously we are capable
> of supporting ourselves ;). I do need to have an src.rpm tho so I can build
> and test it on our systems and verify the issue I was seeing is actually
Sounds good. We'll get you something when we're ready. Thanks for agreeing to test :-)
--- Additional comment from Siddhesh Poyarekar on 2015-08-20 00:43:00 EDT ---
Removing the "already fixed in 6.7.z" from the title because it confused me the couple of times I read it.
--- Additional comment from Florian Weimer on 2015-08-24 05:58:22 EDT ---
This issue has been addressed in the following products:
Red Hat Enterprise Linux 6
Via RHBA-2015:1465 https://rhn.redhat.com/errata/RHBA-2015-1465.html
The SRPM is available here: ftp://ftp.redhat.com/pub/redhat/linux/enterprise/6Server/en/os/SRPMS/glibc-2.12-1.166.el6_7.1.src.rpm
RHEL7 seems to have the same problem as Josef reported here. Attaching a reproducer and a patch that fixes this.
Created attachment 1109015 [details]
reproducer for the problem
Reproducer. Build this and watch the asserts pop.
Created attachment 1109016 [details]
patch - malloc: only do the alloc_perturb if the perturb_byte isn't 0
This patch fixes the problem for me.
Note that we have an official support contract as well and have opened a case for this:
Oh and to be clear, this is seen in glibc-2.17-106.el7_2.1.
...and appears to be a regression from RHEL7.1 (just by inspection, I haven't actually tested it on RHEL7.1 to confirm it though).
*** Bug 1294080 has been marked as a duplicate of this bug. ***
This is indeed a regression between rhel-7.1 and rhel-7.2. We dropped one of the fixes that went into rhel-6.7/rhel-6.8 and we should not have, to be specific it was this fix:
Author: Ondřej Bílka <email@example.com>
Date: Mon Dec 9 17:25:19 2013 +0100
Simplify perturb_byte logic.
Which is the correct way to resolve this problem by simplifying the perturbation logic to match upstream (moves the perturb == 0 check into one central place).
We consider this a serious issue and are looking into resolving this as quickly as possible. If you have any questions please don't hesitate to ask.
One possible workaround on affected systems may be to set the environment variable MALLOC_PERTURB to 255 to nullify the perturbation effects e.g. 0xff ^ 0xff == 0. This has performance implications though since that additional operation and write have a non-zero cost.
Yes, commit e8349efd466c seems like a much better fix since you always want to check the perturb_byte value before applying it.
We're also aware of the workaround and it does work. We decided instead to just build a glibc package for now with the patch above to use internally until RH releases an official fix.
Let us know if you'd like us to test out anything in the meantime.
(In reply to Jeff Layton from comment #15)
> Yes, commit e8349efd466c seems like a much better fix since you always want
> to check the perturb_byte value before applying it.
> We're also aware of the workaround and it does work. We decided instead to
> just build a glibc package for now with the patch above to use internally
> until RH releases an official fix.
> Let us know if you'd like us to test out anything in the meantime.
Thank you for the feedback Jeff.
(In reply to Jeff Layton from comment #2)
> Created attachment 1109015 [details]
> reproducer for the problem
> Reproducer. Build this and watch the asserts pop.
Thanks, I expanded it and committed it upstream:
Actually the reproducer was provided by Idan Kedar <firstname.lastname@example.org>...I cc'ed him here (I should have credited him when I uploaded it originally, my bad...)
We're at over a month since I first reported this. Any ETA on when the fixed packages will show up in the repo?
(In reply to Jeff Layton from comment #20)
> We're at over a month since I first reported this. Any ETA on when the fixed
> packages will show up in the repo?
Do you need immediate fix for this in rhel-7.2.z? Is this a production issue for you? Please talk to GSS if the workaround (set MALLOC_PERTRUB=255) is not working.
We plan to fix this for rhel-7.3.0 and rhel-7.2.z. The general availability of RHEL 7.3 is sometime in the future. The release of a rhel-7.2.z is also sometime in the future. I can't give you any specific dates unfortunately.
All I can say is that we are releasing a rhel-7.2.z update as soon as we possibly can. I apologize for the delay.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.