Description of problem:
On Red Hat Enterprise Linux 7.0+ the Vertica database server process may fail.
1.1 This error appears in the <CATALOG_DIRECTORY>/dbLog file after the failure:
*** Error in `/opt/vertica/bin/vertica': invalid fastbin entry (free): 0x00007ef70f209800 ***
======= Backtrace: =========
0x7f0614f0efe1(/lib64/libc.so.6): + 0x7cfe1
0x2a1e014(/opt/vertica/bin/vertica) CAT::TabColPair_pairToBytes2(void const*, void*, unsigned long)
1.2 The vertica.log file appears as if was truncated at an arbitrary place, sometime in the middle of a line.
1.3 In the core file for the failure, the following pattern appears at the top of the stack
CAT::TabColPair_pairToBytes2(void const*, void*, unsigned long)
2.0 Root cause
It appears that RHEL have taken the following glibc bug fix:
3.0 How to check you have the affected glibc
3.1 Find your libc.so file.
ldd /opt/vertica/bin/vertica | grep libc.so
libc.so.6 => /lib64/libc.so.6 (0x00007ff6dd99e000)
3.2 Run this command to determine whether fix has been applied:
## example of buggy lib.c
objdump -r -d /lib64/libc.so.6 | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21 | grep -A 3 cmpxchg | tail -1 | (grep '%r' && echo "Your libc is likely buggy." || echo "Your libc looks OK.")
7ca16: 48 85 c9 test %rcx,%rcx
Your libc is likely buggy.
## example of good lib.c
objdump -r -d /lib/x86_64-linux-gnu/libc.so.6 | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21 | grep -A 3 cmpxchg | tail -1 | (grep '%r' && echo "Your libc is likely buggy." || echo "Your libc looks OK.")
Your libc looks OK.
3.3. Complete examples
You can also examine your libc and identify whether the fix has been applied or not. The following example contains the string ‘test %dil,%dil. This means that the fix has been applied:
objdump -r -d /lib64/libc-2.12.so | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21
32cd8786cb: 40 20 f7 and %sil,%dil
32cd8786ce: 74 0c je 32cd8786dc <_int_free+0xec>
32cd8786d0: 4c 8b 42 08 mov 0x8(%rdx),%r8
32cd8786d4: 41 c1 e8 04 shr $0x4,%r8d
32cd8786d8: 41 83 e8 02 sub $0x2,%r8d
32cd8786dc: 48 89 53 10 mov %rdx,0x10(%rbx)
32cd8786e0: 48 89 d0 mov %rdx,%rax
32cd8786e3: 64 83 3c 25 18 00 00 cmpl $0x0,%fs:0x18
32cd8786ea: 00 00
32cd8786ec: 74 01 je 32cd8786ef <_int_free+0xff>
32cd8786ee: f0 48 0f b1 19 lock cmpxchg %rbx,(%rcx)
32cd8786f3: 48 39 c2 cmp %rax,%rdx
32cd8786f6: 75 c0 jne 32cd8786b8 <_int_free+0xc8>
32cd8786f8: 40 84 ff test %dil,%dil <==** likely good**==
32cd8786fb: 74 09 je 32cd878706 <_int_free+0x116>
32cd8786fd: 41 39 e8 cmp %ebp,%r8d
32cd878700: 0f 85 05 07 00 00 jne 32cd878e0b <_int_free+0x81b>
32cd878706: 48 83 c4 28 add $0x28,%rsp
32cd87870a: 5b pop %rbx
32cd87870b: 5d pop %rbp
32cd87870c: 41 5c pop %r12
The following example does not contain the ‘test %dil,%dil’ . This means the fix has not been applied:
objdump -r -d /lib64/libc-2.17.so | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21
7c9ec: 48 85 c9 test %rcx,%rcx
7c9ef: 74 09 je 7c9fa <_int_free+0xda>
7c9f1: 8b 41 08 mov 0x8(%rcx),%eax
7c9f4: c1 e8 04 shr $0x4,%eax
7c9f7: 8d 70 fe lea -0x2(%rax),%esi
7c9fa: 48 89 4b 10 mov %rcx,0x10(%rbx)
7c9fe: 48 89 c8 mov %rcx,%rax
7ca01: 64 83 3c 25 18 00 00 cmpl $0x0,%fs:0x18
7ca08: 00 00
7ca0a: 74 01 je 7ca0d <_int_free+0xed>
7ca0c: f0 48 0f b1 1a lock cmpxchg %rbx,(%rdx)
7ca11: 48 39 c1 cmp %rax,%rcx
7ca14: 75 ca jne 7c9e0 <_int_free+0xc0>
7ca16: 48 85 c9 test %rcx,%rcx <==**likely buggy**===
7ca19: 74 09 je 7ca24 <_int_free+0x104>
7ca1b: 44 39 e6 cmp %r12d,%esi
7ca1e: 0f 85 84 08 00 00 jne 7d2a8 <_int_free+0x988>
7ca24: 48 83 c4 48 add $0x48,%rsp
7ca28: 5b pop %rbx
7ca29: 5d pop %rbp
7ca2a: 41 5c pop %r12
Version-Release number of selected component (if applicable):
Randomly occurs, not reliably reproduicible
Steps to Reproduce:
It appears that RHEL has *NOT* taken/back-ported the following glibc bug fix into the gllibc 2.17 release stream.
I am hoping someone can confirm or ascertain if RHEL took this glbic fix or not?
Our observations of the sources and the dis-assembly suggest that RHEL is missing this crucial fix, which may result in application crashes.
Yes we know rhel patched it in 2.12 stream of glibc on RHEL 6.X.
but that does not mean RHEL patched 2.17 stream on RHEL 7.x.
check the glibc sources from the glibc src rpm for 2.17 - the patch is not there.
We later discovered that depending on how the bug is fixed,
the “test %rcx,%rcx” is part of a valid fix PROVIDED there are 4 conditional jumps after the “cmpxchg” rather than the 3 as shown in the description.
The *only* reason the “fastbin” error shows up on stdout is because of the code’s erroneous belief that ABA is an error, therefore, *all* “fastbin” errors in stdout are exactly because of this error (missed patch).
Also, specific lines from sources (from src.rpm) that show the patch isn't there
3835 while ((old = catomic_compare_and_exchange_val_rel (fb, p, fd)) != fd);
3837 if (fd != NULL && __builtin_expect (old_idx != idx, 0))
This is a regression from RHEL6 and we will be looking into this to make sure RHEL7 doesn't have the same flaw.
Current RHEL 7.x does have this issue at least till glibc-2.17-106.
(based on observations of the sources and dis-assembly - above).
Operationally, processes do get killed with SIGABORT (sometimes several times a day).
Do you have a time estimate as to when this can be patched and released to the repositories?
(In reply to Sumeet Keswani from comment #8)
> Do you have a time estimate as to when this can be patched and released to
> the repositories?
We confirm that this issue exits. A future update may address it.
Please open a support case if you can. It helps us to prioritize this issue.
thanks - will do
Please confirm your HPE email address.
(In reply to Joseph Kachuck from comment #12)
> Hello Sumeet,
> Please confirm your HPE email address.
> Thank You
> Joe Kachuck
Do you want me to put my email in the BZ?
I just changed the email of my account to the hpe email, is that sufficient?
Although we ran into this on both RHEL 7.0 and RHEL 7.1.
A fix for this desired on the following streams
I believe a glibc update should be pushed to repositories on all three streams.
This would be a minor change which should not require a end user to upgrade to a higher version of the kernel/OS.
In order to request this for Z stream. Please provide a client impact statement.
Please confirm what a client would be doing to see the issue.
Please also note that in order for a client to get a 7.1.z update they would need to have an EUS entitlement. Please confirm if you know of a client that would be willing to purchase EUS to get this update? If a client already has EUS would you be able to have them file a support case requesting this update.
Please also note this cannot be requested for Z until it accepted and verified for the current release.
My apologies, I am not very familiar with internal RHEL processes and business practices.
1. This is a regression. I presume that should mean something.
2. I will ask three customers who have a confirmed RHEL entitlements to open support cases referring to this bug and escalate via their internal IT departments.
Hi JoeK, can the customer with EUS subscription request RH to create the 7.0 or 7.1 EUS ZStream they need?
We have put out a request to our customer(s) who are hitting this and have RHEL support to escalate and ask for a patch. there are a few of them.
Hopefully you will get a support request on this.
(In reply to Trinh Dao from comment #22)
> Hi JoeK, can the customer with EUS subscription request RH to create the 7.0
> or 7.1 EUS ZStream they need?
To clarify, there is no EUS stream for 7.0, and no future updates will be released. Please refer to this resource for detail product life-cycle information:
the link stated:
In Red Hat Enterprise Linux 7, EUS is available for the following releases:
•7.1 (ends March 31, 2017)
•7.2 (ends November 30, 2017)
7.0 end date is not listed. What is the end date for 7.0?
A client with EUS can open a support case and request this support case to be connected to this BZ. They would also need to state they need this fix for RHEL 7.1.z.
> This is a regression. I presume that should mean something.
Please confirm the latest RHEL 7.x release this worked correctly. I apologize for details, in order for me to flag this BZ as a regression. It must have worked previously in same major release.
RHEL 7.0 did have EUS support. No errata updates would have been released once RHEL 7.1 was released.
> This is a regression. I presume that should mean something.
we saw this exact same thing in RHEL 6.5, which was ultimately fixed
( https://bugzilla.redhat.com/show_bug.cgi?id=1027101 )
Now we ran into it in RHEL 7.x again.
i.e. something was broken in 6.5 , was fixed, and then is now broken in 7.x.
Hence i referred to it as a regression.
Alen, you set the needinfo? flag on this bug. What kind of information do you need?
Please download and test the packages from:
Please note these are test packages only.
Please note they may be removed in 3 days.
Thanks, i got them.
We will try to test. This is a random failure, so will take a bit to know for sure. But if our RHEL 6.5 experience is any indication, this is just the fix we need(ed).
Any update on if the test package corrected this issue?
Its a race conditions so we don't have proof positive with your rpm specifically just yet.
We have built our own glibc rpm with the patch from sourceware and that has been working without any issues for a while - if that is any indication.
(In reply to Sumeet Keswani from comment #39)
> Its a race conditions so we don't have proof positive with your rpm
> specifically just yet.
> We have built our own glibc rpm with the patch from sourceware and that has
> been working without any issues for a while - if that is any indication.
Thanks. That does help indicate that the patch is a step in the right direction. I agree that with concurrency defects it is difficult to validate them.
Please confirm if you were are to confirm if the test package corrected this issue.
We need to run for a bit longer before we will know with reasonable degree of certainty. We have also asked customer who gets this often to try the rpms.
I will update you with the outcome in a week or so.
glibc patch/rpm looks stable and good for weeks without issues.
I feel this is good.
(would like to test for a few more weeks, but i think i am ready to declare success here since we can go on for ever.)
My customer is requesting for the i686 version of this glibc patch.
Do you know how to obtain the packages?
Thanks & Regards
RedHat Global Support
There is already 7.2z bug requested, bug 1313308.
I have send the i686 version of this glibc patch to the customer.
Thanks for re-generating it.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see firstname.lastname@example.org with any questions
no crashes since we started using the patched glibc.
AFAIK, this fixes it.
now the only things is to get it out there proactively for all/relevant RHEL streams.
Hi Joe, will this patch fix be in RHEL7.3 Alpha?
(In reply to Trinh Dao from comment #73)
> Hi Joe, will this patch fix be in RHEL7.3 Alpha?
Yes, we have already addressed the issue in 7.2.z (via RHBA-2016:1030-1), and 7.3 will inherit the fix.
Hi, I'm a HPE customer and AWS customer using Red Hat.
how can I get this package? (glibc-2.17-121.el7)
From HPE, from AWS or directly from Red Hat?
This issue was addressed for Red Hat Enterprise Linux 7.2 in this erratum:
If you need assistance in obtaining it and have an active Red Hat Enterprise Linux subscription, please file a support request at:
In the fixed in field for this bug I see:
In the errata I see:
This is what lead to my confusion. Should we update the fixed in field of this BZ with the errata's build number or is the -121 the correct build number for this fix?
(In reply to Ben Turner from comment #85)
> In the fixed in field for this bug I see:
> In the errata I see:
> This is what lead to my confusion. Should we update the fixed in field of
> this BZ with the errata's build number or is the -121 the correct build
> number for this fix?
The fix for Red Hat Enterprise Linux 7.2.z is tracked in bug 1313308, but we also need to fix it in Red Hat Enterprise Linux 7.3 because 7.3 branched from 7.2 after the fix went into 7.2.z.
*** Bug 1371228 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.