Bug 1296722 - qemu-kvm crashes with double free or corruption in cephx code
Summary: qemu-kvm crashes with double free or corruption in cephx code
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 1.3.1
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: 1.3.2
Assignee: Josh Durgin
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks: 1319075
TreeView+ depends on / blocked
 
Reported: 2016-01-08 00:57 UTC by Brad Hubbard
Modified: 2019-10-10 10:50 UTC (History)
10 users (show)

Fixed In Version: RHEL: ceph-0.94.3-6.el7cp, Ubuntu: ceph_0.94.3.3-2redhat1trusty
Doc Type: Bug Fix
Doc Text:
A race condition occurs sporadically in cephx's interactions with libnss. This could cause Ceph applications (for example, qemu-kvm with librbd) to crash. The Cephx NSS code has been refactored, and Ceph no longer crashes in the described scenario.
Clone Of:
Environment:
Last Closed: 2016-02-08 21:28:57 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 6480 0 None None None 2016-01-08 00:58:52 UTC
Red Hat Bugzilla 1327540 0 high CLOSED qemu-kvm crashes with double free or corruption in cephx code after hotfix in bz1296722 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 2115601 0 None None None 2016-01-08 00:58:32 UTC
Red Hat Product Errata RHBA-2016:0133 0 normal SHIPPED_LIVE ceph bug fix update 2016-02-09 02:28:36 UTC

Internal Links: 1327540

Description Brad Hubbard 2016-01-08 00:57:09 UTC
Description of problem:

Looks like http://tracker.ceph.com/issues/6480

*** Error in `/usr/libexec/qemu-kvm': invalid fastbin entry (free): 0x00007fe37806cde0 ***

Program terminated with signal 6, Aborted.
#0  0x00007fe52d47c5d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007fe52d47c5d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fe52d47dcc8 in __GI_abort () at abort.c:90
#2  0x00007fe52d4bce07 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fe52d5c58c8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007fe52d4c41fd in malloc_printerr (ptr=<optimized out>, str=0x7fe52d5c3001 "invalid fastbin entry (free)", action=3) at malloc.c:4972
#4  _int_free (av=0x7fe378000020, p=<optimized out>, have_lock=0) at malloc.c:3804
#5  0x00007fe537a722b0 in PK11_GetBestSlotMultipleWithAttributes (type=type@entry=0x7fe3d20d6168, mechanismInfoFlags=mechanismInfoFlags@entry=0x0, keySize=keySize@entry=0x0, mech_count=mech_count@entry=1, wincx=0x0) at pk11slot.c:2119
#6  0x00007fe537a7233f in PK11_GetBestSlot (type=4229, wincx=<optimized out>) at pk11slot.c:2142
#7  0x00007fe53178b4bc in nss_aes_operation (op=260, secret=..., in=..., out=..., error="") at auth/Crypto.cc:110
#8  0x00007fe53178a220 in CryptoKey::encrypt (this=this@entry=0x7fe3780750e8, cct=cct@entry=0x7fe539a25930, in=..., out=..., error="") at auth/Crypto.cc:358
#9  0x00007fe531782f8f in encode_encrypt_enc_bl<ceph::buffer::list> (error="", out=..., key=..., t=..., cct=0x7fe539a25930) at auth/cephx/CephxProtocol.h:465
#10 encode_encrypt<ceph::buffer::list> (cct=0x7fe539a25930, t=..., key=..., out=..., error="") at auth/cephx/CephxProtocol.h:490
#11 0x00007fe53178241e in CephxSessionHandler::sign_message (this=0x7fe3780750d0, m=0x7fe1900c94f0) at auth/cephx/CephxSessionHandler.cc:48
#12 0x00007fe53171b036 in Pipe::writer (this=0x7fe1900673d0) at msg/simple/Pipe.cc:1812
#13 0x00007fe5317273fd in Pipe::Writer::entry (this=<optimized out>) at msg/simple/Pipe.h:62
#14 0x00007fe536f95df5 in start_thread (arg=0x7fe3d20d7700) at pthread_create.c:308
#15 0x00007fe52d53d1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Version-Release number of selected component (if applicable):
ceph-common-0.94.1-13.el7cp.x86_64

How reproducible:
Intermittent, but seems to repeat in the same instances under Openstack rather than affect all instances.

Additional info:

This looks like a race and may indicate a threading issue with libnss but that has not been positively identified as yet. https://github.com/ceph/ceph/commit/973cd1c00a7811e95ff0406a90386f6ead5491c4 is an optimization for Infernalis which should stop this issue being seen, backporting it may be a solution.

Comment 12 Ken Dreyer (Red Hat) 2016-01-15 19:13:15 UTC
For the record, the patches Josh cherry-picked for this issue are:

auth: return error code from encrypt/decrypt; make error string optional
auth: optimize crypto++ key context
auth/Crypto: optimize libnss key
auth: refactor crypto key context
auth/cephx: optimize signature check
auth/cephx: move signature calc into helper
auth/Crypto: avoid memcpy on libnss crypto operation
auth: make CryptoHandler implementations totally private

which are part of https://github.com/ceph/ceph/pull/3896/commits

Let's file an upstream ticket to ensure these get backported to Hammer upstream as well.

Comment 15 Brad Hubbard 2016-01-15 22:32:16 UTC
(In reply to Ken Dreyer (Red Hat) from comment #12)
> Let's file an upstream ticket to ensure these get backported to Hammer
> upstream as well.

http://tracker.ceph.com/issues/6480 attached under "External Trackers"

Comment 22 Ken Dreyer (Red Hat) 2016-01-19 15:46:14 UTC
Ubuntu build with this patch is ceph_0.94.3.3-1redhat1trusty

Comment 24 Ken Dreyer (Red Hat) 2016-01-19 19:28:14 UTC
(In reply to Ken Dreyer (Red Hat) from comment #22)
> Ubuntu build with this patch is ceph_0.94.3.3-1redhat1trusty

I had to bump the version number, so it's ceph_0.94.3.3-2redhat1trusty

Comment 28 Tanay Ganguly 2016-02-02 06:41:52 UTC
Marking this Bug as Verified as this was tested part of 1.3.1 Async Release.

Comment 30 errata-xmlrpc 2016-02-08 21:28:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0133

Comment 31 Vikhyat Umrao 2016-02-09 04:11:32 UTC
I have checked the errata and issue is fixed in version : ceph-0.94.3-6.el7cp


Note You need to log in before you can comment on or make changes to this bug.