Bug 1296722 - qemu-kvm crashes with double free or corruption in cephx code
qemu-kvm crashes with double free or corruption in cephx code
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
x86_64 Linux
urgent Severity urgent
: rc
: 1.3.2
Assigned To: Josh Durgin
Depends On:
Blocks: 1319075
  Show dependency treegraph
Reported: 2016-01-07 19:57 EST by Brad Hubbard
Modified: 2017-07-30 11:17 EDT (History)
10 users (show)

See Also:
Fixed In Version: RHEL: ceph-0.94.3-6.el7cp, Ubuntu: ceph_0.94.3.3-2redhat1trusty
Doc Type: Bug Fix
Doc Text:
A race condition occurs sporadically in cephx's interactions with libnss. This could cause Ceph applications (for example, qemu-kvm with librbd) to crash. The Cephx NSS code has been refactored, and Ceph no longer crashes in the described scenario.
Story Points: ---
Clone Of:
Last Closed: 2016-02-08 16:28:57 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2115601 None None None 2016-01-07 19:58 EST
Ceph Project Bug Tracker 6480 None None None 2016-01-07 19:58 EST

  None (edit)
Description Brad Hubbard 2016-01-07 19:57:09 EST
Description of problem:

Looks like http://tracker.ceph.com/issues/6480

*** Error in `/usr/libexec/qemu-kvm': invalid fastbin entry (free): 0x00007fe37806cde0 ***

Program terminated with signal 6, Aborted.
#0  0x00007fe52d47c5d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007fe52d47c5d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fe52d47dcc8 in __GI_abort () at abort.c:90
#2  0x00007fe52d4bce07 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fe52d5c58c8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007fe52d4c41fd in malloc_printerr (ptr=<optimized out>, str=0x7fe52d5c3001 "invalid fastbin entry (free)", action=3) at malloc.c:4972
#4  _int_free (av=0x7fe378000020, p=<optimized out>, have_lock=0) at malloc.c:3804
#5  0x00007fe537a722b0 in PK11_GetBestSlotMultipleWithAttributes (type=type@entry=0x7fe3d20d6168, mechanismInfoFlags=mechanismInfoFlags@entry=0x0, keySize=keySize@entry=0x0, mech_count=mech_count@entry=1, wincx=0x0) at pk11slot.c:2119
#6  0x00007fe537a7233f in PK11_GetBestSlot (type=4229, wincx=<optimized out>) at pk11slot.c:2142
#7  0x00007fe53178b4bc in nss_aes_operation (op=260, secret=..., in=..., out=..., error="") at auth/Crypto.cc:110
#8  0x00007fe53178a220 in CryptoKey::encrypt (this=this@entry=0x7fe3780750e8, cct=cct@entry=0x7fe539a25930, in=..., out=..., error="") at auth/Crypto.cc:358
#9  0x00007fe531782f8f in encode_encrypt_enc_bl<ceph::buffer::list> (error="", out=..., key=..., t=..., cct=0x7fe539a25930) at auth/cephx/CephxProtocol.h:465
#10 encode_encrypt<ceph::buffer::list> (cct=0x7fe539a25930, t=..., key=..., out=..., error="") at auth/cephx/CephxProtocol.h:490
#11 0x00007fe53178241e in CephxSessionHandler::sign_message (this=0x7fe3780750d0, m=0x7fe1900c94f0) at auth/cephx/CephxSessionHandler.cc:48
#12 0x00007fe53171b036 in Pipe::writer (this=0x7fe1900673d0) at msg/simple/Pipe.cc:1812
#13 0x00007fe5317273fd in Pipe::Writer::entry (this=<optimized out>) at msg/simple/Pipe.h:62
#14 0x00007fe536f95df5 in start_thread (arg=0x7fe3d20d7700) at pthread_create.c:308
#15 0x00007fe52d53d1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Version-Release number of selected component (if applicable):

How reproducible:
Intermittent, but seems to repeat in the same instances under Openstack rather than affect all instances.

Additional info:

This looks like a race and may indicate a threading issue with libnss but that has not been positively identified as yet. https://github.com/ceph/ceph/commit/973cd1c00a7811e95ff0406a90386f6ead5491c4 is an optimization for Infernalis which should stop this issue being seen, backporting it may be a solution.
Comment 12 Ken Dreyer (Red Hat) 2016-01-15 14:13:15 EST
For the record, the patches Josh cherry-picked for this issue are:

auth: return error code from encrypt/decrypt; make error string optional
auth: optimize crypto++ key context
auth/Crypto: optimize libnss key
auth: refactor crypto key context
auth/cephx: optimize signature check
auth/cephx: move signature calc into helper
auth/Crypto: avoid memcpy on libnss crypto operation
auth: make CryptoHandler implementations totally private

which are part of https://github.com/ceph/ceph/pull/3896/commits

Let's file an upstream ticket to ensure these get backported to Hammer upstream as well.
Comment 15 Brad Hubbard 2016-01-15 17:32:16 EST
(In reply to Ken Dreyer (Red Hat) from comment #12)
> Let's file an upstream ticket to ensure these get backported to Hammer
> upstream as well.

http://tracker.ceph.com/issues/6480 attached under "External Trackers"
Comment 22 Ken Dreyer (Red Hat) 2016-01-19 10:46:14 EST
Ubuntu build with this patch is ceph_0.94.3.3-1redhat1trusty
Comment 24 Ken Dreyer (Red Hat) 2016-01-19 14:28:14 EST
(In reply to Ken Dreyer (Red Hat) from comment #22)
> Ubuntu build with this patch is ceph_0.94.3.3-1redhat1trusty

I had to bump the version number, so it's ceph_0.94.3.3-2redhat1trusty
Comment 28 Tanay Ganguly 2016-02-02 01:41:52 EST
Marking this Bug as Verified as this was tested part of 1.3.1 Async Release.
Comment 30 errata-xmlrpc 2016-02-08 16:28:57 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

Comment 31 Vikhyat Umrao 2016-02-08 23:11:32 EST
I have checked the errata and issue is fixed in version : ceph-0.94.3-6.el7cp

Note You need to log in before you can comment on or make changes to this bug.