Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1358697

Summary: common: thread_pool worker queue crash in void *item = wq->_void_dequeue();
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RBDAssignee: Jason Dillaman <jdillama>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 1.3.2CC: aathomas, ceph-eng-bugs, flucifre, jbiao, jdillama, kdreyer, nlevine, pablo.iranzo, pneedle, tchandra, tserlin, vakulkar
Target Milestone: rc   
Target Release: 1.3.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-0.94.5-15.el7cp Ubuntu: ceph_0.94.5-9redhat1 Doc Type: Bug Fix
Doc Text:
If this bug requires documentation, please select an appropriate Doc Type value.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-01 21:36:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vikhyat Umrao 2016-07-21 09:49:13 UTC
Description of problem:
common: thread_pool worker queue crash in void *item = wq->_void_dequeue();

Version-Release number of selected component (if applicable):

Red Hat Ceph Storage 1.3.2

ceph-common-0.94.5-14.el7cp.x86_64
librbd1-0.94.5-14.el7cp.x86_64
python-rbd-0.94.5-14.el7cp.x86_64
librados2-0.94.5-14.el7cp.x86_64
python-rados-0.94.5-14.el7cp.x86_64


Core was generated by `/usr/bin/python2 /usr/bin/nova-compute --config-file /etc/nova/nova.conf --conf'.
Program terminated with signal 11, Segmentation fault.
#0  ThreadPool::worker (this=0x726e5a0, wt=0x2d9a4f0) at common/WorkQueue.cc:120
120		void *item = wq->_void_dequeue();

(gdb) bt
#0  ThreadPool::worker (this=0x726e5a0, wt=0x2d9a4f0) at common/WorkQueue.cc:120
#1  0x00007f1e59b614d0 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:318
#2  0x00007f1ef0abedc5 in start_thread (arg=0x7f1e599fa700) at pthread_create.c:308
#3  0x00007f1ef00e31cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

(gdb) l
115	      while (tries--) {
116		last_work_queue++;
117		last_work_queue %= work_queues.size();
118		wq = work_queues[last_work_queue];
119		
120		void *item = wq->_void_dequeue();
121		if (item) {
122		  processing++;
123		  ldout(cct,12) << "worker wq " << wq->name << " start processing " << item
124				<< " (" << processing << " active)" << dendl;

(gdb) p wq
$2 = (ThreadPool::WorkQueue_ *) 0x0

It looks like 'wq' was NULL and when it was dereferenced (dereferencing a NULL pointer) in line number 120 it got segfault.

120		void *item = wq->_void_dequeue();

Comment 1 Vikhyat Umrao 2016-07-21 09:55:36 UTC
- We have got two potential issues from upstream which i think we need to backport to our Red Hat Ceph Storage 1.3.2 (0.94.5-14.el7cp.x86_64).

1. WorkQueue: add/remove_work_queue methods now thread safe
   Main tracker: http://tracker.ceph.com/issues/12662
   Hammer backport tracker: http://tracker.ceph.com/issues/13042
   Hammer backport PR: https://github.com/ceph/ceph/pull/5889

2.WorkQueue: new PointerWQ base class for ContextWQ
  Main tracker: http://tracker.ceph.com/issues/13636
  Hammer backport tracker: http://tracker.ceph.com/issues/13758
  Hammer backport PR: https://github.com/ceph/ceph/pull/6587

Comment 3 Jason Dillaman 2016-07-21 13:43:53 UTC
Agree with backport suggestions.

Comment 4 Vikhyat Umrao 2016-07-21 13:51:35 UTC
(In reply to Jason Dillaman from comment #3)
> Agree with backport suggestions.

Thank you Jason for your confirmation.

Comment 24 Tejas 2016-09-01 13:08:10 UTC
Verified this on VM's with ceph storage.
The VM's were started and stopped for nearly a thousand times.

Comment 26 errata-xmlrpc 2016-09-01 21:36:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1805.html

Comment 27 Paul Needle 2017-09-07 16:16:03 UTC
*** Bug 1482112 has been marked as a duplicate of this bug. ***