1247752 – Log::reopen_log_file() must take the flusher lock to avoid closing an fd ::_flush() is still using

Bug 1247752 - Log::reopen_log_file() must take the flusher lock to avoid closing an fd ::_flush() is still using

Summary: Log::reopen_log_file() must take the flusher lock to avoid closing an fd ::_f...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.2.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	1.2.4
Assignee:	Ken Dreyer (Red Hat)
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-28 18:47 UTC by Ken Dreyer (Red Hat)
Modified:	2017-07-31 14:15 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-0.80.8-16.el6cp ceph-0.80.8-16.el7cp
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1246694
Environment:
Last Closed:	2015-07-31 12:54:17 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	12465	None	None	None	Never
Red Hat Knowledge Base (Solution)	1551063	None	None	None	Never
Red Hat Product Errata	RHBA-2015:1527	normal	SHIPPED_LIVE	Ceph bug fix and enhancement update	2015-07-31 16:54:05 UTC

Comment 1 Ken Dreyer (Red Hat) 2015-07-29 14:04:19 UTC

upstream's firefly patch is https://github.com/ceph/ceph/pull/5406 so we'll use that.

Comment 4 Tamil 2015-07-30 01:24:20 UTC

steps to reproduce:
set logging in ceph.conf
 
debug ms = 20
debug osd = 20
debug filestore = 20
 
 
sudo rbd create image --size 1000000000
 
sudo rbd bench-write image --io-threads 256 --io-size 4096 --io-total 1000000000000 2>&1 >/dev/n
 
while the workload is in progress [in the background],
 
sudo vi /root/.gdbinit
set pagination off
set target-async on
set non-stop on
 
ps -ef | grep ceph-osd - look for pid of osd.0
 
sudo gdb
attach <pid of osd.0>
 
b Log.cc:117
 
c -a
 
check for sighup and where it breaks
 
sudo ceph osd dump
 
sudo ceph pg dump
 
sudo ceph pg scrub <pg.id>
 
watch for the objects to corrupt
 
 
[ubuntu@magna016 ~]$ sudo ceph -s
    cluster 8c89dca4-2ad2-46f9-b38f-d8450a2c6e0a
     health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; recovery 3920/34408 objects degraded (11
     monmap e1: 1 mons at {magna016=10.8.128.16:6789/0}, election epoch 2, quorum 0 magna016
     osdmap e37: 3 osds: 2 up, 2 in
      pgmap v1723: 193 pgs, 4 pools, 60233 MB data, 15733 objects
            106 GB used, 1745 GB / 1852 GB avail
            3920/34408 objects degraded (11.393%)
                 192 active+degraded
                   1 active+clean
 
  client io 2336 kB/s wr, 1168 op/s

Comment 5 Samuel Just 2015-07-30 02:16:35 UTC

Degraded objects aren't what you are looking for.  What happened here is the osd died.  That might actually have been due to the bug causing corruption in something the osd then read back, or it might just be that the thread stopped by the gdb session eventually caused a timeout to fail and kill the osd.  You'll have to try it again and keep the osd log.

Comment 6 Ken Dreyer (Red Hat) 2015-07-30 04:00:28 UTC

For non-RHEL, the fix will be in the Ceph v0.80.8.4 packages.

Comment 7 Tamil 2015-07-31 00:45:23 UTC

works fine.

Comment 9 errata-xmlrpc 2015-07-31 12:54:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:1527

Note You need to log in before you can comment on or make changes to this bug.