Bug 1326323

Summary: [RBD] rbd crashes if deep-flatten is done while writing on Clone
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tejas <tchandra>
Component: RBDAssignee: Jason Dillaman <jdillama>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.0CC: ceph-eng-bugs, hyelloji, jdillama, kdreyer, nlevine, tchandra, tganguly, vakulkar
Target Milestone: rc   
Target Release: 2.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-10.2.0-1.el7cp Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:35:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tejas 2016-04-12 12:22:36 UTC
Description of problem:
When a flatten is done on the clone during bemch-write the rbd crashes with a segmentation fault.

Version-Release number of selected component (if applicable):
Ceph 10.1.1

How reproducible:
Always

Steps to Reproduce:
1. Create rbd image, write some data, create a snap and protect it
2. ceate a clone , and create a snap on it.
3. start bench-write on the clone, and then start flatten. RBD crashes

Actual results:
rbd crash is seen

Expected results:
Even if this not a supported scenario, it should be handled gracefully.

Additional info:

[root@magna080 ~]# rbd create Tejas/b1 --size 20G --image-feature layering,deep-flatten,object-map,fast-diff,exclusive-lock
[root@magna080 ~]# 
[root@magna080 ~]# 
[root@magna080 ~]# rbd ls -l Tejas
NAME       SIZE PARENT         FMT PROT LOCK 
b1       20480M                  2           
cln      51200M Tejas/imgio@s1   2           
imgio    51200M                  2           
imgio@s1 51200M                  2 yes       
[root@magna080 ~]# 
[root@magna080 ~]# 
[root@magna080 ~]# 
[root@magna080 ~]# 
[root@magna080 ~]# rbd snap create Tejas/b1@s1
[root@magna080 ~]# 
[root@magna080 ~]# 
[root@magna080 ~]# rbd snap protect Tejas/b1@s1
[root@magna080 ~]# 
[root@magna080 ~]# 
[root@magna080 ~]# rbd clone Tejas/b1@s1 Tejas/cln1 --image-feature layering,deep-flatten,fast-diff,object-map,exclusive-lock
[root@magna080 ~]# 

[root@magna080 ~]# rbd snap create Tejas/c1@s2
[root@magna080 ~]# 
[root@magna080 ~]# 
[root@magna080 ~]# rbd ls -l Tejas
NAME       SIZE PARENT         FMT PROT LOCK 
b1       20480M                  2           
b1@s1    20480M                  2 yes       
c1       20480M Tejas/b1@s1      2           
c1@s2    20480M Tejas/b1@s1      2           
cln      51200M Tejas/imgio@s1   2           
imgio    51200M                  2           
imgio@s1 51200M                  2 yes       
[root@magna080 ~]#
[root@magna080 ~]# rbd bench-write Tejas/c1
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     17760  15950.51  65333292.38
    2     31429  15629.65  64019054.23
    3     42086  13979.88  57261593.23
    4     50589  12467.14  51065405.08
    5     59024  11738.08  48079179.17
    6     69654  10583.35  43349403.61
    7     85448  10764.23  44090296.24
    8     98077  11037.82  45210896.94
    9    109663  11858.98  48574396.06
   10    114938  10180.47  41699187.63
   11    121117  10261.95  42032936.55
   12    129627   8791.30  36009168.47
   13    137002   7882.85  32288140.16
   14    152782   8686.71  35580762.05
   15    167413  11514.71  47164251.05
   16    176866  11118.85  45542827.88
   17    184264  11041.19  45224716.02
   18    187511   9959.37  40793562.29
   19    190675   7153.20  29299490.49
   20    192821   5050.29  20685992.49
   21    194953   3622.35  14837153.76
   22    201266   3248.50  13305841.29
   23    205507   3650.07  14950689.25
   24    211847   4318.73  17689500.71
   25    216092   4496.31  18416876.09
   26    219299   4685.22  19190663.60
   27    223551   4584.67  18778805.12
   28    229907   4844.05  19841223.47
   29    237270   5132.72  21023608.74
   30    240447   5161.99  21143492.33
   31    243645   4935.39  20215367.20
   32    247864   4609.86  18882000.58
   33    251023   4215.27  17265759.52
   34    254227   3459.52  14170186.42
   35    259476   3657.96  14982999.34
elapsed:    36  ops:   262144  ops/sec:  7136.75  bytes/sec: 29232112.21
*** Caught signal (Segmentation fault) **
 in thread 7f5ee2bca700 thread_name:fn_anonymous
 ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
 1: (()+0x1d8dea) [0x7f5f02302dea]
 2: (()+0xf100) [0x7f5eee6ff100]
 3: (()+0x221a04) [0x7f5ef880aa04]
 4: (()+0x91b07) [0x7f5ef867ab07]
 5: (()+0x82554) [0x7f5ef866b554]
 6: (()+0x82a85) [0x7f5ef866ba85]
 7: (()+0x134049) [0x7f5ef871d049]
 8: (()+0x87516) [0x7f5ef8670516]
 9: (()+0x8765b) [0x7f5ef867065b]
 10: (()+0x71f29) [0x7f5ef865af29]
 11: (()+0x815d7) [0x7f5ef866a5d7]
 12: (()+0x89979) [0x7f5ef8672979]
 13: (()+0x8bf0a) [0x7f5ef8674f0a]
 14: (()+0x9d00d) [0x7f5eeef0d00d]
 15: (()+0x85529) [0x7f5eeeef5529]
 16: (()+0x16eb46) [0x7f5eeefdeb46]
 17: (()+0x7dc5) [0x7f5eee6f7dc5]
 18: (clone()+0x6d) [0x7f5eec80d28d]
2016-04-12 10:35:20.101837 7f5ee2bca700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f5ee2bca700 thread_name:fn_anonymous

 ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
 1: (()+0x1d8dea) [0x7f5f02302dea]
 2: (()+0xf100) [0x7f5eee6ff100]
 3: (()+0x221a04) [0x7f5ef880aa04]
 4: (()+0x91b07) [0x7f5ef867ab07]
 5: (()+0x82554) [0x7f5ef866b554]
 6: (()+0x82a85) [0x7f5ef866ba85]
 7: (()+0x134049) [0x7f5ef871d049]
 8: (()+0x87516) [0x7f5ef8670516]
 9: (()+0x8765b) [0x7f5ef867065b]
 10: (()+0x71f29) [0x7f5ef865af29]
 11: (()+0x815d7) [0x7f5ef866a5d7]
 12: (()+0x89979) [0x7f5ef8672979]
 13: (()+0x8bf0a) [0x7f5ef8674f0a]
 14: (()+0x9d00d) [0x7f5eeef0d00d]
 15: (()+0x85529) [0x7f5eeeef5529]
 16: (()+0x16eb46) [0x7f5eeefdeb46]
 17: (()+0x7dc5) [0x7f5eee6f7dc5]
 18: (clone()+0x6d) [0x7f5eec80d28d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -14> 2016-04-12 10:34:43.260938 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command perfcounters_dump hook 0x7f5f0bdcbb90
   -13> 2016-04-12 10:34:43.260949 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command 1 hook 0x7f5f0bdcbb90
   -12> 2016-04-12 10:34:43.260954 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command perf dump hook 0x7f5f0bdcbb90
   -11> 2016-04-12 10:34:43.260956 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command perfcounters_schema hook 0x7f5f0bdcbb90
   -10> 2016-04-12 10:34:43.260959 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command 2 hook 0x7f5f0bdcbb90
    -9> 2016-04-12 10:34:43.260961 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command perf schema hook 0x7f5f0bdcbb90
    -8> 2016-04-12 10:34:43.260965 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command perf reset hook 0x7f5f0bdcbb90
    -7> 2016-04-12 10:34:43.260981 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command config show hook 0x7f5f0bdcbb90
    -6> 2016-04-12 10:34:43.260984 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command config set hook 0x7f5f0bdcbb90
    -5> 2016-04-12 10:34:43.260990 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command config get hook 0x7f5f0bdcbb90
    -4> 2016-04-12 10:34:43.260994 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command config diff hook 0x7f5f0bdcbb90
    -3> 2016-04-12 10:34:43.260996 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command log flush hook 0x7f5f0bdcbb90
    -2> 2016-04-12 10:34:43.261000 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command log dump hook 0x7f5f0bdcbb90
    -1> 2016-04-12 10:34:43.261005 7f5f020f0d80  5 asok(0x7f5f0bdc6e00) register_command log reopen hook 0x7f5f0bdcbb90
     0> 2016-04-12 10:35:20.101837 7f5ee2bca700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f5ee2bca700 thread_name:fn_anonymous

 ceph version 10.1.1-1.el7cp (61adb020219fbad4508050b5f0a792246ba74dae)
 1: (()+0x1d8dea) [0x7f5f02302dea]
 2: (()+0xf100) [0x7f5eee6ff100]
 3: (()+0x221a04) [0x7f5ef880aa04]
 4: (()+0x91b07) [0x7f5ef867ab07]
 5: (()+0x82554) [0x7f5ef866b554]
 6: (()+0x82a85) [0x7f5ef866ba85]
 7: (()+0x134049) [0x7f5ef871d049]
 8: (()+0x87516) [0x7f5ef8670516]
 9: (()+0x8765b) [0x7f5ef867065b]
 10: (()+0x71f29) [0x7f5ef865af29]
 11: (()+0x815d7) [0x7f5ef866a5d7]
 12: (()+0x89979) [0x7f5ef8672979]
 13: (()+0x8bf0a) [0x7f5ef8674f0a]
 14: (()+0x9d00d) [0x7f5eeef0d00d]
 15: (()+0x85529) [0x7f5eeeef5529]
 16: (()+0x16eb46) [0x7f5eeefdeb46]
 17: (()+0x7dc5) [0x7f5eee6f7dc5]
 18: (clone()+0x6d) [0x7f5eec80d28d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 0 journaler
   0/ 5 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent       500
  max_new         1000
  log_file /var/log/rbd-clients/qemu-guest-16083.log
--- end dump of recent events ---
Segmentation fault (core dumped)
[root@magna080 ~

Comment 2 Jason Dillaman 2016-04-12 17:53:21 UTC
Please attach the core dump or provide backtraces with symbols.  I haven't been able to recreate the segfault, but I was able to create a different issue when running against a starved OSD.

Comment 3 Jason Dillaman 2016-04-13 01:13:13 UTC
Cancel that ... I believe I recreated the crash you witnessed.  It occurs when "rbd bench-write" completes while the flatten is still in-progress.

Comment 4 Jason Dillaman 2016-04-13 01:15:57 UTC
Upstream PR: https://github.com/ceph/ceph/pull/8565

Comment 5 Jason Dillaman 2016-04-13 11:31:44 UTC
*** Bug 1326650 has been marked as a duplicate of this bug. ***

Comment 6 Ken Dreyer (Red Hat) 2016-04-26 20:51:17 UTC
The above PR is present in v10.2.0.

Comment 8 Hemanth Kumar 2016-05-31 10:58:35 UTC
Unable to reproduce the crash. Moving to Verified state.

Comment 10 errata-xmlrpc 2016-08-23 19:35:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html