Bug 1381463

Summary: RADOS bench crashes while doing sequential read operations
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: RADOSAssignee: Loic Dachary <ldachary>
Status: CLOSED ERRATA QA Contact: Vidushi Mishra <vimishra>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 1.3.3CC: ceph-eng-bugs, dzafman, hnallurv, jdurgin, kchai, kdreyer
Target Milestone: rc   
Target Release: 2.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.7-2.el7cp Ubuntu: ceph_10.2.7-3redhat1xenial Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-19 13:27:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
out.txt - none

Description Vasishta 2016-10-04 07:51:30 UTC
Created attachment 1207098 [details]
out.txt -

Description of problem:
RADOS bench read crashes when tried on a pool, on which rados bench write command was executed from a different nodes.

Version-Release number of selected component (if applicable):
ceph version 0.94.9-3.el7cp

How reproducible:
Always

Steps to Reproduce:
1. Create a pool from node 1.

2. Run rados bench write test on the newly created pool from node 1:
rados bench -p <pool_name> 10 write --no-cleanup

3. From different node (node 2), run a sequential read test on same pool:
rados bench -p <pool_name> 10 seq

Actual results: (Please refer attachment - out.txt for entire log)

+known_if_redirected e249) v5 -- ?+0 0x35f7090 con 0x35ca0d0
    -5> 2016-10-03 12:13:13.112353 7fe13c2cd700  1 -- 10.8.128.111:0/3815556558 <== osd.5 10.8.128.110:6808/4307 2 ==== osd_op_reply(2 benchmark_data_magna111_19058_object0 [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 204+0+0 (3295514241 0 0) 0x7fe12c000ca0 con 0x35b0000
    -4> 2016-10-03 12:13:13.112429 7fe13c2cd700  1 -- 10.8.128.111:0/3815556558 <== osd.5 10.8.128.110:6808/4307 3 ==== osd_op_reply(4 benchmark_data_magna111_19058_object2 [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 204+0+0 (1055835100 0 0) 0x7fe12c000bc0 con 0x35b0000
    -3> 2016-10-03 12:13:13.112571 7fe13c2cd700  1 -- 10.8.128.111:0/3815556558 <== osd.5 10.8.128.110:6808/4307 4 ==== osd_op_reply(9 benchmark_data_magna111_19058_object7 [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 204+0+0 (2382207108 0 0) 0x7fe12c000bc0 con 0x35b0000
    -2> 2016-10-03 12:13:13.112676 7fe13c2cd700  1 -- 10.8.128.111:0/3815556558 <== osd.5 10.8.128.110:6808/4307 5 ==== osd_op_reply(11 benchmark_data_magna111_19058_object9 [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 204+0+0 (614169484 0 0) 0x7fe12c000bc0 con 0x35b0000
    -1> 2016-10-03 12:13:13.112833 7fe126ef8700  1 -- 10.8.128.111:0/3815556558 <== osd.4 10.8.128.110:6804/4088 1 ==== osd_op_reply(3 benchmark_data_magna111_19058_object1 [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 204+0+0 (1600347836 0 0) 0x7fe100000b00 con 0x35ca0d0
     0> 2016-10-03 12:13:13.112853 7fe14766f7c0 -1 *** Caught signal (Segmentation fault) **
 in thread 7fe14766f7c0

 ceph version 0.94.9-3.el7cp (7358f71bebe44c463df4d91c2770149e812bbeaa)
 1: rados() [0x4efb72]
 2: (()+0xf100) [0x7fe144330100]
 3: (()+0x15dd20) [0x7fe14348ed20]
 4: (ObjBencher::seq_read_bench(int, int, int, int, bool)+0xbeb) [0x4e274b]
 5: (ObjBencher::aio_bench(int, int, int, int, int, bool, char const*, bool)+0x307) [0x4e7ad7]
 6: (main()+0x9195) [0x4c5f85]
 7: (__libc_start_main()+0xf5) [0x7fe143352b15]
 8: rados() [0x4ca549]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent       500
  max_new         1000
  log_file 
--- end dump of recent events ---


Expected results:
rados bench read should execute without any errors.

Observation -
'Rados bench sequential read' looking for objects with hostname of node from which read was executed but pool doesn't have any objects having name of this particular node.

Comment 2 Loic Dachary 2016-10-06 16:31:56 UTC
(gdb) bt
bt
#0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:69
#1  0x0000000000872829 in ObjBencher::seq_read_bench (this=0x7fffffffe730,
    seconds_to_run=20, num_objects=36, concurrentios=16, pid=1587,
    no_verify=false) at common/obj_bencher.cc:592
#2  0x00000000008703d5 in ObjBencher::aio_bench (this=0x7fffffffe730,
    operation=2, secondsToRun=20, maxObjectsToCreate=0, concurrentios=16,
    op_size=4194304, cleanup=true, run_name=0x0, no_verify=false)
    at common/obj_bencher.cc:208
Python Exception <class 'IndexError'> list index out of range:
#3  0x000000000084b340 in rados_tool_common (opts=std::map with 1 elements,
    nargs=std::vector of length 3, capacity 8 = {...})
    at tools/rados/rados.cc:2271
#4  0x0000000000850e61 in main (argc=6, argv=0x7fffffffecd8)
    at tools/rados/rados.cc:2732
(gdb) frame 1
frame 1
#1  0x0000000000872829 in ObjBencher::seq_read_bench (this=0x7fffffffe730,
    seconds_to_run=20, num_objects=36, concurrentios=16, pid=1587,
    no_verify=false) at common/obj_bencher.cc:592
592           if (memcmp(data.object_contents, cur_contents->c_str(), data.object_size) != 0) {
(gdb) list
list
587         // invalidate internal crc cache
588         cur_contents->invalidate_crc();
589
590         if (!no_verify) {
591           snprintf(data.object_contents, data.object_size, "I'm the %16dth object!", current_index);
592           if (memcmp(data.object_contents, cur_contents->c_str(), data.object_size) != 0) {
593             cerr << name[slot] << " is not correct!" << std::endl;
594             ++errors;
595           }
596         }
(gdb) print cur_contents->c_str()
print cur_contents->c_str()
$3 = 0x0

Comment 3 Loic Dachary 2017-01-05 15:55:04 UTC
http://tracker.ceph.com/issues/17526 was incorrectly set to Resolved and not backported. This is now fixed and should be ready for 2.3.

Comment 7 Josh Durgin 2017-04-01 02:16:33 UTC
The rados man page update is included in 10.2.6.

Comment 13 errata-xmlrpc 2017-06-19 13:27:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1497