Bug 763347 (GLUSTER-1615)

Summary: dbench fails on rdma
Product: [Community] GlusterFS Reporter: Anush Shetty <anush>
Component: rdmaAssignee: Raghavendra G <raghavendra>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: mainlineCC: amarts, gluster-bugs, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: fuse
Documentation: DNR CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Anush Shetty 2010-09-15 21:49:13 EDT
Client log:

[2010-09-15 10:49:56.504216] E [rpc-clnt.c:196:call_bail] sep15_replica-client-0: bailing out frame type(GlusterFS 3.1) op(LOOKUP(27)) xid = 23279807 sent = 
2010-09-15 10:19:54. timeout = 1800
[2010-09-15 10:53:06.695340] E [rpc-clnt.c:196:call_bail] sep15_replica-client-0: bailing out frame type(GlusterFS 3.1) op(LOOKUP(27)) xid = 23279809 sent = 
2010-09-15 10:23:04. timeout = 1800
[2010-09-15 10:53:06.695472] E [rpc-clnt.c:196:call_bail] sep15_replica-client-0: bailing out frame type(GlusterFS 3.1) op(LOOKUP(27)) xid = 23279808 sent = 
2010-09-15 10:23:04. timeout = 1800
[2010-09-15 10:53:06.696597] D [afr-common.c:562:afr_lookup_collect_xattr] sep15_replica-replicate-1: entry self-heal is pending for /clients/client8/~dmtmp/
ACCESS.
[2010-09-15 11:00:37.150115] E [rpc-clnt.c:196:call_bail] sep15_replica-client-0: bailing out frame type(GlusterFS 3.1) op(ENTRYLK(31)) xid = 23279810 sent =
 2010-09-15 10:30:35. timeout = 1800
[2010-09-15 11:00:37.150303] D [afr-lk-common.c:989:afr_lock_blocking] sep15_replica-replicate-0: we're done locking
[2010-09-15 11:00:37.150341] D [afr-transaction.c:935:afr_post_blocking_entrylk_cbk] sep15_replica-replicate-0: Blocking entrylks done. Proceeding to FOP
[2010-09-15 11:00:37.151531] D [dht-rename.c:274:dht_rename_unlink_cbk] sep15_replica-dht: unlink on sep15_replica-replicate-0 failed (No such file or direct
ory)
[2010-09-15 11:00:37.151556] W [fuse-bridge.c:1297:fuse_rename_cbk] glusterfs-fuse: 20692474: /clients/client6/~dmtmp/EXCEL/0BBC0000 -> /clients/client6/~dmtmp/EXCEL/RESULTS.XLS => -1 (Transport endpoint is not connected)
[2010-09-15 11:00:37.151754] D [afr-lk-common.c:410:transaction_lk_op] sep15_replica-replicate-0: lk op is for a transaction
[2010-09-15 11:05:17.431195] E [rpc-clnt.c:196:call_bail] sep15_replica-client-0: bailing out frame type(GlusterFS 3.1) op(FSYNC(16)) xid = 23279811 sent = 2010-09-15 10:35:15. timeout = 1800
[2010-09-15 11:06:37.511926] E [rpc-clnt.c:196:call_bail] sep15_replica-client-0: bailing out frame type(GlusterFS 3.1) op(SETATTR(38)) xid = 23279812 sent = 2010-09-15 10:36:35. timeout = 1800
[2010-09-15 11:06:37.511989] D [afr-self-heal-data.c:107:afr_sh_data_flush_cbk] sep15_replica-replicate-0: flush or setattr failed on /clients/client5/~dmtmp/PM/PMD383.TMP on subvolume sep15_replica-client-0: Transport endpoint is not connected
[2010-09-15 11:06:37.512014] I [afr-self-heal-common.c:1583:afr_self_heal_completion_cbk] sep15_replica-replicate-0: background  meta-data data entry self-heal completed on /clients/client5/~dmtmp/PM/PMD383.TMP
[2010-09-15 11:19:58.317274] E [rpc-clnt.c:196:call_bail] sep15_replica-client-0: bailing out frame type(GlusterFS 3.1) op(LOOKUP(27)) xid = 23279813 sent = 2010-09-15 10:49:56. timeout = 1800
[2010-09-15 11:23:08.511380] E [rpc-clnt.c:196:call_bail] sep15_replica-client-0: bailing out frame type(GlusterFS 3.1) op(LOOKUP(27)) xid = 23279815 sent = 2010-09-15 10:53:06. timeout = 1800
[2010-09-15 11:23:08.511520] E [rpc-clnt.c:196:call_bail] sep15_replica-client-0: bailing out frame type(GlusterFS 3.1) op(LOOKUP(27)) xid = 23279814 sent = 2010-09-15 10:53:06. timeout = 1800
[2010-09-15 11:23:08.513528] D [afr-common.c:568:afr_lookup_collect_xattr] sep15_replica-replicate-1: data self-heal is pending for /clients/client8/~dmtmp/ACCESS/LABELS.PRN.
Comment 1 Anush Shetty 2010-09-16 00:45:09 EDT
This was on a distribute+replicate setup and dbench was set to run for 6hrs.

  10    146265     0.12 MB/sec  execute 10575 sec  latency 8983676.996 ms
  10    146265     0.12 MB/sec  execute 10576 sec  latency 8984678.020 ms
  10    146265     0.12 MB/sec  execute 10577 sec  latency 8985679.040 ms
  10    146265     0.12 MB/sec  execute 10578 sec  latency 8986680.061 ms
  10    146265     0.12 MB/sec  execute 10579 sec  latency 8987681.082 ms
  10    146265     0.12 MB/sec  execute 10580 sec  latency 8988682.105 ms
  10    146265     0.12 MB/sec  execute 10581 sec  latency 8989683.126 ms
  10    146265     0.12 MB/sec  execute 10582 sec  latency 8990684.148 ms
  10    146265     0.12 MB/sec  execute 10583 sec  latency 8991685.170 ms
  10    146265     0.12 MB/sec  execute 10584 sec  latency 8992686.192 ms
  10    146265     0.12 MB/sec  execute 10585 sec  latency 8993687.213 ms
  10    146265     0.12 MB/sec  execute 10586 sec  latency 8994688.235 ms
  10    146265     0.12 MB/sec  execute 10587 sec  latency 8995689.257 ms
  10    146265     0.12 MB/sec  execute 10588 sec  latency 8996690.279 ms
  10    146265     0.12 MB/sec  execute 10589 sec  latency 8997691.300 ms
  10    146265     0.12 MB/sec  execute 10590 sec  latency 8998692.322 ms
  10    146265     0.12 MB/sec  execute 10591 sec  latency 8999693.344 ms
  10    146265     0.12 MB/sec  execute 10592 sec  latency 9000694.366 ms
  10    146265     0.12 MB/sec  execute 10593 sec  latency 9001695.387 ms
  10    146265     0.12 MB/sec  execute 10594 sec  latency 9002696.409 ms
  10    146265     0.12 MB/sec  execute 10595 sec  latency 9003697.430 ms
  10    146265     0.12 MB/sec  execute 10596 sec  latency 9004698.453 ms
  10    146265     0.12 MB/sec  execute 10597 sec  latency 9005699.474 ms
  10    146265     0.12 MB/sec  execute 10598 sec  latency 9006700.496 ms
  10    146265     0.12 MB/sec  execute 10599 sec  latency 9007701.517 ms
  10    146265     0.12 MB/sec  execute 10600 sec  latency 9008702.540 ms
  10    146265     0.12 MB/sec  execute 10601 sec  latency 9009703.561 ms
[144300] rename ./clients/client6/~dmtmp/EXCEL/0BBC0000 ./clients/client6/~dmtmp/EXCEL/RESULTS.XLS failed (Transport endpoint is not connected) - expected NT_STATUS_OK
ERROR: child 6 failed at line 144300
Child failed with status 1
Comment 2 Amar Tumballi 2010-09-20 23:50:17 EDT
RDMA is not blocker, but critical
Comment 3 Raghavendra G 2010-11-10 21:55:40 EST
on a two node replicate setup, dbench ran successfully. we should run tests on distributed replicate setup.
Comment 4 Raghavendra G 2010-11-10 21:57:57 EST
The tests were run on git commit id eaf0618e47b4e575180a9cbdbeda6ff5.
Comment 5 Raghavendra G 2010-11-13 21:27:28 EST
dbench was successful on distributed replicate setup also. Hence closing this bug.