Bug 1472273

Summary: noticing WriteTimeoutException with Cassandra write query
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Prasanna Kumar Kalever <prasanna.kalever>
Component: gluster-blockAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: krishnaram Karthick <kramdoss>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: amukherj, kdhananj, knarra, mpillai, pkarampu, rcyriac, rhs-bugs, sabose, sasundar
Target Milestone: ---   
Target Release: RHGS 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-37 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-21 05:02:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1417151    
Attachments:
Description Flags
test results and volume profile info none

Description Prasanna Kumar Kalever 2017-07-18 10:42:41 UTC
Created attachment 1300401 [details]
test results and volume profile info

Description of problem:
[root@gprfs046 ~]# df -Th
Filesystem                     Type      Size  Used Avail Use% Mounted on
[...]
/dev/mapper/mpathn             xfs        40G  834M   40G   3% /tmp/block_replica3
[...]

[root@gprfs046 ~]# lsblk -S
NAME HCTL       TYPE VENDOR   MODEL             REV TRAN
sda  0:2:0:0    disk DELL     PERC H710P       3.13 
sdb  0:2:1:0    disk DELL     PERC H710P       3.13 
[...]
sdp  65:0:0:0   disk LIO-ORG  TCMU device      0002 iscsi
sdq  66:0:0:0   disk LIO-ORG  TCMU device      0002 iscsi
sdr  67:0:0:0   disk LIO-ORG  TCMU device      0002 iscsi

# hints_directory: /var/lib/cassandra/hints
hints_directory: /tmp/block_replica3/glusterfs/cassandra/hints
--
data_file_directories:
    - /tmp/block_replica3/glusterfs/cassandra/data
--
#commitlog_directory: /var/lib/cassandra/commitlog
commitlog_directory: /tmp/block_replica3/glusterfs/cassandra/commitlog
--
#saved_caches_directory: /var/lib/cassandra/saved_caches
saved_caches_directory: /tmp/block_replica3/glusterfs/cassandra/saved_caches


[root@gprfs046]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  668.08 MB  256          100.0%            975f97b9-3842-4b1b-84dc-7a687a34f7dc  rack1


# cassandra-stress write duration=2m -rate threads=200
Connected to cluster: Test Cluster, max pending requests per connection 128, max connections per host 8
Datatacenter: datacenter1; Host: localhost/127.0.0.1; Rack: rack1
Created keyspaces. Sleeping 1s for propagation.
Sleeping 2s...
Warming up WRITE with 50000 iterations...
Running WRITE with 200 threads 2 minutes
type,      total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
total,         27002,   25955,   25955,   25955,     7.7,     3.8,    19.9,    61.6,   220.1,   242.1,    1.0,  0.00000,      0,      1,     193,     193,       0,    1590
total,         86637,   54430,   54430,   54430,     3.8,     2.4,    10.7,    24.1,    72.0,   127.9,    2.1,  0.18632,      0,      0,       0,       0,       0,       0
total,        153993,   60128,   60128,   60128,     3.2,     2.2,     7.1,    16.1,   122.7,   133.6,    3.3,  0.16573,      0,      1,     116,     116,       0,    1534
total,        237245,   71887,   71887,   71887,     2.8,     2.1,     6.3,    12.4,    51.1,    76.9,    4.4,  0.13822,      0,      0,       0,       0,       0,       0
total,        302847,   59854,   59854,   59854,     3.3,     2.2,     6.9,    15.6,   160.2,   179.9,    5.5,  0.10977,      0,      1,     151,     151,       0,    1463
[...]
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)
[...]

[root@gprfc076 ~]# gluster vol info replica3
Volume Name: replica3
Type: Replicate
Volume ID: 4787dea8-b60f-4f5e-9ae2-1685a455353f
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.16.154.213:/brickR3
Brick2: 10.16.154.219:/brickR3
Brick3: 10.16.154.221:/brickR3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.ensure-durability: off
server.allow-insecure: on
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: enable
performance.readdir-ahead: off
performance.open-behind: off
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on


Version-Release number of selected component (if applicable):
Client:
[root@gprfs046 test_results]# rpm -qa | grep cassandra
cassandra30-3.0.9-1.noarch
cassandra30-tools-3.0.9-1.noarch

Server:
[root@gprfc076 ~]# rpm -qa | grep glusterfs
glusterfs-3.8.4-33.el7rhgs.x86_64
glusterfs-server-3.8.4-33.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-33.el7rhgs.x86_64
glusterfs-api-3.8.4-33.el7rhgs.x86_64
glusterfs-cli-3.8.4-33.el7rhgs.x86_64
glusterfs-libs-3.8.4-33.el7rhgs.x86_64
glusterfs-fuse-3.8.4-33.el7rhgs.x86_64
[root@gprfc076 ~]# rpm -qa | grep tcmu-runner
tcmu-runner-debuginfo-1.2.0-5.el7rhgs.x86_64
tcmu-runner-1.2.0-10.el7rhgs.x86_64
[root@gprfc076 ~]# rpm -qa | grep targetcli
targetcli-2.1.fb46-2.el7rhgs.noarch
[root@gprfc076 ~]# rpm -qa | grep gluster-block
gluster-block-0.2.1-6.el7rhgs.x86_64
gluster-block-debuginfo-0.2.1-5.el7rhgs.x86_64
[root@gprfc076 ~]# rpm -qa | grep rtslib
python-rtslib-2.1.fb63-2.el7.noarch
[root@gprfc076 ~]# rpm -qa | grep configshell
python-configshell-1.1.fb23-3.el7.noarch
[root@gprfc076 ~]# uname -a
Linux gprfc076.sbu.lab.eng.bos.redhat.com 3.10.0-686.el7.test.x86_64 #1 SMP Fri Jun 23 03:08:29 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux


Actual results:
WriteTimeExceptions (Not Errors though, they are just latency exceptions)

Expected results:
Less Latency

Additional info:
This behavior is not noticed with non-replica gluster volumes.

Volume combinations tried,
replica 1X3 with block profile -> HIT
replica 1X3 with block profile except sharding -> HIT
plain distribute (3 nodes) with block profile -> Miss (no exceptions, working as expected)

from the volume profile, pranith helped to find

| There is a big INODELK delay:
| I think this is the fxattrop taking metadata lock in afr :-/
|      82.85   11123.59 us      15.00 us 1811871.00 us          65658    FINODELK

but needs some more investigation ATM!!

Please find the attachment containing
1. Cassandra test output with exceptions on replica volume backed block device
2. Cassandra test output without exceptions on distribute volume backed block device
3. Replica volume profile (when run Cassandra test) on block mount
4. distribute volume profile (when run Cassandra test) on block mount

Note: 
WriteTimeExceptions are not errors, they are latencies caught w.r.t write acknowledgements in cassandra db.

Comment 4 Pranith Kumar K 2017-07-22 11:01:40 UTC
Some updates:
We tried recreating this issue on EC and AFR because both these xlators are similar and to see what are the differences. We saw that on both the bricks it fails. When we added strace of the brick process, we realized all files are opened with O_RDWR without any O_DIRECT. So we found one of the problems to be remote-dio being 'enable' After disabling this, EC started passing the cassandra runs, but AFR is still failing because metadata locks are taking a long time ~2seconds. Next steps are to implement eager-locking in AFR and try out the test again.

Sahina/Krutika/sas/Kasturi,
     I copied the profile from virt and modified some of the options, I was wondering why we are enabling remote-dio in the profile, isn't it causing any problems in virt?


Pranith

Comment 5 Pranith Kumar K 2017-07-22 11:03:09 UTC
I just wanted to increase the odds of getting a response over the weekend, feel free to clear needinfo on everyone when you answer.

Comment 6 Pranith Kumar K 2017-07-22 11:05:34 UTC
(In reply to Pranith Kumar K from comment #4)
> Some updates:
> We tried recreating this issue on EC and AFR because both these xlators are
> similar and to see what are the differences. We saw that on both the bricks

I meant on both EC and AFR it fails.

> it fails. When we added strace of the brick process, we realized all files
> are opened with O_RDWR without any O_DIRECT. So we found one of the problems
> to be remote-dio being 'enable' After disabling this, EC started passing the
> cassandra runs, but AFR is still failing because metadata locks are taking a
> long time ~2seconds. Next steps are to implement eager-locking in AFR and
> try out the test again.
> 
> Sahina/Krutika/sas/Kasturi,
>      I copied the profile from virt and modified some of the options, I was
> wondering why we are enabling remote-dio in the profile, isn't it causing
> any problems in virt?
> 
> 
> Pranith

Comment 7 Krutika Dhananjay 2017-07-22 14:18:20 UTC
(In reply to Pranith Kumar K from comment #4)
> Some updates:
> We tried recreating this issue on EC and AFR because both these xlators are
> similar and to see what are the differences. We saw that on both the bricks
> it fails. When we added strace of the brick process, we realized all files
> are opened with O_RDWR without any O_DIRECT. So we found one of the problems
> to be remote-dio being 'enable' After disabling this, EC started passing the
> cassandra runs, but AFR is still failing because metadata locks are taking a
> long time ~2seconds. Next steps are to implement eager-locking in AFR and
> try out the test again.
> 
> Sahina/Krutika/sas/Kasturi,
>      I copied the profile from virt and modified some of the options, I was
> wondering why we are enabling remote-dio in the profile, isn't it causing
> any problems in virt?
> 
> 
> Pranith

Hmmm so IIRC, when the last change was made to group virt for Grafton GA, we decided to have odirect enabled only for intense workloads and even make it part of a different profile. I think this was to give users the option to decide whether they want to use O_DIRECT or not. But maybe it's not necessary anymore in the sense that if a user doesn't want o-direct we can assume that they would not even be using cache=none, which means opens won't have o-direct flag, so there is no need for protocol/client to be even filtering O_DIRECT.


FWIW, for Grafton, I think cockpit/gdeploy sets o-direct options (which includes disabling remote-dio) after volume creation at the time of installation.

Kasturi/Sas/Sahina may correct me if I'm wrong.

-Krutika

Comment 8 RamaKasturi 2017-07-24 05:56:40 UTC
(In reply to Krutika Dhananjay from comment #7)
> (In reply to Pranith Kumar K from comment #4)
> > Some updates:
> > We tried recreating this issue on EC and AFR because both these xlators are
> > similar and to see what are the differences. We saw that on both the bricks
> > it fails. When we added strace of the brick process, we realized all files
> > are opened with O_RDWR without any O_DIRECT. So we found one of the problems
> > to be remote-dio being 'enable' After disabling this, EC started passing the
> > cassandra runs, but AFR is still failing because metadata locks are taking a
> > long time ~2seconds. Next steps are to implement eager-locking in AFR and
> > try out the test again.
> > 
> > Sahina/Krutika/sas/Kasturi,
> >      I copied the profile from virt and modified some of the options, I was
> > wondering why we are enabling remote-dio in the profile, isn't it causing
> > any problems in virt?
> > 
> > 
> > Pranith
> 
> Hmmm so IIRC, when the last change was made to group virt for Grafton GA, we
> decided to have odirect enabled only for intense workloads and even make it
> part of a different profile. I think this was to give users the option to
> decide whether they want to use O_DIRECT or not. But maybe it's not
> necessary anymore in the sense that if a user doesn't want o-direct we can
> assume that they would not even be using cache=none, which means opens won't
> have o-direct flag, so there is no need for protocol/client to be even
> filtering O_DIRECT.
> 
> 
> FWIW, for Grafton, I think cockpit/gdeploy sets o-direct options (which
> includes disabling remote-dio) after volume creation at the time of
> installation.

yes, cockpit sets o-direct options which includes disabling remote-dio.
> 
> Kasturi/Sas/Sahina may correct me if I'm wrong.
> 
> -Krutika

Comment 9 Pranith Kumar K 2017-08-02 02:22:08 UTC
remote:   https://code.engineering.redhat.com/gerrit/114102 extras: Disable remote-dio in gluster-block profile
remote:   https://code.engineering.redhat.com/gerrit/114103 group-gluster-block: Set default shard-block-size to 64MB
remote:   https://code.engineering.redhat.com/gerrit/114104 extras: Turn eager-lock off for gluster-block

Comment 10 Pranith Kumar K 2017-08-02 02:29:34 UTC
These are the final set of options after all these patches:
performance.quick-read=off
performance.read-ahead=off
performance.io-cache=off
performance.stat-prefetch=off
performance.open-behind=off
performance.readdir-ahead=off
network.remote-dio=disable
cluster.eager-lock=disable
cluster.quorum-type=auto
cluster.data-self-heal-algorithm=full
cluster.locking-scheme=granular
cluster.shd-max-threads=8
cluster.shd-wait-qlength=10000
features.shard=on
features.shard-block-size=64MB
user.cifs=off
server.allow-insecure=on

Comment 11 Atin Mukherjee 2017-08-02 05:28:41 UTC
(In reply to Pranith Kumar K from comment #9)
> remote:   https://code.engineering.redhat.com/gerrit/114102 extras: Disable
> remote-dio in gluster-block profile
> remote:   https://code.engineering.redhat.com/gerrit/114103
> group-gluster-block: Set default shard-block-size to 64MB
> remote:   https://code.engineering.redhat.com/gerrit/114104 extras: Turn
> eager-lock off for gluster-block

3.12 patches:

https://review.gluster.org/#/c/17918/
https://review.gluster.org/#/c/17919/
https://review.gluster.org/#/c/17955/

Comment 13 krishnaram Karthick 2017-09-19 05:29:48 UTC
This issue is not seen at CNS (builds above cns-deploy-5.0.0-38.el7rhgs.x86_64). i.e., tcmu-runner-1.2.0-14.el7rhgs.x86_64

Logging and metrics using gluster block in ocp environment is up for close to 4 days now without any issues.

Moving the bug to verified.

Comment 15 errata-xmlrpc 2017-09-21 05:02:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774