Bug 1472273
Summary: | noticing WriteTimeoutException with Cassandra write query | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Prasanna Kumar Kalever <prasanna.kalever> | ||||
Component: | gluster-block | Assignee: | Pranith Kumar K <pkarampu> | ||||
Status: | CLOSED ERRATA | QA Contact: | krishnaram Karthick <kramdoss> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rhgs-3.3 | CC: | amukherj, kdhananj, knarra, mpillai, pkarampu, rcyriac, rhs-bugs, sabose, sasundar | ||||
Target Milestone: | --- | ||||||
Target Release: | RHGS 3.3.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.8.4-37 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-09-21 05:02:13 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1417151 | ||||||
Attachments: |
|
Some updates: We tried recreating this issue on EC and AFR because both these xlators are similar and to see what are the differences. We saw that on both the bricks it fails. When we added strace of the brick process, we realized all files are opened with O_RDWR without any O_DIRECT. So we found one of the problems to be remote-dio being 'enable' After disabling this, EC started passing the cassandra runs, but AFR is still failing because metadata locks are taking a long time ~2seconds. Next steps are to implement eager-locking in AFR and try out the test again. Sahina/Krutika/sas/Kasturi, I copied the profile from virt and modified some of the options, I was wondering why we are enabling remote-dio in the profile, isn't it causing any problems in virt? Pranith I just wanted to increase the odds of getting a response over the weekend, feel free to clear needinfo on everyone when you answer. (In reply to Pranith Kumar K from comment #4) > Some updates: > We tried recreating this issue on EC and AFR because both these xlators are > similar and to see what are the differences. We saw that on both the bricks I meant on both EC and AFR it fails. > it fails. When we added strace of the brick process, we realized all files > are opened with O_RDWR without any O_DIRECT. So we found one of the problems > to be remote-dio being 'enable' After disabling this, EC started passing the > cassandra runs, but AFR is still failing because metadata locks are taking a > long time ~2seconds. Next steps are to implement eager-locking in AFR and > try out the test again. > > Sahina/Krutika/sas/Kasturi, > I copied the profile from virt and modified some of the options, I was > wondering why we are enabling remote-dio in the profile, isn't it causing > any problems in virt? > > > Pranith (In reply to Pranith Kumar K from comment #4) > Some updates: > We tried recreating this issue on EC and AFR because both these xlators are > similar and to see what are the differences. We saw that on both the bricks > it fails. When we added strace of the brick process, we realized all files > are opened with O_RDWR without any O_DIRECT. So we found one of the problems > to be remote-dio being 'enable' After disabling this, EC started passing the > cassandra runs, but AFR is still failing because metadata locks are taking a > long time ~2seconds. Next steps are to implement eager-locking in AFR and > try out the test again. > > Sahina/Krutika/sas/Kasturi, > I copied the profile from virt and modified some of the options, I was > wondering why we are enabling remote-dio in the profile, isn't it causing > any problems in virt? > > > Pranith Hmmm so IIRC, when the last change was made to group virt for Grafton GA, we decided to have odirect enabled only for intense workloads and even make it part of a different profile. I think this was to give users the option to decide whether they want to use O_DIRECT or not. But maybe it's not necessary anymore in the sense that if a user doesn't want o-direct we can assume that they would not even be using cache=none, which means opens won't have o-direct flag, so there is no need for protocol/client to be even filtering O_DIRECT. FWIW, for Grafton, I think cockpit/gdeploy sets o-direct options (which includes disabling remote-dio) after volume creation at the time of installation. Kasturi/Sas/Sahina may correct me if I'm wrong. -Krutika (In reply to Krutika Dhananjay from comment #7) > (In reply to Pranith Kumar K from comment #4) > > Some updates: > > We tried recreating this issue on EC and AFR because both these xlators are > > similar and to see what are the differences. We saw that on both the bricks > > it fails. When we added strace of the brick process, we realized all files > > are opened with O_RDWR without any O_DIRECT. So we found one of the problems > > to be remote-dio being 'enable' After disabling this, EC started passing the > > cassandra runs, but AFR is still failing because metadata locks are taking a > > long time ~2seconds. Next steps are to implement eager-locking in AFR and > > try out the test again. > > > > Sahina/Krutika/sas/Kasturi, > > I copied the profile from virt and modified some of the options, I was > > wondering why we are enabling remote-dio in the profile, isn't it causing > > any problems in virt? > > > > > > Pranith > > Hmmm so IIRC, when the last change was made to group virt for Grafton GA, we > decided to have odirect enabled only for intense workloads and even make it > part of a different profile. I think this was to give users the option to > decide whether they want to use O_DIRECT or not. But maybe it's not > necessary anymore in the sense that if a user doesn't want o-direct we can > assume that they would not even be using cache=none, which means opens won't > have o-direct flag, so there is no need for protocol/client to be even > filtering O_DIRECT. > > > FWIW, for Grafton, I think cockpit/gdeploy sets o-direct options (which > includes disabling remote-dio) after volume creation at the time of > installation. yes, cockpit sets o-direct options which includes disabling remote-dio. > > Kasturi/Sas/Sahina may correct me if I'm wrong. > > -Krutika remote: https://code.engineering.redhat.com/gerrit/114102 extras: Disable remote-dio in gluster-block profile remote: https://code.engineering.redhat.com/gerrit/114103 group-gluster-block: Set default shard-block-size to 64MB remote: https://code.engineering.redhat.com/gerrit/114104 extras: Turn eager-lock off for gluster-block These are the final set of options after all these patches: performance.quick-read=off performance.read-ahead=off performance.io-cache=off performance.stat-prefetch=off performance.open-behind=off performance.readdir-ahead=off network.remote-dio=disable cluster.eager-lock=disable cluster.quorum-type=auto cluster.data-self-heal-algorithm=full cluster.locking-scheme=granular cluster.shd-max-threads=8 cluster.shd-wait-qlength=10000 features.shard=on features.shard-block-size=64MB user.cifs=off server.allow-insecure=on (In reply to Pranith Kumar K from comment #9) > remote: https://code.engineering.redhat.com/gerrit/114102 extras: Disable > remote-dio in gluster-block profile > remote: https://code.engineering.redhat.com/gerrit/114103 > group-gluster-block: Set default shard-block-size to 64MB > remote: https://code.engineering.redhat.com/gerrit/114104 extras: Turn > eager-lock off for gluster-block 3.12 patches: https://review.gluster.org/#/c/17918/ https://review.gluster.org/#/c/17919/ https://review.gluster.org/#/c/17955/ This issue is not seen at CNS (builds above cns-deploy-5.0.0-38.el7rhgs.x86_64). i.e., tcmu-runner-1.2.0-14.el7rhgs.x86_64 Logging and metrics using gluster block in ocp environment is up for close to 4 days now without any issues. Moving the bug to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |
Created attachment 1300401 [details] test results and volume profile info Description of problem: [root@gprfs046 ~]# df -Th Filesystem Type Size Used Avail Use% Mounted on [...] /dev/mapper/mpathn xfs 40G 834M 40G 3% /tmp/block_replica3 [...] [root@gprfs046 ~]# lsblk -S NAME HCTL TYPE VENDOR MODEL REV TRAN sda 0:2:0:0 disk DELL PERC H710P 3.13 sdb 0:2:1:0 disk DELL PERC H710P 3.13 [...] sdp 65:0:0:0 disk LIO-ORG TCMU device 0002 iscsi sdq 66:0:0:0 disk LIO-ORG TCMU device 0002 iscsi sdr 67:0:0:0 disk LIO-ORG TCMU device 0002 iscsi # hints_directory: /var/lib/cassandra/hints hints_directory: /tmp/block_replica3/glusterfs/cassandra/hints -- data_file_directories: - /tmp/block_replica3/glusterfs/cassandra/data -- #commitlog_directory: /var/lib/cassandra/commitlog commitlog_directory: /tmp/block_replica3/glusterfs/cassandra/commitlog -- #saved_caches_directory: /var/lib/cassandra/saved_caches saved_caches_directory: /tmp/block_replica3/glusterfs/cassandra/saved_caches [root@gprfs046]# nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 668.08 MB 256 100.0% 975f97b9-3842-4b1b-84dc-7a687a34f7dc rack1 # cassandra-stress write duration=2m -rate threads=200 Connected to cluster: Test Cluster, max pending requests per connection 128, max connections per host 8 Datatacenter: datacenter1; Host: localhost/127.0.0.1; Rack: rack1 Created keyspaces. Sleeping 1s for propagation. Sleeping 2s... Warming up WRITE with 50000 iterations... Running WRITE with 200 threads 2 minutes type, total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb total, 27002, 25955, 25955, 25955, 7.7, 3.8, 19.9, 61.6, 220.1, 242.1, 1.0, 0.00000, 0, 1, 193, 193, 0, 1590 total, 86637, 54430, 54430, 54430, 3.8, 2.4, 10.7, 24.1, 72.0, 127.9, 2.1, 0.18632, 0, 0, 0, 0, 0, 0 total, 153993, 60128, 60128, 60128, 3.2, 2.2, 7.1, 16.1, 122.7, 133.6, 3.3, 0.16573, 0, 1, 116, 116, 0, 1534 total, 237245, 71887, 71887, 71887, 2.8, 2.1, 6.3, 12.4, 51.1, 76.9, 4.4, 0.13822, 0, 0, 0, 0, 0, 0 total, 302847, 59854, 59854, 59854, 3.3, 2.2, 6.9, 15.6, 160.2, 179.9, 5.5, 0.10977, 0, 1, 151, 151, 0, 1463 [...] com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) [...] [root@gprfc076 ~]# gluster vol info replica3 Volume Name: replica3 Type: Replicate Volume ID: 4787dea8-b60f-4f5e-9ae2-1685a455353f Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.16.154.213:/brickR3 Brick2: 10.16.154.219:/brickR3 Brick3: 10.16.154.221:/brickR3 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on cluster.ensure-durability: off server.allow-insecure: on user.cifs: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: enable performance.readdir-ahead: off performance.open-behind: off performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet nfs.disable: on Version-Release number of selected component (if applicable): Client: [root@gprfs046 test_results]# rpm -qa | grep cassandra cassandra30-3.0.9-1.noarch cassandra30-tools-3.0.9-1.noarch Server: [root@gprfc076 ~]# rpm -qa | grep glusterfs glusterfs-3.8.4-33.el7rhgs.x86_64 glusterfs-server-3.8.4-33.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-33.el7rhgs.x86_64 glusterfs-api-3.8.4-33.el7rhgs.x86_64 glusterfs-cli-3.8.4-33.el7rhgs.x86_64 glusterfs-libs-3.8.4-33.el7rhgs.x86_64 glusterfs-fuse-3.8.4-33.el7rhgs.x86_64 [root@gprfc076 ~]# rpm -qa | grep tcmu-runner tcmu-runner-debuginfo-1.2.0-5.el7rhgs.x86_64 tcmu-runner-1.2.0-10.el7rhgs.x86_64 [root@gprfc076 ~]# rpm -qa | grep targetcli targetcli-2.1.fb46-2.el7rhgs.noarch [root@gprfc076 ~]# rpm -qa | grep gluster-block gluster-block-0.2.1-6.el7rhgs.x86_64 gluster-block-debuginfo-0.2.1-5.el7rhgs.x86_64 [root@gprfc076 ~]# rpm -qa | grep rtslib python-rtslib-2.1.fb63-2.el7.noarch [root@gprfc076 ~]# rpm -qa | grep configshell python-configshell-1.1.fb23-3.el7.noarch [root@gprfc076 ~]# uname -a Linux gprfc076.sbu.lab.eng.bos.redhat.com 3.10.0-686.el7.test.x86_64 #1 SMP Fri Jun 23 03:08:29 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux Actual results: WriteTimeExceptions (Not Errors though, they are just latency exceptions) Expected results: Less Latency Additional info: This behavior is not noticed with non-replica gluster volumes. Volume combinations tried, replica 1X3 with block profile -> HIT replica 1X3 with block profile except sharding -> HIT plain distribute (3 nodes) with block profile -> Miss (no exceptions, working as expected) from the volume profile, pranith helped to find | There is a big INODELK delay: | I think this is the fxattrop taking metadata lock in afr :-/ | 82.85 11123.59 us 15.00 us 1811871.00 us 65658 FINODELK but needs some more investigation ATM!! Please find the attachment containing 1. Cassandra test output with exceptions on replica volume backed block device 2. Cassandra test output without exceptions on distribute volume backed block device 3. Replica volume profile (when run Cassandra test) on block mount 4. distribute volume profile (when run Cassandra test) on block mount Note: WriteTimeExceptions are not errors, they are latencies caught w.r.t write acknowledgements in cassandra db.