Bug 1482376

Summary: IO errors on gluster-block device
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Manoj Pillai <mpillai>
Component: posixAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: Sweta Anandpara <sanandpa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: amukherj, andcosta, dwojslaw, mpillai, pkarampu, prasanna.kalever, psuriset, rcyriac, rhinduja, rhs-bugs, sanandpa, sankarshan, shberry, sheggodu, storage-qa-internal, vbellur
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:35:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1472757, 1583464    
Bug Blocks: 1503134    

Description Manoj Pillai 2017-08-17 06:18:11 UTC
Description of problem:

fio test on gluster-block device results in I/O error messages in /var/log/messages

Version-Release number of selected component (if applicable):

On server:
glusterfs-3.8.4-41.el7rhgs.x86_64
glusterfs-api-3.8.4-41.el7rhgs.x86_64
glusterfs-cli-3.8.4-41.el7rhgs.x86_64
glusterfs-server-3.8.4-41.el7rhgs.x86_64
glusterfs-libs-3.8.4-41.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-41.el7rhgs.x86_64
glusterfs-fuse-3.8.4-41.el7rhgs.x86_64

gluster-block-0.2.1-6.el7rhgs.x86_64

libtcmu-1.2.0-11.el7rhgs.x86_64
tcmu-runner-1.2.0-11.el7rhgs.x86_64

On client:
iscsi-initiator-utils-6.2.0.874-4.el7.x86_64


Steps to Reproduce:

1. Created a gluster-block device /dev/sdb with xfs FS:
<quote>
# lsscsi
[1:0:0:0]    disk    LIO-ORG  TCMU device      0002  /dev/sdb

# lsblk
NAME                   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                      8:0    0 465.8G  0 disk 
├─sda1                   8:1    0     1G  0 part /boot
└─sda2                   8:2    0 464.8G  0 part 
  ├─rhel_gprfc013-root 253:0    0    50G  0 lvm  /
  ├─rhel_gprfc013-swap 253:1    0  23.6G  0 lvm  [SWAP]
  └─rhel_gprfc013-home 253:2    0 391.1G  0 lvm  /home
sdb                      8:16   0   300G  0 disk /mnt/glustervol
</quote>

2. Ran fio sequential write test:
<quote>
# cat job.fio.write 
[global]
rw=write
create_on_open=1
fsync_on_close=1
size=10g
bs=1024k
openfiles=1
startdelay=0
ioengine=sync
nrfiles=1

[lgf-write]
directory=/mnt/glustervol/${HOSTNAME}
filename_format=f.$jobnum.$filenum
numjobs=8
</quote>

# ./fio --output=out.fio.write job.fio.write

Actual results:

Errors in /var/log/messages:

<quote>
Aug 17 01:50:13 localhost kernel: sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 17 01:50:13 localhost kernel: sd 1:0:0:0: [sdb] Sense Key : Not Ready [current]
Aug 17 01:50:13 localhost kernel: sd 1:0:0:0: [sdb] Add. Sense: Logical unit communication failure
Aug 17 01:50:13 localhost kernel: sd 1:0:0:0: [sdb] CDB: Write(10) 2a 00 00 66 03 00 00 00 80 00
Aug 17 01:50:13 localhost kernel: blk_update_request: I/O error, dev sdb, sector 6685440
Aug 17 01:50:13 localhost kernel: Buffer I/O error on dev sdb, logical block 834528, lost async page write
Aug 17 01:50:13 localhost kernel: Buffer I/O error on dev sdb, logical block 834529, lost async page write
Aug 17 01:50:13 localhost kernel: Buffer I/O error on dev sdb, logical block 834530, lost async page write
Aug 17 01:50:13 localhost kernel: Buffer I/O error on dev sdb, logical block 834531, lost async page write
Aug 17 01:50:13 localhost kernel: Buffer I/O error on dev sdb, logical block 834532, lost async page write
Aug 17 01:50:13 localhost kernel: Buffer I/O error on dev sdb, logical block 834533, lost async page write
Aug 17 01:50:13 localhost kernel: Buffer I/O error on dev sdb, logical block 834534, lost async page write
Aug 17 01:50:13 localhost kernel: Buffer I/O error on dev sdb, logical block 834535, lost async page write
Aug 17 01:50:13 localhost kernel: Buffer I/O error on dev sdb, logical block 834536, lost async page write
Aug 17 01:50:13 localhost kernel: Buffer I/O error on dev sdb, logical block 834537, lost async page write
Aug 17 01:50:13 localhost kernel: sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 17 01:50:13 localhost kernel: sd 1:0:0:0: [sdb] Sense Key : Not Ready [current]
Aug 17 01:50:13 localhost kernel: sd 1:0:0:0: [sdb] Add. Sense: Logical unit communication failure
Aug 17 01:50:13 localhost kernel: sd 1:0:0:0: [sdb] CDB: Write(10) 2a 00 01 0a 04 80 00 00 80 00
Aug 17 01:50:13 localhost kernel: blk_update_request: I/O error, dev sdb, sector 17433728
Aug 17 01:50:14 localhost kernel: sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 17 01:50:14 localhost kernel: sd 1:0:0:0: [sdb] Sense Key : Not Ready [current]
Aug 17 01:50:14 localhost kernel: sd 1:0:0:0: [sdb] Add. Sense: Logical unit communication failure
Aug 17 01:50:14 localhost kernel: sd 1:0:0:0: [sdb] CDB: Write(10) 2a 00 00 4a 03 80 00 00 80 00
Aug 17 01:50:14 localhost kernel: blk_update_request: I/O error, dev sdb, sector 4850560
Aug 17 01:50:15 localhost kernel: sd 1:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 17 01:50:15 localhost kernel: sd 1:0:0:0: [sdb] Sense Key : Not Ready [current]
Aug 17 01:50:15 localhost kernel: sd 1:0:0:0: [sdb] Add. Sense: Logical unit communication failure
Aug 17 01:50:15 localhost kernel: sd 1:0:0:0: [sdb] CDB: Write(10) 2a 00 01 0c 02 80 00 00 80 00
Aug 17 01:50:15 localhost kernel: blk_update_request: I/O error, dev sdb, sector 17564288
...
</quote>

Expected results:

No I/O errors

Additional info:

gluster v info perfvol:

Volume Name: perfvol
Type: Distribute
Volume ID: 0a1381e8-3cae-4f6a-8151-4171a28edd56
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: smerf04-10ge:/mnt/rhs_brick1
Options Reconfigured:
server.allow-insecure: on
user.cifs: off
features.shard-block-size: 64MB
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.quorum-type: auto
cluster.eager-lock: disable
network.remote-dio: disable
performance.readdir-ahead: off
performance.open-behind: off
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on

Comment 9 Manoj Pillai 2017-08-23 12:15:10 UTC
Forgot to put in the OS version. On both client and server:

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)

kernel rpm:
kernel-3.10.0-693.el7.x86_64

Comment 14 Pranith Kumar K 2017-08-29 11:27:43 UTC
Manoj and I looked at the setup today and found this issue to be the same race fixed by https://review.gluster.org/17821, this issue is seen only with plain distribute and not with Replication.

Manoj is doing some more tests and will update again with more information. I am clearing the needinfo on us for now.

Comment 16 Manoj Pillai 2017-08-30 18:39:10 UTC
Switched to 1x3 volume configuration. No I/O errors seen after that.

Comment 29 errata-xmlrpc 2018-09-04 06:35:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607