Description of problem: ------------------------ Installed RHEL6 on the VM. Added 100G disk and formatted with ext4. Did IO (dd, fio, linux untar and directory creations). I see the input/output errors during IO. Listing the contents shows as below : -?????????? ? ? ? ? ? 189dhcp46-189-512k-vdb-write-seq.results_bw.log -?????????? ? ? ? ? ? 189dhcp46-189-512k-vdb-write-seq.results_iops.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-randread-para.results_bw.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-randread-para.results_iops.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-randread-seq.results_bw.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-randread-seq.results_iops.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-randwrite-para.results_bw.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-randwrite-para.results_iops.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-randwrite-seq.results_bw.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-randwrite-seq.results_iops.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-read-para.results_bw.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-read-para.results_iops.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-read-seq.results_bw.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-read-seq.results_iops.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-write-para.results_bw.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-write-para.results_iops.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-write-seq.results_bw.log -?????????? ? ? ? ? ? 189dhcp46-189-64k-vdb-write-seq.results_iops.log drwxr-xr-x. 1 root root 30154752 May 11 18:22 dirs drwxr-xr-x. 2 root root 11284480 May 10 14:58 files drwxr-xr-x. 4 root root 4096 May 11 16:09 iso drwxr-xr-x. 6 root root 4096 May 11 16:26 li ?x d?????????? ? ? ? ? ? lost*found There was data in iso directory but i dont see that too. I have not deleted anything in that though. [root@rhsqa13 .shard]# gluster v status Status of volume: data Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp43-201.lab.eng.blr.redhat.com:/rh gs/data/data 49154 0 Y 3670 Brick dhcp43-219.lab.eng.blr.redhat.com:/rh gs/data/data 49154 0 Y 6335 Brick dhcp43-220.lab.eng.blr.redhat.com:/rh gs/data/data 49154 0 Y 3295 NFS Server on localhost 2049 0 Y 7007 Self-heal Daemon on localhost N/A N/A Y 7023 NFS Server on dhcp43-219.lab.eng.blr.redhat .com 2049 0 Y 8702 Self-heal Daemon on dhcp43-219.lab.eng.blr. redhat.com N/A N/A Y 8710 NFS Server on dhcp43-220.lab.eng.blr.redhat .com 2049 0 Y 11181 Self-heal Daemon on dhcp43-220.lab.eng.blr. redhat.com N/A N/A Y 11189 Task Status of Volume data ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: engine_vol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp43-201.lab.eng.blr.redhat.com:/rh gs/engine/ev 49152 0 Y 17634 Brick dhcp43-219.lab.eng.blr.redhat.com:/rh gs/engine/ev 49152 0 Y 8682 Brick dhcp43-220.lab.eng.blr.redhat.com:/rh gs/engine/ev 49152 0 Y 17048 NFS Server on localhost 2049 0 Y 7007 Self-heal Daemon on localhost N/A N/A Y 7023 NFS Server on dhcp43-219.lab.eng.blr.redhat .com 2049 0 Y 8702 Self-heal Daemon on dhcp43-219.lab.eng.blr. redhat.com N/A N/A Y 8710 NFS Server on dhcp43-220.lab.eng.blr.redhat .com 2049 0 Y 11181 Self-heal Daemon on dhcp43-220.lab.eng.blr. redhat.com N/A N/A Y 11189 Task Status of Volume engine_vol ------------------------------------------------------------------------------ There are no active volume tasks Task Status of Volume engine_vol ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: vmstore Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp43-201.lab.eng.blr.redhat.com:/rh gs/vmstore/vms 49153 0 Y 69218 Brick dhcp43-219.lab.eng.blr.redhat.com:/rh gs/vmstore/vms 49153 0 Y 18353 Brick dhcp43-220.lab.eng.blr.redhat.com:/rh gs/vmstore/vms 49153 0 Y 26617 NFS Server on localhost 2049 0 Y 7007 Self-heal Daemon on localhost N/A N/A Y 7023 NFS Server on dhcp43-219.lab.eng.blr.redhat .com 2049 0 Y 8702 Self-heal Daemon on dhcp43-219.lab.eng.blr. redhat.com N/A N/A Y 8710 NFS Server on dhcp43-220.lab.eng.blr.redhat .com 2049 0 Y 11181 Self-heal Daemon on dhcp43-220.lab.eng.blr. redhat.com N/A N/A Y 11189 Task Status of Volume vmstore ------------------------------------------------------------------------------ There are no active volume tasks [root@rhsqa13 .shard]# gluster v info Volume Name: data Type: Replicate Volume ID: 12b4d188-85c7-429b-81ba-59b641efd15e Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: dhcp43-201.lab.eng.blr.redhat.com:/rhgs/data/data Brick2: dhcp43-219.lab.eng.blr.redhat.com:/rhgs/data/data Brick3: dhcp43-220.lab.eng.blr.redhat.com:/rhgs/data/data Options Reconfigured: diagnostics.client-log-level: DEBUG cluster.data-self-heal-algorithm: full performance.low-prio-threads: 32 features.shard-block-size: 512MB features.shard: on storage.owner-gid: 36 storage.owner-uid: 36 cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off performance.readdir-ahead: on cluster.shd-max-threads: 4 Volume Name: engine_vol Type: Replicate Volume ID: b98cda8e-19a0-4372-9518-361d9e2d8315 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: dhcp43-201.lab.eng.blr.redhat.com:/rhgs/engine/ev Brick2: dhcp43-219.lab.eng.blr.redhat.com:/rhgs/engine/ev Brick3: dhcp43-220.lab.eng.blr.redhat.com:/rhgs/engine/ev Options Reconfigured: performance.readdir-ahead: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.shd-max-threads: 4 Volume Name: vmstore Type: Replicate Volume ID: 27f1afb4-6fe8-4cf1-9d29-8deefbcdb43f Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: dhcp43-201.lab.eng.blr.redhat.com:/rhgs/vmstore/vms Brick2: dhcp43-219.lab.eng.blr.redhat.com:/rhgs/vmstore/vms Brick3: dhcp43-220.lab.eng.blr.redhat.com:/rhgs/vmstore/vms Options Reconfigured: diagnostics.client-log-level: DEBUG cluster.data-self-heal-algorithm: full performance.low-prio-threads: 32 features.shard-block-size: 512MB features.shard: on storage.owner-gid: 36 storage.owner-uid: 36 cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off performance.readdir-ahead: on cluster.shd-max-threads: 4 [root@rhsqa13 .shard]# Version-Release number of selected component (if applicable): -------------------------------------------------------------- 3.7.9-3 How reproducible: ----------------- 100% Steps to Reproduce: ------------------- As in description. Actual results: Expected results: Additional info: ---------------- sosreports will be attached.
Input/output error would most likely be due to a split-brain. Did you check for that? If the issue is split-brain, then the bug is not in sharding. -Krutika
My bad for delaying this. I have not seen any split-brain during this corruption.
Could you please attach the sosreports?
The same VM is not there now due to setup issues. I will have to re-create it. Will provide the sosreports once it is done.
Created attachment 1160452 [details] sosreport
Hi Bhaskar, I checked the attachment. There are no directories '/var/log/glusterfs' or '/var/lib/glusterd' in the sosreport. Could you attach the correct sosreports? -Krutika
all the sosreports are copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1335156/
I checked your setup. For the disks with uuid badff993-fe76-4147-97b7-dfd53293907f and 7ff05ee6-b76d-46cd-a98a-17569e2bd318, there are corresponding directories under images/ in the volume 'vmstore' each of which contain another uuid file, a .lease and a .meta for this uuid file. For 7ff05ee6-b76d-46cd-a98a-17569e2bd318, turns out there were two such files, with two .meta and .lease files. Is this normal? Also, for one of those files - namely 8bd7bc1d-8ed3-49e4-92ff-72cd009cd6a6 - the file was empty and all its shards were also 0 bytes. In essence, the file contains no data at all. Not sure if this is normal either. Still needs investigation. -Krutika
One more thing: I found the files associataed with th uuid for the corrupt data disks given by Bhaskarakiran on the volume vmstore as opposed to data. Is this okay? Or is that incorrect?
The primary disk i.e. OS disk is on data while the IO is run on the disk that is attached from vmstore. The corruption is seen on the disk that comes from vmstore.
https://code.engineering.redhat.com/gerrit/74760 which fixes 1330044 is the patch which fixes the I/O errors even when there were no files in split-brain Pranith
Tested on 3.7.9-6 build and didn' see the corruption. Marking this as fixed for now. Will re-open if its seen again.