Bug 1416321

Summary: read error: Input/output error but not in brain split stat while file on bricks has different sizes
Product: [Community] GlusterFS Reporter: likunbyl
Component: replicateAssignee: bugs <bugs>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.8CC: bugs, hchiramm, khiremat, likunbyl, pkarampu, ravishankar
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-07 10:42:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
volume info and volume status/heal info, brick logs at that time
none
glusterfs client mount log none

Description likunbyl 2017-01-25 09:39:08 UTC
Created attachment 1244198 [details]
volume info and volume status/heal info, brick logs at that time

Description of problem:

One of our glusterfs consumer reported an error: java.io.FileNotFoundException: /mnt/glusterfs/gps/data/2017-01/02/4.txt (Input/output error).

Then I tried to access this file:
# cd /mnt/glusterfs/gps/data/2017-01/02
# more 4.txt
# cat 4.txt
cat: read error: Input/output error

Then I logined into the bricks that hold this file, the size of the file 4.txt was different:

core@ab09 /export/gluster/sdb3/vol/gps/data/2017-01/02 $ ls -l 4.txt
-rw-r--r--. 2 root root  7490 Jan  2 00:00 4.txt

core@ac08 /export/gluster/sdb3/vol/gps/data/2017-01/02 $ ls -l 4.txt 
-rw-r--r--. 2 root root 9046 Jan  2 00:00 4.txt

core@ad08 /export/gluster/sdb3/vol/gps/data/2017-01/02 $ ls -l 4.txt 
-rw-r--r--. 2 root root 9046 Jan  2 00:00 4.txt

And the getfattr output:

core@ab09 /export/gluster/sdb3/vol/gps/data/2017-01/02 #  getfattr -m . -d -e hex 4.txt
# file: 4.txt
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.gvol0-client-51=0x000000000000000000000000
trusted.afr.gvol0-client-52=0x000000000000000000000000
trusted.afr.gvol0-client-53=0x000000000000000000000000
trusted.bit-rot.version=0x02000000000000005850b03d00072c12
trusted.gfid=0xa61d7b2fd53f4eb9be7a656b0915dbea
trusted.glusterfs.quota.363d3c2d-5ccb-497e-9ecc-6eaf5f435026.contri.1=0x0000000000001e000000000000000001
trusted.pgfid.363d3c2d-5ccb-497e-9ecc-6eaf5f435026=0x00000001

core@ac08 /export/gluster/sdb3/vol/gps/data/2017-01/02 #  getfattr -m . -d -e hex 4.txt
# file: 4.txt
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.gvol0-client-51=0x000000000000000000000000
trusted.afr.gvol0-client-52=0x000000000000000000000000
trusted.afr.gvol0-client-53=0x000000000000000000000000
trusted.bit-rot.version=0x02000000000000005850b03d0007711e
trusted.gfid=0xa61d7b2fd53f4eb9be7a656b0915dbea
trusted.glusterfs.quota.363d3c2d-5ccb-497e-9ecc-6eaf5f435026.contri.1=0x00000000000024000000000000000001
trusted.pgfid.363d3c2d-5ccb-497e-9ecc-6eaf5f435026=0x00000001

core@ad08 /export/gluster/sdb3/vol/gps/data/2017-01/02 #  getfattr -m . -d -e hex 4.txt
# file: 4.txt
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.gvol0-client-51=0x000000000000000000000000
trusted.afr.gvol0-client-52=0x000000000000000000000000
trusted.afr.gvol0-client-53=0x000000000000000000000000
trusted.bit-rot.version=0x02000000000000005861dcf900054117
trusted.gfid=0xa61d7b2fd53f4eb9be7a656b0915dbea
trusted.glusterfs.quota.363d3c2d-5ccb-497e-9ecc-6eaf5f435026.contri.1=0x00000000000024000000000000000001
trusted.pgfid.363d3c2d-5ccb-497e-9ecc-6eaf5f435026=0x00000001

gluster volume heal info didn’t report anything. 

Although a “gluster volume heal gvol0 full” command handled this, synced this file, the impact to glusterfs consumer was real, and would happen again in the near future, I supposed.

Setup:

OS: coreos 1185.5.0
kubernetes: v1.5.1
Image: official gluster-centos:gluster3u8_centos7
Gluster: 3.8.5

20 nodes cluster running 132 bricks dist-rep 44x3 volume

Version-Release number of selected component (if applicable):
3.8.5

How reproducible:
Not always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
will attach the volume info and volume status/heal info, brick logs at that time.

Comment 1 Ravishankar N 2017-01-31 13:00:05 UTC
Hi, I think you already mailed on gluster-users, but 

1.can you also attach the mount log at that time you got EIO?  
2.Are the getfattr outputs before you ran heal full?
3.When you ran heal full, did you get messages similar to "Completed data self-heal on 0xa61d7b2fd53f4eb9be7a656b0915dbea" etc. in mount or glustershd.log?

Comment 2 likunbyl 2017-02-04 08:33:50 UTC
1. There is mount log at that time I got EIO in my attached file.
2. yes, getfattr outputs was before I ran heal full
3.at ac08, I got the following log:
[2017-01-03 02:30:55.147054] I [MSGID: 108026] [afr-self-heal-common.c:1076:afr_log_selfheal] 0-gvol0-replicate-17: Completed data selfheal on a61d7b2f-d53f-4eb9-be7a-656b0915dbea. sources=[2]  sinks=0 1 

a61d7b2f-d53f-4eb9-be7a-656b0915dbea is the same inode with the error file. There is no log in another two server.

2017-01-03 02:30:55 is the time that I ran heal full command.

Comment 3 likunbyl 2017-02-04 09:22:30 UTC
today, One of glusterfs server rebooted. 

from glusterfsd.log: 
06:20:32, gluster selfheal began.
07:18:21, gluster selfheal finished.
08:15,    my java application report EIO
08:19:08, start gluster volume heal gvol0 full
08:25:18, heal full finished.

between 08:19:08 and 08:25:18, I got 46 "Completed data self-heal" logs. that means there were 46 files that were not healed by self-heal daemon, right?

Comment 4 Ravishankar N 2017-02-06 09:29:15 UTC
(In reply to likunbyl from comment #2)
> 1. There is mount log at that time I got EIO in my attached file.
I only see brick logs in https://bugzilla.redhat.com/attachment.cgi?id=1244198 , no mount (client) logs. Can you provide the mount log? I'm trying to see which translator returned EIO. The brick logs you provided also do not contain EIO related messages.

Some other comments:
The afr extended attributes you provided are all zeroes, meaning there was no pending heal. If you still got "Completed data selfheal" message, some FOP must have failed after you obtained the getfattr output, causing non-zero afr xattrs and thus a heal.

Comment 5 likunbyl 2017-02-07 07:15:34 UTC
(In reply to Ravishankar N from comment #4)
> Can you provide the mount log? 
> 

where is the mount log located ? I am running kubernetes to manage all my apps, and mount glusterfs volume into pod with following configurations:

 {
     name: "glusterfs",
     glusterfs: {
         endpoints: "glusterfs-cluster",
         path: "gvol0",
     },
 },

Comment 6 Ravishankar N 2017-02-07 08:41:31 UTC
It is normally at /var/log/glusterfs/ on the machine where you have mounted the volume.

Humble, would you know if the log location is any different for container set ups?

Comment 7 likunbyl 2017-02-09 03:11:18 UTC
There is no logs about glusterfs mount in the container and the servers that run the pod. is this problem related to the client side? before upgrade to glusterfs3.8.5, I have ran glusterfs 3.7 for several months, without the EIO error at all. the difference between the two stage is as the following:

1. different glusterfs version. 
2. before, we mount volume through fuse, later, we mount volume througth k8s
3. before, the servers is stable, later, several servers reboot abnormally.

Comment 8 Pranith Kumar K 2017-02-11 06:58:23 UTC
> 
> core@ab09 /export/gluster/sdb3/vol/gps/data/2017-01/02 $ ls -l 4.txt
> -rw-r--r--. 2 root root  7490 Jan  2 00:00 4.txt
> 
> core@ac08 /export/gluster/sdb3/vol/gps/data/2017-01/02 $ ls -l 4.txt 
> -rw-r--r--. 2 root root 9046 Jan  2 00:00 4.txt
> 
> core@ad08 /export/gluster/sdb3/vol/gps/data/2017-01/02 $ ls -l 4.txt 
> -rw-r--r--. 2 root root 9046 Jan  2 00:00 4.txt
> 
> And the getfattr output:
> 
> core@ab09 /export/gluster/sdb3/vol/gps/data/2017-01/02 #  getfattr -m . -d
> -e hex 4.txt
> # file: 4.txt
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
> trusted.afr.gvol0-client-51=0x000000000000000000000000
> trusted.afr.gvol0-client-52=0x000000000000000000000000
> trusted.afr.gvol0-client-53=0x000000000000000000000000
> trusted.bit-rot.version=0x02000000000000005850b03d00072c12
> trusted.gfid=0xa61d7b2fd53f4eb9be7a656b0915dbea
> trusted.glusterfs.quota.363d3c2d-5ccb-497e-9ecc-6eaf5f435026.contri.
> 1=0x0000000000001e000000000000000001
> trusted.pgfid.363d3c2d-5ccb-497e-9ecc-6eaf5f435026=0x00000001
> 
> core@ac08 /export/gluster/sdb3/vol/gps/data/2017-01/02 #  getfattr -m . -d
> -e hex 4.txt
> # file: 4.txt
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
> trusted.afr.gvol0-client-51=0x000000000000000000000000
> trusted.afr.gvol0-client-52=0x000000000000000000000000
> trusted.afr.gvol0-client-53=0x000000000000000000000000
> trusted.bit-rot.version=0x02000000000000005850b03d0007711e
> trusted.gfid=0xa61d7b2fd53f4eb9be7a656b0915dbea
> trusted.glusterfs.quota.363d3c2d-5ccb-497e-9ecc-6eaf5f435026.contri.
> 1=0x00000000000024000000000000000001
> trusted.pgfid.363d3c2d-5ccb-497e-9ecc-6eaf5f435026=0x00000001
> 
> core@ad08 /export/gluster/sdb3/vol/gps/data/2017-01/02 #  getfattr -m . -d
> -e hex 4.txt
> # file: 4.txt
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
> trusted.afr.gvol0-client-51=0x000000000000000000000000
> trusted.afr.gvol0-client-52=0x000000000000000000000000
> trusted.afr.gvol0-client-53=0x000000000000000000000000
> trusted.bit-rot.version=0x02000000000000005861dcf900054117
> trusted.gfid=0xa61d7b2fd53f4eb9be7a656b0915dbea
> trusted.glusterfs.quota.363d3c2d-5ccb-497e-9ecc-6eaf5f435026.contri.
> 1=0x00000000000024000000000000000001
> trusted.pgfid.363d3c2d-5ccb-497e-9ecc-6eaf5f435026=0x00000001
> 

I see that the bit-rit versions are different for all the files. Did anyone get a chance to find why? Ideally they should be same for the same content right?

Kotresh,
   Do you suspect there to be any issue?

Comment 9 likunbyl 2017-02-20 02:11:39 UTC
So, since I ran glusterfs in kubernetes, no client mount log can be provided, is there anything else I can do?

Comment 10 Humble Chirammal 2017-02-20 09:51:44 UTC
(In reply to likunbyl from comment #9)
> So, since I ran glusterfs in kubernetes, no client mount log can be
> provided, is there anything else I can do?

Is this a kubernetes setup or openshift setup ?

For example, in Openshift setup gluster logs are kept under below path:

/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/gluster-pv-local/glusterfs.log

Where

"/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/" is common for all the gluster logs. This common path is appended with  'volume name' and 'pod name' as shown below.

--snip-

p := path.Join(b.glusterfs.plugin.host.GetPluginDir(glusterfsPluginName), b.glusterfs.volName)
	....
log := path.Join(p, b.pod.Name+"-glusterfs.log")

--/snip--

Comment 11 Kotresh HR 2017-02-20 10:17:05 UTC
> 
> Kotresh,
>    Do you suspect there to be any issue?

Bitrot versions can be different for different replica bricks. It can happen if one of brick goes down and the I/O happens. In that case, bit-rot version would increase on the brick which is up. If multiple writes happen, the version keeps incrementing on it. Now when the brick which is down comes up, AFR would heal it at one shot, incrementing the version only once. So even though the data is consistent, the versions can differ. But the signature should be consistent across replicas.

Comment 12 likunbyl 2017-02-21 03:48:57 UTC
(In reply to Humble Chirammal from comment #10)
>
> Is this a kubernetes setup or openshift setup ?

Thank you very much, I have found the glusterfs client mount log, it's located under the /var/lib/kubelet/plugins/kubernetes.io/glusterfs/glusterfs/ directory.

I have excerpted the logs about the EIO at that time in an attached file . 

I have several servers access these files at almost the same time.

Comment 13 likunbyl 2017-02-21 03:56:50 UTC
Created attachment 1255924 [details]
glusterfs client mount log

Comment 14 Ravishankar N 2017-02-21 06:41:49 UTC
What version of gluster are you using for the clients?

[2017-01-01 16:00:02.990799] E [afr-self-heal-common.c:233:afr_sh_print_split_brain_log] 0-gvol0-replicate-17: Unable to self-heal contents of '/gps/up_data/2017-01/02/4.txt' (possible split-brain). Please delete the file from all but the preferred subvolume.- Pending matrix:  [ [ 0 0 0 ] [ 0 0 0 ] [ 0 0 0 ] ]
[2017-01-01 16:00:02.991044] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-gvol0-replicate-17:  backgroung data self heal  failed,   on /gps/up_data/2017-01/02/4.txt
[2017-01-01 16:00:02.991239] W [afr-open.c:209:afr_open] 0-gvol0-replicate-17: failed to open as split brain seen, returning EIO
[2017-01-01 16:00:02.991265] W [fuse-bridge.c:2209:fuse_writev_cbk] 0-glusterfs-fuse: 340493338: WRITE => -1 (Input/output error)

From the client logs, it seems AFR seems to have returned the EIO despite a pending matrix of all zeroes. But what is noteworthy is that the messages correspond to an older version of afr (i.e. afr-v1 in glusterfs-3.5 or older).

1. What version of gluster are you using for the clients?
2. It is recommended to update all the clients also to the same version (3.8.x) as the servers.

Comment 15 likunbyl 2017-02-21 09:09:34 UTC
(In reply to Ravishankar N from comment #14)
> What version of gluster are you using for the clients?
> 
> 1. What version of gluster are you using for the clients?
> 2. It is recommended to update all the clients also to the same version
> (3.8.x) as the servers.

the client's glusterfs version is 3.5.2.

We are now using coreos kubernetes 1.5.1 to manage our containers, and the glusterfs client software is packaged into hyperkube, so, the glusterfs client we can use is up to the update status of hyperkube.

so, is the bug related to the glusterfs client version ? could keeping the client version same to the server avoid this problem ?

Comment 16 Ravishankar N 2017-02-21 09:59:33 UTC
(In reply to likunbyl from comment #15)
 
> so, is the bug related to the glusterfs client version ? could keeping the
> client version same to the server avoid this problem ?

It could likely be, since the EIO is originating from AFR. gluster-3.5 is end of life[1] now, (as is 3.6 and soon 3.7 too, once 3.10 is out) and even it if were a bug in it, there won't be any fix available.

I would urge you to have all gluster bits at 3.8 and see if the issue it hit.

[1] https://www.gluster.org/community/release-schedule/

Comment 17 likunbyl 2017-02-21 13:57:01 UTC
(In reply to Ravishankar N from comment #16)
> (In reply to likunbyl from comment #15)
> It could likely be, since the EIO is originating from AFR. gluster-3.5 is
> end of life[1] now, (as is 3.6 and soon 3.7 too, once 3.10 is out) and even
> it if were a bug in it, there won't be any fix available.
> 
> I would urge you to have all gluster bits at 3.8 and see if the issue it hit.
> 

So, back to the original problem, what was the root cause of the inconsistence of file 4.txt between three replicas, client side or server side ?

Comment 18 Ravishankar N 2017-02-21 15:19:16 UTC
(In reply to likunbyl from comment #17)
> (In reply to Ravishankar N from comment #16)
> > (In reply to likunbyl from comment #15)
> > It could likely be, since the EIO is originating from AFR. gluster-3.5 is
> > end of life[1] now, (as is 3.6 and soon 3.7 too, once 3.10 is out) and even
> > it if were a bug in it, there won't be any fix available.
> > 
> > I would urge you to have all gluster bits at 3.8 and see if the issue it hit.
> > 
> 
> So, back to the original problem, what was the root cause of the
> inconsistence of file 4.txt between three replicas, client side or server
> side ?

I thought I already answered that. AFR==client side.

Comment 19 likunbyl 2017-02-22 03:20:22 UTC
(In reply to Ravishankar N from comment #18)
> 
> I thought I already answered that. AFR==client side.

Do I need to close this case ?

Comment 20 Pranith Kumar K 2017-02-22 03:31:48 UTC
(In reply to likunbyl from comment #17)
> (In reply to Ravishankar N from comment #16)
> > (In reply to likunbyl from comment #15)
> > It could likely be, since the EIO is originating from AFR. gluster-3.5 is
> > end of life[1] now, (as is 3.6 and soon 3.7 too, once 3.10 is out) and even
> > it if were a bug in it, there won't be any fix available.
> > 
> > I would urge you to have all gluster bits at 3.8 and see if the issue it hit.
> > 
> 
> So, back to the original problem, what was the root cause of the
> inconsistence of file 4.txt between three replicas, client side or server
> side ?

Ravi,
      I am also curious to know what lead to the inconsistency to begin with. Do we know what might have happened which lead to different file sizes but no xattrs to indicate it.

Pranith

Comment 21 Pranith Kumar K 2017-02-22 03:43:30 UTC
(In reply to likunbyl from comment #19)
> (In reply to Ravishankar N from comment #18)
> > 
> > I thought I already answered that. AFR==client side.
> 
> Do I need to close this case ?

I think we don't yet have any theory as to why this might have happened, so while upgrading to a supported version is recommended, it doesn't give us any confidence about if the problem we experienced in the existing setup is already solved or not in the newer version.

We used to see this kind of problem when any of the disks crash where the sizes would mismatch but there wouldn't be any extended attributes to tell that this happened.

Comment 22 Pranith Kumar K 2017-02-22 03:50:41 UTC
(In reply to Pranith Kumar K from comment #21)
> (In reply to likunbyl from comment #19)
> > (In reply to Ravishankar N from comment #18)
> > > 
> > > I thought I already answered that. AFR==client side.
> > 
> > Do I need to close this case ?
> 
> I think we don't yet have any theory as to why this might have happened, so
> while upgrading to a supported version is recommended, it doesn't give us
> any confidence about if the problem we experienced in the existing setup is
> already solved or not in the newer version.
> 
> We used to see this kind of problem when any of the disks crash where the
> sizes would mismatch but there wouldn't be any extended attributes to tell
> that this happened.

Just to want to clarify that if a file ends up in the state you mentioned in the bz, the file will be automatically healed in versions > 3.6.x so if you upgrade to 3.8.x then you won't get EIO when the file ends up in this state. So upgrade is recommended.

What I am curious to know is how did the file end up in this state. We used to see this kind of issue only when there is a disk/filesystem crash or ungraceful shutdown of the machine.

Did any of the above happen on the brick servers where the first brick is present?

Comment 23 likunbyl 2017-02-22 06:46:42 UTC
(In reply to Pranith Kumar K from comment #22)
> 
> Did any of the above happen on the brick servers where the first brick is
> present?

yes, one of the servers, ad08, rebooted at Dec 27, 2016, abnormally.

Comment 24 Pranith Kumar K 2017-02-22 07:27:48 UTC
(In reply to likunbyl from comment #23)
> (In reply to Pranith Kumar K from comment #22)
> > 
> > Did any of the above happen on the brick servers where the first brick is
> > present?
> 
> yes, one of the servers, ad08, rebooted at Dec 27, 2016, abnormally.

Okay that probably is the bit-rot checksum mismatch even when the filesizes were same. It still doesn't explain the reason for less size in ab09. Anything happened on that machine?

Comment 25 likunbyl 2017-02-22 09:05:43 UTC
(In reply to Pranith Kumar K from comment #24)
> It still doesn't explain the reason for less size in ab09.
> Anything happened on that machine?

Noop

Comment 26 Niels de Vos 2017-11-07 10:42:53 UTC
This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.