Hide Forgot
On an ec vol which had md-cache enabled, I created about 1lakh 20Byte files and I brought down one brick(killing brick process).(4+2 ecvol) I did a lookup post that , it seem to display all files without issue(used ls -l) I then brought down another brick by forcefully umounting the brick (the xfs lv) I then issued an ls on the client . The client displayed files as below For some files [root@dhcp35-180 dir1]# time ls -lt ls: cannot access c.4047: Input/output error ls: cannot access c.4048: Input/output error ls: cannot access c.4049: Input/output error ls: cannot access c.4050: Input/output error ls: cannot access c.4051: Input/output error ls: cannot access c.4052: Input/output error ls: cannot access c.4053: Input/output error ls: cannot access c.4054: Input/output error ls: cannot access c.4055: Input/output error ls: cannot access c.4056: Input/output error ls: cannot access c.4057: Input/output error ls: cannot access c.4058: Input/output error ls: cannot access c.4059: Input/output error ls: cannot access c.4060: Input/output error ls: cannot access c.4061: Input/output error ls: cannot access c.4062: Input/output error ls: cannot access c.4063: Input/output error for some files : -?????????? ? ? ? ? ? c.4248 -?????????? ? ? ? ? ? c.4249 -?????????? ? ? ? ? ? c.4250 -?????????? ? ? ? ? ? c.4251 -?????????? ? ? ? ? ? c.4252 -?????????? ? ? ? ? ? c.4253 -?????????? ? ? ? ? ? c.4254 for some files : -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2571 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2570 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2569 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2568 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2567 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2566 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2565 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2564 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2563 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2562 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2561 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2560 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2559 -rw-r--r--. 1 root root 20 Oct 19 16:14 a.2558 In total there were 1,10,000 files but the ls displayed only 25000 files (attached is the compelte o/p) However on second ls, it display correctly I did see this once even on an afr volume with mdcache enabled(and just with one brick of a replica pair down , by killing brick process) Status of volume: ecvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.86:/rhs/brick2/ecvol N/A N/A N N/A Brick 10.70.35.9:/rhs/brick2/ecvol 49155 0 Y 18173 Brick 10.70.35.153:/rhs/brick2/ecvol 49154 0 Y 15230 Brick 10.70.35.79:/rhs/brick2/ecvol 49154 0 Y 13959 Brick 10.70.35.86:/rhs/brick3/ecvol 49154 0 Y 17610 ===>above brick was forcefully unmounted using umount -l Brick 10.70.35.9:/rhs/brick3/ecvol 49156 0 Y 18193 Self-heal Daemon on localhost N/A N/A Y 17631 Self-heal Daemon on 10.70.35.153 N/A N/A Y 15856 Self-heal Daemon on 10.70.35.79 N/A N/A Y 14590 Self-heal Daemon on 10.70.35.9 N/A N/A Y 18858 Task Status of Volume ecvol ------------------------------------------------------------------------------ There are no active volume tasks volume Name: ecvol Type: Disperse Volume ID: 809177ca-258a-4262-9ec5-7744ea4f7564 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (4 + 2) = 6 Transport-type: tcp Bricks: Brick1: 10.70.35.86:/rhs/brick2/ecvol Brick2: 10.70.35.9:/rhs/brick2/ecvol Brick3: 10.70.35.153:/rhs/brick2/ecvol Brick4: 10.70.35.79:/rhs/brick2/ecvol Brick5: 10.70.35.86:/rhs/brick3/ecvol Brick6: 10.70.35.9:/rhs/brick3/ecvol Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet features.cache-invalidation: on features.cache-invalidation-timeout: 60 performance.stat-prefetch: on performance.cache-invalidation: on performance.md-cache-timeout: 60 [root@dhcp35-86 ~]# BUILD: It was taken from what was mentioned in http://etherpad.corp.redhat.com/md-cache-3-2 nfs-ganesha-gluster-2.4.0-2.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 glusterfs-api-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 glusterfs-events-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 glusterfs-libs-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 glusterfs-cli-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 glusterfs-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 glusterfs-server-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 glusterfs-rdma-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 python-gluster-3.8.4-2.26.git0a405a4.el7rhgs.noarch glusterfs-fuse-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
Created attachment 1212137 [details] complete file list(but displayed only partial list)
EC volume is used with redundancy count 2 is it? if so we expect that even after bringing down 2 bricks the data should be intact? Does a fresh mount also displays the same error when 'ls' is issued? Also did you happen to try the same test case disabling md-cache features?
I see that the other brick was brought down by " brick by forcefully umounting the brick". If something goes wrong in the backend without the knowledge of client, i m not sure if the client is intelligent enough to identify the failure the first time and re-read from a different brick. Also, md-cache doesn't cache the readddirp data, readdir-ahead may have some role to play. If thee second brick is brought down by killing brick process, this issue may not be seen. My understanding is, this issue will be reproducible even without md-cache feature on. Can check with EC maintainers if this is an expected behaviour?
Pranith/Ashish - could you provide your inputs on comment 4?
Looks like EC issue more than md-cache. We will take a look first and if we find that it is not EC issue, then we can hand it over to the correct component.
As discussed offline, Nag mentioned he was able to see the same issue with AFR and killing bricks also showed the same error? Can you please confirm.
Based on comment 7 I am moving the component to disperse.
Nag, sosreport is also missing for this bug.
Nag, I have enough reasons to doubt that this sosreport belongs to same issue which you have mentioned in comment#1 [root@apandey bug.1386678]# [root@apandey bug.1386678]# ll server/ total 24 drwxrwxrwx. 3 root root 4096 Nov 25 16:57 dhcp35-116.lab.eng.blr.redhat.com drwxrwxrwx. 3 root root 4096 Nov 25 16:58 dhcp35-135.lab.eng.blr.redhat.com drwxrwxrwx. 3 root root 4096 Nov 25 16:59 dhcp35-196.lab.eng.blr.redhat.com drwxrwxrwx. 3 root root 4096 Nov 25 16:59 dhcp35-239.lab.eng.blr.redhat.com drwxrwxrwx. 3 root root 4096 Nov 25 16:59 dhcp35-37.lab.eng.blr.redhat.com drwxrwxrwx. 3 root root 4096 Nov 25 16:59 dhcp35-8.lab.eng.blr.redhat.com IP's of the servers are not matching with the volume info you have provided. 1: volume ecvol-client-0 2: type protocol/client 3: option ping-timeout 42 4: option remote-host 10.70.35.37 5: option remote-subvolume /rhs/brick1/ecvol 6: option transport-type socket 7: option transport.address-family inet 8: option send-gids true 9: end-volume 10: 11: volume ecvol-client-1 12: type protocol/client 13: option ping-timeout 42 14: option remote-host 10.70.35.116 15: option remote-subvolume /rhs/brick1/ecvol 16: option transport-type socket 17: option transport.address-family inet 18: option send-gids true 19: end-volume 20: 21: volume ecvol-client-2 22: type protocol/client 23: option ping-timeout 42 24: option remote-host 10.70.35.239 25: option remote-subvolume /rhs/brick1/ecvol 26: option transport-type socket 27: option transport.address-family inet 28: option send-gids true 29: end-volume 30: 31: volume ecvol-client-3 32: type protocol/client 33: option ping-timeout 42 34: option remote-host 10.70.35.135 35: option remote-subvolume /rhs/brick1/ecvol 36: option transport-type socket 37: option transport.address-family inet 38: option send-gids true 39: end-volume Also, on one of the brick I saw crash logs. Please confirm that it has the logs for this bug only. If you don't have soisreport, please reproduce the issue and provide it. Also, as Poornima asked in comment#7, did you try it for afr?
As per the triaging we all have the agreement that this BZ has to be fixed in rhgs-3.2.0. Providing qa_ack
currently I am blocked as I am hitting this https://bugzilla.redhat.com/show_bug.cgi?id=1397667 Will upgrade to the build with fix of https://bugzilla.redhat.com/show_bug.cgi?id=1397667 and then update
I have tried to reproduce this issue, but couldn't hit it