| Summary: | [md-cache]: forceful brick down shows i/o error for some files and unknown permissions for some files | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Nag Pavan Chilakam <nchilaka> | ||||
| Component: | disperse | Assignee: | Pranith Kumar K <pkarampu> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Nag Pavan Chilakam <nchilaka> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | rhgs-3.2 | CC: | amukherj, aspandey, nchilaka, pgurusid, pkarampu, rhinduja, rhs-bugs, sbhaloth, storage-qa-internal | ||||
| Target Milestone: | --- | ||||||
| Target Release: | RHGS 3.2.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-12-19 12:44:44 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Nag Pavan Chilakam
2016-10-19 12:36:57 UTC
Created attachment 1212137 [details]
complete file list(but displayed only partial list)
EC volume is used with redundancy count 2 is it? if so we expect that even after bringing down 2 bricks the data should be intact? Does a fresh mount also displays the same error when 'ls' is issued? Also did you happen to try the same test case disabling md-cache features? I see that the other brick was brought down by " brick by forcefully umounting the brick". If something goes wrong in the backend without the knowledge of client, i m not sure if the client is intelligent enough to identify the failure the first time and re-read from a different brick. Also, md-cache doesn't cache the readddirp data, readdir-ahead may have some role to play. If thee second brick is brought down by killing brick process, this issue may not be seen. My understanding is, this issue will be reproducible even without md-cache feature on. Can check with EC maintainers if this is an expected behaviour? Pranith/Ashish - could you provide your inputs on comment 4? Looks like EC issue more than md-cache. We will take a look first and if we find that it is not EC issue, then we can hand it over to the correct component. As discussed offline, Nag mentioned he was able to see the same issue with AFR and killing bricks also showed the same error? Can you please confirm. Based on comment 7 I am moving the component to disperse. Nag, sosreport is also missing for this bug. Nag, I have enough reasons to doubt that this sosreport belongs to same issue which you have mentioned in comment#1 [root@apandey bug.1386678]# [root@apandey bug.1386678]# ll server/ total 24 drwxrwxrwx. 3 root root 4096 Nov 25 16:57 dhcp35-116.lab.eng.blr.redhat.com drwxrwxrwx. 3 root root 4096 Nov 25 16:58 dhcp35-135.lab.eng.blr.redhat.com drwxrwxrwx. 3 root root 4096 Nov 25 16:59 dhcp35-196.lab.eng.blr.redhat.com drwxrwxrwx. 3 root root 4096 Nov 25 16:59 dhcp35-239.lab.eng.blr.redhat.com drwxrwxrwx. 3 root root 4096 Nov 25 16:59 dhcp35-37.lab.eng.blr.redhat.com drwxrwxrwx. 3 root root 4096 Nov 25 16:59 dhcp35-8.lab.eng.blr.redhat.com IP's of the servers are not matching with the volume info you have provided. 1: volume ecvol-client-0 2: type protocol/client 3: option ping-timeout 42 4: option remote-host 10.70.35.37 5: option remote-subvolume /rhs/brick1/ecvol 6: option transport-type socket 7: option transport.address-family inet 8: option send-gids true 9: end-volume 10: 11: volume ecvol-client-1 12: type protocol/client 13: option ping-timeout 42 14: option remote-host 10.70.35.116 15: option remote-subvolume /rhs/brick1/ecvol 16: option transport-type socket 17: option transport.address-family inet 18: option send-gids true 19: end-volume 20: 21: volume ecvol-client-2 22: type protocol/client 23: option ping-timeout 42 24: option remote-host 10.70.35.239 25: option remote-subvolume /rhs/brick1/ecvol 26: option transport-type socket 27: option transport.address-family inet 28: option send-gids true 29: end-volume 30: 31: volume ecvol-client-3 32: type protocol/client 33: option ping-timeout 42 34: option remote-host 10.70.35.135 35: option remote-subvolume /rhs/brick1/ecvol 36: option transport-type socket 37: option transport.address-family inet 38: option send-gids true 39: end-volume Also, on one of the brick I saw crash logs. Please confirm that it has the logs for this bug only. If you don't have soisreport, please reproduce the issue and provide it. Also, as Poornima asked in comment#7, did you try it for afr? As per the triaging we all have the agreement that this BZ has to be fixed in rhgs-3.2.0. Providing qa_ack currently I am blocked as I am hitting this https://bugzilla.redhat.com/show_bug.cgi?id=1397667 Will upgrade to the build with fix of https://bugzilla.redhat.com/show_bug.cgi?id=1397667 and then update I have tried to reproduce this issue, but couldn't hit it |