| Summary: | brick with missing directory causes errors | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Joe Julian <joe> |
| Component: | distribute | Assignee: | shishir gowda <sgowda> |
| Status: | CLOSED NOTABUG | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.1.3 | CC: | aavati, gluster-bugs, jonathansteffan, nsathyan |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | --- | Mount Type: | All |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
I've also confirmed this is still an issue with 3.2.1qa3. Here are TRACE level logs from a failing mkdir(): http://jsteffan.fedorapeople.org/logs/glusterfs-3.2.1qa3-trace-mkdir-failure.log As an added note, I was able to successfully create this directory when loading these bricks with 3.0.7. However, we can't go back to 3.0.7 as it seems content created in 3.1.3 is not properly read by 3.0.7. The "simulation" of losing an inode (as part of a "partial node failure") is not really real-world in this test. What you are doing is directly manipulating the backend with an rm -rf. In the real world you would have lost the inode as part of an fsck when gluster was not running. To "match" the real world test, restart the brick process on the node where you are trying to create the "partial failure" and gluster should recreate the missing directories. Checking for missing directories in every lookup is an expensive operation and we believe that performing checks on first access of a directory after a new server connect/reconnect is a very reasonable middle-ground (only hand-crafted tests like this would fail and real-world situations would be captured) Avati |
1. Create distributed volume (my test used "gluster volume create dtest centos{2,3,4}:/var/spool/gluster/{0,1,2,3}") 2. mount the volume, mount -t glusterfs centos2:/mnt/dtest 3. Create a directory tree, ie. mkdir -p /mnt/dtest/{1,2,3,4}/{5,6,7,8}/{1,2,3,4,5,6,7,8} 4. take a look at the directory structure on one of the servers: centos2# ls -l /var/spool/gluster/*/1 /var/spool/gluster/0/1: total 32 drwxr-xr-x 10 root root 4096 Apr 2 20:49 5 drwxr-xr-x 10 root root 4096 Apr 2 22:56 6 drwxr-xr-x 10 root root 4096 Apr 2 20:49 7 drwxr-xr-x 10 root root 4096 Apr 2 20:49 8 /var/spool/gluster/1/1: total 32 drwxr-xr-x 10 root root 4096 Apr 2 21:58 5 drwxr-xr-x 10 root root 4096 Apr 2 22:56 6 drwxr-xr-x 10 root root 4096 Apr 2 21:58 7 drwxr-xr-x 10 root root 4096 Apr 2 21:58 8 /var/spool/gluster/2/1: total 32 drwxr-xr-x 10 root root 4096 Apr 2 20:49 5 drwxr-xr-x 10 root root 4096 Apr 2 22:56 6 drwxr-xr-x 10 root root 4096 Apr 2 20:49 7 drwxr-xr-x 10 root root 4096 Apr 2 20:49 8 /var/spool/gluster/3/1: total 32 drwxr-xr-x 10 root root 4096 Apr 2 20:49 5 drwxr-xr-x 10 root root 4096 Apr 2 22:56 6 drwxr-xr-x 10 root root 4096 Apr 2 20:49 7 drwxr-xr-x 10 root root 4096 Apr 2 20:49 8 4. Simulate a partially failed brick that, perhaps, lost an inode for a directory: centos2# rm -rf /var/spool/gluster/1/1/6 5. remove that directory through the client: rm -rf /mnt/dtest/1/6 RESULTS: rm: cannot remove directory `/mnt/dtest/1/6/3': No such file or directory rm: cannot remove directory `/mnt/dtest/1/6/4': No such file or directory rm: cannot remove directory `/mnt/dtest/1/6/6': No such file or directory rm: cannot remove directory `/mnt/dtest/1/6/7': No such file or directory rm: cannot remove directory `/mnt/dtest/1/6/2': No such file or directory rm: cannot remove directory `/mnt/dtest/1/6/1': No such file or directory rm: cannot remove directory `/mnt/dtest/1/6/5': No such file or directory rm: cannot remove directory `/mnt/dtest/1/6/8': No such file or directory ls -ld /mnt/dtest/1/6 drwxr-xr-x 2 root root 4096 Apr 2 22:59 /mnt/dtest/1/6 centos2# ls -l /var/spool/gluster/*/1 /var/spool/gluster/0/1: total 32 drwxr-xr-x 10 root root 4096 Apr 2 20:49 5 drwxr-xr-x 2 root root 4096 Apr 2 22:59 6 drwxr-xr-x 10 root root 4096 Apr 2 20:49 7 drwxr-xr-x 10 root root 4096 Apr 2 20:49 8 /var/spool/gluster/1/1: total 24 drwxr-xr-x 10 root root 4096 Apr 2 21:58 5 drwxr-xr-x 10 root root 4096 Apr 2 21:58 7 drwxr-xr-x 10 root root 4096 Apr 2 21:58 8 /var/spool/gluster/2/1: total 32 drwxr-xr-x 10 root root 4096 Apr 2 20:49 5 drwxr-xr-x 2 root root 4096 Apr 2 22:59 6 drwxr-xr-x 10 root root 4096 Apr 2 20:49 7 drwxr-xr-x 10 root root 4096 Apr 2 20:49 8 /var/spool/gluster/3/1: total 32 drwxr-xr-x 10 root root 4096 Apr 2 20:49 5 drwxr-xr-x 2 root root 4096 Apr 2 22:59 6 drwxr-xr-x 10 root root 4096 Apr 2 20:49 7 drwxr-xr-x 10 root root 4096 Apr 2 20:49 8 rm -rf /mnt/dtest/1/6 rm: cannot remove directory `/mnt/dtest/1/6': No such file or directory EXPECTATION: When a directory is missing on a dht brick and that directory exists on other bricks, that directory should be recreated to match it's peers.