Bug 764392 (GLUSTER-2660) - brick with missing directory causes errors
Summary: brick with missing directory causes errors
Keywords:
Status: CLOSED NOTABUG
Alias: GLUSTER-2660
Product: GlusterFS
Classification: Community
Component: distribute
Version: 3.1.3
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
Assignee: shishir gowda
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-04-03 06:04 UTC by Joe Julian
Modified: 2013-12-09 01:24 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: All
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Joe Julian 2011-04-03 06:04:00 UTC
1. Create distributed volume (my test used "gluster volume create dtest centos{2,3,4}:/var/spool/gluster/{0,1,2,3}")
2. mount the volume, mount -t glusterfs centos2:/mnt/dtest
3. Create a directory tree, ie. mkdir -p /mnt/dtest/{1,2,3,4}/{5,6,7,8}/{1,2,3,4,5,6,7,8}
4. take a look at the directory structure on one of the servers: centos2# ls -l /var/spool/gluster/*/1
 /var/spool/gluster/0/1:
 total 32
 drwxr-xr-x 10 root root 4096 Apr  2 20:49 5
 drwxr-xr-x 10 root root 4096 Apr  2 22:56 6
 drwxr-xr-x 10 root root 4096 Apr  2 20:49 7
 drwxr-xr-x 10 root root 4096 Apr  2 20:49 8
 
 /var/spool/gluster/1/1:
 total 32
 drwxr-xr-x 10 root root 4096 Apr  2 21:58 5
 drwxr-xr-x 10 root root 4096 Apr  2 22:56 6
 drwxr-xr-x 10 root root 4096 Apr  2 21:58 7
 drwxr-xr-x 10 root root 4096 Apr  2 21:58 8
 
 /var/spool/gluster/2/1:
 total 32
 drwxr-xr-x 10 root root 4096 Apr  2 20:49 5
 drwxr-xr-x 10 root root 4096 Apr  2 22:56 6
 drwxr-xr-x 10 root root 4096 Apr  2 20:49 7
 drwxr-xr-x 10 root root 4096 Apr  2 20:49 8
 
 /var/spool/gluster/3/1:
 total 32
 drwxr-xr-x 10 root root 4096 Apr  2 20:49 5
 drwxr-xr-x 10 root root 4096 Apr  2 22:56 6
 drwxr-xr-x 10 root root 4096 Apr  2 20:49 7
 drwxr-xr-x 10 root root 4096 Apr  2 20:49 8

4. Simulate a partially failed brick that, perhaps, lost an inode for a directory: centos2# rm -rf /var/spool/gluster/1/1/6
5. remove that directory through the client: rm -rf /mnt/dtest/1/6

RESULTS:
rm: cannot remove directory `/mnt/dtest/1/6/3': No such file or directory
rm: cannot remove directory `/mnt/dtest/1/6/4': No such file or directory
rm: cannot remove directory `/mnt/dtest/1/6/6': No such file or directory
rm: cannot remove directory `/mnt/dtest/1/6/7': No such file or directory
rm: cannot remove directory `/mnt/dtest/1/6/2': No such file or directory
rm: cannot remove directory `/mnt/dtest/1/6/1': No such file or directory
rm: cannot remove directory `/mnt/dtest/1/6/5': No such file or directory
rm: cannot remove directory `/mnt/dtest/1/6/8': No such file or directory

ls -ld /mnt/dtest/1/6
drwxr-xr-x 2 root root 4096 Apr  2 22:59 /mnt/dtest/1/6

centos2# ls -l /var/spool/gluster/*/1
/var/spool/gluster/0/1:
total 32
drwxr-xr-x 10 root root 4096 Apr  2 20:49 5
drwxr-xr-x  2 root root 4096 Apr  2 22:59 6
drwxr-xr-x 10 root root 4096 Apr  2 20:49 7
drwxr-xr-x 10 root root 4096 Apr  2 20:49 8

/var/spool/gluster/1/1:
total 24
drwxr-xr-x 10 root root 4096 Apr  2 21:58 5
drwxr-xr-x 10 root root 4096 Apr  2 21:58 7
drwxr-xr-x 10 root root 4096 Apr  2 21:58 8

/var/spool/gluster/2/1:
total 32
drwxr-xr-x 10 root root 4096 Apr  2 20:49 5
drwxr-xr-x  2 root root 4096 Apr  2 22:59 6
drwxr-xr-x 10 root root 4096 Apr  2 20:49 7
drwxr-xr-x 10 root root 4096 Apr  2 20:49 8

/var/spool/gluster/3/1:
total 32
drwxr-xr-x 10 root root 4096 Apr  2 20:49 5
drwxr-xr-x  2 root root 4096 Apr  2 22:59 6
drwxr-xr-x 10 root root 4096 Apr  2 20:49 7
drwxr-xr-x 10 root root 4096 Apr  2 20:49 8

rm -rf /mnt/dtest/1/6
rm: cannot remove directory `/mnt/dtest/1/6': No such file or directory

EXPECTATION:
When a directory is missing on a dht brick and that directory exists on other bricks, that directory should be recreated to match it's peers.

Comment 1 Jonathan Steffan 2011-04-03 19:50:28 UTC
I've also confirmed this is still an issue with 3.2.1qa3. Here are TRACE level logs from a failing mkdir():

http://jsteffan.fedorapeople.org/logs/glusterfs-3.2.1qa3-trace-mkdir-failure.log

Comment 2 Jonathan Steffan 2011-04-03 21:20:23 UTC
As an added note, I was able to successfully create this directory when loading these bricks with 3.0.7. However, we can't go back to 3.0.7 as it seems content created in 3.1.3 is not properly read by 3.0.7.

Comment 3 Anand Avati 2011-04-05 01:59:04 UTC
The "simulation" of losing an inode (as part of a "partial node failure") is not really real-world in this test. What you are doing is directly manipulating the backend with an rm -rf. In the real world you would have lost the inode as part of an fsck when gluster was not running. To "match" the real world test, restart the brick process on the node where you are trying to create the "partial failure" and gluster should recreate the missing directories.

Checking for missing directories in every lookup is an expensive operation and we believe that performing checks on first access of a directory after a new server connect/reconnect is a very reasonable middle-ground (only hand-crafted tests like this would fail and real-world situations would be captured)

Avati


Note You need to log in before you can comment on or make changes to this bug.