Description of problem: 3 brick replica volume used for common file system with web application platform. Version-Release number of selected component (if applicable): 3.2.5-1 How reproducible: Trash single brick, i.e. hard power off, continue to run web application with active requests. Actual results: Unusuable mounted glusterFS file system, logs follow. [2012-02-17 10:31:32.721910] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 88: LOOKUP() /public_html/status.html => -1 (Input/ output error)[2012-02-17 10:31:32.744218] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/wp-atom.php, LOOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.744266] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 89: LOOKUP() /public_html/wp-atom.php => -1 (Input/ output error)[2012-02-17 10:31:32.761374] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/.changelog, L OOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.761415] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 90: LOOKUP() /public_html/.changelog => -1 (Input/o utput error)[2012-02-17 10:31:32.781385] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LO OKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.781440] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 91: LOOKUP() /public_html/intercept => -1 (Input/ou tput error)[2012-02-17 10:31:32.801367] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/wp-rdf.php, L OOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.801409] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 94: LOOKUP() /public_html/wp-rdf.php => -1 (Input/o utput error)[2012-02-17 10:31:32.852227] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/wp-rss2.php, LOOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.852341] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 99: LOOKUP() /public_html/wp-rss2.php => -1 (Input/ output error)[2012-02-17 10:31:32.899269] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/rapp, LOOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.899321] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 103: LOOKUP() /public_html/rapp => -1 (Input/output error)[2012-02-17 10:31:32.923432] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/wp-commentsrs s2.php, LOOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.923512] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 106: LOOKUP() /public_html/wp-commentsrss2.php => - 1 (Input/output error)[2012-02-17 10:31:32.942341] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/email, LOOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.942390] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 107: LOOKUP() /public_html/email => -1 (Input/outpu t error)[2012-02-17 10:31:32.961476] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/parseit-soap. php, LOOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.961558] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 108: LOOKUP() /public_html/parseit-soap.php => -1 ( Input/output error)[2012-02-17 10:31:33.1512] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/wp-pass.php, LO OKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:33.1582] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 109: LOOKUP() /public_html/wp-pass.php => -1 (Input/o utput error)[2012-02-17 10:31:33.23808] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/robot.txt, LOO KUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:33.23874] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 110: LOOKUP() /public_html/robot.txt => -1 (Input/ou tput error)[2012-02-17 10:33:33.875848] I [glusterfsd-mgmt.c:62:mgmt_cbk_spec] 0-mgmt: Volume file changed Expected results: Continued file system usage between active nodes. Additional info: In my case we have 2 mountpoints 1. Glusterfs mountpoint for operations needing distributed locks 2. NFS mountpoint this is the main webroot for the web applciation. Resolution is to manualy: 1. Remove brick from volumes 2. detach peer once removed. 3. active service now resumes.
How did you get files without GFID on your bricks? Is it possible they were written directly to the bricks and not written through a client mount point? Could you post a comment with the xattrs of one such file from a backend brick? For example... server# getfattr -m . -d -e hex /brick/dir/public_html/robot.txt (which by the way should probably be named "robots.txt") Also, if you have brick logs covering the same time period please post them as well. Regards.
These are production envs, and the files are written to via a NFS mount point at each node for general operations and a glusterfs mount point for operations requiring distributed locks. I used this python initially due to mirror connectivity issues: --- cat ./listxattr.py import xattr import sys from base64 import b64encode xattrs = xattr.listxattr(sys.argv[1]) for attr in xattrs: print('%s: %s %s'%(sys.argv[1],attr,b64encode(xattr.getxattr(sys.argv[1],attr)))) --- /media/staging/public_html/robots.txt: trusted.gfid RTjKo5iaSWqBdZkpT83jBQ== /media/staging/public_html/robots.txt: trusted.afr.STAGING-client-0 AAAAAAAAAAAAAAAA /media/staging/public_html/robots.txt: trusted.afr.STAGING-client-1 AAAAAAAAAAAAAAAA getfattr results: # file: media/staging/public_html/robots.txt trusted.afr.STAGING-client-0=0x000000000000000000000000 trusted.afr.STAGING-client-1=0x000000000000000000000000 trusted.gfid=0x4538caa3989a496a817599294fcde305 Logs, these are not full as the last message continues to repeat onto the logs above, I have also had to remove folder names etc, as this is a production environment --- I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: size differs for /public_html/******.php /var/log/glusterfs/nfs.log:[2012-02-17 06:27:11.260040] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background data self-heal triggered. path: /public_html/******.php /var/log/glusterfs/nfs.log:[2012-02-17 06:27:11.277804] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background data self-heal completed on /public_html/******.php /var/log/glusterfs/nfs.log:[2012-02-17 06:40:04.592528] I [socket.c:1768:socket_event_handler] 0-transport: disconnecting now /var/log/glusterfs/nfs.log:[2012-02-17 06:45:20.768389] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: size differs for /public_html/******/******/******/******/******/******/******.php /var/log/glusterfs/nfs.log:[2012-02-17 06:45:20.768454] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background data self-heal triggered. path: /public_html/******/******/******/******/******/******/******.php /var/log/glusterfs/nfs.log:[2012-02-17 06:45:20.776064] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background data self-heal completed on /public_html/******/******/******/******/******/******/******.php /var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.388719] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down /var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.389578] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down /var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.444574] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down /var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.445357] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down /var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.446815] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down /var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.448471] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down
More logs follow appears to be same issue as temporary inability to contact peer resulted in gluster failiure this time I have crash data: using git to pull from subversion SCM resulted in a crash this morning, this has been done multiple times without issue, OS is This was run an a single node on a glusterfs mountpoint --- brick log: [2012-03-01 11:03:35.650450] I [server-resolve.c:571:server_resolve] 0-STAGING-server: pure path resolution for /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab (LOOKUP) [2012-03-01 11:03:35.650882] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(resolve_deep_cbk+0x2ea) [0x2aaaabebe04a]))) 0-: inode not found [2012-03-01 11:03:35.651130] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:35.652456] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:35.653327] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:35.657752] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:35.658766] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:35.673056] I [server-resolve.c:571:server_resolve] 0-STAGING-server: pure path resolution for /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747 (LOOKUP) [2012-03-01 11:03:35.673439] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(resolve_deep_cbk+0x2ea) [0x2aaaabebe04a]))) 0-: inode not found [2012-03-01 11:03:35.673565] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:35.675429] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:35.677083] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:36.595921] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:36.795902] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:36.796400] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found [2012-03-01 11:03:43.224646] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.STAGING-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.176.65.165:1008) [2012-03-01 11:03:43.224819] I [server-helpers.c:485:do_fd_cleanup] 0-STAGING-server: fd cleanup on /public_html/.git/refs/remotes/trunk [2012-03-01 11:03:43.224978] I [server.c:438:server_rpc_notify] 0-STAGING-server: disconnected connection from 10.176.65.165:1008 [2012-03-01 11:03:43.225020] I [server-helpers.c:783:server_connection_destroy] 0-STAGING-server: destroyed connection of espn-web1.tv.espn.co.uk-31390-2012/02/22-10:31:26:127915-STAGING-client-0 [2012-03-01 11:51:50.171214] W [socket.c:1494:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:24007) [2012-03-01 11:51:50.171411] W [socket.c:1494:__socket_proto_state_machine] 0-socket.glusterfsd: reading from socket failed. Error (Transport endpoint is not connected), peer () [2012-03-01 11:51:51.230846] W [glusterfsd.c:727:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3505ed44bd] (-->/lib64/libpthread.so.0 [0x3506e0673d] (-->/opt/glusterfs/3.2.5/sbin/glusterfsd(glusterfs_sigwaiter+0x17c) [0x40441c]))) 0-: received signum (15), shutting down -- mounted path log [2012-03-01 11:03:32.915337] I [afr-common.c:869:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: ownership differs for /public_html/.git/HEAD [2012-03-01 11:03:32.953482] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: /public_html/.git/HEAD: gfid different on subvolume [2012-03-01 11:03:32.953526] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background meta-data gfid self-heal triggered. path: /public_html/.git/HEAD [2012-03-01 11:03:33.44144] I [afr-self-heal-common.c:1203:sh_missing_entries_create] 0-STAGING-replicate-0: no missing files - /public_html/.git/HEAD. proceeding to metadata check [2012-03-01 11:03:33.61319] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background meta-data gfid self-heal completed on /public_html/.git/HEAD [2012-03-01 11:03:33.670340] I [afr-common.c:869:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: ownership differs for /public_html/.git/refs/heads/master [2012-03-01 11:03:33.670385] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: /public_html/.git/refs/heads/master: gfid different on subvolume [2012-03-01 11:03:33.670432] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background meta-data gfid self-heal triggered. path: /public_html/.git/refs/heads/master [2012-03-01 11:03:33.682099] I [afr-self-heal-common.c:1203:sh_missing_entries_create] 0-STAGING-replicate-0: no missing files - /public_html/.git/refs/heads/master. proceeding to metadata check [2012-03-01 11:03:33.684003] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background meta-data gfid self-heal completed on /public_html/.git/refs/heads/master [2012-03-01 11:03:33.918485] E [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-STAGING-client-0: remote operation failed: Stale NFS file handle [2012-03-01 11:03:33.918528] W [stat-prefetch.c:1549:sp_open_helper] 0-STAGING-stat-prefetch: lookup-behind has failed for path (/public_html/.git/refs/remotes/trunk)(Stale NFS file handle), unwinding open call waiting on it [2012-03-01 11:03:33.938122] W [fuse-bridge.c:588:fuse_fd_cbk] 0-glusterfs-fuse: 884462: OPEN() /public_html/.git/refs/remotes/trunk => -1 (Stale NFS file handle) [2012-03-01 11:03:33.973142] I [afr-dir-read.c:179:afr_examine_dir_readdir_cbk] 0-STAGING-replicate-0: entry self-heal triggered. path: /public_html/.git/refs/remotes/tags, reason: checksums of directory differ, forced merge option set [2012-03-01 11:03:34.66127] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:34.66170] I [afr-self-heal-entry.c:1741:afr_sh_entry_common_lookup_done] 0-STAGING-replicate-0: /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53: Skipping entry self-heal because of gfid absence [2012-03-01 11:03:34.66204] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:34.66232] I [afr-self-heal-entry.c:1741:afr_sh_entry_common_lookup_done] 0-STAGING-replicate-0: /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54: Skipping entry self-heal because of gfid absence [2012-03-01 11:03:34.67946] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background entry self-heal failed on /public_html/.git/refs/remotes/tags [2012-03-01 11:03:34.123743] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53. [2012-03-01 11:03:34.123768] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53 [2012-03-01 11:03:34.124894] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:34.125216] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:34.196968] W [dict.c:418:dict_unref] (-->/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2) [0x2ba2e04cd432] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_fstat_cbk+0x2c9) [0x2aaaab1b0e09] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x1d0) [0x2aaaab3e9510]))) 0-dict: dict is NULL [2012-03-01 11:03:34.199266] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal completed on /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53 [2012-03-01 11:03:34.219384] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54. [2012-03-01 11:03:34.219426] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54 [2012-03-01 11:03:34.220670] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:34.220983] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:34.224895] W [dict.c:418:dict_unref] (-->/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2) [0x2ba2e04cd432] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_fstat_cbk+0x2c9) [0x2aaaab1b0e09] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x1d0) [0x2aaaab3e9510]))) 0-dict: dict is NULL [2012-03-01 11:03:34.227635] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal completed on /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54 [2012-03-01 11:03:35.610559] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/objects/24. [2012-03-01 11:03:35.610591] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/objects/24 [2012-03-01 11:03:35.611421] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/24 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:35.611761] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/24 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:35.651086] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:35.651196] I [afr-self-heal-entry.c:1741:afr_sh_entry_common_lookup_done] 0-STAGING-replicate-0: /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab: Skipping entry self-heal because of gfid absence [2012-03-01 11:03:35.651652] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal failed on /public_html/.git/objects/24 [2012-03-01 11:03:35.652540] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab. [2012-03-01 11:03:35.652564] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab [2012-03-01 11:03:35.653426] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:35.653755] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:35.657206] W [dict.c:418:dict_unref] (-->/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2) [0x2ba2e04cd432] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_fstat_cbk+0x2c9) [0x2aaaab1b0e09] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x1d0) [0x2aaaab3e9510]))) 0-dict: dict is NULL [2012-03-01 11:03:35.657931] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/objects/79. [2012-03-01 11:03:35.657954] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/objects/79 [2012-03-01 11:03:35.658932] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/79 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:35.659331] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/79 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:35.659662] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal completed on /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab [2012-03-01 11:03:35.674023] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:35.674045] I [afr-self-heal-entry.c:1741:afr_sh_entry_common_lookup_done] 0-STAGING-replicate-0: /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747: Skipping entry self-heal because of gfid absence [2012-03-01 11:03:35.674684] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal failed on /public_html/.git/objects/79 [2012-03-01 11:03:35.675949] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747. [2012-03-01 11:03:35.675978] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747 [2012-03-01 11:03:35.677473] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:35.677794] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747 on subvolume STAGING-client-1 => -1 (No such file or directory) [2012-03-01 11:03:35.681611] W [dict.c:418:dict_unref] (-->/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2) [0x2ba2e04cd432] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_fstat_cbk+0x2c9) [0x2aaaab1b0e09] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x1d0) [0x2aaaab3e9510]))) 0-dict: dict is NULL [2012-03-01 11:03:35.779422] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background meta-data data entry missing-entry gfid self-heal completed on /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747 [2012-03-01 11:03:36.596060] I [afr-common.c:869:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: ownership differs for /public_html/.git/svn/.metadata [2012-03-01 11:03:36.596088] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: /public_html/.git/svn/.metadata: gfid different on subvolume [2012-03-01 11:03:36.596107] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background meta-data gfid self-heal triggered. path: /public_html/.git/svn/.metadata [2012-03-01 11:03:36.596884] I [afr-self-heal-common.c:1203:sh_missing_entries_create] 0-STAGING-replicate-0: no missing files - /public_html/.git/svn/.metadata. proceeding to metadata check [2012-03-01 11:03:36.598896] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background meta-data gfid self-heal completed on /public_html/.git/svn/.metadata [2012-03-01 11:03:36.771596] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: size differs for /public_html/.git/svn/refs/remotes/trunk/.rev_map.9733698e-6b0f-0410-ab9e-a50e8268ea71 [2012-03-01 11:03:36.771659] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background data self-heal triggered. path: /public_html/.git/svn/refs/remotes/trunk/.rev_map.9733698e-6b0f-0410-ab9e-a50e8268ea71 [2012-03-01 11:03:36.783395] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background data self-heal completed on /public_html/.git/svn/refs/remotes/trunk/.rev_map.9733698e-6b0f-0410-ab9e-a50e8268ea71 [2012-03-01 11:03:36.796016] E [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-STAGING-client-0: remote operation failed: Stale NFS file handle [2012-03-01 11:03:36.796571] I [afr-common.c:869:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: ownership differs for /public_html/.git/refs/remotes/trunk [2012-03-01 11:03:36.796591] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: /public_html/.git/refs/remotes/trunk: gfid different on subvolume [2012-03-01 11:03:36.796608] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background meta-data gfid self-heal triggered. path: /public_html/.git/refs/remotes/trunk [2012-03-01 11:03:36.798873] I [afr-self-heal-common.c:1203:sh_missing_entries_create] 0-STAGING-replicate-0: no missing files - /public_html/.git/refs/remotes/trunk. proceeding to metadata check [2012-03-01 11:03:36.800810] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background meta-data gfid self-heal completed on /public_html/.git/refs/remotes/trunk time of crash: 2012-03-01 11:03:36 pending frames: frame : type(1) op(OPEN) frame : type(1) op(OPEN) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-03-01 11:03:36 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.2.5 /lib64/libc.so.6[0x3505e302d0] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/io-cache.so(ioc_open_cbk+0x9b)[0x2aaaaba40a8b] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/read-ahead.so(ra_open_cbk+0x213)[0x2aaaab835a73] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_open_cbk+0x111)[0x2aaaab624a21] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_open_cbk+0x26b)[0x2aaaab3e3dab] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_open_cbk+0x1b1)[0x2aaaab1a73d1] /opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x2ba2e04cd432] /opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x2ba2e04cd62d] /opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x2ba2e04c82a7] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaaaad70ecf] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_handler+0x188)[0x2aaaaad71078] /opt/glusterfs/3.2.5/lib64/libglusterfs.so.0[0x2ba2e028c9a1] /opt/glusterfs/3.2.5/sbin/glusterfs(main+0x42f)[0x4052af] /lib64/libc.so.6(__libc_start_main+0xf4)[0x3505e1d994] /opt/glusterfs/3.2.5/sbin/glusterfs[0x403649] ---------
Further testing via CloudServers. Setup: n2n providing p2p vpn & TLS NOTE: This is a new experimental setup I am trailing not the original above. New volume TEST iozone -R -l 5 -u 5 -r 4k -s 100m -F /media/glusterfs/test/{1,2,3,4,5} -b /tmp/test3.xls | tee -a /tmp/test3.txt node2 is forcefully rebooted log sample of /var/log/gluster/media-glusterfs-test.log [2012-03-06 09:57:26.655328] W [rpc-clnt.c:1417:rpc_clnt_submit] 0-TEST-client-1: failed to submit rpc-request (XID: 0x2078193x Program: GlusterFS 3.1, ProgVers: 310, Proc: 30) to rpc-transport (TEST-client-1) [2012-03-06 09:57:26.655347] E [client3_1-fops.c:1264:client3_1_finodelk_cbk] 0-TEST-client-1: remote operation failed: Transport endpoint is not connected [2012-03-06 09:57:26.655390] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7fac44ccb568] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7fac44ccacfd] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fac44ccac5e]))) 0-TEST-client-1: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2012-03-06 09:56:18.425191 [2012-03-06 09:57:26.655422] W [rpc-clnt.c:1417:rpc_clnt_submit] 0-TEST-client-1: failed to submit rpc-request (XID: 0x2078194x Program: GlusterFS 3.1, ProgVers: 310, Proc: 30) to rpc-transport (TEST-client-1) [2012-03-06 09:57:26.655451] E [client3_1-fops.c:1264:client3_1_finodelk_cbk] 0-TEST-client-1: remote operation failed: Transport endpoint is not connected [2012-03-06 09:57:26.655491] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7fac44ccb568] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7fac44ccacfd] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fac44ccac5e]))) 0-TEST-client-1: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-03-06 09:56:40.75430 [2012-03-06 09:57:26.655516] E [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-TEST-client-1: remote operation failed: Transport endpoint is not connected [2012-03-06 09:57:26.656369] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7fac44ccb568] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7fac44ccacfd] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fac44ccac5e]))) 0-TEST-client-1: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-03-06 09:56:44.573794 [2012-03-06 09:57:26.656418] W [client-handshake.c:264:client_ping_cbk] 0-TEST-client-1: timer must have expired [2012-03-06 09:57:26.656449] I [client.c:1883:client_rpc_notify] 0-TEST-client-1: disconnected [2012-03-06 09:57:26.657715] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3503340455): failed to get fd ctx. EBADFD [2012-03-06 09:57:26.657754] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state [2012-03-06 09:57:26.657844] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3503340455): failed to get fd ctx. EBADFD [2012-03-06 09:57:26.657864] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state [2012-03-06 09:57:26.658061] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3503340455): failed to get fd ctx. EBADFD [2012-03-06 09:57:26.658086] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state [2012-03-06 09:57:26.658963] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3503340455): failed to get fd ctx. EBADFD [2012-03-06 09:57:26.658997] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state [2012-03-06 09:57:26.663857] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3935430561): failed to get fd ctx. EBADFD [2012-03-06 09:57:26.663893] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state [2012-03-06 09:57:26.663975] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3935430561): failed to get fd ctx. EBADFD [2012-03-06 09:57:26.663997] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state [2012-03-06 09:57:26.664068] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3935430561): failed to get fd ctx. EBADFD Note: in this scenario the filesystem became deadlocked at the surviving node (node1) Once node2 was returned to service, operations continued seamlessly, shouldn't node1 continue operation in a replica volume ?
With regard to the above, I have a theory as to how this failure occurs. an NFS mount point does not deadlock during the loss of a node, as such file operations continue. once the downed node is returned to service, gluster does not appear to be able to handel the differences in the file system, and as such leads to this issue.
Am I correct in thinking that you have mounted node2:/public_html as /public_html on your web server? If Node2 dies, how would your NFS client on the web server know to flip over from Node2 to use Node1? Surely its logical to assume if Node2 dies then any NFS filesystem you have mounted from it will also die? Does your GlusterFS FUSE Client mount fail at the same time?
Whilst I no longer have access to this setup. Each node was addressing an NFS mount locally such thag gluster was providing a common fs for 2 web app servers.
Amar, Assigning because of the crash pointed in comment-3. package-string: glusterfs 3.2.5 /lib64/libc.so.6[0x3505e302d0] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/io-cache.so(ioc_open_cbk+0x9b)[0x2aaaaba40a8b] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/read-ahead.so(ra_open_cbk+0x213)[0x2aaaab835a73] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_open_cbk+0x111)[0x2aaaab624a21] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_open_cbk+0x26b)[0x2aaaab3e3dab] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_open_cbk+0x1b1)[0x2aaaab1a73d1] /opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x2ba2e04cd432] /opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x2ba2e04cd62d] /opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x2ba2e04c82a7] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaaaad70ecf] /opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_handler+0x188)[0x2aaaaad71078] /opt/glusterfs/3.2.5/lib64/libglusterfs.so.0[0x2ba2e028c9a1] /opt/glusterfs/3.2.5/sbin/glusterfs(main+0x42f)[0x4052af] /lib64/libc.so.6(__libc_start_main+0xf4)[0x3505e1d994] /opt/glusterfs/3.2.5/sbin/glusterfs[0x403649]
We are aware of these crashes on 3.2.x branch, but haven't noticed it on 3.4.x branch yet. Please re-open the bug if the 3.4.x release (now @ Alpha) doesn't fix it.