Bug 794699 - Can not tollerate failiure of a brick in replica volume
Summary: Can not tollerate failiure of a brick in replica volume
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: io-cache
Version: 3.2.5
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 952693
TreeView+ depends on / blocked
 
Reported: 2012-02-17 10:52 UTC by David Busby
Modified: 2013-04-26 09:48 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.4.x
Clone Of:
Environment:
Last Closed: 2013-04-26 09:48:37 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description David Busby 2012-02-17 10:52:28 UTC
Description of problem:

3 brick replica volume used for common file system with web application platform.

Version-Release number of selected component (if applicable):

3.2.5-1


How reproducible:

Trash single brick, i.e. hard power off, continue to run web application with active requests.
  
Actual results:

Unusuable mounted glusterFS file system, logs follow.

[2012-02-17 10:31:32.721910] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 88: LOOKUP() /public_html/status.html => -1 (Input/
output error)[2012-02-17 10:31:32.744218] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/wp-atom.php, 
LOOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.744266] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 89: LOOKUP() /public_html/wp-atom.php => -1 (Input/
output error)[2012-02-17 10:31:32.761374] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/.changelog, L
OOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.761415] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 90: LOOKUP() /public_html/.changelog => -1 (Input/o
utput error)[2012-02-17 10:31:32.781385] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LO
OKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.781440] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 91: LOOKUP() /public_html/intercept => -1 (Input/ou
tput error)[2012-02-17 10:31:32.801367] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/wp-rdf.php, L
OOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.801409] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 94: LOOKUP() /public_html/wp-rdf.php => -1 (Input/o
utput error)[2012-02-17 10:31:32.852227] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/wp-rss2.php, 
LOOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.852341] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 99: LOOKUP() /public_html/wp-rss2.php => -1 (Input/
output error)[2012-02-17 10:31:32.899269] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/rapp, LOOKUP 
on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.899321] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 103: LOOKUP() /public_html/rapp => -1 (Input/output
 error)[2012-02-17 10:31:32.923432] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/wp-commentsrs
s2.php, LOOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.923512] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 106: LOOKUP() /public_html/wp-commentsrss2.php => -
1 (Input/output error)[2012-02-17 10:31:32.942341] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/email, LOOKUP
 on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.942390] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 107: LOOKUP() /public_html/email => -1 (Input/outpu
t error)[2012-02-17 10:31:32.961476] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/parseit-soap.
php, LOOKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:32.961558] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 108: LOOKUP() /public_html/parseit-soap.php => -1 (
Input/output error)[2012-02-17 10:31:33.1512] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/wp-pass.php, LO
OKUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:33.1582] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 109: LOOKUP() /public_html/wp-pass.php => -1 (Input/o
utput error)[2012-02-17 10:31:33.23808] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/robot.txt, LOO
KUP on a file without gfid is not allowed when some of the children are down[2012-02-17 10:31:33.23874] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 110: LOOKUP() /public_html/robot.txt => -1 (Input/ou
tput error)[2012-02-17 10:33:33.875848] I [glusterfsd-mgmt.c:62:mgmt_cbk_spec] 0-mgmt: Volume file changed


Expected results:

Continued file system usage between active nodes.

Additional info:

In my case we have 2 mountpoints 

1. Glusterfs mountpoint for operations needing distributed locks
2. NFS mountpoint this is the main webroot for the web applciation.


Resolution is to manualy:

1. Remove brick from volumes
2. detach peer once removed.
3. active service now resumes.

Comment 1 Louis Zuckerman 2012-02-21 12:24:42 UTC
How did you get files without GFID on your bricks?  Is it possible they were written directly to the bricks and not written through a client mount point?

Could you post a comment with the xattrs of one such file from a backend brick?

For example...

server# getfattr -m . -d -e hex /brick/dir/public_html/robot.txt

(which by the way should probably be named "robots.txt")

Also, if you have brick logs covering the same time period please post them as well.

Regards.

Comment 2 David Busby 2012-02-22 11:16:26 UTC
These are production envs, and the files are written to via a NFS mount point at each node for general operations and a glusterfs mount point for operations requiring distributed locks.

I used this python initially due to mirror connectivity issues:
---
cat ./listxattr.py 
import xattr
import sys
from base64 import b64encode
xattrs = xattr.listxattr(sys.argv[1])
for attr in xattrs:
    print('%s: %s %s'%(sys.argv[1],attr,b64encode(xattr.getxattr(sys.argv[1],attr))))

---

/media/staging/public_html/robots.txt: trusted.gfid RTjKo5iaSWqBdZkpT83jBQ==
/media/staging/public_html/robots.txt: trusted.afr.STAGING-client-0 AAAAAAAAAAAAAAAA
/media/staging/public_html/robots.txt: trusted.afr.STAGING-client-1 AAAAAAAAAAAAAAAA

getfattr results:


# file: media/staging/public_html/robots.txt
trusted.afr.STAGING-client-0=0x000000000000000000000000
trusted.afr.STAGING-client-1=0x000000000000000000000000
trusted.gfid=0x4538caa3989a496a817599294fcde305

Logs, these are not full as the last message continues to repeat onto the logs above, I have also had to remove folder names etc, as this is a production environment

---

I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: size differs for /public_html/******.php 
/var/log/glusterfs/nfs.log:[2012-02-17 06:27:11.260040] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  data self-heal triggered. path: /public_html/******.php
/var/log/glusterfs/nfs.log:[2012-02-17 06:27:11.277804] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  data self-heal completed on /public_html/******.php
/var/log/glusterfs/nfs.log:[2012-02-17 06:40:04.592528] I [socket.c:1768:socket_event_handler] 0-transport: disconnecting now
/var/log/glusterfs/nfs.log:[2012-02-17 06:45:20.768389] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: size differs for /public_html/******/******/******/******/******/******/******.php 
/var/log/glusterfs/nfs.log:[2012-02-17 06:45:20.768454] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  data self-heal triggered. path: /public_html/******/******/******/******/******/******/******.php
/var/log/glusterfs/nfs.log:[2012-02-17 06:45:20.776064] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  data self-heal completed on /public_html/******/******/******/******/******/******/******.php
/var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.388719] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down
/var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.389578] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down
/var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.444574] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down
/var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.445357] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down
/var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.446815] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down
/var/log/glusterfs/nfs.log:[2012-02-17 06:45:24.448471] E [afr-common.c:1410:afr_lookup_done] 0-STAGING-replicate-0: Failing lookup for /public_html/intercept, LOOKUP on a file without gfid is not allowed when some of the children are down

Comment 3 David Busby 2012-03-01 12:01:29 UTC
More logs follow appears to be same issue as temporary inability to contact peer resulted in gluster failiure this time I have crash data:



using git to pull from subversion SCM resulted in a crash this morning, this has been done multiple times without issue, OS is 

This was run an a single node on a glusterfs mountpoint

--- brick log:

[2012-03-01 11:03:35.650450] I [server-resolve.c:571:server_resolve] 0-STAGING-server: pure path resolution for /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab (LOOKUP)
[2012-03-01 11:03:35.650882] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(resolve_deep_cbk+0x2ea) [0x2aaaabebe04a]))) 0-: inode not found
[2012-03-01 11:03:35.651130] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:35.652456] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:35.653327] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:35.657752] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:35.658766] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:35.673056] I [server-resolve.c:571:server_resolve] 0-STAGING-server: pure path resolution for /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747 (LOOKUP)
[2012-03-01 11:03:35.673439] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(resolve_deep_cbk+0x2ea) [0x2aaaabebe04a]))) 0-: inode not found
[2012-03-01 11:03:35.673565] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:35.675429] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:35.677083] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:36.595921] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:36.795902] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:36.796400] W [inode.c:844:inode_lookup] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/features/marker.so(marker_lookup_cbk+0xda) [0x2aaaaba8886a] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xe4) [0x2aaaabca4fa4] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/server.so(server_lookup_cbk+0x791) [0x2aaaabed4f81]))) 0-: inode not found
[2012-03-01 11:03:43.224646] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.STAGING-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.176.65.165:1008)
[2012-03-01 11:03:43.224819] I [server-helpers.c:485:do_fd_cleanup] 0-STAGING-server: fd cleanup on /public_html/.git/refs/remotes/trunk
[2012-03-01 11:03:43.224978] I [server.c:438:server_rpc_notify] 0-STAGING-server: disconnected connection from 10.176.65.165:1008
[2012-03-01 11:03:43.225020] I [server-helpers.c:783:server_connection_destroy] 0-STAGING-server: destroyed connection of espn-web1.tv.espn.co.uk-31390-2012/02/22-10:31:26:127915-STAGING-client-0
[2012-03-01 11:51:50.171214] W [socket.c:1494:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:24007)
[2012-03-01 11:51:50.171411] W [socket.c:1494:__socket_proto_state_machine] 0-socket.glusterfsd: reading from socket failed. Error (Transport endpoint is not connected), peer ()
[2012-03-01 11:51:51.230846] W [glusterfsd.c:727:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3505ed44bd] (-->/lib64/libpthread.so.0 [0x3506e0673d] (-->/opt/glusterfs/3.2.5/sbin/glusterfsd(glusterfs_sigwaiter+0x17c) [0x40441c]))) 0-: received signum (15), shutting down

-- mounted path log

[2012-03-01 11:03:32.915337] I [afr-common.c:869:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: ownership differs for /public_html/.git/HEAD 
[2012-03-01 11:03:32.953482] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: /public_html/.git/HEAD: gfid different on subvolume
[2012-03-01 11:03:32.953526] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  meta-data gfid self-heal triggered. path: /public_html/.git/HEAD
[2012-03-01 11:03:33.44144] I [afr-self-heal-common.c:1203:sh_missing_entries_create] 0-STAGING-replicate-0: no missing files - /public_html/.git/HEAD. proceeding to metadata check
[2012-03-01 11:03:33.61319] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  meta-data gfid self-heal completed on /public_html/.git/HEAD
[2012-03-01 11:03:33.670340] I [afr-common.c:869:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: ownership differs for /public_html/.git/refs/heads/master 
[2012-03-01 11:03:33.670385] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: /public_html/.git/refs/heads/master: gfid different on subvolume
[2012-03-01 11:03:33.670432] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  meta-data gfid self-heal triggered. path: /public_html/.git/refs/heads/master
[2012-03-01 11:03:33.682099] I [afr-self-heal-common.c:1203:sh_missing_entries_create] 0-STAGING-replicate-0: no missing files - /public_html/.git/refs/heads/master. proceeding to metadata check
[2012-03-01 11:03:33.684003] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  meta-data gfid self-heal completed on /public_html/.git/refs/heads/master
[2012-03-01 11:03:33.918485] E [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-STAGING-client-0: remote operation failed: Stale NFS file handle
[2012-03-01 11:03:33.918528] W [stat-prefetch.c:1549:sp_open_helper] 0-STAGING-stat-prefetch: lookup-behind has failed for path (/public_html/.git/refs/remotes/trunk)(Stale NFS file handle), unwinding open call waiting on it
[2012-03-01 11:03:33.938122] W [fuse-bridge.c:588:fuse_fd_cbk] 0-glusterfs-fuse: 884462: OPEN() /public_html/.git/refs/remotes/trunk => -1 (Stale NFS file handle)
[2012-03-01 11:03:33.973142] I [afr-dir-read.c:179:afr_examine_dir_readdir_cbk] 0-STAGING-replicate-0:  entry self-heal triggered. path: /public_html/.git/refs/remotes/tags, reason: checksums of directory differ, forced merge option set
[2012-03-01 11:03:34.66127] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:34.66170] I [afr-self-heal-entry.c:1741:afr_sh_entry_common_lookup_done] 0-STAGING-replicate-0: /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53: Skipping entry self-heal because of gfid absence
[2012-03-01 11:03:34.66204] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:34.66232] I [afr-self-heal-entry.c:1741:afr_sh_entry_common_lookup_done] 0-STAGING-replicate-0: /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54: Skipping entry self-heal because of gfid absence
[2012-03-01 11:03:34.67946] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  entry self-heal failed on /public_html/.git/refs/remotes/tags
[2012-03-01 11:03:34.123743] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53.
[2012-03-01 11:03:34.123768] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53
[2012-03-01 11:03:34.124894] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:34.125216] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:34.196968] W [dict.c:418:dict_unref] (-->/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2) [0x2ba2e04cd432] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_fstat_cbk+0x2c9) [0x2aaaab1b0e09] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x1d0) [0x2aaaab3e9510]))) 0-dict: dict is NULL
[2012-03-01 11:03:34.199266] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal completed on /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.53
[2012-03-01 11:03:34.219384] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54.
[2012-03-01 11:03:34.219426] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54
[2012-03-01 11:03:34.220670] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:34.220983] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:34.224895] W [dict.c:418:dict_unref] (-->/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2) [0x2ba2e04cd432] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_fstat_cbk+0x2c9) [0x2aaaab1b0e09] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x1d0) [0x2aaaab3e9510]))) 0-dict: dict is NULL
[2012-03-01 11:03:34.227635] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal completed on /public_html/.git/refs/remotes/tags/client_name_removed_3.1_1.54
[2012-03-01 11:03:35.610559] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/objects/24.
[2012-03-01 11:03:35.610591] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/objects/24
[2012-03-01 11:03:35.611421] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/24 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:35.611761] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/24 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:35.651086] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:35.651196] I [afr-self-heal-entry.c:1741:afr_sh_entry_common_lookup_done] 0-STAGING-replicate-0: /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab: Skipping entry self-heal because of gfid absence
[2012-03-01 11:03:35.651652] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal failed on /public_html/.git/objects/24
[2012-03-01 11:03:35.652540] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab.
[2012-03-01 11:03:35.652564] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab
[2012-03-01 11:03:35.653426] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:35.653755] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:35.657206] W [dict.c:418:dict_unref] (-->/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2) [0x2ba2e04cd432] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_fstat_cbk+0x2c9) [0x2aaaab1b0e09] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x1d0) [0x2aaaab3e9510]))) 0-dict: dict is NULL
[2012-03-01 11:03:35.657931] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/objects/79.
[2012-03-01 11:03:35.657954] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/objects/79
[2012-03-01 11:03:35.658932] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/79 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:35.659331] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/79 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:35.659662] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal completed on /public_html/.git/objects/24/74704d088689cfb1151debf9429a2739cb78ab
[2012-03-01 11:03:35.674023] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:35.674045] I [afr-self-heal-entry.c:1741:afr_sh_entry_common_lookup_done] 0-STAGING-replicate-0: /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747: Skipping entry self-heal because of gfid absence
[2012-03-01 11:03:35.674684] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal failed on /public_html/.git/objects/79
[2012-03-01 11:03:35.675949] I [afr-common.c:917:afr_detect_self_heal_by_lookup_status] 0-STAGING-replicate-0: entries are missing in lookup of /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747.
[2012-03-01 11:03:35.675978] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal triggered. path: /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747
[2012-03-01 11:03:35.677473] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:35.677794] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 0-STAGING-replicate-0: path /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747 on subvolume STAGING-client-1 => -1 (No such file or directory)
[2012-03-01 11:03:35.681611] W [dict.c:418:dict_unref] (-->/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2) [0x2ba2e04cd432] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_fstat_cbk+0x2c9) [0x2aaaab1b0e09] (-->/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x1d0) [0x2aaaab3e9510]))) 0-dict: dict is NULL
[2012-03-01 11:03:35.779422] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  meta-data data entry missing-entry gfid self-heal completed on /public_html/.git/objects/79/3822932d49da73a879f1f79efc7e1ca8319747
[2012-03-01 11:03:36.596060] I [afr-common.c:869:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: ownership differs for /public_html/.git/svn/.metadata 
[2012-03-01 11:03:36.596088] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: /public_html/.git/svn/.metadata: gfid different on subvolume
[2012-03-01 11:03:36.596107] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  meta-data gfid self-heal triggered. path: /public_html/.git/svn/.metadata
[2012-03-01 11:03:36.596884] I [afr-self-heal-common.c:1203:sh_missing_entries_create] 0-STAGING-replicate-0: no missing files - /public_html/.git/svn/.metadata. proceeding to metadata check
[2012-03-01 11:03:36.598896] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  meta-data gfid self-heal completed on /public_html/.git/svn/.metadata
[2012-03-01 11:03:36.771596] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: size differs for /public_html/.git/svn/refs/remotes/trunk/.rev_map.9733698e-6b0f-0410-ab9e-a50e8268ea71 
[2012-03-01 11:03:36.771659] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  data self-heal triggered. path: /public_html/.git/svn/refs/remotes/trunk/.rev_map.9733698e-6b0f-0410-ab9e-a50e8268ea71
[2012-03-01 11:03:36.783395] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  data self-heal completed on /public_html/.git/svn/refs/remotes/trunk/.rev_map.9733698e-6b0f-0410-ab9e-a50e8268ea71
[2012-03-01 11:03:36.796016] E [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-STAGING-client-0: remote operation failed: Stale NFS file handle
[2012-03-01 11:03:36.796571] I [afr-common.c:869:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: ownership differs for /public_html/.git/refs/remotes/trunk 
[2012-03-01 11:03:36.796591] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-STAGING-replicate-0: /public_html/.git/refs/remotes/trunk: gfid different on subvolume
[2012-03-01 11:03:36.796608] I [afr-common.c:1038:afr_launch_self_heal] 0-STAGING-replicate-0: background  meta-data gfid self-heal triggered. path: /public_html/.git/refs/remotes/trunk
[2012-03-01 11:03:36.798873] I [afr-self-heal-common.c:1203:sh_missing_entries_create] 0-STAGING-replicate-0: no missing files - /public_html/.git/refs/remotes/trunk. proceeding to metadata check
[2012-03-01 11:03:36.800810] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-STAGING-replicate-0: background  meta-data gfid self-heal completed on /public_html/.git/refs/remotes/trunk
time of crash: 2012-03-01 11:03:36
pending frames:
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2012-03-01 11:03:36
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.2.5
/lib64/libc.so.6[0x3505e302d0]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/io-cache.so(ioc_open_cbk+0x9b)[0x2aaaaba40a8b]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/read-ahead.so(ra_open_cbk+0x213)[0x2aaaab835a73]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_open_cbk+0x111)[0x2aaaab624a21]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_open_cbk+0x26b)[0x2aaaab3e3dab]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_open_cbk+0x1b1)[0x2aaaab1a73d1]
/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x2ba2e04cd432]
/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x2ba2e04cd62d]
/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x2ba2e04c82a7]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaaaad70ecf]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_handler+0x188)[0x2aaaaad71078]
/opt/glusterfs/3.2.5/lib64/libglusterfs.so.0[0x2ba2e028c9a1]
/opt/glusterfs/3.2.5/sbin/glusterfs(main+0x42f)[0x4052af]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3505e1d994]
/opt/glusterfs/3.2.5/sbin/glusterfs[0x403649]
---------

Comment 4 David Busby 2012-03-06 10:04:01 UTC
Further testing via CloudServers.

Setup:

n2n providing p2p vpn & TLS

NOTE: This is a new experimental setup I am trailing not the original above.

New volume TEST

iozone -R -l 5 -u 5 -r 4k -s 100m -F /media/glusterfs/test/{1,2,3,4,5} -b /tmp/test3.xls | tee -a /tmp/test3.txt

node2 is forcefully rebooted

log sample of /var/log/gluster/media-glusterfs-test.log

[2012-03-06 09:57:26.655328] W [rpc-clnt.c:1417:rpc_clnt_submit] 0-TEST-client-1: failed to submit rpc-request (XID: 0x2078193x Program: GlusterFS 3.1, ProgVers: 310, Proc: 30) to rpc-transport (TEST-client-1)
[2012-03-06 09:57:26.655347] E [client3_1-fops.c:1264:client3_1_finodelk_cbk] 0-TEST-client-1: remote operation failed: Transport endpoint is not connected
[2012-03-06 09:57:26.655390] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7fac44ccb568] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7fac44ccacfd] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fac44ccac5e]))) 0-TEST-client-1: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2012-03-06 09:56:18.425191
[2012-03-06 09:57:26.655422] W [rpc-clnt.c:1417:rpc_clnt_submit] 0-TEST-client-1: failed to submit rpc-request (XID: 0x2078194x Program: GlusterFS 3.1, ProgVers: 310, Proc: 30) to rpc-transport (TEST-client-1)
[2012-03-06 09:57:26.655451] E [client3_1-fops.c:1264:client3_1_finodelk_cbk] 0-TEST-client-1: remote operation failed: Transport endpoint is not connected
[2012-03-06 09:57:26.655491] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7fac44ccb568] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7fac44ccacfd] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fac44ccac5e]))) 0-TEST-client-1: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-03-06 09:56:40.75430
[2012-03-06 09:57:26.655516] E [client3_1-fops.c:2228:client3_1_lookup_cbk] 0-TEST-client-1: remote operation failed: Transport endpoint is not connected
[2012-03-06 09:57:26.656369] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7fac44ccb568] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7fac44ccacfd] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fac44ccac5e]))) 0-TEST-client-1: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-03-06 09:56:44.573794
[2012-03-06 09:57:26.656418] W [client-handshake.c:264:client_ping_cbk] 0-TEST-client-1: timer must have expired
[2012-03-06 09:57:26.656449] I [client.c:1883:client_rpc_notify] 0-TEST-client-1: disconnected
[2012-03-06 09:57:26.657715] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3503340455): failed to get fd ctx. EBADFD
[2012-03-06 09:57:26.657754] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state
[2012-03-06 09:57:26.657844] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3503340455): failed to get fd ctx. EBADFD
[2012-03-06 09:57:26.657864] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state
[2012-03-06 09:57:26.658061] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3503340455): failed to get fd ctx. EBADFD
[2012-03-06 09:57:26.658086] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state
[2012-03-06 09:57:26.658963] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3503340455): failed to get fd ctx. EBADFD
[2012-03-06 09:57:26.658997] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state
[2012-03-06 09:57:26.663857] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3935430561): failed to get fd ctx. EBADFD
[2012-03-06 09:57:26.663893] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state
[2012-03-06 09:57:26.663975] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3935430561): failed to get fd ctx. EBADFD
[2012-03-06 09:57:26.663997] W [client3_1-fops.c:4915:client3_1_finodelk] 0-TEST-client-1: failed to send the fop: File descriptor in bad state
[2012-03-06 09:57:26.664068] W [client3_1-fops.c:4868:client3_1_finodelk] 0-TEST-client-1: (-3935430561): failed to get fd ctx. EBADFD


Note: in this scenario the filesystem became deadlocked at the surviving node (node1)

Once node2 was returned to service, operations continued seamlessly, shouldn't node1 continue operation in a replica volume ?

Comment 5 David Busby 2012-03-06 10:13:42 UTC
With regard to the above, I have a theory as to how this failure occurs.

an NFS mount point does not deadlock during the loss of a node, as such file operations continue.

once the downed node is returned to service, gluster does not appear to be able to handel the differences in the file system, and as such leads to this issue.

Comment 6 Richard 2013-01-11 15:37:04 UTC
Am I correct in thinking that you have mounted node2:/public_html as /public_html on your web server?

If Node2 dies, how would your NFS client on the web server know to flip over from Node2 to use Node1? Surely its logical to assume if Node2 dies then any NFS filesystem you have mounted from it will also die?

Does your GlusterFS FUSE Client mount fail at the same time?

Comment 7 David Busby 2013-03-11 09:34:49 UTC
Whilst I no longer have access to this setup.

Each node was addressing an NFS mount locally such thag gluster was providing a common fs for 2 web app servers.

Comment 8 Pranith Kumar K 2013-03-21 12:22:09 UTC
Amar,
     Assigning because of the crash pointed in comment-3.

package-string: glusterfs 3.2.5
/lib64/libc.so.6[0x3505e302d0]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/io-cache.so(ioc_open_cbk+0x9b)[0x2aaaaba40a8b]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/read-ahead.so(ra_open_cbk+0x213)[0x2aaaab835a73]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_open_cbk+0x111)[0x2aaaab624a21]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/cluster/replicate.so(afr_open_cbk+0x26b)[0x2aaaab3e3dab]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_open_cbk+0x1b1)[0x2aaaab1a73d1]
/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x2ba2e04cd432]
/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x2ba2e04cd62d]
/opt/glusterfs/3.2.5/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x2ba2e04c82a7]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaaaad70ecf]
/opt/glusterfs/3.2.5/lib64/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_handler+0x188)[0x2aaaaad71078]
/opt/glusterfs/3.2.5/lib64/libglusterfs.so.0[0x2ba2e028c9a1]
/opt/glusterfs/3.2.5/sbin/glusterfs(main+0x42f)[0x4052af]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3505e1d994]
/opt/glusterfs/3.2.5/sbin/glusterfs[0x403649]

Comment 9 Amar Tumballi 2013-04-26 09:48:37 UTC
We are aware of these crashes on 3.2.x branch, but haven't noticed it on 3.4.x branch yet. Please re-open the bug if the 3.4.x release (now @ Alpha) doesn't fix it.


Note You need to log in before you can comment on or make changes to this bug.