Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 763802 (GLUSTER-2070)

Summary:	AFR not being triggered on some files
Product:	[Community] GlusterFS	Reporter:	Jeff Darcy <jdarcy>
Component:	replicate	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED WORKSFORME	QA Contact:
Severity:	medium	Docs Contact:
Priority:	low
Version:	mainline	CC:	aavati, amarts, gluster-bugs, vijay
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	---	Mount Type:	---
Documentation:	DNR	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeff Darcy 2010-11-09 18:07:51 UTC

I have a very simple two-node AFR setup, in which I start five test processes on each node.  Each test process opens a random file (from among 5K), writes to it for a while, then closes and moves on.  I leave these running and used iptables to simulate a network partition.  After a while I let the two nodes reunite, and check (via a scan of the xattrs on the server bricks) to see which files needed self-heal.  A stat(2) on each of those files generally causes self-heal, but there are a few stragglers, though, and nothing I can do prior to an unmount/remount on either node seems to cause self-heal to happen.  Their xattrs and contents remain out of sync indefinitely.

The configuration is totally standard (from "gluster volume create") and there's no background self-heal activity at this point.  Nothing particularly interesting is in the logs.  I suspect that one of the "performance" translators is keeping the stat(2) from triggering self-heal.  I looked for similar bugs, but none quite fit; #1041 and #1365 seem closest.

Comment 1 Anand Avati 2010-11-10 02:05:15 UTC

Was this on 3.1.0? There was an issue in 3.10 where a reconnect wouldn't succeed if a directory was held open at the time of disconnection. This was fixed quickly after 3.1.0 and will be pushed out in 3.1.1.

Avati

Comment 2 Jeff Darcy 2010-11-10 10:17:49 UTC

(In reply to comment #1)
> Was this on 3.1.0?

This was on trunk, as of a few days ago.  Do you have a bug or patch number for that fix?  I can check to make sure I have it.

Overall, recovery after this kind of disconnection seems a bit unreliable.  In addition to the problem noted here, I've also seen cases where writing to one of the "broken" files after the reconnect hangs unless the volume is stopped and restarted.  I'm still trying to characterize that one; I mention it here only in case it might be another symptom of the same underlying issue.

Comment 3 Jeff Darcy 2010-11-10 13:44:39 UTC

I rebuilt, reinstalled and reconfigured from scratch using a new git clone.  I still see the files that won't heal in response to a stat/fstat.  When I actually "touch" the files it has no effect before a remount, and hangs after.  In other words, no change.  This has no unusual configuration beyond that generated by the following command:

  gluster volume create replica 2 host1:/bricks/afr host2:/bricks/afr

After that I start and mount.  Everything works normally unless/until I simulate a network partition.  While partitioned, I let each node "dirty" a few more files by writing only to the local copy, then kill the tests and restore the network.  The nodes seem to reconnect almost immediately (implying that connection retries are happening rather fast BTW) and then I check for "injured" files.

Comment 4 Jeff Darcy 2010-11-12 14:40:54 UTC

From Freenode/#gluster today, possibly related.


[11:49] <Disco> but my nodes are out of sync after that
[11:52] <Disco> is there any command to force consistency check ?
[11:52] <EdWyse_Home1> Today, ls -laR {mountpoint}
[11:54] <Disco> well, tried, no effect
[11:54] <Disco> gluster peer status returns correct results
[11:54] <Disco> both nodes are up
[11:58] <EdWyse_Home1> I wonder if you're seeing bug 763802

Comment 5 Anand Avati 2010-12-01 12:23:39 UTC

(In reply to comment #2)
> (In reply to comment #1)
> > Was this on 3.1.0?
> 
> This was on trunk, as of a few days ago.  Do you have a bug or patch number for
> that fix?  I can check to make sure I have it.
> 
> Overall, recovery after this kind of disconnection seems a bit unreliable.  In
> addition to the problem noted here, I've also seen cases where writing to one
> of the "broken" files after the reconnect hangs unless the volume is stopped
> and restarted.  I'm still trying to characterize that one; I mention it here
> only in case it might be another symptom of the same underlying issue.

This seems to be an issue with stale locks. It sounds like the network partition happened in the middle of a transaction and there are stale locks left over on the server. Clients are generally more proactive in terms of detecting network splits as they keep tab of pending frames and timeout and quickly reconnect. Servers, on the other hand, rely upon tcp keepalive to disconnect the 'broken' client connections (unless of course, there is a TCP RST) and only then clean up the connection state (which includes open fds and held locks)

The problem here seems to be that the client is reconnecting before the server has fired the tcp keepalive on the old connection, and so the reconnection has 'inherited' the connection state of the "broken" connection at the server (as both clients have the same process uuid.) This behaviour, of a new connection inheriting connection state from an older-but-yet-to-be-detected-that-it-is-disconnected-version of itself, is still lingering around for historical reasons, though not needed now.

This problem was "fixed" (using that word hesitantly) in patch b2f195720b27d9e69f7b851478515781e5786469 (git describe - v3.1.0-50-gb2f1957) where the keepalive times were made more aggressive on the server side so that both clients and servers come to the same decision on the state of a connection (both at around 40 seconds of non-response). This patch however only reduces the margin of error for the race of conflicting conclusion between the server and client, but significantly.

I will be happy to take the connection-state-inheritance feature away in the release branch (and thereby make it race-proof), provided we can verify and confirm the root cause of the bug is actually the theory stated above. To verify that, please fast forward your tree to v3.1.0-50-gb2f1957, or even better, upgrade to 3.1.1, and during your testing, make sure you do not re-join the network split till there is a disconnection line for client in the server log.

Avati

Comment 6 Pranith Kumar K 2011-02-04 03:58:34 UTC

Jeff Darcy,
      Could you please let us know if you are still facing this problem on 3.1.x releases?. If yes could you please confirm the root cause is what Avati stated.

Pranith

Comment 7 Jeff Darcy 2011-02-04 13:43:42 UTC

I haven't seen this for a while, nor have I seen any reports from other users, and I don't have the time/resources to do an exhaustive re-test right now.  It's OK by me if we close it.  If it - or something like it - does recur, we can deal with it then.

Comment 8 Amar Tumballi 2011-04-13 05:28:15 UTC

As its not happening now, I am marking it for DNR (document not required). Will update the known issues (or faq) section if the bug is reopened.