Bug 761918 (GLUSTER-186)

Summary: With /home on afr/replicate, Firefox bookmarks don't get saved
Product: [Community] GlusterFS Reporter: Gordan Bobic <gordan>
Component: replicateAssignee: Pavan Vilas Sondur <pavan>
Status: CLOSED DUPLICATE QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 2.0.9CC: amarts, gluster-bugs, gordan, pavan
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
server1 volume spec
none
server2 volume spec
none
server3 volume spec
none
client volume spec
none
Server-side AFR
none
Client for server-side AFR none

Description Gordan Bobic 2009-08-03 23:40:01 UTC
I mentioned this several times before. The problem went away in Firefox 3.0.11 for some reason, but it has now returned with FF 3.0.12. The problem doesn't arise with /home on any other file system (ext3, NFS, GFS). It only arises with /home on gluster. Since the problem has returned with the latest FF, I am filing a bug report here.

To reproduce:
1) Mount /home on gluster with afr/replicate.
2) Fire up FF.
3) Save a bookmark.
4) Exit FF.
5) Start FF again - bookmark will not have been saved.

This is reproducible 100% of the time on my CentOS 5 systems.

See bug 761858 for volume spec files.

Comment 1 Gordan Bobic 2009-08-22 23:06:55 UTC
The problem appears to have gone away with 2.0.6. Perhaps this bug should be closed for now, and if the issue arises again in future versions, I can re-open it.

Comment 2 Gordan Bobic 2009-08-25 20:29:33 UTC
I spoke too soon - the problem is still present, it just doesn't seem to happen _every_ time, only most of the time.

Comment 3 Gordan Bobic 2009-09-21 20:02:22 UTC
It would appear that mounting with direct-io-mode=off solves the problem. I haven't been able to reproduce this issue since I made this change.

Comment 4 Gordan Bobic 2009-09-24 05:27:15 UTC
I spoke too soon again - the problem still occurs even with direct-io-mode=off. :(

Comment 5 Pavan Vilas Sondur 2009-11-19 03:59:17 UTC
Is this seen on Glusterfs 2.0.8 too? Can it be reproduced easily?

Comment 6 Gordan Bobic 2009-11-19 09:56:47 UTC
I haven't had it come up on 2.0.8 yet, but I only upgraded yesterday so it's not very conclusive.

Comment 7 Gordan Bobic 2009-11-19 15:33:58 UTC
I can now confirm that the issue still happens with 2.0.8, it just doesn't seem to happen as regularly. The problem appears to be related to I/O to FF's SQLite files.

Comment 8 Vikas Gorur 2009-11-19 17:21:43 UTC
We've been trying to reproduce this but it remains elusive. Can you give us a few more details?:

1) The volume spec files on the other bug (126) seem to indicate server-side replicate. Did you reproduce this with 2.0.8 with server-side afr? If not, what
were the volume files used?

2) Were your servers on the same machine or on different physical machines (shouldn't matter, but stranger things happen)?

Comment 9 Vikas Gorur 2009-11-19 17:26:01 UTC
Also, we suspect that this is an afr issue. Can you confirm that by removing all performance translators both on the server and the client and then reproducing it?

Comment 10 Gordan Bobic 2009-11-19 21:16:50 UTC
Created attachment 105 [details]
A newer version of the 'sl' file

Comment 11 Gordan Bobic 2009-11-19 21:17:16 UTC
Created attachment 106 [details]
Notes from my search for a fix

Comment 12 Gordan Bobic 2009-11-19 21:17:57 UTC
Created attachment 107 [details]
patch to add "emptycheck" to pam_pwdb

Comment 13 Gordan Bobic 2009-11-19 21:18:15 UTC
Created attachment 108 [details]
new patch for lpr-0.50

Comment 14 Gordan Bobic 2009-11-19 21:27:15 UTC
All servers and clients are separate physical machines. As you can see from the volume spec files I attached, there are no performance translators used anywhere.

At the same time I upgraded from patched fuse 2.7.4-glfs11 kernel module (ABI 7.8) to vanilla RHEL5 fuse kernel module in 2.6.18-164.6.1.el5 kernel (ABI 7.10) and GlusterFS from 2.0.6 to 2.0.8. It seems that with combination the issue isn't as easy to reproduce as with the previous setup, but the issue still occurs eventually (I noticed things broke after a few hours of general usage).

Deleting places.sqlite* resolves the problem, but it will just happen again later. Whenever the file places.sqlite-journal is present, it indicates that things have gotten broken, and from there onward saving new bookmarks doesn't work, and the back button in FF becomes permanently disabled.

Another thing of note is that frequently the stale .parentlock files get left behind on FF shutdown (same happens with Thunderbird), and the only way to get it to start up again is to clear them manually.

I suspect all of the above issues may be related. AIO, perhaps?

Comment 15 Gordan Bobic 2009-12-31 00:25:41 UTC
I have some new information that may be relevant to this. I recently switched from using glfs on clients to NFS. The servers are single-process glfs client-server setup. Due to fuse limitations (requirement of the patch for kernel NFS), I an using unfsd (user-space NFS server). Note - unfsd doesn't support file locking!

Ever since I have switched to this setup (about a week now), this issue hasn't re-occured. Previously it would occur within a few hours. This leads me to suspect one of the following:

1) Having all access happen via one node (rather than having the client push writes to all 3 nodes) makes the problem go away. This sort of makes sense because the node I'm accessing via NFS is the favorite-child.

2) The extra buffering/ordering that happens due to the NFS conduit solves the problem and makes sure everything is committed.

I have, however, just noticed something that I didn't notice before. I just noticed a weird time-out when I was connecting to one of the server nodes. This was in the logs:

[2009-12-31 03:03:58] E [client-protocol.c:457:client_ping_timer_expired] home3: Server 10.2.0.13:6997 has not responded in the last 10 seconds, disconnecting.

I don't know what caused this, but if this sort of thing happens with any regularity (I can't see anything similar in the logs except the known reboot times, but this last one definitely wasn't caused by a reboot) could it be that client AFR doesn't handle this scenario gracefully for open files and can lead to writes being mis-ordered or dropped when a server leaves and re-joins? Since NFS always connects to the same AFR node (which also happens to be the primary node and read-subvolume) it is possible that this error condition is avoided.

How plausible does this sound?

Comment 16 Gordan Bobic 2010-01-06 09:50:37 UTC
Here is another config I tried:

unfsd exporting glfs resulted in no similar failures over a month that I've used it.

I have now switched to using server-side AFR and importing only a single volume on the client, because I figured it's more similar conceptually to the way glfs+unfsd works.

Firefox and Tunuderbird's .parentlock files don't get deleted when the application closes. This is 100% reproducible, it happens every time. I haven't seen the bookmark/history database get corrupted yet, though (but it has only been a few hours since I switched to using this setup).

I have attached the new server and client configs I use for server-side AFR.

.parentlock file was never left behind when the clients were connecting via the unfsd export.

Comment 17 Gordan Bobic 2010-01-06 09:51:45 UTC
Created attachment 131 [details]
updated test program

Comment 18 Gordan Bobic 2010-01-06 09:52:17 UTC
Created attachment 132 [details]
Patch to allow ping to lookup names from IP addresses

Comment 19 Amar Tumballi 2010-03-09 08:41:42 UTC
Pavan to verify and change the status accordingly.

Comment 20 Pavan Vilas Sondur 2010-03-23 02:48:32 UTC

*** This bug has been marked as a duplicate of bug 6 ***