Hide Forgot
I mentioned this several times before. The problem went away in Firefox 3.0.11 for some reason, but it has now returned with FF 3.0.12. The problem doesn't arise with /home on any other file system (ext3, NFS, GFS). It only arises with /home on gluster. Since the problem has returned with the latest FF, I am filing a bug report here. To reproduce: 1) Mount /home on gluster with afr/replicate. 2) Fire up FF. 3) Save a bookmark. 4) Exit FF. 5) Start FF again - bookmark will not have been saved. This is reproducible 100% of the time on my CentOS 5 systems. See bug 761858 for volume spec files.
The problem appears to have gone away with 2.0.6. Perhaps this bug should be closed for now, and if the issue arises again in future versions, I can re-open it.
I spoke too soon - the problem is still present, it just doesn't seem to happen _every_ time, only most of the time.
It would appear that mounting with direct-io-mode=off solves the problem. I haven't been able to reproduce this issue since I made this change.
I spoke too soon again - the problem still occurs even with direct-io-mode=off. :(
Is this seen on Glusterfs 2.0.8 too? Can it be reproduced easily?
I haven't had it come up on 2.0.8 yet, but I only upgraded yesterday so it's not very conclusive.
I can now confirm that the issue still happens with 2.0.8, it just doesn't seem to happen as regularly. The problem appears to be related to I/O to FF's SQLite files.
We've been trying to reproduce this but it remains elusive. Can you give us a few more details?: 1) The volume spec files on the other bug (126) seem to indicate server-side replicate. Did you reproduce this with 2.0.8 with server-side afr? If not, what were the volume files used? 2) Were your servers on the same machine or on different physical machines (shouldn't matter, but stranger things happen)?
Also, we suspect that this is an afr issue. Can you confirm that by removing all performance translators both on the server and the client and then reproducing it?
Created attachment 105 [details] A newer version of the 'sl' file
Created attachment 106 [details] Notes from my search for a fix
Created attachment 107 [details] patch to add "emptycheck" to pam_pwdb
Created attachment 108 [details] new patch for lpr-0.50
All servers and clients are separate physical machines. As you can see from the volume spec files I attached, there are no performance translators used anywhere. At the same time I upgraded from patched fuse 2.7.4-glfs11 kernel module (ABI 7.8) to vanilla RHEL5 fuse kernel module in 2.6.18-164.6.1.el5 kernel (ABI 7.10) and GlusterFS from 2.0.6 to 2.0.8. It seems that with combination the issue isn't as easy to reproduce as with the previous setup, but the issue still occurs eventually (I noticed things broke after a few hours of general usage). Deleting places.sqlite* resolves the problem, but it will just happen again later. Whenever the file places.sqlite-journal is present, it indicates that things have gotten broken, and from there onward saving new bookmarks doesn't work, and the back button in FF becomes permanently disabled. Another thing of note is that frequently the stale .parentlock files get left behind on FF shutdown (same happens with Thunderbird), and the only way to get it to start up again is to clear them manually. I suspect all of the above issues may be related. AIO, perhaps?
I have some new information that may be relevant to this. I recently switched from using glfs on clients to NFS. The servers are single-process glfs client-server setup. Due to fuse limitations (requirement of the patch for kernel NFS), I an using unfsd (user-space NFS server). Note - unfsd doesn't support file locking! Ever since I have switched to this setup (about a week now), this issue hasn't re-occured. Previously it would occur within a few hours. This leads me to suspect one of the following: 1) Having all access happen via one node (rather than having the client push writes to all 3 nodes) makes the problem go away. This sort of makes sense because the node I'm accessing via NFS is the favorite-child. 2) The extra buffering/ordering that happens due to the NFS conduit solves the problem and makes sure everything is committed. I have, however, just noticed something that I didn't notice before. I just noticed a weird time-out when I was connecting to one of the server nodes. This was in the logs: [2009-12-31 03:03:58] E [client-protocol.c:457:client_ping_timer_expired] home3: Server 10.2.0.13:6997 has not responded in the last 10 seconds, disconnecting. I don't know what caused this, but if this sort of thing happens with any regularity (I can't see anything similar in the logs except the known reboot times, but this last one definitely wasn't caused by a reboot) could it be that client AFR doesn't handle this scenario gracefully for open files and can lead to writes being mis-ordered or dropped when a server leaves and re-joins? Since NFS always connects to the same AFR node (which also happens to be the primary node and read-subvolume) it is possible that this error condition is avoided. How plausible does this sound?
Here is another config I tried: unfsd exporting glfs resulted in no similar failures over a month that I've used it. I have now switched to using server-side AFR and importing only a single volume on the client, because I figured it's more similar conceptually to the way glfs+unfsd works. Firefox and Tunuderbird's .parentlock files don't get deleted when the application closes. This is 100% reproducible, it happens every time. I haven't seen the bookmark/history database get corrupted yet, though (but it has only been a few hours since I switched to using this setup). I have attached the new server and client configs I use for server-side AFR. .parentlock file was never left behind when the clients were connecting via the unfsd export.
Created attachment 131 [details] updated test program
Created attachment 132 [details] Patch to allow ping to lookup names from IP addresses
Pavan to verify and change the status accordingly.
*** This bug has been marked as a duplicate of bug 6 ***