Created attachment 249 [details] new spec file
Here are more details, The corruption seem to happen under load. The work load is to run 200 or more lame encoding process on the mount point. The processes run normal initially and seem to freeze after a while. After the freeze, the files appear corrupted. The files are mp3, jpeg and xml files between a few kb's to 9MB in size and millions of them. When files are locked or process is frozen, a reboot seem to fix the corruption. The files created/modified during heavy load seem to be corrupted. xml files hold metadata for lame encoder and when xml is corrupted, lame segfaults or coredumps. On another machine with the same gluster mount, there is no corruption. This server is not on load.
Chida, what is the application used for encoding? I want to incorporate this into my tests.
(In reply to comment #3) > Chida, what is the application used for encoding? I want to incorporate this > into my tests. This is the application, http://lame.sourceforge.net/ You can launch 200+ instances of encoding process in parallel. While this is running. you may create 100's of small txt files with some content. Then check if the txt files and/or mp3's are intact. There are other tools to verify the integrity of mp3 files such as id3, http://checkmate.gissen.nl/, or even a checksum. I will try to get a more accurate test case.
Files appear as zero byes on the mount point. When the files are closed and opened `\n' characters are seen. After reboot, everything seems okay. Attaching client vol file.
Here are more details: It's when lame is encoding mp3 to mp3 in many directories. The source and destinations are scattered amongst many directories relatively randomly. At some point under heavy load the access to certain files on the mount point is tainted. Some files report as a null for every byte in the file or the file is empty when an ls -la returns the expected result. This only effects some files on the mount point, other files work fine. At the same time other client machines mounting the same gluster mount point using the same gluster node have no problem, the files reporting as corrupted on the other client are accessible as normal. Some processes on the failed client are zombied and disk bound. A reboot of the client machine fixes the problem until it hits heavy load again.
*** This bug has been marked as a duplicate of bug 815 ***