Created attachment 1666585 [details] Reproducer Description of problem: I recently noticed data corruption with a ZFS-on-Linux VM running in qemu-kvm with its storage in qcow2 on a gluster cluster. Since then I attempted frantically to create a reproducer that does not involve running a guest machine with ZoL and streaming gigabytes of data into it. I think I finally succeeded. By now I just hope that noone can point out a bug in my reproducer code leading to the corruption. For the original discussion, see <https://lists.gluster.org/pipermail/integration/2020-February/000257.html>. Version-Release number of selected component (if applicable): I tested mostly with a Fedora 31 client. With the distro libgfapi as well as git master. On the server side both CentOS 7 and Fedora 31 with the 7.3 releases. The production cluster where I originally witnessed this problem is running rather older versions of everything. So my impression is that the same thing happens with "any" version of the glusterfs code, basically. I'm not sure about the libgfapi component. The problem might also be caused by glusterd. How reproducible: Run real.c and check the resulting data file. Actual results: Verifier complains Expected results: Verifier does not complain Additional info: Information about the reproducer: It writes a specific pattern into a data file repeatedly, spaced 2097152 (0x1000*512) bytes apart. Some of these always turn out wrong. Details about the pattern here: <https://lists.gluster.org/pipermail/integration/2020-February/000263.html>. I compile on Fedora 31 like this: $ gcc -O2 -g -pthread -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -std=c11 real.c -lgfapi The code might look a bit unwieldy and is not super clean, but should be rather straightforward after a few words of explanation. I also mix all kinds of integer types freely, but no huge quantities of anything are used, so everything should be smooth. It is intended to employ two writer threads, so it keeps two sets of data for the pwritev calls and marks them busy individually after dispatching. It then waits for them to become available again (by the completion function clearing the busy bits) in order to dispatch the next data item. The get_worker function handles this waiting and worker set selection. It can wait either for both of them to become idle (IDLE, nothing in flight), any one of them being idle (ANY) or ANY with the additional restriction that a specific sequence number must not be in flight (required for the 8704 request, which overwrites data from 8703). As this kind of asynchronous code is always a little tricky to write, I tested it on a very simple fake aio interface in order to gain confidence (fake.c). This version writes to a local file and always produces the correct output. All the file, volume and host names are hard-coded, and the data file ("testfile0") needs to exist – it will be overwritten. I also add a crude verifier (Python 2). For a correct file, it should just output "256". From my last run, I get this output: ('bad', ([('\x04', 41472)], 31)) ('bad', ([('\x04', 41472)], 68)) ('bad', ([('\x04', 41472)], 91)) ('bad', ([('\x04', 41472)], 92)) ('bad', ([('\x04', 41472)], 93)) ('bad', ([('\x04', 41472)], 94)) ('bad', ([('\x04', 41472)], 103)) ('bad', ([('\x04', 41472)], 118)) ('bad', ([('\x04', 41472)], 151)) ('bad', ([('\x04', 41472)], 169)) ('bad', ([('\x04', 41472)], 175)) ('bad', ([('\x04', 41472)], 207)) ('bad', ([('\x04', 41472)], 214)) ('bad', ([('\x04', 41472)], 228)) 256 Which means that the 31st, 68th, and so on repetition of the pattern is wrong, with a block of 41472 fours instead of fives.
Created attachment 1666586 [details] With fake AIO fake.c
Created attachment 1666587 [details] Verifier
This bug is moved to https://github.com/gluster/glusterfs/issues/884, and will be tracked there from now on. Visit GitHub issues URL for further details