1808688 – Data corruption with asynchronous writes (please try to reproduce!)

Bug 1808688 - Data corruption with asynchronous writes (please try to reproduce!)

Summary: Data corruption with asynchronous writes (please try to reproduce!)

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	libgfapi
Sub Component:
Version:	7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:	bugs@gluster.org
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-29 11:56 UTC by Stefan Ring
Modified:	2020-03-12 12:22 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-03-12 12:22:21 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Reproducer (5.71 KB, text/x-csrc) 2020-02-29 11:56 UTC, Stefan Ring	no flags	Details
With fake AIO (6.89 KB, text/x-csrc) 2020-02-29 11:57 UTC, Stefan Ring	no flags	Details
Verifier (1.32 KB, text/plain) 2020-02-29 11:58 UTC, Stefan Ring	no flags	Details
View All

Description Stefan Ring 2020-02-29 11:56:19 UTC

Created attachment 1666585 [details]
Reproducer

Description of problem:

I recently noticed data corruption with a ZFS-on-Linux VM running in qemu-kvm with its storage in qcow2 on a gluster cluster. Since then I attempted frantically to create a reproducer that does not involve running a guest machine with ZoL and streaming gigabytes of data into it. I think I finally succeeded. By now I just hope that noone can point out a bug in my reproducer code leading to the corruption. For the original discussion, see <https://lists.gluster.org/pipermail/integration/2020-February/000257.html>.


Version-Release number of selected component (if applicable):

I tested mostly with a Fedora 31 client. With the distro libgfapi as well as git master. On the server side both CentOS 7 and Fedora 31 with the 7.3 releases. The production cluster where I originally witnessed this problem is running rather older versions of everything. So my impression is that the same thing happens with "any" version of the glusterfs code, basically.

I'm not sure about the libgfapi component. The problem might also be caused by glusterd.


How reproducible:

Run real.c and check the resulting data file.

Actual results:

Verifier complains

Expected results:

Verifier does not complain


Additional info:

Information about the reproducer:

It writes a specific pattern into a data file repeatedly, spaced 2097152 (0x1000*512) bytes apart. Some of these always turn out wrong. Details about the pattern here: <https://lists.gluster.org/pipermail/integration/2020-February/000263.html>.

I compile on Fedora 31 like this:

$ gcc -O2 -g -pthread -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -std=c11 real.c -lgfapi

The code might look a bit unwieldy and is not super clean, but should be rather straightforward after a few words of explanation. I also mix all kinds of integer types freely, but no huge quantities of anything are used, so everything should be smooth. It is intended to employ two writer threads, so it keeps two sets of data for the pwritev calls and marks them busy individually after dispatching. It then waits for them to become available again (by the completion function clearing the busy bits) in order to dispatch the next data item. The get_worker function handles this waiting and worker set selection. It can wait either for both of them to become idle (IDLE, nothing in flight), any one of them being idle (ANY) or ANY with the additional restriction that a specific sequence number must not be in flight (required for the 8704 request, which overwrites data from 8703).

As this kind of asynchronous code is always a little tricky to write, I tested it on a very simple fake aio interface in order to gain confidence (fake.c). This version writes to a local file and always produces the correct output.

All the file, volume and host names are hard-coded, and the data file ("testfile0") needs to exist – it will be overwritten.

I also add a crude verifier (Python 2). For a correct file, it should just output "256". From my last run, I get this output:

('bad', ([('\x04', 41472)], 31))
('bad', ([('\x04', 41472)], 68))
('bad', ([('\x04', 41472)], 91))
('bad', ([('\x04', 41472)], 92))
('bad', ([('\x04', 41472)], 93))
('bad', ([('\x04', 41472)], 94))
('bad', ([('\x04', 41472)], 103))
('bad', ([('\x04', 41472)], 118))
('bad', ([('\x04', 41472)], 151))
('bad', ([('\x04', 41472)], 169))
('bad', ([('\x04', 41472)], 175))
('bad', ([('\x04', 41472)], 207))
('bad', ([('\x04', 41472)], 214))
('bad', ([('\x04', 41472)], 228))
256

Which means that the 31st, 68th, and so on repetition of the pattern is wrong, with a block of 41472 fours instead of fives.

Comment 1 Stefan Ring 2020-02-29 11:57:58 UTC

Created attachment 1666586 [details]
With fake AIO

fake.c

Comment 2 Stefan Ring 2020-02-29 11:58:21 UTC

Created attachment 1666587 [details]
Verifier

Comment 3 Worker Ant 2020-03-12 12:22:21 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/884, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.