Description of problem: When setting up a Gluster volume and using default volume options the volume causes Data-corruption when used as a storage target for Postgresql RDBMS. Version-Release number of selected component (if applicable): RHGS 3.3 How reproducible: Running simple Database transactions creates Warnings and Errors Steps to Reproduce: 1. Install RHGS 3.3 on two nodes (node1, node2) with 2 bricks each 2. Setup a 2-node cluster and create a 2x2 volume with gdeploy 3. Setup a RHEL7 client and install postgresql-server and postgresql-contrib from the RHEL repo (the 2nd is required for the "pgbench" utility) 4. Initialize the database system: # mount -t glusterfs node1:/vol01 /var/lib/pgsql # postgresql-setup initdb # systemctl start postgresql.service 5. Setup a benchmark database: # su - postgres # createdb bench-db 6. Run the benchmark: # pgbench -i -s 15 bench-db # pgbench -c 4 -j 2 -T 120 bench-db Actual results: The following errors & warnings comes up indicating that data is corrupted: starting vacuum...end. Client 1 aborted in state 12: ERROR: could not read block 2 in file "base/16384/24702_fsm": read only 0 of 8192 bytes Client 0 aborted in state 12: ERROR: unexpected data beyond EOF in block 1 of relation base/16384/24702 HINT: This has been seen to occur with buggy kernels; consider updating your system. Client 3 aborted in state 12: ERROR: unexpected data beyond EOF in block 1 of relation base/16384/24702 HINT: This has been seen to occur with buggy kernels; consider updating your system. Expected results: No default volume option should create a situation where any application produces corrupted data on a Gluster volume. Output should look like this: starting vacuum…end. transaction type: TPC-B (sort of) scaling factor: 15 query mode: simple number of clients: 4 number of threads: 2 duration: 600 s number of transactions actually processed: 44691 tps = 74.482169 (including connections establishing) tps = 74.491468 (excluding connections establishing) Additional info: When the following volume options are turned off the benchmark runs smoothly without errors and delivers the expected output above. performance.open-behind: off performance.write-behind: off I have also tried the other performance.* options and none of them made much difference or produced any errors like the above. We have seen the same errors occur on Gluster also when running real world database applications. I just took the pgbench utility to consistently reproduce this.
I want to understand whether its data or metadata (stat) that is corrupted. What would be your results like if you, 1. turn off performance.stat-prefetch 2. turn on performance.write-behind and performance.open-behind There is an upstream patch which fixes stale stats due to write-behind [1]. Can you check whether this patch helps? Also, while running tests is it possible to collect the request/response traffic between kernel and glusterfs? This can be done by mounting glusterfs with the option, --dump-fuse=PATH Dump fuse traffic to PATH Meanwhile we'll try to reproduce the bug locally. [1] http://review.gluster.org/15757
(In reply to Raghavendra G from comment #4) > I want to understand whether its data or metadata (stat) that is corrupted. > What would be your results like if you, > > 1. turn off performance.stat-prefetch > 2. turn on performance.write-behind and performance.open-behind Note that all the three above settings should be effective during a single test run. > > There is an upstream patch which fixes stale stats due to write-behind [1]. > Can you check whether this patch helps? > > Also, while running tests is it possible to collect the request/response > traffic between kernel and glusterfs? This can be done by mounting glusterfs > with the option, > --dump-fuse=PATH Dump fuse traffic to PATH > > Meanwhile we'll try to reproduce the bug locally. > > [1] http://review.gluster.org/15757
(In reply to Raghavendra G from comment #4) > I want to understand whether its data or metadata (stat) that is corrupted. > What would be your results like if you, > > 1. turn off performance.stat-prefetch > 2. turn on performance.write-behind and performance.open-behind > > There is an upstream patch which fixes stale stats due to write-behind [1]. > Can you check whether this patch helps? > > Also, while running tests is it possible to collect the request/response > traffic between kernel and glusterfs? This can be done by mounting glusterfs > with the option, > --dump-fuse=PATH Dump fuse traffic to PATH Please attach the binary fuse-dump to the bug once tests are over. > > Meanwhile we'll try to reproduce the bug locally. > > [1] http://review.gluster.org/15757
(In reply to Raghavendra G from comment #4) > I want to understand whether its data or metadata (stat) that is corrupted. > What would be your results like if you, > > 1. turn off performance.stat-prefetch > 2. turn on performance.write-behind and performance.open-behind > > There is an upstream patch which fixes stale stats due to write-behind [1]. > Can you check whether this patch helps? > > Also, while running tests is it possible to collect the request/response > traffic between kernel and glusterfs? This can be done by mounting glusterfs > with the option, > --dump-fuse=PATH Dump fuse traffic to PATH > > Meanwhile we'll try to reproduce the bug locally. > > [1] http://review.gluster.org/15757 stat-prefetch: off open-behind + write-behind: on => Same errors and warnings as before. FYI: Before I submitted this BZ I tested all performanc.* options turning them on/off indididually. Only open-behind & write-behind caused these issues.
Created attachment 1362642 [details] Fuse-dump with open-behind, write-behind= on, stat-prefetch=off Fuse-dump with open-behind, write-behind= on, stat-prefetch=off
Created attachment 1362643 [details] Fuse-dump with open-behind, write-behind= off, stat-prefetch=off
Created attachment 1362655 [details] Fuse-log with open-behind, write-behind=on, stat-prefetch=off I also mounted the Gluster volume with the volume command to get the fuse debug output: glusterfs --volfile-server=node1 --volfile-id=vol01 --log-file=/tmp/fuse-log2.txt --log-level=DEBUG /var/lib/pgsql volume options set: performance.open-behind: on performance.write-behind: on performance.stat-prefetch: off
Created attachment 1362656 [details] Fuse-log with open-behind, write-behind=off, stat-prefetch=off I also mounted the Gluster volume with the volume command to get the fuse debug output: glusterfs --volfile-server=node1 --volfile-id=vol01 --log-file=/tmp/fuse-log2.txt --log-level=DEBUG /var/lib/pgsql volume options set: performance.open-behind: off performance.write-behind: off performance.stat-prefetch: off
When I ran my last tests with the fuse-log on which I then reviewed, some entries in the log led me to a suspicion and I ran some more tests. My suspicion was that the issues might have to do with times being out of synch between the gluster nodes. I'm running these tests on VMs and occasionally I just tend to save the VMs instead of doing a full reboot. My fault but not too uncommon. This results in times being different across the different VMs as there is no NTP/CHRONY running on these nodes by default (RHGS setup doesn't do that). So I setup chrony to synchronize all VMs and ran the tests again with the following volume option settings: open-behind: on write-behind: on stat-prefetch: on I ran the same shortened test procedures: # pgbench -i -s 1 # pgbench -c 4 -j 2 -T 10 Result: No more WARNINGS. No more ERRORS. Data seems to be ok. My colleague who first had experienced these issues while doing a customer PoC with RHGS on AWS also used the default RHGS deployment and volume settings which effectively led to the same result: data corruption. My advice: 1. Make time synchronization between Gluster nodes mandatory. The only options during installation should be to synchronize against an internal or an external source. Having no time synch should be unsupported! There are some occurances in the RHGS Admin guide about using time synchronization but nowhere in the context of replicas running out of synch 2. Document a strong warning at the volume options section when using open-behind, write-behind, stat-prefetch and mention the necessity to setup time synchronization. 3. It would be worth IMO to investigate and document which time deviation is actually still acceptable.
I re-ran tests with open-behind/write-behind options on and the warnings and errors did occur again. So time sync seems to have some influence but there is still some more to it.