1518710 – Default volume options open-behind and write-behind cause problems with Postgresql

Bug 1518710 - Default volume options open-behind and write-behind cause problems with Postgresql

Summary: Default volume options open-behind and write-behind cause problems with Postg...

Keywords:
Status:	CLOSED DUPLICATE of bug 1540116
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	write-behind
Sub Component:
Version:	rhgs-3.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra G
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:	GLUSTERFS_METADATA_INCONSISTENCY
Depends On:
Blocks:	1540116
TreeView+	depends on / blocked

Reported:	2017-11-29 13:55 UTC by Gerald Sternagl
Modified:	2022-03-13 14:34 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Cause: turning volume options open-behind and write-behind on is known to create issues when storing RDBMS data on Gluster volumes. Consequence: Need to add a description in the Admin Guide for these two volume options. Workaround (if any): Turn the two options off by default Result: No data corruption and error message when running RDBMS
Clone Of:
Environment:
Last Closed:	2019-05-07 08:31:06 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Fuse-dump with open-behind, write-behind= on, stat-prefetch=off (7.07 MB, application/octet-stream) 2017-12-04 11:40 UTC, Gerald Sternagl	no flags	Details
Fuse-dump with open-behind, write-behind= off, stat-prefetch=off (8.22 MB, application/octet-stream) 2017-12-04 11:41 UTC, Gerald Sternagl	no flags	Details
Fuse-log with open-behind, write-behind=on, stat-prefetch=off (33.28 KB, text/plain) 2017-12-04 12:23 UTC, Gerald Sternagl	no flags	Details
Fuse-log with open-behind, write-behind=off, stat-prefetch=off (13.02 KB, text/plain) 2017-12-04 12:23 UTC, Gerald Sternagl	no flags	Details
View All

Description Gerald Sternagl 2017-11-29 13:55:56 UTC

Description of problem:
When setting up a Gluster volume and using default volume options the volume causes Data-corruption when used as a storage target for Postgresql RDBMS.

Version-Release number of selected component (if applicable):
RHGS 3.3

How reproducible:
Running simple Database transactions creates Warnings and Errors

Steps to Reproduce:
1. Install RHGS 3.3 on two nodes (node1, node2) with 2 bricks each
2. Setup a 2-node cluster and create a 2x2 volume with gdeploy
3. Setup a RHEL7 client and install postgresql-server and postgresql-contrib from the RHEL repo (the 2nd is required for the "pgbench" utility)

4. Initialize the database system:
# mount -t glusterfs node1:/vol01 /var/lib/pgsql
# postgresql-setup initdb
# systemctl start postgresql.service

5. Setup a benchmark database:
# su - postgres
# createdb bench-db

6. Run the benchmark:
# pgbench -i -s 15 bench-db
# pgbench -c 4 -j 2 -T 120 bench-db

Actual results:
The following errors & warnings comes up indicating that data is corrupted:

starting vacuum...end.

Client 1 aborted in state 12: ERROR: could not read block 2 in file "base/16384/24702_fsm": read only 0 of 8192 bytes

Client 0 aborted in state 12: ERROR: unexpected data beyond EOF in block 1 of relation base/16384/24702

HINT: This has been seen to occur with buggy kernels; consider updating your system.

Client 3 aborted in state 12: ERROR: unexpected data beyond EOF in block 1 of relation base/16384/24702

HINT: This has been seen to occur with buggy kernels; consider updating your system.

Expected results:
No default volume option should create a situation where any application produces corrupted data on a Gluster volume. Output should look like this:

starting vacuum…end.
transaction type: TPC-B (sort of) scaling factor: 15
query mode: simple
number of clients: 4
number of threads: 2
duration: 600 s
number of transactions actually processed: 44691
tps = 74.482169 (including connections establishing)
tps = 74.491468 (excluding connections establishing)

Additional info:
When the following volume options are turned off the benchmark runs smoothly without errors and delivers the expected output above.

performance.open-behind: off
performance.write-behind: off

I have also tried the other performance.* options and none of them made much difference or produced any errors like the above.
We have seen the same errors occur on Gluster also when running real world database applications. I just took the pgbench utility to consistently reproduce this.

Comment 4 Raghavendra G 2017-11-30 10:54:25 UTC

I want to understand whether its data or metadata (stat) that is corrupted. What would be your results like if you,

1. turn off performance.stat-prefetch
2. turn on performance.write-behind and performance.open-behind

There is an upstream patch which fixes stale stats due to write-behind [1]. Can you check whether this patch helps?

Also, while running tests is it possible to collect the request/response traffic between kernel and glusterfs? This can be done by mounting glusterfs with the option,
      --dump-fuse=PATH       Dump fuse traffic to PATH

Meanwhile we'll try to reproduce the bug locally.

[1] http://review.gluster.org/15757

Comment 5 Raghavendra G 2017-11-30 10:55:30 UTC

(In reply to Raghavendra G from comment #4)
> I want to understand whether its data or metadata (stat) that is corrupted.
> What would be your results like if you,
> 
> 1. turn off performance.stat-prefetch
> 2. turn on performance.write-behind and performance.open-behind

Note that all the three above settings should be effective during a single test run.

> 
> There is an upstream patch which fixes stale stats due to write-behind [1].
> Can you check whether this patch helps?
> 
> Also, while running tests is it possible to collect the request/response
> traffic between kernel and glusterfs? This can be done by mounting glusterfs
> with the option,
>       --dump-fuse=PATH       Dump fuse traffic to PATH
> 
> Meanwhile we'll try to reproduce the bug locally.
> 
> [1] http://review.gluster.org/15757

Comment 6 Raghavendra G 2017-11-30 10:59:57 UTC

(In reply to Raghavendra G from comment #4)
> I want to understand whether its data or metadata (stat) that is corrupted.
> What would be your results like if you,
> 
> 1. turn off performance.stat-prefetch
> 2. turn on performance.write-behind and performance.open-behind
> 
> There is an upstream patch which fixes stale stats due to write-behind [1].
> Can you check whether this patch helps?
> 
> Also, while running tests is it possible to collect the request/response
> traffic between kernel and glusterfs? This can be done by mounting glusterfs
> with the option,
>       --dump-fuse=PATH       Dump fuse traffic to PATH

Please attach the binary fuse-dump to the bug once tests are over.

> 
> Meanwhile we'll try to reproduce the bug locally.
> 
> [1] http://review.gluster.org/15757

Comment 7 Gerald Sternagl 2017-11-30 11:22:19 UTC

(In reply to Raghavendra G from comment #4)
> I want to understand whether its data or metadata (stat) that is corrupted.
> What would be your results like if you,
> 
> 1. turn off performance.stat-prefetch
> 2. turn on performance.write-behind and performance.open-behind
> 
> There is an upstream patch which fixes stale stats due to write-behind [1].
> Can you check whether this patch helps?
> 
> Also, while running tests is it possible to collect the request/response
> traffic between kernel and glusterfs? This can be done by mounting glusterfs
> with the option,
>       --dump-fuse=PATH       Dump fuse traffic to PATH
> 
> Meanwhile we'll try to reproduce the bug locally.
> 
> [1] http://review.gluster.org/15757

stat-prefetch: off
open-behind + write-behind: on 

=> Same errors and warnings as before.

FYI: Before I submitted this BZ I tested all performanc.* options turning them on/off indididually. Only open-behind & write-behind caused these issues.

Comment 8 Gerald Sternagl 2017-12-04 11:40:22 UTC

Created attachment 1362642 [details]
Fuse-dump with open-behind, write-behind= on, stat-prefetch=off

Fuse-dump with open-behind, write-behind= on, stat-prefetch=off

Comment 9 Gerald Sternagl 2017-12-04 11:41:17 UTC

Created attachment 1362643 [details]
Fuse-dump with open-behind, write-behind= off, stat-prefetch=off

Comment 10 Gerald Sternagl 2017-12-04 12:23:13 UTC

Created attachment 1362655 [details]
Fuse-log with open-behind, write-behind=on, stat-prefetch=off

I also mounted the Gluster volume with the volume command to get the fuse debug output:

glusterfs --volfile-server=node1 --volfile-id=vol01 --log-file=/tmp/fuse-log2.txt --log-level=DEBUG /var/lib/pgsql

volume options set:
performance.open-behind: on
performance.write-behind: on
performance.stat-prefetch: off

Comment 11 Gerald Sternagl 2017-12-04 12:23:47 UTC

Created attachment 1362656 [details]
Fuse-log with open-behind, write-behind=off, stat-prefetch=off

I also mounted the Gluster volume with the volume command to get the fuse debug output:

glusterfs --volfile-server=node1 --volfile-id=vol01 --log-file=/tmp/fuse-log2.txt --log-level=DEBUG /var/lib/pgsql

volume options set:
performance.open-behind: off
performance.write-behind: off
performance.stat-prefetch: off

Comment 12 Gerald Sternagl 2017-12-04 13:26:26 UTC

When I ran my last tests with the fuse-log on which I then reviewed, some entries in the log led me to a suspicion and I ran some more tests. My suspicion was that the issues might have to do with times being out of synch between the gluster nodes. I'm running these tests on VMs and occasionally I just tend to save the VMs instead of doing a full reboot. My fault but not too uncommon. This results in times being different across the different VMs as there is no NTP/CHRONY running on these nodes by default (RHGS setup doesn't do that). So I setup chrony to synchronize all VMs and ran the tests again with the following volume option settings:

open-behind: on
write-behind: on
stat-prefetch: on

I ran the same shortened test procedures:
# pgbench -i -s 1
# pgbench -c 4 -j 2 -T 10

Result:
No more WARNINGS. No more ERRORS. Data seems to be ok.
My colleague who first had experienced these issues while doing a customer PoC with RHGS on AWS also used the default RHGS deployment and volume settings which effectively led to the same result: data corruption.

My advice:
1. Make time synchronization between Gluster nodes mandatory. The only options during installation should be to synchronize against an internal or an external source. Having no time synch should be unsupported!

There are some occurances in the RHGS Admin guide about using time synchronization but nowhere in the context of replicas running out of synch

2. Document a strong warning at the volume options section when using open-behind, write-behind, stat-prefetch and mention the necessity to setup time synchronization.

3. It would be worth IMO to investigate and document which time deviation is actually still acceptable.

Comment 13 Gerald Sternagl 2017-12-04 13:51:14 UTC

I re-ran tests with open-behind/write-behind options on and the warnings and errors did occur again. So time sync seems to have some influence but there is still some more to it.

Note You need to log in before you can comment on or make changes to this bug.