Bug 1512691 - PostgreSQL DB Restore: unexpected data beyond EOF
PostgreSQL DB Restore: unexpected data beyond EOF
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: fuse (Show other bugs)
mainline
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Csaba Henk
GLUSTERFS_METADATA_INCONSISTENCY
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-11-13 15:54 EST by norman.j.maul
Modified: 2018-06-20 13:57 EDT (History)
11 users (show)

See Also:
Fixed In Version: glusterfs-v4.1.0
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1435832
Environment:
Last Closed: 2018-06-20 13:57:11 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description norman.j.maul 2017-11-13 15:54:31 EST
Per bug 1435832, I'm re-reporting this under a newer version (3.12). I'm not the original submitter, but I can confirm that this problem still occurs in 3.12.1.

My steps to reproduce:

Install GlusterFS on 3+ nodes (I used 3.12.1 packages from CentOS-SIG), on RHEL7.3
I used Heketi to set up the cluster, but AFAICT it shouldn't matter
create a volume... I used heketi again, distributed-replicated, 1x3.
mount it somewhere (ex: /mnt)

Then you can run the "pgbench" tool like this (pass the right mountpoint):

docker run -d --net host -v /mnt:/var/lib/postgresql/data --name pgbench postgres:alpine
docker exec -it pgbench psql -U postgres
create database pgbench;
time docker exec -it pgbench pgbench pgbench -U postgres -i -s 100
time docker exec -it pgbench pgbench pgbench -U postgres -c 50 -j 8 -t 20000 -r -P 10

Everything up to and including "create database" works fine. The first pgbench command *almost* works... it adds all the rows, but then fails at the end, around when it would do a vacuum. The second pgbench command fails spectacularly right away.

If you reduce the second command to "-c1 -j1", it will work (at least for a while- it's incredibly slow for me so I didn't wait around to see if it works completely).

If /mnt is a regular local filesystem (or Ceph-RBD), this works fine. It only fails if that's a GlusterFS volume.


+++ This bug was initially created as a clone of Bug #1435832 +++

Description of problem:

I'm running Gluster in a kubernetes cluster with the help of https://github.com/gluster/gluster-kubernetes.  I have a postgresql container where the /var/lib/postgresql/data/pgdata directory is a glusterfs mounted persistent volume.  I then run another container to restore a PostgreSQL backup.  It successfully restores all tables except one, which happens to be the largest sized table >100MB.  The error given for that table is:

```
ERROR:  unexpected data beyond EOF in block 14917 of relation base/78620/78991
HINT:  This has been seen to occur with buggy kernels; consider updating your system.
CONTEXT:  COPY es_config_app_solutiondraft, line 906
```

I have tried several different containers to perform the restore on including ubuntu:16:04, postgresql:9.6.2, and alpine:3.5.  All have the same issue.  Interestingly, the entire restore works including the large table if I run it directly on the postgresql container.  That makes me think this is related to container to container networking and not necessarily gluster's fault but wanted to report it in case there are any suggestions or kernel setting tweaks to fix the issue.

Version-Release number of selected component (if applicable):

GlusterFS 3.8.5

PostgreSQL 9.6.2 Container:
uname -a
Linux develop-postgresql-3992946951-3srqg 4.4.0-65-generic #86-Ubuntu SMP Thu Feb 23 17:49:58 UTC 2017 x86_64 GNU/Linux

Other containers used for the restore are running the same 4.4.0-65-generic kernel.

Kubernetes 1.5.1
Docker 1.12.6

How reproducible:

First, get kubernetes working with gluster and heketi.  See https://github.com/gluster/gluster-kubernetes

Steps to Reproduce:
1. Start a PostgreSQL "pod" with the /var/lib/postgresql/data/pgdata set up as persistent volume.
2. Start a second container that can access the postgres container.
3. Attempt to restore a backup containing a large table >100MB.

Actual results:

Restore fails on large table with above error.

Expected results:

Restore applies cleanly, even for large tables.

Additional info:

Volume is mounted as type fuse.glusterfs.  From postgresql container:
# mount
10.163.148.196:vol_6d09a586370e26a718a74d5d280f8dfd on /var/lib/postgresql/data/pgdata type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

Googling this error does return some fairly old results that don't really have anything conclusive.

--- Additional comment from  on 2017-03-28 14:19:25 EDT ---

My thoughts about it being a container networking issue were incorrect.  I now believe this is truly a glusterfs + postgresql issue.  I confirmed that I occasionally do get restore failures on the postgresql container itself which eliminates the container networking interface (CNI).  I also get occasional successful restores on separate restore containers which further eliminates CNI.  The "unexpected data beyond EOF" error occurs intermittently with about a ~30% success rate regardless of how the restore is attempted.

Also, the table size for the failing table is actually 244MB.  All other tables that do successfully restore are under 10MB.

--- Additional comment from Zbigniew Kostrzewa on 2017-09-28 01:29:38 EDT ---

Just recently I bumped into the same error using GlusterFS 3.10.5 and 3.12.1 (from SIG repositories).
I have created a cluster of 3 VMs with CentOS 7.2 (uname below) and spin up a PostgreSQL 9.6.2 docker (v17.06) container. GlusterFS volume was bind-mounted into the container to default location where PostgreSQL stores its data (/var/lib/postgresql/data). When filling up the database with data at some point I got this "unexpected data beyond EOF" error.

On PostgreSQL's mailing list similar issue was discussed but about PostgreSQL on NFS. In fact such issue was reported and fixed already in RHEL5 (https://bugzilla.redhat.com/show_bug.cgi?id=672981).

I tried using latest PostgreSQL's docker image (i.e. 9.6.5), unfortunately with the same results.

uname -a:
Linux node-10-9-4-109 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

--- Additional comment from Rui on 2017-10-23 06:28:54 EDT ---

I'm having the same problem here.

I have installed postgresql 9.6.5 on 3.10.0-693.2.2.el7.x86_64 and executed pgbench with a scale factor of 1000, i.e. 10.000.000 accounts. 

First run was executed using the O.S filesystem. Everything went well.

After that I have stopped postgresql, created a GlusterFS replicated volume (3 replicas), and copied postgresql data directory into the GlusterFS volume.The volume is mounted as type fuse.glusterfs.

10.112.76.37:gv0 on /mnt/batatas type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)


After that I've tried to run pgbench. Running with concurrency level of one, things work fine. However, running with concurrency level > 1, this error occurs:

client 1 aborted in state 9: ERROR:  unexpected data beyond EOF in block 316 of relation base/16384/16516
HINT:  This has been seen to occur with buggy kernels; consider updating your system.

I'm using glusterfs 3.12.2.

Any idea?

--- Additional comment from Niels de Vos on 2017-11-07 05:40:31 EST ---

This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.
Comment 1 Milo 2018-01-10 08:17:40 EST
Same problem here. I read that most databases are not supported on Gluster but could we fix that?
See "Gluster does not support so called “structured data”, meaning live, SQL databases." in http://docs.gluster.org/en/latest/Install-Guide/Overview/.

We are running:
- glusterfs 3.12.3
- kernel 3.10.0-693.el7.x86_64
- Red Hat Enterprise Linux Server release 7.4 (Maipo)

It seems to be a problem with lseek(SEEK_END) not taking into account the last write in hi-traffic situations.
See postgres source code in "src/backend/storage/buffer/bufmgr.c" around line 806.
Comment 2 Amar Tumballi 2018-01-30 00:26:48 EST
Thanks for the pointer Milo. We will pick this up with 4.x version (Starting March).
Comment 3 Csaba Henk 2018-01-30 04:01:38 EST
Hi Norman,

there was another bug related to Postgres: Bug 1518710. Thereby OP indicated that turning off open-behind and write-behind mitigated the issue. So I'm asking you too,

1) does {open,write}-behind off make a difference for you?
2) if it helps to the degree of not seeing the issue with that setting, is it also a viable workable workaround performance-wise?

Thanks,
Csaba
Comment 4 norman.j.maul 2018-02-02 18:14:12 EST
Hi Csaba,

Sorry, I don't have a test environment for GlusterFS anymore. This problem and others (including the lack of action on this bug- see the original bug I cloned from) convinced caused us to give up and look elsewhere. We still haven't found a suitable open-source system, but I doubt we'll be looking back at GlusterFS any time soon, so it's unlikely I'll be of any further help.

Sorry I can't light the way any further. Hopefully the steps I left in comment 0 will help someone else replicate it and proceed further.

Cheers,
Norman
Comment 5 Amar Tumballi 2018-02-04 05:52:27 EST
Thanks for getting back Norman. We understand the reasoning. We will let you know once the work load would be well supported on top of GlusterFS.
Comment 6 Worker Ant 2018-03-05 07:17:16 EST
REVIEW: https://review.gluster.org/19673 (fuse: enable proper \"fgetattr\"-like semantics) posted (#1) for review on master by Csaba Henk
Comment 7 Worker Ant 2018-03-05 22:45:31 EST
COMMIT: https://review.gluster.org/19673 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- fuse: enable proper "fgetattr"-like semantics

GETATTR FUSE message can carry a file handle
reference in which case it serves as a hint
for the FUSE server that the stat data is
preferably acquired in context of the given
filehandle (which we call '"fgetattr"-like
semantics').

So far FUSE ignored the GETTATTR provided
filehandle and grabbed a file handle
heuristically. This caused confusion in the
caching layers, which has been tracked down
as one of the reasons of referred BUG.

As of the BUG, this is just a partial fix.

BUG: 1512691
Change-Id: I67eebbf5407ca725ed111fbda4181ead10d03f6d
Signed-off-by: Csaba Henk <csaba@redhat.com>
Comment 8 Philip Chan 2018-03-13 13:56:41 EDT
(In reply to Milo from comment #1)
> Same problem here. I read that most databases are not supported on Gluster
> but could we fix that?
> See "Gluster does not support so called “structured data”, meaning live, SQL
> databases." in http://docs.gluster.org/en/latest/Install-Guide/Overview/.
> 
> We are running:
> - glusterfs 3.12.3
> - kernel 3.10.0-693.el7.x86_64
> - Red Hat Enterprise Linux Server release 7.4 (Maipo)
> 
> It seems to be a problem with lseek(SEEK_END) not taking into account the
> last write in hi-traffic situations.
> See postgres source code in "src/backend/storage/buffer/bufmgr.c" around
> line 806.

We have also hit the same problem under heavy load when populating ou database inside a Postgresql container mapped to a Replicated Glusterfs volume.  Our environment consists of:
- glusterfs 3.12.3
- kernel 4.4.0-31-generic
- Ubuntu Ubuntu 16.04.3 LTS

We applied the workaround as mentioned in Bug#1518710 (turning off both performance.open-behind and performance.write-behind) that helped rid the load errors.

My questions are:
1) Can RedHat please clarify the statement "Gluster does not support so called “structured data”, meaning live, SQL databases." means? Is this to say the configuration we have(postgres using a glusterfs volume) is not supported?
2) What exact version of Gluster is this problem targeted to be fixed? About what time frame?

Thanks,
-Phil
Comment 9 Worker Ant 2018-05-24 23:01:39 EDT
REVIEW: https://review.gluster.org/20082 (Revert \"performance/write-behind: fix flush stuck by former failed writes\") posted (#1) for review on master by Raghavendra G
Comment 10 Worker Ant 2018-05-24 23:26:59 EDT
REVIEW: https://review.gluster.org/20083 (performance/read-ahead: throwaway read-ahead cache of all fds on writes on any fd) posted (#1) for review on master by Raghavendra G
Comment 11 Worker Ant 2018-05-28 22:28:04 EDT
COMMIT: https://review.gluster.org/20082 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- Revert "performance/write-behind: fix flush stuck by former failed writes"

This reverts commit 9340b3c7a6c8556d6f1d4046de0dbd1946a64963.

operations/writes across different fds of the same file cannot be
considered as independent. For eg., man 2 fsync states,

<man 2 fsync>

fsync()  transfers  ("flushes")  all  modified  in-core  data of
(i.e., modified buffer cache pages for) the file referred to by the
file descriptor fd to the disk device

</man>

This means fsync is an operation on file and fd is just a way to reach
file. So, it has to sync writes done on other fds too. Patch
9340b3c7a6c, prevents this.

The problem fixed by patch 9340b3c7a6c - a flush on an fd is hung on a
failed write (held in cache for retrying) on a different fd - is
solved in this patch by making sure __wb_request_waiting_on considers
failed writes on any fd as dependent on flush/fsync on any fd (not
just the fd on which writes happened) opened on the same file. This
means failed writes on any fd are either synced or thrown away on
witnessing flush/fsync on any fd of the same file.

Change-Id: Iee748cebb6d2a5b32f9328aff2b5b7cbf6c52c05
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Updates: bz#1512691
Comment 12 Worker Ant 2018-05-28 22:44:13 EDT
COMMIT: https://review.gluster.org/20083 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- performance/read-ahead: throwaway read-ahead cache of all fds on writes on any fd

This is to make sure applications that read and write on different fds
of the same file work.

This patch also fixes two other issues:
1. while iterating over the list of open fds on an inode, initialize
tmp_file to 0 for each iteration before fd_ctx_get to make sure we
don't carry over the history from previous iterations.
2. remove flushing of cache in flush and fsync as by themselves, they
don't modify the data

Change-Id: Ib9959eb73702a3ebbf90badccaa16b2608050eff
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Updates: bz#1512691
Comment 13 Shyamsundar 2018-06-20 13:57:11 EDT
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v4.1.0, please open a new bug report.

glusterfs-v4.1.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-June/000102.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.