Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1512691 - PostgreSQL DB Restore: unexpected data beyond EOF
PostgreSQL DB Restore: unexpected data beyond EOF
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: fuse (Show other bugs)
mainline
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Csaba Henk
GLUSTERFS_METADATA_INCONSISTENCY
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-11-13 15:54 EST by norman.j.maul
Modified: 2018-10-23 11:06 EDT (History)
14 users (show)

See Also:
Fixed In Version: glusterfs-5.0
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1435832
Environment:
Last Closed: 2018-10-23 11:06:35 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
strace output (2.46 MB, text/plain)
2018-06-29 20:06 EDT, Vijay Bellur
no flags Details

  None (edit)
Description norman.j.maul 2017-11-13 15:54:31 EST
Per bug 1435832, I'm re-reporting this under a newer version (3.12). I'm not the original submitter, but I can confirm that this problem still occurs in 3.12.1.

My steps to reproduce:

Install GlusterFS on 3+ nodes (I used 3.12.1 packages from CentOS-SIG), on RHEL7.3
I used Heketi to set up the cluster, but AFAICT it shouldn't matter
create a volume... I used heketi again, distributed-replicated, 1x3.
mount it somewhere (ex: /mnt)

Then you can run the "pgbench" tool like this (pass the right mountpoint):

docker run -d --net host -v /mnt:/var/lib/postgresql/data --name pgbench postgres:alpine
docker exec -it pgbench psql -U postgres
create database pgbench;
time docker exec -it pgbench pgbench pgbench -U postgres -i -s 100
time docker exec -it pgbench pgbench pgbench -U postgres -c 50 -j 8 -t 20000 -r -P 10

Everything up to and including "create database" works fine. The first pgbench command *almost* works... it adds all the rows, but then fails at the end, around when it would do a vacuum. The second pgbench command fails spectacularly right away.

If you reduce the second command to "-c1 -j1", it will work (at least for a while- it's incredibly slow for me so I didn't wait around to see if it works completely).

If /mnt is a regular local filesystem (or Ceph-RBD), this works fine. It only fails if that's a GlusterFS volume.


+++ This bug was initially created as a clone of Bug #1435832 +++

Description of problem:

I'm running Gluster in a kubernetes cluster with the help of https://github.com/gluster/gluster-kubernetes.  I have a postgresql container where the /var/lib/postgresql/data/pgdata directory is a glusterfs mounted persistent volume.  I then run another container to restore a PostgreSQL backup.  It successfully restores all tables except one, which happens to be the largest sized table >100MB.  The error given for that table is:

```
ERROR:  unexpected data beyond EOF in block 14917 of relation base/78620/78991
HINT:  This has been seen to occur with buggy kernels; consider updating your system.
CONTEXT:  COPY es_config_app_solutiondraft, line 906
```

I have tried several different containers to perform the restore on including ubuntu:16:04, postgresql:9.6.2, and alpine:3.5.  All have the same issue.  Interestingly, the entire restore works including the large table if I run it directly on the postgresql container.  That makes me think this is related to container to container networking and not necessarily gluster's fault but wanted to report it in case there are any suggestions or kernel setting tweaks to fix the issue.

Version-Release number of selected component (if applicable):

GlusterFS 3.8.5

PostgreSQL 9.6.2 Container:
uname -a
Linux develop-postgresql-3992946951-3srqg 4.4.0-65-generic #86-Ubuntu SMP Thu Feb 23 17:49:58 UTC 2017 x86_64 GNU/Linux

Other containers used for the restore are running the same 4.4.0-65-generic kernel.

Kubernetes 1.5.1
Docker 1.12.6

How reproducible:

First, get kubernetes working with gluster and heketi.  See https://github.com/gluster/gluster-kubernetes

Steps to Reproduce:
1. Start a PostgreSQL "pod" with the /var/lib/postgresql/data/pgdata set up as persistent volume.
2. Start a second container that can access the postgres container.
3. Attempt to restore a backup containing a large table >100MB.

Actual results:

Restore fails on large table with above error.

Expected results:

Restore applies cleanly, even for large tables.

Additional info:

Volume is mounted as type fuse.glusterfs.  From postgresql container:
# mount
10.163.148.196:vol_6d09a586370e26a718a74d5d280f8dfd on /var/lib/postgresql/data/pgdata type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

Googling this error does return some fairly old results that don't really have anything conclusive.

--- Additional comment from  on 2017-03-28 14:19:25 EDT ---

My thoughts about it being a container networking issue were incorrect.  I now believe this is truly a glusterfs + postgresql issue.  I confirmed that I occasionally do get restore failures on the postgresql container itself which eliminates the container networking interface (CNI).  I also get occasional successful restores on separate restore containers which further eliminates CNI.  The "unexpected data beyond EOF" error occurs intermittently with about a ~30% success rate regardless of how the restore is attempted.

Also, the table size for the failing table is actually 244MB.  All other tables that do successfully restore are under 10MB.

--- Additional comment from Zbigniew Kostrzewa on 2017-09-28 01:29:38 EDT ---

Just recently I bumped into the same error using GlusterFS 3.10.5 and 3.12.1 (from SIG repositories).
I have created a cluster of 3 VMs with CentOS 7.2 (uname below) and spin up a PostgreSQL 9.6.2 docker (v17.06) container. GlusterFS volume was bind-mounted into the container to default location where PostgreSQL stores its data (/var/lib/postgresql/data). When filling up the database with data at some point I got this "unexpected data beyond EOF" error.

On PostgreSQL's mailing list similar issue was discussed but about PostgreSQL on NFS. In fact such issue was reported and fixed already in RHEL5 (https://bugzilla.redhat.com/show_bug.cgi?id=672981).

I tried using latest PostgreSQL's docker image (i.e. 9.6.5), unfortunately with the same results.

uname -a:
Linux node-10-9-4-109 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

--- Additional comment from Rui on 2017-10-23 06:28:54 EDT ---

I'm having the same problem here.

I have installed postgresql 9.6.5 on 3.10.0-693.2.2.el7.x86_64 and executed pgbench with a scale factor of 1000, i.e. 10.000.000 accounts. 

First run was executed using the O.S filesystem. Everything went well.

After that I have stopped postgresql, created a GlusterFS replicated volume (3 replicas), and copied postgresql data directory into the GlusterFS volume.The volume is mounted as type fuse.glusterfs.

10.112.76.37:gv0 on /mnt/batatas type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)


After that I've tried to run pgbench. Running with concurrency level of one, things work fine. However, running with concurrency level > 1, this error occurs:

client 1 aborted in state 9: ERROR:  unexpected data beyond EOF in block 316 of relation base/16384/16516
HINT:  This has been seen to occur with buggy kernels; consider updating your system.

I'm using glusterfs 3.12.2.

Any idea?

--- Additional comment from Niels de Vos on 2017-11-07 05:40:31 EST ---

This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.
Comment 1 Milo 2018-01-10 08:17:40 EST
Same problem here. I read that most databases are not supported on Gluster but could we fix that?
See "Gluster does not support so called “structured data”, meaning live, SQL databases." in http://docs.gluster.org/en/latest/Install-Guide/Overview/.

We are running:
- glusterfs 3.12.3
- kernel 3.10.0-693.el7.x86_64
- Red Hat Enterprise Linux Server release 7.4 (Maipo)

It seems to be a problem with lseek(SEEK_END) not taking into account the last write in hi-traffic situations.
See postgres source code in "src/backend/storage/buffer/bufmgr.c" around line 806.
Comment 2 Amar Tumballi 2018-01-30 00:26:48 EST
Thanks for the pointer Milo. We will pick this up with 4.x version (Starting March).
Comment 3 Csaba Henk 2018-01-30 04:01:38 EST
Hi Norman,

there was another bug related to Postgres: Bug 1518710. Thereby OP indicated that turning off open-behind and write-behind mitigated the issue. So I'm asking you too,

1) does {open,write}-behind off make a difference for you?
2) if it helps to the degree of not seeing the issue with that setting, is it also a viable workable workaround performance-wise?

Thanks,
Csaba
Comment 4 norman.j.maul 2018-02-02 18:14:12 EST
Hi Csaba,

Sorry, I don't have a test environment for GlusterFS anymore. This problem and others (including the lack of action on this bug- see the original bug I cloned from) convinced caused us to give up and look elsewhere. We still haven't found a suitable open-source system, but I doubt we'll be looking back at GlusterFS any time soon, so it's unlikely I'll be of any further help.

Sorry I can't light the way any further. Hopefully the steps I left in comment 0 will help someone else replicate it and proceed further.

Cheers,
Norman
Comment 5 Amar Tumballi 2018-02-04 05:52:27 EST
Thanks for getting back Norman. We understand the reasoning. We will let you know once the work load would be well supported on top of GlusterFS.
Comment 6 Worker Ant 2018-03-05 07:17:16 EST
REVIEW: https://review.gluster.org/19673 (fuse: enable proper \"fgetattr\"-like semantics) posted (#1) for review on master by Csaba Henk
Comment 7 Worker Ant 2018-03-05 22:45:31 EST
COMMIT: https://review.gluster.org/19673 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- fuse: enable proper "fgetattr"-like semantics

GETATTR FUSE message can carry a file handle
reference in which case it serves as a hint
for the FUSE server that the stat data is
preferably acquired in context of the given
filehandle (which we call '"fgetattr"-like
semantics').

So far FUSE ignored the GETTATTR provided
filehandle and grabbed a file handle
heuristically. This caused confusion in the
caching layers, which has been tracked down
as one of the reasons of referred BUG.

As of the BUG, this is just a partial fix.

BUG: 1512691
Change-Id: I67eebbf5407ca725ed111fbda4181ead10d03f6d
Signed-off-by: Csaba Henk <csaba@redhat.com>
Comment 8 Philip Chan 2018-03-13 13:56:41 EDT
(In reply to Milo from comment #1)
> Same problem here. I read that most databases are not supported on Gluster
> but could we fix that?
> See "Gluster does not support so called “structured data”, meaning live, SQL
> databases." in http://docs.gluster.org/en/latest/Install-Guide/Overview/.
> 
> We are running:
> - glusterfs 3.12.3
> - kernel 3.10.0-693.el7.x86_64
> - Red Hat Enterprise Linux Server release 7.4 (Maipo)
> 
> It seems to be a problem with lseek(SEEK_END) not taking into account the
> last write in hi-traffic situations.
> See postgres source code in "src/backend/storage/buffer/bufmgr.c" around
> line 806.

We have also hit the same problem under heavy load when populating ou database inside a Postgresql container mapped to a Replicated Glusterfs volume.  Our environment consists of:
- glusterfs 3.12.3
- kernel 4.4.0-31-generic
- Ubuntu Ubuntu 16.04.3 LTS

We applied the workaround as mentioned in Bug#1518710 (turning off both performance.open-behind and performance.write-behind) that helped rid the load errors.

My questions are:
1) Can RedHat please clarify the statement "Gluster does not support so called “structured data”, meaning live, SQL databases." means? Is this to say the configuration we have(postgres using a glusterfs volume) is not supported?
2) What exact version of Gluster is this problem targeted to be fixed? About what time frame?

Thanks,
-Phil
Comment 9 Worker Ant 2018-05-24 23:01:39 EDT
REVIEW: https://review.gluster.org/20082 (Revert \"performance/write-behind: fix flush stuck by former failed writes\") posted (#1) for review on master by Raghavendra G
Comment 10 Worker Ant 2018-05-24 23:26:59 EDT
REVIEW: https://review.gluster.org/20083 (performance/read-ahead: throwaway read-ahead cache of all fds on writes on any fd) posted (#1) for review on master by Raghavendra G
Comment 11 Worker Ant 2018-05-28 22:28:04 EDT
COMMIT: https://review.gluster.org/20082 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- Revert "performance/write-behind: fix flush stuck by former failed writes"

This reverts commit 9340b3c7a6c8556d6f1d4046de0dbd1946a64963.

operations/writes across different fds of the same file cannot be
considered as independent. For eg., man 2 fsync states,

<man 2 fsync>

fsync()  transfers  ("flushes")  all  modified  in-core  data of
(i.e., modified buffer cache pages for) the file referred to by the
file descriptor fd to the disk device

</man>

This means fsync is an operation on file and fd is just a way to reach
file. So, it has to sync writes done on other fds too. Patch
9340b3c7a6c, prevents this.

The problem fixed by patch 9340b3c7a6c - a flush on an fd is hung on a
failed write (held in cache for retrying) on a different fd - is
solved in this patch by making sure __wb_request_waiting_on considers
failed writes on any fd as dependent on flush/fsync on any fd (not
just the fd on which writes happened) opened on the same file. This
means failed writes on any fd are either synced or thrown away on
witnessing flush/fsync on any fd of the same file.

Change-Id: Iee748cebb6d2a5b32f9328aff2b5b7cbf6c52c05
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Updates: bz#1512691
Comment 12 Worker Ant 2018-05-28 22:44:13 EDT
COMMIT: https://review.gluster.org/20083 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- performance/read-ahead: throwaway read-ahead cache of all fds on writes on any fd

This is to make sure applications that read and write on different fds
of the same file work.

This patch also fixes two other issues:
1. while iterating over the list of open fds on an inode, initialize
tmp_file to 0 for each iteration before fd_ctx_get to make sure we
don't carry over the history from previous iterations.
2. remove flushing of cache in flush and fsync as by themselves, they
don't modify the data

Change-Id: Ib9959eb73702a3ebbf90badccaa16b2608050eff
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Updates: bz#1512691
Comment 13 Shyamsundar 2018-06-20 13:57:11 EDT
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v4.1.0, please open a new bug report.

glusterfs-v4.1.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-June/000102.html
[2] https://www.gluster.org/pipermail/gluster-users/
Comment 14 Raghavendra G 2018-06-27 20:23:12 EDT
Reopening, as (write-behind + stat-prefetch) on combination is still giving errors.
Comment 15 Raghavendra G 2018-06-27 20:39:20 EDT
[2018-06-27 16:36:23.969018] T [fuse-bridge.c:862:fuse_attr_cbk] 0-glusterfs-fuse: 2929207: FSTAT() ERR => 13592650803908775745stat = gfid=cb68eef3-04f1-4719-bca2-cbe3a2e90f41 ino=13592650803908775745, mode=1006
00, nlink=1, uid=26, gid=26, size=202088448, blocks=394704, atime=2018-06-27-16:36:23 mtime=2018-06-27-16:36:23 ctime=2018-06-27-16:36:23 atime_sec=1530117383, atime_nsec=955733539, mtime_sec=1530117383, mtime_n
sec=967733546, ctime_sec=1530117383, ctime_nsec=967733546
[2018-06-27 16:36:23.969160] T [fuse-bridge.c:2503:fuse_write_resume] 0-glusterfs-fuse: 2929209: WRITE (0x7feb203abd90, size=8192, offset=202088448)
[2018-06-27 16:36:23.969211] T [fuse-bridge.c:2459:fuse_writev_cbk] 0-glusterfs-fuse: 2929209: gfid=cb68eef3-04f1-4719-bca2-cbe3a2e90f41 WRITE => 540720528/8192,8192/202088448

Above write extends file by 8192. So size should be 20208848 + 8192

[2018-06-27 16:36:23.969241] T [fuse-bridge.c:862:fuse_attr_cbk] 0-glusterfs-fuse: 2929206: FSTAT() ERR => 13592650803908775745stat = gfid=cb68eef3-04f1-4719-bca2-cbe3a2e90f41 ino=13592650803908775745, mode=1006
00, nlink=1, uid=26, gid=26, size=202088448, blocks=394704, atime=2018-06-27-16:36:23 mtime=2018-06-27-16:36:23 ctime=2018-06-27-16:36:23 atime_sec=1530117383, atime_nsec=955733539, mtime_sec=1530117383, mtime_n
sec=967733546, ctime_sec=1530117383, ctime_nsec=967733546
[2018-06-27 16:36:23.969305] T [fuse-bridge.c:2399:fuse_readv_resume] 0-glusterfs-fuse: 2929210: READ (0x7feb20562ff0, size=8192, offset=41058304)

[2018-06-27 16:36:23.969464] T [fuse-bridge.c:970:fuse_getattr_resume] 0-glusterfs-fuse: 2929211: FGETATTR 140648101387600 ((null)/0x7feb203abd90)

Note that this getattr was issued after write extended the file

[2018-06-27 16:36:23.969496] T [fuse-bridge.c:862:fuse_attr_cbk] 0-glusterfs-fuse: 2929211: FSTAT() ERR => 13592650803908775745stat = gfid=cb68eef3-04f1-4719-bca2-cbe3a2e90f41 ino=13592650803908775745, mode=1006
00, nlink=1, uid=26, gid=26, size=202088448, blocks=394704, atime=2018-06-27-16:36:23 mtime=2018-06-27-16:36:23 ctime=2018-06-27-16:36:23 atime_sec=1530117383, atime_nsec=955733539, mtime_sec=1530117383, mtime_n
sec=967733546, ctime_sec=1530117383, ctime_nsec=967733546

Size is still 202088448. Effect of write (id 2929209) is not reflected in a stat (id 2929211) issued after write. This is the bug.

The bug is in md-cache.  Imagine following series of operations:
    * A stat s1 was issued from fuse layer when size of file was s1
    * stat s1 completes on brick stack, but yet to reach md-cache
    * A write w1 extends file to size s2, but write is cached in
      write-behind and response is unwound
    * md-cache while handling write-cbk, invalidates cache of file
    * md-cache receives response for s1, updates cache with stale stat
    
    Fix is to remember whether s1 was older than w1. If yes, s1 won't
    update cache in write-behind.
Comment 16 Raghavendra G 2018-06-27 20:40:11 EDT
(In reply to Raghavendra G from comment #15)  
>     Fix is to remember whether s1 was older than w1. If yes, s1 won't
>     update cache in write-behind.

s/write-behind/md-cache
Comment 17 Worker Ant 2018-06-27 20:42:57 EDT
REVIEW: https://review.gluster.org/20410 (performance/md-cache: update cache only from fops issued after previous invalidation) posted (#2) for review on master by Raghavendra G
Comment 18 Raghavendra G 2018-06-28 00:42:14 EDT
(In reply to Raghavendra G from comment #15)
 
> The bug is in md-cache.  Imagine following series of operations:
>     * A stat s1 was issued from fuse layer when size of file was s1
>     * stat s1 completes on brick stack, but yet to reach md-cache
>     * A write w1 extends file to size s2, but write is cached in
>       write-behind and response is unwound

write-behind has no role. The bug can happen even without write-behind. In fact its necessary that two fops that would fetch stat (like a write and stat or two stats on the same file) to be invoked from two different threads. Note that caching and premature unwinding in write-behind cannot cause this class of bugs.

Updated RCA is below:

Imagine following series of operations from two application threads:
    * A stat s1 was issued from application thread t1 when size of file
      was s1
    * stat s1 completes on brick stack, but yet to reach md-cache
    * A write w1 from application thread t2 extends file to size s2
    * md-cache while handling write-cbk, updates size to s2
    * md-cache receives response for s1, updates cache with stale stat
      with size of s1 overwriting a valid stat s2
    
    Similar race can occur between any fops that fetch stat done by two
    application threads.
    
    Fix is to remember whether s1 was older than w1. If yes, s1 won't
    update cache in md-cache.
Comment 19 Worker Ant 2018-06-28 01:11:21 EDT
REVIEW: https://review.gluster.org/20413 (performance/write-behind: better invalidation in readdirp) posted (#1) for review on master by Raghavendra G
Comment 20 Raghavendra G 2018-06-28 04:38:14 EDT
With patch #20413 and patch #20410 applied I was able to successfully run around 70 iterations of the test case presented in this bz. No performance xlator was turned off in this exercise.
Comment 21 Vijay Bellur 2018-06-29 20:06 EDT
Created attachment 1455587 [details]
strace output
Comment 22 Vijay Bellur 2018-06-29 20:12:00 EDT
While running tests on a build with patches 20413 and 20410, I observed the following:

$pgbench -c 8 -j 2 -T 30 bench-db
starting vacuum...WARNING:  corrupted statistics file "pg_stat_tmp/pgstat.stat"
end.

This is usually seen in the second iteration of a pgbench loop. First iteration ran clean quite a few times.

Disabling quick-read seems to eliminate this problem. Relevant parts from attached strace output (comment 21):

586436 open("pg_stat_tmp/pgstat.stat", O_RDONLY) = 12
586436 fstat(12, {st_mode=S_IFREG|0600, st_size=13479, ...}) = 0
586436 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f57c5dec000
586436 read(12, "\232\274\245\1\353\203Q\361\316\22\2\0\2\0\0\0\0\0\0\0\245\0\0\0\0\0\0\0+l\"\0"..., 8192) = 8192
586436 close(12)                        = 0
586436 munmap(0x7f57c5dec000, 8192)     = 0
586436 sendto(10, "\1\0\0\0\20\0\0\0])W\362\316\22\2\0", 16, 0, NULL, 0) = 16
586436 select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
586436 open("pg_stat_tmp/pgstat.stat", O_RDONLY) = 12
586436 fstat(12, 0x7fff84995980)        = -1 ESTALE (Stale file handle)
586436 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f57c5dec000
586436 read(12, 0x7f57c5dec000, 8192)   = -1 ESTALE (Stale file handle)
586436 write(2, "\0\0>\0\304\362\10\0tWARNING:  corrupted sta"..., 71) = 71
586436 sendto(11, "N\0\0\0{SWARNING\0C01000\0Mcorrupted "..., 124, 0, NULL, 0) = 124
586436 close(12)
Comment 23 Vijay Bellur 2018-06-29 20:15:13 EDT
ESTALE observations in the client log file:

[2018-06-30 00:13:35.907136] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 8498054: FSTAT() ERR => -1 (Stale file handle)
[2018-06-30 00:13:35.981005] E [MSGID: 109040] [dht-helper.c:1386:dht_migration_complete_check_task] 26-db-dht: 02886b52-168b-47af-8567-3192a3410cd9: failed to lookup the file on db-dht [Stale file handle]
[2018-06-30 00:13:35.981086] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 8498122: FSTAT() <gfid:02886b52-168b-47af-8567-3192a3410cd9> => -1 (Stale file handle)
Comment 24 Vijay Bellur 2018-06-29 20:43:58 EDT
More client logs:

2018-06-30 00:39:57.083002] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk]
 4-db-client-1: remote operation failed [No such file or directory]
[2018-06-30 00:39:57.083717] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk]
 4-db-client-0: remote operation failed [No such file or directory]
[2018-06-30 00:39:57.083720] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk]
 4-db-client-1: remote operation failed [No such file or directory]
[2018-06-30 00:39:57.083834] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-2: remote operation failed [No such file or directory]
[2018-06-30 00:39:57.085388] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-0: remote operation failed [No such file or directory]
[2018-06-30 00:39:57.085521] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-1: remote operation failed [No such file or directory]
[2018-06-30 00:39:57.085610] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-2: remote operation failed [No such file or directory]
[2018-06-30 00:39:57.086641] E [MSGID: 109040] [dht-helper.c:1386:dht_migration_complete_check_task] 4-db-dht: 3fd00188-08b6-47bb-883a-c8fe87f29b91: failed to lookup the file on db-dht [Stale file handle]
[2018-06-30 00:39:57.086715] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 97871: FSTAT() <gfid:3fd00188-08b6-47bb-883a-c8fe87f29b91> => -1 (Stale file handle)
Comment 25 Raghavendra G 2018-06-30 10:24:18 EDT
Note the pattern of "postgres: autovacuum launcher process"

openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.002975>
fstat(4, {st_mode=S_IFREG|0600, st_size=2457, ...}) = 0 <0.000007>
read(4, "\235\274\245\1\207\366\206\371\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"..., 8192) = 2457 <0.000079>
close(4)                                = 0 <0.000114>
sendto(10, "\1\0\0\0 \0\0\0\nN\331\371\332\22\2\0\352\254\321\371\332\22\2\0\0\0\0\0\0\0\0\0", 32, 0, NULL, 0) = 32 <0.000025>
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout) <0.010212>
openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.000870>
fstat(4, {st_mode=S_IFREG|0600, st_size=2457, ...}) = 0 <0.000008>
read(4, "\235\274\245\1\207\366\206\371\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"..., 8192) = 2457 <0.000106>
close(4)                                = 0 <0.000091>
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout) <0.010089>
openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.000704>
fstat(4, {st_mode=S_IFREG|0600, st_size=2457, ...}) = 0 <0.000008>
read(4, "\235\274\245\1\364^\331\371\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"..., 8192) = 2457 <0.000268>
close(4)                                = 0 <0.000076>
openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.000433>

Another process "postgres: stats collector process", acts as a writer to global.stat (and other .stat files). It,
1. creates a pg_stat_tmp/global.tmp
2. writes to the file
3. rename (pg_stat_tmp/global.tmp, pg_stat_tmp/global.stat)

openat(AT_FDCWD, "pg_stat_tmp/global.tmp", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 <0.004589>
fstat(3, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000011>
openat(AT_FDCWD, "pg_stat_tmp/db_0.tmp", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 <0.001744>
fstat(4, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000005>
write(4, "\235\274\245\1T\275\4\0\0\0\0\0\0\24\0\0\0\0\0\0\0(\0\0\0\0\0\0\0\0\0\0"..., 2033) = 2033 <0.000099>
close(4)                                = 0 <0.000706>
rename("pg_stat_tmp/db_0.tmp", "pg_stat_tmp/db_0.stat") = 0 <0.003799>
write(3, "\235\274\245\1\7C9\370\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"..., 2457) = 2457 <0.000346>
close(3)                                = 0 <0.001612>
rename("pg_stat_tmp/global.tmp", "pg_stat_tmp/global.stat") = 0 <0.003582>

Basically, this I/O pattern expects that fd opened on a file before rename to  be valid after rename overwrote that file (fd is opened on dst of rename before rename is done).

While looking at open-behind for rename I found that while rename tries to open fd on dst before rename is wound down, I found a bug in that code. Note that,

1. Only first ever fd opened on an inode is opened-behind. Rest of the opens are unwound only after open is done on bricks.
2. So, ob_rename tries to open _that_ first fd, by doing this:

<ob_rename>
        stub = fop_rename_stub (frame, default_rename_resume, src, dst, xdata);
        if (!stub)
                goto err;

        if (dst->inode)
		fd = fd_lookup (dst->inode, 0);

        open_and_resume (this, fd, stub);
</ob_rename>

Note how it gets the fd to be opened - by doing fd_lookup (with pid 0). However, fd_lookup with pid = 0 returns the first fd in the list of fds opened on the inode.

<__fd_lookup>

	list_for_each_entry (iter_fd, &inode->fd_list, inode_list) {
                if (iter_fd->anonymous)
			/* If someone was interested in getting an                                                                                                                                                 
                           anonymous fd (or was OK getting an anonymous fd),                                                                                                                                       
                           they can as well call fd_anonymous() directly */
	                continue;

               	if (!pid || iter_fd->pid == pid) {
			fd = __fd_ref (iter_fd);
	                break;
                }
        }

</__fd_lookup>

But, if we look at __fd_bind which ads an fd to this list,

<__fd_bind>

        list_del_init (&fd->inode_list);
        list_add (&fd->inode_list, &fd->inode->fd_list);
        fd->inode->fd_count++;

</__fd_bind>

So, the first member of inode->fd_list is the recently opened fd on inode. But the opened-behind fd is the oldest fd that is opened on the inode and is the last member of inode->fd_list.

So, when more than one fds are opened on a file ob_rename actually tries to open the fd which is already opened on bricks missing the fd which is opened-behind and not opened on brick. When rename is done, dst is overwritten with a new gfid . When operations like fstat/readv are done on the opened-behind fd, they fail with ESTALE as the gfid is no longer present on bricks being overwritten by rename.

Similar bug is present in ob_unlink too.

I couldn't reproduce the corruption while stracing (may be tracing altered the race), but I could reproduce the bug without stracing (with all perf xlators tuned on). When I turned off open-behind, the bug went away. Given the reasoning and the fact that I couldn't reproduce the bug with open-behind off, I conclude the RCA for corruption is the one I explained above.

I am going to send a fix to ob_rename and ob_unlink and report back the results with the fix.
Comment 26 Worker Ant 2018-06-30 21:40:48 EDT
REVIEW: https://review.gluster.org/20428 (performance/ob: check the oldest fd(s) for pending opens in rename and unlink) posted (#1) for review on master by Raghavendra G
Comment 27 Worker Ant 2018-07-03 02:01:40 EDT
REVIEW: https://review.gluster.org/20444 (Revert \"performance/quick-read: Use generation numbers to avoid updating the cache with stale data\") posted (#1) for review on master by Raghavendra G
Comment 28 Worker Ant 2018-07-03 02:02:46 EDT
REVIEW: https://review.gluster.org/20445 (performance/quick-read: Don't update cache with stale data) posted (#1) for review on master by Raghavendra G
Comment 29 Worker Ant 2018-07-03 21:57:41 EDT
REVIEW: https://review.gluster.org/20456 (performance/write-behind: adjust generation numbers while collapsing small writes) posted (#1) for review on master by Raghavendra G
Comment 30 Worker Ant 2018-07-07 10:33:48 EDT
REVIEW: https://review.gluster.org/20478 (performance/md-cache: update cache only from fops issued after previous invalidation) posted (#1) for review on master by Raghavendra G
Comment 31 Worker Ant 2018-07-22 04:10:17 EDT
REVIEW: https://review.gluster.org/20549 (performance/md-cache: update cache only from fops issued after previous invalidation) posted (#1) for review on master by Raghavendra G
Comment 32 Worker Ant 2018-07-27 05:25:20 EDT
REVIEW: https://review.gluster.org/20576 (performance/write-behind: synchronize rename with cached writes on src) posted (#1) for review on master by Raghavendra G
Comment 33 Worker Ant 2018-07-27 06:02:46 EDT
REVIEW: https://review.gluster.org/20579 (performance/quick-read: don't update with stale data after invalidation) posted (#1) for review on master by Raghavendra G
Comment 34 Worker Ant 2018-07-28 21:50:24 EDT
COMMIT: https://review.gluster.org/20413 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- performance/write-behind: better invalidation in readdirp

Current invalidation of stats in wb_readdirp_cbk is prone to races. As
the deleted comment explains,

<snip>
We cannot guarantee integrity of entry->d_stat as there are cached
writes. The stat is most likely stale as it doesn't account the cached
writes. However, checking for non-empty liability list here is not a
fool-proof solution as there can be races like,
  1. readdirp is successful on posix
  2. sync of cached write is successful on posix
  3. write-behind received sync response and removed the request from
     liability queue
  4. readdirp response is processed at write-behind.

In the above scenario, stat for the file is sent back in readdirp
response but it is stale.
</snip>

Change-Id: I6ce170985cc6ce3df2382ec038dd5415beefded5
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Updates: bz#1512691
Comment 35 Worker Ant 2018-08-01 23:20:12 EDT
COMMIT: https://review.gluster.org/20576 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- performance/write-behind: synchronize rename with cached writes on src

rename response contains a postbuf stat of src-inode. Since md-cache
caches stat in rename codepath too, we've to make sure stat accounts
any cached writes in write-behind. So, we make sure rename is resumed
only after any cached writes are committed to backend.

Change-Id: Ic9f2adf8edd0b58ebaf661f3a8d0ca086bc63111
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Updates: bz#1512691
Comment 36 Worker Ant 2018-08-02 03:01:47 EDT
COMMIT: https://review.gluster.org/20549 committed in master by "Poornima G" <pgurusid@redhat.com> with a commit message- performance/md-cache: update cache only from fops issued after previous invalidation

Invalidations are triggered mainly by two codepaths - upcall and
write-behind unwinding a cached write with zeroed out stat. For the
case of upcall, following race can happen:
* stat s1 is fetched from brick
* invalidation is detected on brick
* invalidation is propagated to md-cache and cache is invalidated
* s1 updates md-cache with a stale state

For the case of write-behind, imagine following sequence of operations,
* A stat s1 was issued from application thread t1 when size of file
  was s1
* stat s1 completes on brick stack, but yet to reach md-cache
* A write w1 from application thread t2 extends file to size s2 is
  cached in write-behind and response is unwound with zeroed out stat
* md-cache while handling write-cbk, invalidates cache
* md-cache receives response for s1, updates cache with stale stat
  with size of s1 overwriting invalidation state

Fix is to remember when s1 was incident on md-cache and update cache
with results of s1 only if the it was incident after invalidation of
cache.

This patch identified some bugs in regression tests which is tracked
in https://bugzilla.redhat.com/show_bug.cgi?id=1608158. As a stop gap
measure I am marking following tests as bad
        basic/afr/split-brain-resolution.t
        bugs/bug-1368312.t
        bugs/replicate/bug-1238398-split-brain-resolution.t
        bugs/replicate/bug-1417522-block-split-brain-resolution.t
        bugs/replicate/bug-1438255-do-not-mark-self-accusing-xattrs.t

Change-Id: Ia4bb9dd36494944e2d91e9e71a79b5a3974a8c77
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Updates: bz#1512691
Comment 37 Worker Ant 2018-08-03 10:47:35 EDT
COMMIT: https://review.gluster.org/20428 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- performance/ob: stringent synchronization between rename/unlink and open

Issue 1:
========
open all pending fds before resuming rename and unlink

currently ob uses fd_lookup to find out the opened-behind. But,
fd_lookup gives the recent fd opened on the inode, but the oldest
fd(s) (there can be multiple fds opened-behind when the very first
opens on an inode are issued in parallel) are the candidates for fds
with pending opens on backend. So, this patch explictily tracks the
opened-behind fds on an inode and opens them before resuming rename or
unlink.

similar code changes are also done for setattr and setxattr to make
sure pending opens are complete before permission change.

This patch also adds a check for an open-in-progress to
ob_get_wind_fd. If there is already an open-in-progress,
ob_get_wind_fd won't return an anonymous fd as a result. This is done
to make sure that rename/unlink/setattr/setxattr don't race with an
operation like readv/fstat on an anonymous fd already in progress.

Issue 2:
========
once renamed/unlinked, don't open-behind any future opens on the same
inode.

Issue 3:
========
Don't use anonymous fds by default. Note that rename/unlink can race
with a read/fd on anonymous fds and these operations can fail with
ESTALE. So, for better consistency in default mode, don't use
anonymous fds. If performance is needed with tradeoff of consistency,
one can switch on the option "use-anonymous-fd"

Change-Id: Iaf130db71ce61ac37269f422e348a45f6ae6e82c
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Updates: bz#1512691
Comment 38 Worker Ant 2018-08-05 03:40:28 EDT
COMMIT: https://review.gluster.org/20579 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- performance/quick-read: don't update with stale data after invalidation

Once invalidated, make sure that only ops incident after invalidation
update the cache. This makes sure that ops before invalidation don't
repopulate cache with stale data. This patch also uses an internal
counter instead of frame->root->unique for keeping track of
generations.

Change-Id: I6b38b141985283bd54b287775f3ec67b88bf6cb8
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Updates: bz#1512691
Comment 39 Worker Ant 2018-08-11 09:54:13 EDT
REVIEW: https://review.gluster.org/20705 (performance/quick-read: handle rollover of generation counter) posted (#1) for review on master by Raghavendra G
Comment 40 Worker Ant 2018-08-13 07:40:14 EDT
COMMIT: https://review.gluster.org/20705 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- performance/quick-read: handle rollover of generation counter

Change-Id: I37a6e0efda430b70d03dd431c35bef23b3d16361
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Updates: bz#1512691
Comment 41 Worker Ant 2018-08-14 05:50:56 EDT
REVIEW: https://review.gluster.org/20737 (performance/read-ahead: flush cache on fsync/flush failures) posted (#1) for review on master by Raghavendra G
Comment 42 Worker Ant 2018-08-21 00:19:51 EDT
REVIEW: https://review.gluster.org/20871 (Revert \"performance/write-behind: better invalidation in readdirp\") posted (#1) for review on master by Raghavendra G
Comment 43 Worker Ant 2018-08-21 00:21:04 EDT
REVIEW: https://review.gluster.org/20872 (performance/write-behind: fix fulfill and readdirp race) posted (#1) for review on master by Raghavendra G
Comment 44 Worker Ant 2018-08-21 07:37:27 EDT
COMMIT: https://review.gluster.org/20871 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- Revert "performance/write-behind: better invalidation in readdirp"

This reverts commit 4d3c62e71f3250f10aa0344085a5ec2d45458d5c.

Traversing all children of a directory in wb_readdirp caused
significant performance regression. Hence reverting this patch

Change-Id: I6c3b6cee2dd2aca41d49fe55ecdc6262e7cc5f34
updates: bz#1512691
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Comment 45 Worker Ant 2018-08-23 11:41:35 EDT
COMMIT: https://review.gluster.org/20872 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- performance/write-behind: fix fulfill and readdirp race

Current invalidation of stats in wb_readdirp_cbk is prone to races. As
the deleted comment explains,

<snip>
    We cannot guarantee integrity of entry->d_stat as there are cached
    writes. The stat is most likely stale as it doesn't account the cached
    writes. However, checking for non-empty liability list here is not a
    fool-proof solution as there can be races like,
      1. readdirp is successful on posix
      2. sync of cached write is successful on posix
      3. write-behind received sync response and removed the request from
         liability queue
      4. readdirp response is processed at write-behind.

    In the above scenario, stat for the file is sent back in readdirp
    response but it is stale.
</snip>

The fix is to mark readdirp sessions (tracked in this patch by
non-zero value of "readdirps" on parent inode) and if fulfill
completes when one or more readdirp sessions are in progress, mark the
inode so that wb_readdirp_cbk doesn't send iatts for that in inode in
readdirp response. Note that wb_readdirp_cbk already checks for
presence of a non-empty liability queue and invalidates iatt. Since
the only way a liability queue can shrink is by fulfilling requests in
liability queue, wb_fulfill_cbk indicates wb_readdirp_cbk that a
potential race could've happened b/w readdirp and fulfill.

Change-Id: I12d167bf450648baa64be1cbe1ca0fddf5379521
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
updates: bz#1512691
Comment 46 Worker Ant 2018-08-23 11:51:38 EDT
REVIEW: https://review.gluster.org/20980 (performance/read-ahead: add more trace msgs) posted (#1) for review on master by Raghavendra G
Comment 47 Worker Ant 2018-08-23 11:52:54 EDT
REVIEW: https://review.gluster.org/20981 (performance/read-ahead: update next expected offset before unwind) posted (#1) for review on master by Raghavendra G
Comment 48 Worker Ant 2018-08-27 02:16:07 EDT
REVIEW: https://review.gluster.org/21006 (performance/read-ahead: flush old data only when read is unwound) posted (#1) for review on master by Raghavendra G
Comment 49 Worker Ant 2018-09-04 23:10:25 EDT
REVIEW: https://review.gluster.org/21035 (performance/write-behind: serve stats locally) posted (#2) for review on master by Raghavendra G
Comment 50 Worker Ant 2018-10-12 05:05:21 EDT
REVIEW: https://review.gluster.org/21398 (performance/io-cache: update pages with write data) posted (#1) for review on master by Raghavendra G
Comment 51 Shyamsundar 2018-10-23 11:06:35 EDT
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.