Per bug 1435832, I'm re-reporting this under a newer version (3.12). I'm not the original submitter, but I can confirm that this problem still occurs in 3.12.1. My steps to reproduce: Install GlusterFS on 3+ nodes (I used 3.12.1 packages from CentOS-SIG), on RHEL7.3 I used Heketi to set up the cluster, but AFAICT it shouldn't matter create a volume... I used heketi again, distributed-replicated, 1x3. mount it somewhere (ex: /mnt) Then you can run the "pgbench" tool like this (pass the right mountpoint): docker run -d --net host -v /mnt:/var/lib/postgresql/data --name pgbench postgres:alpine docker exec -it pgbench psql -U postgres create database pgbench; time docker exec -it pgbench pgbench pgbench -U postgres -i -s 100 time docker exec -it pgbench pgbench pgbench -U postgres -c 50 -j 8 -t 20000 -r -P 10 Everything up to and including "create database" works fine. The first pgbench command *almost* works... it adds all the rows, but then fails at the end, around when it would do a vacuum. The second pgbench command fails spectacularly right away. If you reduce the second command to "-c1 -j1", it will work (at least for a while- it's incredibly slow for me so I didn't wait around to see if it works completely). If /mnt is a regular local filesystem (or Ceph-RBD), this works fine. It only fails if that's a GlusterFS volume. +++ This bug was initially created as a clone of Bug #1435832 +++ Description of problem: I'm running Gluster in a kubernetes cluster with the help of https://github.com/gluster/gluster-kubernetes. I have a postgresql container where the /var/lib/postgresql/data/pgdata directory is a glusterfs mounted persistent volume. I then run another container to restore a PostgreSQL backup. It successfully restores all tables except one, which happens to be the largest sized table >100MB. The error given for that table is: ``` ERROR: unexpected data beyond EOF in block 14917 of relation base/78620/78991 HINT: This has been seen to occur with buggy kernels; consider updating your system. CONTEXT: COPY es_config_app_solutiondraft, line 906 ``` I have tried several different containers to perform the restore on including ubuntu:16:04, postgresql:9.6.2, and alpine:3.5. All have the same issue. Interestingly, the entire restore works including the large table if I run it directly on the postgresql container. That makes me think this is related to container to container networking and not necessarily gluster's fault but wanted to report it in case there are any suggestions or kernel setting tweaks to fix the issue. Version-Release number of selected component (if applicable): GlusterFS 3.8.5 PostgreSQL 9.6.2 Container: uname -a Linux develop-postgresql-3992946951-3srqg 4.4.0-65-generic #86-Ubuntu SMP Thu Feb 23 17:49:58 UTC 2017 x86_64 GNU/Linux Other containers used for the restore are running the same 4.4.0-65-generic kernel. Kubernetes 1.5.1 Docker 1.12.6 How reproducible: First, get kubernetes working with gluster and heketi. See https://github.com/gluster/gluster-kubernetes Steps to Reproduce: 1. Start a PostgreSQL "pod" with the /var/lib/postgresql/data/pgdata set up as persistent volume. 2. Start a second container that can access the postgres container. 3. Attempt to restore a backup containing a large table >100MB. Actual results: Restore fails on large table with above error. Expected results: Restore applies cleanly, even for large tables. Additional info: Volume is mounted as type fuse.glusterfs. From postgresql container: # mount 10.163.148.196:vol_6d09a586370e26a718a74d5d280f8dfd on /var/lib/postgresql/data/pgdata type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) Googling this error does return some fairly old results that don't really have anything conclusive. --- Additional comment from on 2017-03-28 14:19:25 EDT --- My thoughts about it being a container networking issue were incorrect. I now believe this is truly a glusterfs + postgresql issue. I confirmed that I occasionally do get restore failures on the postgresql container itself which eliminates the container networking interface (CNI). I also get occasional successful restores on separate restore containers which further eliminates CNI. The "unexpected data beyond EOF" error occurs intermittently with about a ~30% success rate regardless of how the restore is attempted. Also, the table size for the failing table is actually 244MB. All other tables that do successfully restore are under 10MB. --- Additional comment from Zbigniew Kostrzewa on 2017-09-28 01:29:38 EDT --- Just recently I bumped into the same error using GlusterFS 3.10.5 and 3.12.1 (from SIG repositories). I have created a cluster of 3 VMs with CentOS 7.2 (uname below) and spin up a PostgreSQL 9.6.2 docker (v17.06) container. GlusterFS volume was bind-mounted into the container to default location where PostgreSQL stores its data (/var/lib/postgresql/data). When filling up the database with data at some point I got this "unexpected data beyond EOF" error. On PostgreSQL's mailing list similar issue was discussed but about PostgreSQL on NFS. In fact such issue was reported and fixed already in RHEL5 (https://bugzilla.redhat.com/show_bug.cgi?id=672981). I tried using latest PostgreSQL's docker image (i.e. 9.6.5), unfortunately with the same results. uname -a: Linux node-10-9-4-109 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux --- Additional comment from Rui on 2017-10-23 06:28:54 EDT --- I'm having the same problem here. I have installed postgresql 9.6.5 on 3.10.0-693.2.2.el7.x86_64 and executed pgbench with a scale factor of 1000, i.e. 10.000.000 accounts. First run was executed using the O.S filesystem. Everything went well. After that I have stopped postgresql, created a GlusterFS replicated volume (3 replicas), and copied postgresql data directory into the GlusterFS volume.The volume is mounted as type fuse.glusterfs. 10.112.76.37:gv0 on /mnt/batatas type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) After that I've tried to run pgbench. Running with concurrency level of one, things work fine. However, running with concurrency level > 1, this error occurs: client 1 aborted in state 9: ERROR: unexpected data beyond EOF in block 316 of relation base/16384/16516 HINT: This has been seen to occur with buggy kernels; consider updating your system. I'm using glusterfs 3.12.2. Any idea? --- Additional comment from Niels de Vos on 2017-11-07 05:40:31 EST --- This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.
Same problem here. I read that most databases are not supported on Gluster but could we fix that? See "Gluster does not support so called “structured data”, meaning live, SQL databases." in http://docs.gluster.org/en/latest/Install-Guide/Overview/. We are running: - glusterfs 3.12.3 - kernel 3.10.0-693.el7.x86_64 - Red Hat Enterprise Linux Server release 7.4 (Maipo) It seems to be a problem with lseek(SEEK_END) not taking into account the last write in hi-traffic situations. See postgres source code in "src/backend/storage/buffer/bufmgr.c" around line 806.
Thanks for the pointer Milo. We will pick this up with 4.x version (Starting March).
Hi Norman, there was another bug related to Postgres: Bug 1518710. Thereby OP indicated that turning off open-behind and write-behind mitigated the issue. So I'm asking you too, 1) does {open,write}-behind off make a difference for you? 2) if it helps to the degree of not seeing the issue with that setting, is it also a viable workable workaround performance-wise? Thanks, Csaba
Hi Csaba, Sorry, I don't have a test environment for GlusterFS anymore. This problem and others (including the lack of action on this bug- see the original bug I cloned from) convinced caused us to give up and look elsewhere. We still haven't found a suitable open-source system, but I doubt we'll be looking back at GlusterFS any time soon, so it's unlikely I'll be of any further help. Sorry I can't light the way any further. Hopefully the steps I left in comment 0 will help someone else replicate it and proceed further. Cheers, Norman
Thanks for getting back Norman. We understand the reasoning. We will let you know once the work load would be well supported on top of GlusterFS.
REVIEW: https://review.gluster.org/19673 (fuse: enable proper \"fgetattr\"-like semantics) posted (#1) for review on master by Csaba Henk
COMMIT: https://review.gluster.org/19673 committed in master by "Raghavendra G" <rgowdapp> with a commit message- fuse: enable proper "fgetattr"-like semantics GETATTR FUSE message can carry a file handle reference in which case it serves as a hint for the FUSE server that the stat data is preferably acquired in context of the given filehandle (which we call '"fgetattr"-like semantics'). So far FUSE ignored the GETTATTR provided filehandle and grabbed a file handle heuristically. This caused confusion in the caching layers, which has been tracked down as one of the reasons of referred BUG. As of the BUG, this is just a partial fix. BUG: 1512691 Change-Id: I67eebbf5407ca725ed111fbda4181ead10d03f6d Signed-off-by: Csaba Henk <csaba>
(In reply to Milo from comment #1) > Same problem here. I read that most databases are not supported on Gluster > but could we fix that? > See "Gluster does not support so called “structured data”, meaning live, SQL > databases." in http://docs.gluster.org/en/latest/Install-Guide/Overview/. > > We are running: > - glusterfs 3.12.3 > - kernel 3.10.0-693.el7.x86_64 > - Red Hat Enterprise Linux Server release 7.4 (Maipo) > > It seems to be a problem with lseek(SEEK_END) not taking into account the > last write in hi-traffic situations. > See postgres source code in "src/backend/storage/buffer/bufmgr.c" around > line 806. We have also hit the same problem under heavy load when populating ou database inside a Postgresql container mapped to a Replicated Glusterfs volume. Our environment consists of: - glusterfs 3.12.3 - kernel 4.4.0-31-generic - Ubuntu Ubuntu 16.04.3 LTS We applied the workaround as mentioned in Bug#1518710 (turning off both performance.open-behind and performance.write-behind) that helped rid the load errors. My questions are: 1) Can RedHat please clarify the statement "Gluster does not support so called “structured data”, meaning live, SQL databases." means? Is this to say the configuration we have(postgres using a glusterfs volume) is not supported? 2) What exact version of Gluster is this problem targeted to be fixed? About what time frame? Thanks, -Phil
REVIEW: https://review.gluster.org/20082 (Revert \"performance/write-behind: fix flush stuck by former failed writes\") posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20083 (performance/read-ahead: throwaway read-ahead cache of all fds on writes on any fd) posted (#1) for review on master by Raghavendra G
COMMIT: https://review.gluster.org/20082 committed in master by "Raghavendra G" <rgowdapp> with a commit message- Revert "performance/write-behind: fix flush stuck by former failed writes" This reverts commit 9340b3c7a6c8556d6f1d4046de0dbd1946a64963. operations/writes across different fds of the same file cannot be considered as independent. For eg., man 2 fsync states, <man 2 fsync> fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device </man> This means fsync is an operation on file and fd is just a way to reach file. So, it has to sync writes done on other fds too. Patch 9340b3c7a6c, prevents this. The problem fixed by patch 9340b3c7a6c - a flush on an fd is hung on a failed write (held in cache for retrying) on a different fd - is solved in this patch by making sure __wb_request_waiting_on considers failed writes on any fd as dependent on flush/fsync on any fd (not just the fd on which writes happened) opened on the same file. This means failed writes on any fd are either synced or thrown away on witnessing flush/fsync on any fd of the same file. Change-Id: Iee748cebb6d2a5b32f9328aff2b5b7cbf6c52c05 Signed-off-by: Raghavendra G <rgowdapp> Updates: bz#1512691
COMMIT: https://review.gluster.org/20083 committed in master by "Raghavendra G" <rgowdapp> with a commit message- performance/read-ahead: throwaway read-ahead cache of all fds on writes on any fd This is to make sure applications that read and write on different fds of the same file work. This patch also fixes two other issues: 1. while iterating over the list of open fds on an inode, initialize tmp_file to 0 for each iteration before fd_ctx_get to make sure we don't carry over the history from previous iterations. 2. remove flushing of cache in flush and fsync as by themselves, they don't modify the data Change-Id: Ib9959eb73702a3ebbf90badccaa16b2608050eff Signed-off-by: Raghavendra G <rgowdapp> Updates: bz#1512691
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v4.1.0, please open a new bug report. glusterfs-v4.1.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2018-June/000102.html [2] https://www.gluster.org/pipermail/gluster-users/
Reopening, as (write-behind + stat-prefetch) on combination is still giving errors.
[2018-06-27 16:36:23.969018] T [fuse-bridge.c:862:fuse_attr_cbk] 0-glusterfs-fuse: 2929207: FSTAT() ERR => 13592650803908775745stat = gfid=cb68eef3-04f1-4719-bca2-cbe3a2e90f41 ino=13592650803908775745, mode=1006 00, nlink=1, uid=26, gid=26, size=202088448, blocks=394704, atime=2018-06-27-16:36:23 mtime=2018-06-27-16:36:23 ctime=2018-06-27-16:36:23 atime_sec=1530117383, atime_nsec=955733539, mtime_sec=1530117383, mtime_n sec=967733546, ctime_sec=1530117383, ctime_nsec=967733546 [2018-06-27 16:36:23.969160] T [fuse-bridge.c:2503:fuse_write_resume] 0-glusterfs-fuse: 2929209: WRITE (0x7feb203abd90, size=8192, offset=202088448) [2018-06-27 16:36:23.969211] T [fuse-bridge.c:2459:fuse_writev_cbk] 0-glusterfs-fuse: 2929209: gfid=cb68eef3-04f1-4719-bca2-cbe3a2e90f41 WRITE => 540720528/8192,8192/202088448 Above write extends file by 8192. So size should be 20208848 + 8192 [2018-06-27 16:36:23.969241] T [fuse-bridge.c:862:fuse_attr_cbk] 0-glusterfs-fuse: 2929206: FSTAT() ERR => 13592650803908775745stat = gfid=cb68eef3-04f1-4719-bca2-cbe3a2e90f41 ino=13592650803908775745, mode=1006 00, nlink=1, uid=26, gid=26, size=202088448, blocks=394704, atime=2018-06-27-16:36:23 mtime=2018-06-27-16:36:23 ctime=2018-06-27-16:36:23 atime_sec=1530117383, atime_nsec=955733539, mtime_sec=1530117383, mtime_n sec=967733546, ctime_sec=1530117383, ctime_nsec=967733546 [2018-06-27 16:36:23.969305] T [fuse-bridge.c:2399:fuse_readv_resume] 0-glusterfs-fuse: 2929210: READ (0x7feb20562ff0, size=8192, offset=41058304) [2018-06-27 16:36:23.969464] T [fuse-bridge.c:970:fuse_getattr_resume] 0-glusterfs-fuse: 2929211: FGETATTR 140648101387600 ((null)/0x7feb203abd90) Note that this getattr was issued after write extended the file [2018-06-27 16:36:23.969496] T [fuse-bridge.c:862:fuse_attr_cbk] 0-glusterfs-fuse: 2929211: FSTAT() ERR => 13592650803908775745stat = gfid=cb68eef3-04f1-4719-bca2-cbe3a2e90f41 ino=13592650803908775745, mode=1006 00, nlink=1, uid=26, gid=26, size=202088448, blocks=394704, atime=2018-06-27-16:36:23 mtime=2018-06-27-16:36:23 ctime=2018-06-27-16:36:23 atime_sec=1530117383, atime_nsec=955733539, mtime_sec=1530117383, mtime_n sec=967733546, ctime_sec=1530117383, ctime_nsec=967733546 Size is still 202088448. Effect of write (id 2929209) is not reflected in a stat (id 2929211) issued after write. This is the bug. The bug is in md-cache. Imagine following series of operations: * A stat s1 was issued from fuse layer when size of file was s1 * stat s1 completes on brick stack, but yet to reach md-cache * A write w1 extends file to size s2, but write is cached in write-behind and response is unwound * md-cache while handling write-cbk, invalidates cache of file * md-cache receives response for s1, updates cache with stale stat Fix is to remember whether s1 was older than w1. If yes, s1 won't update cache in write-behind.
(In reply to Raghavendra G from comment #15) > Fix is to remember whether s1 was older than w1. If yes, s1 won't > update cache in write-behind. s/write-behind/md-cache
REVIEW: https://review.gluster.org/20410 (performance/md-cache: update cache only from fops issued after previous invalidation) posted (#2) for review on master by Raghavendra G
(In reply to Raghavendra G from comment #15) > The bug is in md-cache. Imagine following series of operations: > * A stat s1 was issued from fuse layer when size of file was s1 > * stat s1 completes on brick stack, but yet to reach md-cache > * A write w1 extends file to size s2, but write is cached in > write-behind and response is unwound write-behind has no role. The bug can happen even without write-behind. In fact its necessary that two fops that would fetch stat (like a write and stat or two stats on the same file) to be invoked from two different threads. Note that caching and premature unwinding in write-behind cannot cause this class of bugs. Updated RCA is below: Imagine following series of operations from two application threads: * A stat s1 was issued from application thread t1 when size of file was s1 * stat s1 completes on brick stack, but yet to reach md-cache * A write w1 from application thread t2 extends file to size s2 * md-cache while handling write-cbk, updates size to s2 * md-cache receives response for s1, updates cache with stale stat with size of s1 overwriting a valid stat s2 Similar race can occur between any fops that fetch stat done by two application threads. Fix is to remember whether s1 was older than w1. If yes, s1 won't update cache in md-cache.
REVIEW: https://review.gluster.org/20413 (performance/write-behind: better invalidation in readdirp) posted (#1) for review on master by Raghavendra G
With patch #20413 and patch #20410 applied I was able to successfully run around 70 iterations of the test case presented in this bz. No performance xlator was turned off in this exercise.
Created attachment 1455587 [details] strace output
While running tests on a build with patches 20413 and 20410, I observed the following: $pgbench -c 8 -j 2 -T 30 bench-db starting vacuum...WARNING: corrupted statistics file "pg_stat_tmp/pgstat.stat" end. This is usually seen in the second iteration of a pgbench loop. First iteration ran clean quite a few times. Disabling quick-read seems to eliminate this problem. Relevant parts from attached strace output (comment 21): 586436 open("pg_stat_tmp/pgstat.stat", O_RDONLY) = 12 586436 fstat(12, {st_mode=S_IFREG|0600, st_size=13479, ...}) = 0 586436 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f57c5dec000 586436 read(12, "\232\274\245\1\353\203Q\361\316\22\2\0\2\0\0\0\0\0\0\0\245\0\0\0\0\0\0\0+l\"\0"..., 8192) = 8192 586436 close(12) = 0 586436 munmap(0x7f57c5dec000, 8192) = 0 586436 sendto(10, "\1\0\0\0\20\0\0\0])W\362\316\22\2\0", 16, 0, NULL, 0) = 16 586436 select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout) 586436 open("pg_stat_tmp/pgstat.stat", O_RDONLY) = 12 586436 fstat(12, 0x7fff84995980) = -1 ESTALE (Stale file handle) 586436 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f57c5dec000 586436 read(12, 0x7f57c5dec000, 8192) = -1 ESTALE (Stale file handle) 586436 write(2, "\0\0>\0\304\362\10\0tWARNING: corrupted sta"..., 71) = 71 586436 sendto(11, "N\0\0\0{SWARNING\0C01000\0Mcorrupted "..., 124, 0, NULL, 0) = 124 586436 close(12)
ESTALE observations in the client log file: [2018-06-30 00:13:35.907136] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 8498054: FSTAT() ERR => -1 (Stale file handle) [2018-06-30 00:13:35.981005] E [MSGID: 109040] [dht-helper.c:1386:dht_migration_complete_check_task] 26-db-dht: 02886b52-168b-47af-8567-3192a3410cd9: failed to lookup the file on db-dht [Stale file handle] [2018-06-30 00:13:35.981086] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 8498122: FSTAT() <gfid:02886b52-168b-47af-8567-3192a3410cd9> => -1 (Stale file handle)
More client logs: 2018-06-30 00:39:57.083002] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-1: remote operation failed [No such file or directory] [2018-06-30 00:39:57.083717] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-0: remote operation failed [No such file or directory] [2018-06-30 00:39:57.083720] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-1: remote operation failed [No such file or directory] [2018-06-30 00:39:57.083834] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-2: remote operation failed [No such file or directory] [2018-06-30 00:39:57.085388] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-0: remote operation failed [No such file or directory] [2018-06-30 00:39:57.085521] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-1: remote operation failed [No such file or directory] [2018-06-30 00:39:57.085610] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 4-db-client-2: remote operation failed [No such file or directory] [2018-06-30 00:39:57.086641] E [MSGID: 109040] [dht-helper.c:1386:dht_migration_complete_check_task] 4-db-dht: 3fd00188-08b6-47bb-883a-c8fe87f29b91: failed to lookup the file on db-dht [Stale file handle] [2018-06-30 00:39:57.086715] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 97871: FSTAT() <gfid:3fd00188-08b6-47bb-883a-c8fe87f29b91> => -1 (Stale file handle)
Note the pattern of "postgres: autovacuum launcher process" openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.002975> fstat(4, {st_mode=S_IFREG|0600, st_size=2457, ...}) = 0 <0.000007> read(4, "\235\274\245\1\207\366\206\371\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"..., 8192) = 2457 <0.000079> close(4) = 0 <0.000114> sendto(10, "\1\0\0\0 \0\0\0\nN\331\371\332\22\2\0\352\254\321\371\332\22\2\0\0\0\0\0\0\0\0\0", 32, 0, NULL, 0) = 32 <0.000025> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout) <0.010212> openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.000870> fstat(4, {st_mode=S_IFREG|0600, st_size=2457, ...}) = 0 <0.000008> read(4, "\235\274\245\1\207\366\206\371\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"..., 8192) = 2457 <0.000106> close(4) = 0 <0.000091> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout) <0.010089> openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.000704> fstat(4, {st_mode=S_IFREG|0600, st_size=2457, ...}) = 0 <0.000008> read(4, "\235\274\245\1\364^\331\371\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"..., 8192) = 2457 <0.000268> close(4) = 0 <0.000076> openat(AT_FDCWD, "pg_stat_tmp/global.stat", O_RDONLY) = 4 <0.000433> Another process "postgres: stats collector process", acts as a writer to global.stat (and other .stat files). It, 1. creates a pg_stat_tmp/global.tmp 2. writes to the file 3. rename (pg_stat_tmp/global.tmp, pg_stat_tmp/global.stat) openat(AT_FDCWD, "pg_stat_tmp/global.tmp", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 <0.004589> fstat(3, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000011> openat(AT_FDCWD, "pg_stat_tmp/db_0.tmp", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 <0.001744> fstat(4, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000005> write(4, "\235\274\245\1T\275\4\0\0\0\0\0\0\24\0\0\0\0\0\0\0(\0\0\0\0\0\0\0\0\0\0"..., 2033) = 2033 <0.000099> close(4) = 0 <0.000706> rename("pg_stat_tmp/db_0.tmp", "pg_stat_tmp/db_0.stat") = 0 <0.003799> write(3, "\235\274\245\1\7C9\370\332\22\2\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\277H\5\0"..., 2457) = 2457 <0.000346> close(3) = 0 <0.001612> rename("pg_stat_tmp/global.tmp", "pg_stat_tmp/global.stat") = 0 <0.003582> Basically, this I/O pattern expects that fd opened on a file before rename to be valid after rename overwrote that file (fd is opened on dst of rename before rename is done). While looking at open-behind for rename I found that while rename tries to open fd on dst before rename is wound down, I found a bug in that code. Note that, 1. Only first ever fd opened on an inode is opened-behind. Rest of the opens are unwound only after open is done on bricks. 2. So, ob_rename tries to open _that_ first fd, by doing this: <ob_rename> stub = fop_rename_stub (frame, default_rename_resume, src, dst, xdata); if (!stub) goto err; if (dst->inode) fd = fd_lookup (dst->inode, 0); open_and_resume (this, fd, stub); </ob_rename> Note how it gets the fd to be opened - by doing fd_lookup (with pid 0). However, fd_lookup with pid = 0 returns the first fd in the list of fds opened on the inode. <__fd_lookup> list_for_each_entry (iter_fd, &inode->fd_list, inode_list) { if (iter_fd->anonymous) /* If someone was interested in getting an anonymous fd (or was OK getting an anonymous fd), they can as well call fd_anonymous() directly */ continue; if (!pid || iter_fd->pid == pid) { fd = __fd_ref (iter_fd); break; } } </__fd_lookup> But, if we look at __fd_bind which ads an fd to this list, <__fd_bind> list_del_init (&fd->inode_list); list_add (&fd->inode_list, &fd->inode->fd_list); fd->inode->fd_count++; </__fd_bind> So, the first member of inode->fd_list is the recently opened fd on inode. But the opened-behind fd is the oldest fd that is opened on the inode and is the last member of inode->fd_list. So, when more than one fds are opened on a file ob_rename actually tries to open the fd which is already opened on bricks missing the fd which is opened-behind and not opened on brick. When rename is done, dst is overwritten with a new gfid . When operations like fstat/readv are done on the opened-behind fd, they fail with ESTALE as the gfid is no longer present on bricks being overwritten by rename. Similar bug is present in ob_unlink too. I couldn't reproduce the corruption while stracing (may be tracing altered the race), but I could reproduce the bug without stracing (with all perf xlators tuned on). When I turned off open-behind, the bug went away. Given the reasoning and the fact that I couldn't reproduce the bug with open-behind off, I conclude the RCA for corruption is the one I explained above. I am going to send a fix to ob_rename and ob_unlink and report back the results with the fix.
REVIEW: https://review.gluster.org/20428 (performance/ob: check the oldest fd(s) for pending opens in rename and unlink) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20444 (Revert \"performance/quick-read: Use generation numbers to avoid updating the cache with stale data\") posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20445 (performance/quick-read: Don't update cache with stale data) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20456 (performance/write-behind: adjust generation numbers while collapsing small writes) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20478 (performance/md-cache: update cache only from fops issued after previous invalidation) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20549 (performance/md-cache: update cache only from fops issued after previous invalidation) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20576 (performance/write-behind: synchronize rename with cached writes on src) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20579 (performance/quick-read: don't update with stale data after invalidation) posted (#1) for review on master by Raghavendra G
COMMIT: https://review.gluster.org/20413 committed in master by "Raghavendra G" <rgowdapp> with a commit message- performance/write-behind: better invalidation in readdirp Current invalidation of stats in wb_readdirp_cbk is prone to races. As the deleted comment explains, <snip> We cannot guarantee integrity of entry->d_stat as there are cached writes. The stat is most likely stale as it doesn't account the cached writes. However, checking for non-empty liability list here is not a fool-proof solution as there can be races like, 1. readdirp is successful on posix 2. sync of cached write is successful on posix 3. write-behind received sync response and removed the request from liability queue 4. readdirp response is processed at write-behind. In the above scenario, stat for the file is sent back in readdirp response but it is stale. </snip> Change-Id: I6ce170985cc6ce3df2382ec038dd5415beefded5 Signed-off-by: Raghavendra G <rgowdapp> Updates: bz#1512691
COMMIT: https://review.gluster.org/20576 committed in master by "Raghavendra G" <rgowdapp> with a commit message- performance/write-behind: synchronize rename with cached writes on src rename response contains a postbuf stat of src-inode. Since md-cache caches stat in rename codepath too, we've to make sure stat accounts any cached writes in write-behind. So, we make sure rename is resumed only after any cached writes are committed to backend. Change-Id: Ic9f2adf8edd0b58ebaf661f3a8d0ca086bc63111 Signed-off-by: Raghavendra G <rgowdapp> Updates: bz#1512691
COMMIT: https://review.gluster.org/20549 committed in master by "Poornima G" <pgurusid> with a commit message- performance/md-cache: update cache only from fops issued after previous invalidation Invalidations are triggered mainly by two codepaths - upcall and write-behind unwinding a cached write with zeroed out stat. For the case of upcall, following race can happen: * stat s1 is fetched from brick * invalidation is detected on brick * invalidation is propagated to md-cache and cache is invalidated * s1 updates md-cache with a stale state For the case of write-behind, imagine following sequence of operations, * A stat s1 was issued from application thread t1 when size of file was s1 * stat s1 completes on brick stack, but yet to reach md-cache * A write w1 from application thread t2 extends file to size s2 is cached in write-behind and response is unwound with zeroed out stat * md-cache while handling write-cbk, invalidates cache * md-cache receives response for s1, updates cache with stale stat with size of s1 overwriting invalidation state Fix is to remember when s1 was incident on md-cache and update cache with results of s1 only if the it was incident after invalidation of cache. This patch identified some bugs in regression tests which is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1608158. As a stop gap measure I am marking following tests as bad basic/afr/split-brain-resolution.t bugs/bug-1368312.t bugs/replicate/bug-1238398-split-brain-resolution.t bugs/replicate/bug-1417522-block-split-brain-resolution.t bugs/replicate/bug-1438255-do-not-mark-self-accusing-xattrs.t Change-Id: Ia4bb9dd36494944e2d91e9e71a79b5a3974a8c77 Signed-off-by: Raghavendra G <rgowdapp> Updates: bz#1512691
COMMIT: https://review.gluster.org/20428 committed in master by "Raghavendra G" <rgowdapp> with a commit message- performance/ob: stringent synchronization between rename/unlink and open Issue 1: ======== open all pending fds before resuming rename and unlink currently ob uses fd_lookup to find out the opened-behind. But, fd_lookup gives the recent fd opened on the inode, but the oldest fd(s) (there can be multiple fds opened-behind when the very first opens on an inode are issued in parallel) are the candidates for fds with pending opens on backend. So, this patch explictily tracks the opened-behind fds on an inode and opens them before resuming rename or unlink. similar code changes are also done for setattr and setxattr to make sure pending opens are complete before permission change. This patch also adds a check for an open-in-progress to ob_get_wind_fd. If there is already an open-in-progress, ob_get_wind_fd won't return an anonymous fd as a result. This is done to make sure that rename/unlink/setattr/setxattr don't race with an operation like readv/fstat on an anonymous fd already in progress. Issue 2: ======== once renamed/unlinked, don't open-behind any future opens on the same inode. Issue 3: ======== Don't use anonymous fds by default. Note that rename/unlink can race with a read/fd on anonymous fds and these operations can fail with ESTALE. So, for better consistency in default mode, don't use anonymous fds. If performance is needed with tradeoff of consistency, one can switch on the option "use-anonymous-fd" Change-Id: Iaf130db71ce61ac37269f422e348a45f6ae6e82c Signed-off-by: Raghavendra G <rgowdapp> Updates: bz#1512691
COMMIT: https://review.gluster.org/20579 committed in master by "Raghavendra G" <rgowdapp> with a commit message- performance/quick-read: don't update with stale data after invalidation Once invalidated, make sure that only ops incident after invalidation update the cache. This makes sure that ops before invalidation don't repopulate cache with stale data. This patch also uses an internal counter instead of frame->root->unique for keeping track of generations. Change-Id: I6b38b141985283bd54b287775f3ec67b88bf6cb8 Signed-off-by: Raghavendra G <rgowdapp> Updates: bz#1512691
REVIEW: https://review.gluster.org/20705 (performance/quick-read: handle rollover of generation counter) posted (#1) for review on master by Raghavendra G
COMMIT: https://review.gluster.org/20705 committed in master by "Raghavendra G" <rgowdapp> with a commit message- performance/quick-read: handle rollover of generation counter Change-Id: I37a6e0efda430b70d03dd431c35bef23b3d16361 Signed-off-by: Raghavendra G <rgowdapp> Updates: bz#1512691
REVIEW: https://review.gluster.org/20737 (performance/read-ahead: flush cache on fsync/flush failures) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20871 (Revert \"performance/write-behind: better invalidation in readdirp\") posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20872 (performance/write-behind: fix fulfill and readdirp race) posted (#1) for review on master by Raghavendra G
COMMIT: https://review.gluster.org/20871 committed in master by "Raghavendra G" <rgowdapp> with a commit message- Revert "performance/write-behind: better invalidation in readdirp" This reverts commit 4d3c62e71f3250f10aa0344085a5ec2d45458d5c. Traversing all children of a directory in wb_readdirp caused significant performance regression. Hence reverting this patch Change-Id: I6c3b6cee2dd2aca41d49fe55ecdc6262e7cc5f34 updates: bz#1512691 Signed-off-by: Raghavendra G <rgowdapp>
COMMIT: https://review.gluster.org/20872 committed in master by "Raghavendra G" <rgowdapp> with a commit message- performance/write-behind: fix fulfill and readdirp race Current invalidation of stats in wb_readdirp_cbk is prone to races. As the deleted comment explains, <snip> We cannot guarantee integrity of entry->d_stat as there are cached writes. The stat is most likely stale as it doesn't account the cached writes. However, checking for non-empty liability list here is not a fool-proof solution as there can be races like, 1. readdirp is successful on posix 2. sync of cached write is successful on posix 3. write-behind received sync response and removed the request from liability queue 4. readdirp response is processed at write-behind. In the above scenario, stat for the file is sent back in readdirp response but it is stale. </snip> The fix is to mark readdirp sessions (tracked in this patch by non-zero value of "readdirps" on parent inode) and if fulfill completes when one or more readdirp sessions are in progress, mark the inode so that wb_readdirp_cbk doesn't send iatts for that in inode in readdirp response. Note that wb_readdirp_cbk already checks for presence of a non-empty liability queue and invalidates iatt. Since the only way a liability queue can shrink is by fulfilling requests in liability queue, wb_fulfill_cbk indicates wb_readdirp_cbk that a potential race could've happened b/w readdirp and fulfill. Change-Id: I12d167bf450648baa64be1cbe1ca0fddf5379521 Signed-off-by: Raghavendra G <rgowdapp> updates: bz#1512691
REVIEW: https://review.gluster.org/20980 (performance/read-ahead: add more trace msgs) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/20981 (performance/read-ahead: update next expected offset before unwind) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/21006 (performance/read-ahead: flush old data only when read is unwound) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/21035 (performance/write-behind: serve stats locally) posted (#2) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/21398 (performance/io-cache: update pages with write data) posted (#1) for review on master by Raghavendra G
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/
REVISION POSTED: https://review.gluster.org/21398 (performance/io-cache: update pages with write data) posted (#4) for review on master by Raghavendra G
Restoring previous state, as there is another bug raised to handle above patch!