1561393 – Arequal checksum mismatch on older mount

Bug 1561393 - Arequal checksum mismatch on older mount

Summary: Arequal checksum mismatch on older mount

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	quick-read
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra G
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:	1591621
Blocks:	1503143
TreeView+	depends on / blocked

Reported:	2018-03-28 09:25 UTC by Rochelle
Modified:	2019-11-25 07:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, if quick-read was enabled, target geo-replication volumes sometimes served stale data. This problem can be avoided by enabling new software-level consistent time options in Red Hat Gluster Storage 3.5.
Clone Of:
Clones:	1591621 (view as bug list)
Environment:
Last Closed:	2019-11-25 07:34:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rochelle 2018-03-28 09:25:41 UTC

Description of problem:
========================
A fanout geo-replication session was set up i.e; 2 seperate geo-rep sessions from one master to two different slaves.
Arequal is the binary we use to calculate the checksum and see whether the files have synced successfully.

The checksum for the master and one of the slaves did not match as shown:

Old mount:
----------

[root@dhcp42-61 ~]# ./arequal-checksum -p /nonroot/master

Entry counts
Regular files   : 5000
Directories     : 2013
Symbolic links  : 6900
Other           : 0
Total           : 13913

Metadata checksums
Regular files   : 749c7b4
Directories     : a7fef2
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 8256a954e23d7be1f238774372bbadc7
Directories     : 232f6d0a157e2505
Symbolic links  : 54540514086a6471
Other           : 0
Total           : 715b6098d929752

[root@dhcp42-61 ~]# ./arequal-checksum -p /nonroot/slave

Entry counts
Regular files   : 5000
Directories     : 2013
Symbolic links  : 6900
Other           : 0
Total           : 13913

Metadata checksums
Regular files   : 749c7b4
Directories     : a7fef2
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : fb99819f6b3a91ee9dc18087551ad0ef
Directories     : 232f6d0a157e2505
Symbolic links  : 54540514086a6471
Other           : 0
Total           : 1123690623340075



New mount:
----------
When the volume was mounted on a new mount, the checksum was shown as expected:

[root@dhcp42-61 ~]# ./arequal-checksum -p /mnt/new_master

Entry counts
Regular files   : 5000
Directories     : 2013
Symbolic links  : 6900
Other           : 0
Total           : 13913

Metadata checksums
Regular files   : 749c7b4
Directories     : a7fef2
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 8256a954e23d7be1f238774372bbadc7
Directories     : 232f6d0a157e2505
Symbolic links  : 54540514086a6471
Other           : 0
Total           : 715b6098d929752


[root@dhcp42-61 ~]# ./arequal-checksum -p /mnt/new_slave

Entry counts
Regular files   : 5000
Directories     : 2013
Symbolic links  : 6900
Other           : 0
Total           : 13913

Metadata checksums
Regular files   : 749c7b4
Directories     : a7fef2
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 8256a954e23d7be1f238774372bbadc7
Directories     : 232f6d0a157e2505
Symbolic links  : 54540514086a6471
Other           : 0
Total           : 715b6098d929752

Other           : 0
Total           : 715b6098d929752




for a particular file (there are about 23 files with this problem)

Old mount 
----------

[root@dhcp42-61 ~]# ./arequal-checksum -p /nonroot/master/thread7/level03/5aba0284%%DXAMLOZTMR

Entry counts
Regular files   : 1
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : 1

Metadata checksums
Regular files   : 2ffd05828
Directories     : 3e9
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 764dc94fb8666ec09d89a12d960c75a4
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : ebc468622e6a1b64

[root@dhcp42-61 ~]# ./arequal-checksum -p /nonroot/slave/thread7/level03/5aba0284%%DXAMLOZTMR

Entry counts
Regular files   : 1
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : 1

Metadata checksums
Regular files   : 2ffd05828
Directories     : 3e9
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 977708c84a4149d26f2eac1eb53fee55
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : f859a4d6ff7ea787




New mount :
----------

[root@dhcp42-61 ~]# ./arequal-checksum -p /mnt/new_master/thread7/level03/5aba0284%%DXAMLOZTMR

Entry counts
Regular files   : 1
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : 1

Metadata checksums
Regular files   : 2ffd05828
Directories     : 3e9
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 764dc94fb8666ec09d89a12d960c75a4
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : ebc468622e6a1b64
[root@dhcp42-61 ~]# ./arequal-checksum -p /mnt/new_slave/thread7/level03/5aba0284%%DXAMLOZTMR

Entry counts
Regular files   : 1
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : 1

Metadata checksums
Regular files   : 2ffd05828
Directories     : 3e9
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 764dc94fb8666ec09d89a12d960c75a4
Directories     : 0
Symbolic links  : 0
Other           : 0
Total           : ebc468622e6a1b64



Version-Release number of selected component (if applicable):
=============================================================
[root@dhcp42-20 ~]# rpm -qa | grep gluster
glusterfs-rdma-3.8.4-54.4.el7rhgs.x86_64
glusterfs-server-3.8.4-54.4.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.4.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.4.el7rhgs.x86_64
tendrl-gluster-integration-1.5.4-14.el7rhgs.noarch
glusterfs-cli-3.8.4-54.4.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7.x86_64
python-gluster-3.8.4-54.4.el7rhgs.noarch
glusterfs-events-3.8.4-54.4.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch
glusterfs-client-xlators-3.8.4-54.4.el7rhgs.x86_64
glusterfs-api-3.8.4-54.4.el7rhgs.x86_64
glusterfs-3.8.4-54.4.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-fuse-3.8.4-54.4.el7rhgs.x86_64
[root@dhcp42-20 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.5 (Maipo)

How reproducible:
=================
1/1


Actual results:
===============
Checksum did not match


Expected results:
================
Checksum should match

Will attach sosreports.

Comment 3 Raghavendra G 2018-03-28 15:00:40 UTC

[root@dhcp42-61 ~]# cp -fv /nonroot/slave/thread7/level03/5aba0284%%DXAMLOZTMR slave 
‘/nonroot/slave/thread7/level03/5aba0284%%DXAMLOZTMR’ -> ‘slave’


[root@dhcp42-61 ~]# cp -fv /nonroot/master/thread7/level03/5aba0284%%DXAMLOZTMR master 
‘/nonroot/master/thread7/level03/5aba0284%%DXAMLOZTMR’ -> ‘master’

[root@dhcp42-61 ~]# md5sum master
764dc94fb8666ec09d89a12d960c75a4  master

[root@dhcp42-61 ~]# md5sum slave
977708c84a4149d26f2eac1eb53fee55  slave

[root@dhcp42-61 ~]# ls -l slave
--wxr-xr-x. 1 root root 4390 Mar 28 10:46 slave

[root@dhcp42-61 ~]# ls -l master
--wxr-xr-x. 1 root root 5025 Mar 28 10:47 master

[root@dhcp42-61 ~]# truncate --size 4390 ./master

[root@dhcp42-61 ~]# ls -l ./master
--wxr-xr-x. 1 root root 4390 Mar 28 10:47 ./master

[root@dhcp42-61 ~]# md5sum ./master.old 
977708c84a4149d26f2eac1eb53fee55  ./master

[root@dhcp42-61 ~]# md5sum slave
977708c84a4149d26f2eac1eb53fee55  slave

Note that file "slave" is a partial copy of "master". To be precise first 4390 bytes of "master" match with "slave"

From logs, I can see quick-read returns only 4390 bytes when read from slave mount. However stat on both master and slave return correct value of 5025.


From gdb,
(gdb) c
Continuing.
[Switching to Thread 0x7f52b649d700 (LWP 20804)]

Breakpoint 1, qr_readv (frame=0x7f52a8652b50, this=0x7f52b801b750, fd=0x7f52a8678810, size=8192, offset=0, flags=32768, xdata=0x0) at quick-read.c:611

(gdb) p *qr_inode
$2 = {data = 0x7f52b0999e30, size = 4390, priority = 0, ia_mtime = 1522139994, ia_mtime_nsec = 386948000, buf = {ia_ino = 13103533730534322441, ia_gfid = "\207\235&XT1O\"\265\331\032\200\236\330\361\t", 
    ia_dev = 64779, ia_type = IA_IFREG, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', 
        write = 1 '\001', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, ia_nlink = 1, ia_uid = 0, ia_gid = 29991, ia_rdev = 0, ia_size = 5025, ia_blksize = 4096, 
    ia_blocks = 10, ia_atime = 1522140312, ia_atime_nsec = 610248000, ia_mtime = 1522139994, ia_mtime_nsec = 386948000, ia_ctime = 1522140312, ia_ctime_nsec = 614232024}, last_refresh = {tv_sec = 1522248921, 
    tv_usec = 635511}, lru = {next = 0x7f52b8036d90, prev = 0x7f52b195a3a0}}

Comment 4 Raghavendra G 2018-03-28 15:04:28 UTC

There are two aspects to the problem:

* why was partial data cached in quick-read?
* Through gdb, I could see cache-timeout is 1s. However for quite a sometime quick-read kept returning stale data. What caused cache_refresh logic to misbehave for so long? Also note that after I attached gdb, I could see that cache was found invalid and readv was wound to backend and md5sum matched with that of master.

Comment 6 Raghavendra G 2018-03-29 03:42:52 UTC

(In reply to Raghavendra G from comment #4)
> There are two aspects to the problem:
> 
> * why was partial data cached in quick-read?
> * Through gdb, I could see cache-timeout is 1s. However for quite a sometime
> quick-read kept returning stale data. What caused cache_refresh logic to
> misbehave for so long? 

Note that qr_inode->ia_mtime = qr_inode->buf->ia_mtime = 1522139994
Since refresh logic relies on ia_time, it thinks file has not changed

Also, even though qr_inode->size has wrong value, qr_inode->buf->ia_size has correct value of 5025.

> Also note that after I attached gdb, I could see that
> cache was found invalid and readv was wound to backend and md5sum matched
> with that of master.

cache-timeout by default is 1 sec. Attaching gdb makes sure that time last time inode was refreshed is greater than 1 sec. Note that most likely a lookup might've refreshed qr_inode->ia_time and was immediately followed by readv.

Comment 7 Raghavendra G 2018-03-29 14:05:23 UTC

What makes this more complicated is all of qr_inode->ia_mtime, qr_inode->size are atomically updated from same iatt:

void
qr_content_update (xlator_t *this, qr_inode_t *qr_inode, void *data,
                   struct iatt *buf)
{
        qr_private_t      *priv = NULL;
        qr_inode_table_t  *table = NULL;

        priv = this->private;
	table = &priv->table;

 	LOCK (&table->lock);
	{
	        __qr_inode_prune (table, qr_inode);

		qr_inode->data = data;
                qr_inode->size = buf->ia_size;

                qr_inode->ia_mtime = buf->ia_mtime;
		qr_inode->ia_mtime_nsec = buf->ia_mtime_nsec;

	        qr_inode->buf = *buf;

	        gettimeofday (&qr_inode->last_refresh, NULL);

		__qr_inode_register (table, qr_inode);
	}
        UNLOCK (&table->lock);

	qr_cache_prune (this);
}

The only other place qr_inode->mtime is updated is in qr_content_refresh

void
__qr_content_refresh (xlator_t *this, qr_inode_t *qr_inode, struct iatt *buf)
{
	qr_private_t      *priv = NULL;
	qr_inode_table_t  *table = NULL;
        qr_conf_t         *conf = NULL;

	priv = this->private;
        table = &priv->table;
	conf = &priv->conf;

        if (qr_size_fits (conf, buf) && qr_mtime_equal (qr_inode, buf)) {
                qr_inode->buf = *buf;

                gettimeofday (&qr_inode->last_refresh, NULL);

		__qr_inode_register (table, qr_inode);
	} else {
                __qr_inode_prune (table, qr_inode);
        }

 	return;
}

So, how can we get an iatt whose mtime is latest, but file size is not correct?

Comment 8 Raghavendra G 2018-03-29 14:45:44 UTC

-t, --times
This tells rsync to transfer modification times along with the files and update them on the remote system. Note that if this option is not used, the optimization that excludes files that
have not been modified cannot be effective; in other words, a missing -t or -a will cause the next transfer to behave as if it used -I, causing all files to be updated (though rsync’s
delta-transfer algorithm will make the update fairly efficient if the files haven’t actually changed, you’re much better off using -t).

</man rsync>

My guess is that rsync preserved mtime from master, even though file size changed. This confused quick-read to not detect file changes as mtime remained same. So, something like this might've happened:

* lookup on mount m1 fetched a partially written file whose size is 4390 bytes, but mtime is updated to mtime of master
* another write from rsync extended the file by 635 bytes changing the file size to 5025 bytes, but mtime on backend fs remained same

Every time quick-read got an iatt of the file, it checked the mtime and it matched with mtime when it had cached the file. So, quick-read things file has not changed and keep serving partial file of size 4390 bytes.

But there are some glaring loopholes in this argument.
* One would expect rsync to update mtime once the entire file is written. What I said can happen only if rsync is updating mtime of file after _every_ write to file.
* Also, file size is pretty small. One would wonder what made rsync to update the file in two writes instead of a single one. Note that two writes are essential for this hypothesis to be true.

@Kotresh,

Can you verify and update the bug with what's the behaviour of rsync?

regards,
Raghavendra

Comment 10 Raghavendra G 2018-04-05 12:40:41 UTC

[2018-04-05 10:31:37.842689] I [trace.c:1716:trace_lookup] 0-slave-debug: 198013: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 path=/thread2/level02/level12/level22/level32/5ac5f625%%0NCC8DMKTG
[2018-04-05 10:31:37.843311] I [trace.c:612:trace_lookup_cbk] 0-slave-debug: 198013: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 (op_ret=0 *buf {gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 ino=9488749538296052200, m
ode=100770, nlink=1, uid=0, gid=42539, size=4266, blocks=9, atime=2018-04-05-10:10:45 mtime=2018-04-05-10:15:28 ctime=2018-04-05-10:24:29 atime_sec=1522923045, atime_nsec=360115000, mtime_sec=1522923328, mtime_n
sec=526547000, ctime_sec=1522923869, ctime_nsec=54816113}, *postparent {gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=N/A mtime=N/A ctime=N/A atime_sec=0
, atime_nsec=0, mtime_sec=0, mtime_nsec=0, ctime_sec=0, ctime_nsec=0}
[2018-04-05 10:31:37.846794] I [trace.c:2209:trace_open] 0-slave-debug: 198014: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 path=/thread2/level02/level12/level22/level32/5ac5f625%%0NCC8DMKTG flags=0 fd=0x7f7e3484e
c00
[2018-04-05 10:31:37.846977] I [trace.c:141:trace_open_cbk] 0-slave-debug: 198014: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 op_ret=0, op_errno=0, *fd=0x7f7e3484ec00
[2018-04-05 10:31:37.848980] I [trace.c:1716:trace_lookup] 0-slave-debug: 14653: gfid=9108439f-3d67-4b02-a128-5287266f676e path=/thread8/level07/level17/level27/level37
[2018-04-05 10:31:37.849204] I [trace.c:2209:trace_open] 0-slave-debug: 198015: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 path=/thread2/level02/level12/level22/level32/5ac5f625%%0NCC8DMKTG flags=1 fd=0x7f7e3484e
160
[2018-04-05 10:31:37.849375] I [trace.c:141:trace_open_cbk] 0-slave-debug: 198015: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 op_ret=0, op_errno=0, *fd=0x7f7e3484e160

[2018-04-05 10:31:37.854173] I [trace.c:1684:trace_fxattrop] 0-slave-debug: 198018: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 fd=0x7f7e3484e160, flags=0
[2018-04-05 10:31:37.854394] I [trace.c:1353:trace_fxattrop_cbk] 0-slave-debug: 198018: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 op_ret=0, op_errno=0
[2018-04-05 10:31:37.856159] I [trace.c:1684:trace_fxattrop] 0-slave-debug: 198019: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 fd=0x7f7e3484e160, flags=0
[2018-04-05 10:31:37.856350] I [trace.c:1353:trace_fxattrop_cbk] 0-slave-debug: 198019: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 op_ret=0, op_errno=0

Note the mtime of file before writev - mtime_sec=1522923328, mtime_n
sec=526547000

[2018-04-05 10:31:37.857212] I [trace.c:2310:trace_writev] 0-slave-debug: 198020: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 fd=0x7f7e3484e160, count=1,  offset=4200 flags=08001 write_size=4704
[2018-04-05 10:31:37.857470] I [trace.c:258:trace_writev_cbk] 0-slave-debug: 198020: (op_ret=4704, *prebuf = {gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 ino=9488749538296052200, mode=100770, nlink=1, uid=0, gid=4
2539, size=4266, blocks=9, atime=2018-04-05-10:10:45 mtime=2018-04-05-10:15:28 ctime=2018-04-05-10:31:37 atime_sec=1522923045, atime_nsec=360115000, mtime_sec=1522923328, mtime_nsec=526547000, ctime_sec=15229242
97, ctime_nsec=855060264}, *postbuf = {gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 ino=9488749538296052200, mode=100770, nlink=1, uid=0, gid=42539, size=8904, blocks=18, atime=2018-04-05-10:10:45 mtime=2018-04-05-
10:31:37 ctime=2018-04-05-10:31:37 atime_sec=1522923045, atime_nsec=360115000, mtime_sec=1522924297, mtime_nsec=856060283, ctime_sec=1522924297, ctime_nsec=856060283})

After writev mtime changes to mtime_sec=1522924297, mtime_nsec=856060283. Note that mtime in prebuf matches to mtime got from lookup.


[2018-04-05 10:31:37.936760] I [trace.c:2076:trace_setattr] 0-slave-debug: 198039: gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 path=/thread2/level02/level12/level22/level32/5ac5f625%%0NCC8DMKTG ia_atime=Apr 05 10:
31:37, ia_mtime=Apr 05 10:15:28
[2018-04-05 10:31:37.937120] I [trace.c:390:trace_setattr_cbk] 0-slave-debug: 198039: (op_ret=0, *prebuf = {gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 ino=9488749538296052200, mode=100770, nlink=1, uid=0, gid=42539, size=8904, blocks=18, atime=2018-04-05-10:10:45 mtime=2018-04-05-10:31:37 ctime=2018-04-05-10:31:37 atime_sec=1522923045, atime_nsec=360115000, mtime_sec=1522924297, mtime_nsec=858060321, ctime_sec=1522924297, ctime_nsec=909061302}, *postbuf = {gfid=e84eae56-7ffd-4b15-83ae-cf6e78c9b9e8 ino=9488749538296052200, mode=100770, nlink=1, uid=0, gid=42539, size=8904, blocks=18, atime=2018-04-05-10:31:37 mtime=2018-04-05-10:15:28 ctime=2018-04-05-10:31:37 atime_sec=1522924297, atime_nsec=935205000, mtime_sec=1522923328, mtime_nsec=526547000, ctime_sec=1522924297, ctime_nsec=936061821})

Now setattr changes mtime to the value before writev. Note that mtime-prebuf is less than mtime-postbuf. Also note that this file was present and populated with data before this writev with size of 4266 and write came at an offset of 4200. This matches with the hypothesis comment #8.

IOW, if a client has cached the file (in quick-read) before write (unique:198020) with a size of 4266, it won't notice the change of file due to write as mtime of file post write matches with mtime of file pre write. All reads on the file would return a data of size 4266 instead of the new size of 8904.

This bug affects io-cache too as it relies on mtime to maintain cache coherency with data on brick.

Comment 11 Raghavendra G 2018-04-05 13:09:13 UTC

> Note that mtime-prebuf is less than mtime-postbuf.

Note that mtime-prebuf in setattr_cbk is greater than mtime-postbuf.

Comment 13 Raghavendra G 2018-04-14 02:50:34 UTC

Looks like features.cache-invalidation is off. If it was turned on, upcall invalidations would've flushed quick-read cache as it also uses upcall framework along with mtime to identify changes to file.

@Rochelle/Kotresh,

Can you proved gluster volume info of slave volume, so that we can find out whether cache-invalidation is on?

Comment 15 Raghavendra G 2018-05-02 06:24:26 UTC

(In reply to Kotresh HR from comment #14)
> Raghavendra,
> 
> Pasting volinfo from sosreports:
> 
> Volume Name: slave
> Type: Distributed-Replicate
> Volume ID: ffb16f20-f595-4951-91a8-661bfee1eb63
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x 3 = 6
> Transport-type: tcp
> Bricks:
> Brick1: 10.70.42.188:/rhs/brick1/b11
> Brick2: 10.70.42.8:/rhs/brick1/b12
> Brick3: 10.70.42.136:/rhs/brick1/b13
> Brick4: 10.70.42.188:/rhs/brick2/b14
> Brick5: 10.70.42.8:/rhs/brick2/b15
> Brick6: 10.70.42.136:/rhs/brick2/b16
> Options Reconfigured:
> diagnostics.client-log-level: TRACE
> transport.address-family: inet
> nfs.disable: on
> cluster.brick-multiplex: enable
> 
> Thanks,
> Kotresh HR

@Poornima,

Is upcall enabled by default? I guess the answer is no, but want your confirmation.

If upcall is not enabled by default, enabling upcall notifications on slave volume will fix this issue as quick-read consumes upcall to invalidate its cache when a file changes.

Comment 16 Raghavendra G 2018-05-02 06:29:11 UTC

From glusterd-volume-set.c:
        
        { .key         = "features.cache-invalidation",
          .voltype     = "features/upcall",
          .value       = "off",
          .op_version  = GD_OP_VERSION_3_7_0,
        },
        { .key         = "features.cache-invalidation-timeout",
          .voltype     = "features/upcall",
          .op_version  = GD_OP_VERSION_3_7_0,
        },

So, I think upcall is turned off by default.

Comment 17 Poornima G 2018-05-02 06:50:02 UTC

Upcall is not enabled by default, yeah it can fix this issue if upcall is enabled. But its a perf hit to be enabled on slave?

Comment 18 Raghavendra G 2018-05-03 04:22:14 UTC

(In reply to Poornima G from comment #17)
> Upcall is not enabled by default, yeah it can fix this issue if upcall is
> enabled. But its a perf hit to be enabled on slave?

Do we've any data/approximation on how much performance drag can upcall be? I think we need to start looking at this aspect as going forward upcall/leases will play more prominent role in maintaining cache coherency.

Comment 22 Raghavendra G 2018-05-24 07:41:48 UTC

@Rochelle,

Can you test with setting following options?
* features.cache-invalidation on

Comment 25 Poornima G 2018-06-12 06:02:54 UTC

(In reply to Raghavendra G from comment #18)
> (In reply to Poornima G from comment #17)
> > Upcall is not enabled by default, yeah it can fix this issue if upcall is
> > enabled. But its a perf hit to be enabled on slave?
> 
> Do we've any data/approximation on how much performance drag can upcall be?
> I think we need to start looking at this aspect as going forward
> upcall/leases will play more prominent role in maintaining cache coherency.

Sorry i missed to reply to this. Especially in the geo-rep use case, enabling upcall on slave can generate lot of upcalls on the directories, as and when a new file gets created. I think there has to be a perf testing, before suggesting this as solution.

Comment 26 Raghavendra G 2018-06-15 11:44:58 UTC

Kotresh had a suggestion to use ctime to identify staleness of cached data.

Following is my response to his suggestion:

<snip>

From stat information, mtime is guaranteed to change during file modification. That way, quick-read is posix complaint. There is nothing wrong in quick-read relying on mtime, if there was an application running on slave mount during sync from master, it could've run into same issue. I am not sure we allow that (accessing slave volume during sync from master). If yes, it'll be a problem with geo-rep :). Since quick-read is an internal consumer we can modify it to suit idiosyncrasies of mtime being constant across file changes, but external applications won't accept that logic.

To summarize, this problem will appear only if files mtimes are explicitly modified (just like in geo-rep through rsync). If there are too many such use cases, we can proceed with ctime based solution.

As to the ctime suggestion, some questions:
* wouldn't rsync try to keep ctime in sync with master too? I think no, as I am not aware of any system call that allows ctime change (utime () only changes atime and mtime).
* using ctime to track file changes solves the problem. But, it can result in false positives too (like changing ownership or access permissions, without file content change can change ctime). We can selectively turn this (using ctime instead of mtime) on on geo-rep slave volumes. But again, if slave is not consumed while sync is happening, I see no point in turning on quick-read till the volume is consumed.

I agree ctime can be a better solution (at least for geo-rep slaves and for cases where mtime is explicitly modified) for the reason it changes with file modification and ctime of a file cannot be explicitly set, while mtime and atime can be.

</snip>

An upstream patch has been submitted to do this - https://review.gluster.org/#/c/20286/.

This patch should be followed by changes to georeplication slave volume to have this option set for quick-read.

Comment 44 Sahina Bose 2019-11-25 07:34:55 UTC

Closing (as per comment 43) as fix is provided as part of RHGS 3.5.0

Note You need to log in before you can comment on or make changes to this bug.