Bug 1634664

Summary: Inconsistent quorum checks during open and fd based operations
Product: [Community] GlusterFS Reporter: Raghavendra G <rgowdapp>
Component: replicateAssignee: Karthik U S <ksubrahm>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: bugs, pasik, ravishankar
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-08 04:03:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Raghavendra G 2018-10-01 10:07:08 UTC
Description of problem:
Quorum imposition by afr in open fop. afr imposes Quorum on fd based operations, but not on open. This means operations can fail on a valid fd due to lack of Quorum

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Raghavendra G 2018-10-01 11:27:43 UTC
I seem to be running into fstat failures on a 1x3 replicated volume while running pgbench on it. I see corruption warnings even when all perf xlators are turned off. 

pgbench has the pattern of,

1. Application 1, creating a tmp file, writing to it and renaming it to "file"
2. Application 2, Keeps opening "file" and does reads and fstats and closes the fd

I see fstat on an fd failing with ENOTCONN from strace of application 2. On delving deeper,

[2018-07-12 12:55:11.241793] T [fuse-bridge.c:2015:fuse_rename_resume] 0-glusterfs-fuse: 14245087: RENAME `<gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.tmp ()' -> `<gfid:ff5d66b2-7fde-45bd-b873
-d688ae604990>/pg_stat_tmp/pgstat.stat ()'
[2018-07-12 12:55:11.244548] T [fuse-bridge.c:2347:fuse_open_resume] 0-glusterfs-fuse: 14245088: OPEN <gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.stat
[2018-07-12 12:55:11.246429] T [MSGID: 0] [afr-lk-common.c:248:afr_unlock_common_cbk] 8-db-replicate-0: All internal locks unlocked
[2018-07-12 12:55:11.247258] D [MSGID: 114031] [client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-0: remote operation failed. Path: <gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.stat (02e0f34b-f9a5-46ff-83fe-abca95955e8b) [No such file or directory]
[2018-07-12 12:55:11.247414] D [MSGID: 114031] [client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-1: remote operation failed. Path: <gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.stat (02e0f34b-f9a5-46ff-83fe-abca95955e8b) [No such file or directory]
[2018-07-12 12:55:11.247597] T [fuse-bridge.c:1111:fuse_fd_cbk] 0-glusterfs-fuse: 14245088: OPEN() <gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.stat => 0x7fb32bd8ed98
[2018-07-12 12:55:11.247759] T [fuse-bridge.c:1945:fuse_rename_cbk] 0-glusterfs-fuse: 14245087: <gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.tmp -> <gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.stat => 0 (buf->ia_ino=10066098376112020953)

Note that open failed on two subvolumes of afr, but open fop itself is successful as can be seen from fuse_fd_cbk output.

Later when fstat comes on the same fd 0x7fb32bd8ed98

[2018-07-12 12:55:11.247963] T [fuse-bridge.c:955:fuse_getattr_resume] 0-glusterfs-fuse: 14245089: FGETATTR 140407508526648 (<gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b>/0x7fb32bd8ed98)
[2018-07-12 12:55:11.248224] D [MSGID: 0] [afr-open.c:332:afr_fix_open] 8-db-replicate-0: need open count: 2
[2018-07-12 12:55:11.248253] D [MSGID: 0] [afr-open.c:352:afr_fix_open] 8-db-replicate-0: opening fd for file <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-0
[2018-07-12 12:55:11.248424] D [MSGID: 0] [afr-open.c:352:afr_fix_open] 8-db-replicate-0: opening fd for file <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-1
[2018-07-12 12:55:11.248562] D [MSGID: 0] [afr-read-txn.c:252:afr_read_txn] 8-db-replicate-0: 02e0f34b-f9a5-46ff-83fe-abca95955e8b: generation now vs cached: 3, 3
[2018-07-12 12:55:11.248742] D [MSGID: 0] [afr-lk-common.c:694:afr_lock_blocking] 8-db-replicate-0: we're done locking
[2018-07-12 12:55:11.249756] T [MSGID: 0] [afr-lk-common.c:248:afr_unlock_common_cbk] 8-db-replicate-0: All internal locks unlocked
[2018-07-12 12:55:11.250458] D [MSGID: 114031] [client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-1: remote operation failed. Path: <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> (02e0f34b-f9a5-46ff-83fe-abc
a95955e8b) [Stale file handle]
[2018-07-12 12:55:11.250513] D [MSGID: 108009] [afr-open.c:229:afr_openfd_fix_open_cbk] 8-db-replicate-0: Failed to open <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-1 [Stale file handle]
[2018-07-12 12:55:11.251015] D [MSGID: 114031] [client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-0: remote operation failed. Path: <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> (02e0f34b-f9a5-46ff-83fe-abc
a95955e8b) [Stale file handle]
[2018-07-12 12:55:11.251066] D [MSGID: 108009] [afr-open.c:229:afr_openfd_fix_open_cbk] 8-db-replicate-0: Failed to open <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-0 [Stale file handle]
[2018-07-12 12:55:11.251462] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 8-db-client-0: remote operation failed [No such file or directory]
[2018-07-12 12:55:11.253450] D [MSGID: 0] [dht-inode-read.c:182:dht_file_attr_cbk] 8-db-dht: subvolume db-replicate-0 returned -1 [Transport endpoint is not connected]
[2018-07-12 12:55:11.253574] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 14245089: FSTAT() <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> => -1 (Transport endpoint is not connected)


[2018-07-12 12:55:11.253739] T [fuse-bridge.c:955:fuse_getattr_resume] 0-glusterfs-fuse: 14245090: FGETATTR 140407508526648 ((null)/0x7fb32bd8ed98)
[2018-07-12 12:55:11.253959] D [MSGID: 0] [afr-open.c:332:afr_fix_open] 8-db-replicate-0: need open count: 2
[2018-07-12 12:55:11.253984] D [MSGID: 0] [afr-open.c:352:afr_fix_open] 8-db-replicate-0: opening fd for file <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-0
[2018-07-12 12:55:11.254178] D [MSGID: 0] [afr-open.c:352:afr_fix_open] 8-db-replicate-0: opening fd for file <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-1
[2018-07-12 12:55:11.254300] D [MSGID: 0] [afr-read-txn.c:252:afr_read_txn] 8-db-replicate-0: 02e0f34b-f9a5-46ff-83fe-abca95955e8b: generation now vs cached: 3, 3
[2018-07-12 12:55:11.254544] T [fuse-bridge.c:2525:fuse_write_resume] 0-glusterfs-fuse: 14245092: WRITE (0x7fb31ed28a98, size=8192, offset=138829824)
[2018-07-12 12:55:11.254696] T [MSGID: 0] [afr-lk-common.c:60:afr_set_lk_owner] 8-db-replicate-0: Setting lk-owner=140407222969688
[2018-07-12 12:55:11.256316] D [MSGID: 114031] [client-rpc-fops_v2.c:1350:client4_0_finodelk_cbk] 8-db-client-2: remote operation failed [Resource temporarily unavailable]
[2018-07-12 12:55:11.256506] D [MSGID: 114031] [client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-1: remote operation failed. Path: <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> (02e0f34b-f9a5-46ff-83fe-abc
a95955e8b) [Stale file handle]
[2018-07-12 12:55:11.256584] D [MSGID: 108009] [afr-open.c:229:afr_openfd_fix_open_cbk] 8-db-replicate-0: Failed to open <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-1 [Stale file handle]
[2018-07-12 12:55:11.256685] D [MSGID: 114031] [client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-0: remote operation failed. Path: <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> (02e0f34b-f9a5-46ff-83fe-abc
a95955e8b) [Stale file handle]
[2018-07-12 12:55:11.256737] D [MSGID: 108009] [afr-open.c:229:afr_openfd_fix_open_cbk] 8-db-replicate-0: Failed to open <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-0 [Stale file handle]
[2018-07-12 12:55:11.256781] D [MSGID: 114031] [client-rpc-fops_v2.c:1350:client4_0_finodelk_cbk] 8-db-client-1: remote operation failed [Resource temporarily unavailable]
[2018-07-12 12:55:11.257089] W [MSGID: 114031] [client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 8-db-client-0: remote operation failed [No such file or directory]
[2018-07-12 12:55:11.257190] D [MSGID: 114031] [client-rpc-fops_v2.c:1350:client4_0_finodelk_cbk] 8-db-client-0: remote operation failed [Resource temporarily unavailable]
[2018-07-12 12:55:11.257305] T [MSGID: 0] [afr-lk-common.c:1006:afr_nonblocking_inodelk_cbk] 8-db-replicate-0: Last inode locking reply received
[2018-07-12 12:55:11.257326] T [MSGID: 0] [afr-lk-common.c:1020:afr_nonblocking_inodelk_cbk] 8-db-replicate-0: 0 servers locked. Trying again with blocking calls
[2018-07-12 12:55:11.257381] T [MSGID: 0] [afr-lk-common.c:321:afr_unlock_inodelk] 8-db-replicate-0: No internal locks unlocked
[2018-07-12 12:55:11.257400] D [MSGID: 0] [afr-transaction.c:1631:afr_post_nonblocking_inodelk_cbk] 8-db-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2018-07-12 12:55:11.259009] D [MSGID: 0] [dht-inode-read.c:182:dht_file_attr_cbk] 8-db-dht: subvolume db-replicate-0 returned -1 [Transport endpoint is not connected]
[2018-07-12 12:55:11.259079] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 14245090: FSTAT() ERR => -1 (Transport endpoint is not connected)

afr tries to "fix" fd by trying to open it on client-0 and client-1, which fails as rename had overwritten the gfid. Then afr gives up and sends fstat to client-0 which fails and the same error is communicated back to application. Some questions,

* Why didn't afr send fstat to client-2 which has a valid fd opened?
* Does afr require Quorum on fstat? If yes, why did open succeeded even when there was no quorum (open on client-0 and client-1 had failed at the time of application sent open)?
* How to make sure either open from kernel fails or if succeeded all future operations on it succeeds?

Comment 2 Raghavendra G 2018-10-01 11:28:24 UTC
I missed something earlier. Note that fstat was sent to db-client-2 too (which has a valid fd opened on brick). It looks like fstat succeeded on db-client-2 (note the absence of error logs from client4_0_fstat_cbk between rpc_clnt_reply_init and dht_file_attr_cbk. If there was an errror from brick it should've been logged here). However, afr seem to have failed fstat with ENOTCONN. Can you confirm this?

Instance 1:

[2018-07-12 12:53:48.722722] T [rpc-clnt.c:674:rpc_clnt_reply_init] 8-db-client-2: received rpc message (RPC XID: 0x1926eb Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 25) from rpc-transport (db-client-2)
[2018-07-12 12:53:48.722829] D [MSGID: 0] [dht-inode-read.c:182:dht_file_attr_cbk] 8-db-dht: subvolume db-replicate-0 returned -1 [Transport endpoint is not connected]
[2018-07-12 12:53:48.722901] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 14199134: FSTAT() <gfid:797d2f9b-e7d7-47b9-801d-b4c3c134ad1f> => -1 (Transport endpoint is not connected)

Instance 2:

[2018-07-12 12:53:48.726265] T [rpc-clnt.c:1752:rpc_clnt_submit] 8-rpc-clnt: submitted request (XID: 0x1926ee Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 25) to rpc-transport (db-client-2)
[2018-07-12 12:53:48.727696] T [rpc-clnt.c:674:rpc_clnt_reply_init] 8-db-client-2: received rpc message (RPC XID: 0x1926ee Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 25) from rpc-transport (db-client-2)
[2018-07-12 12:53:48.727814] D [MSGID: 0] [dht-inode-read.c:182:dht_file_attr_cbk] 8-db-dht: subvolume db-replicate-0 returned -1 [Transport endpoint is not connected]
[2018-07-12 12:53:48.727886] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 14199138: FSTAT() ERR => -1 (Transport endpoint is not connected)

Summary:
* open fop succeeded to application, though opens on two bricks failed
* fstat succeeded on one brick (failed on two bricks), but fop was failed as ENOTCONN.

Comment 3 Raghavendra G 2018-10-01 11:29:04 UTC
update from Ravi,

afr_open() does not have quorum checks. If the open succeeds even on 1 brick out of 3, it unwinds success to DHT.
afr_fstat()  (and all read transactions in general) do not require quorum.  AFR first picks one out of the 'readable subvolumes' (i.e. good copies) and winds the fstat to it. If the cbk fails, it winds the fop to the next subvol and so on, until it tries on all readable subvolumes. If it fails on all attempted subvols, it fails the fop. The error number will be the errno of the fstat_cbk of the last tried subvolume.

Comment 4 Worker Ant 2019-02-22 06:51:33 UTC
REVIEW: https://review.gluster.org/22251 (cluster/afr: Add quorum checks to open & opendir fops) posted (#1) for review on master by Karthik U S

Comment 5 Worker Ant 2019-03-08 04:03:51 UTC
REVIEW: https://review.gluster.org/22251 (cluster/afr: Add quorum checks to open & opendir fops) merged (#6) on master by Ravishankar N