Bug 1636218 - [SNAPSHOT]: with brick multiplexing, snapshot restore will make glusterd send wrong volfile
Summary: [SNAPSHOT]: with brick multiplexing, snapshot restore will make glusterd send...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: snapshot
Version: 4.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On: 1635050
Blocks: 1636162 1636291 1647165
TreeView+ depends on / blocked
 
Reported: 2018-10-04 18:02 UTC by Raghavendra Bhat
Modified: 2018-11-29 15:25 UTC (History)
3 users (show)

Fixed In Version: glusterfs-4.1.6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1635050
: 1647165 (view as bug list)
Environment:
Last Closed: 2018-11-20 09:20:29 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Raghavendra Bhat 2018-10-04 18:02:48 UTC
+++ This bug was initially created as a clone of Bug #1635050 +++

Description of problem:

With brick multiplexing, when a snapshot restore is done, it leads to a snapshot brick getting the client volume file for the corresponding snapshot volume instead of the correct brick volume file.


[2018-09-28 23:34:10.405953] I [glusterfsd-mgmt.c:279:glusterfs_handle_terminate] 0-glusterfs: detaching not-only child /run/gluster/snaps/1534bc2e092341d7bb7d940e9728c3ca/brick1/mirror
[2018-09-28 23:34:10.406009] I [glusterfsd-mgmt.c:279:glusterfs_handle_terminate] 0-glusterfs: detaching not-only child /run/gluster/snaps/1534bc2e092341d7bb7d940e9728c3ca/brick2/mirror

The above 2 lines indicate that it is a brick process. Because of restore, the original snapshot volume's bricks are stopped leading to a detach of the bricks happening which is logged above.


[2018-09-28 23:34:10.407276] E [rpcsvc.c:1542:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with portmap
[2018-09-28 23:34:10.407276] E [rpcsvc.c:1542:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with portmap
[2018-09-28 23:34:10.407299] E [rpcsvc.c:1670:rpcsvc_program_unregister] 0-rpc-service: portmap unregistration of program failed
[2018-09-28 23:34:10.407310] E [rpcsvc.c:1670:rpcsvc_program_unregister] 0-rpc-service: portmap unregistration of program failed
[2018-09-28 23:34:10.407319] E [rpcsvc.c:1720:rpcsvc_program_unregister] 0-rpc-service: Program unregistration failed: GlusterFS Changelog, Num: 1885957735, Ver: 1, Port: 0
[2018-09-28 23:34:10.407323] E [rpcsvc.c:1720:rpcsvc_program_unregister] 0-rpc-service: Program unregistration failed: GlusterFS Changelog, Num: 1885957735, Ver: 1, Port: 0
[2018-09-28 23:34:10.409365] I [barrier.c:665:fini] 0-1534bc2e092341d7bb7d940e9728c3ca-barrier: Disabling barriering and dequeuing all the queued fops
[2018-09-28 23:34:10.409382] I [barrier.c:665:fini] 0-1534bc2e092341d7bb7d940e9728c3ca-barrier: Disabling barriering and dequeuing all the queued fops
[2018-09-28 23:34:10.413063] I [io-stats.c:3937:fini] 0-1534bc2e092341d7bb7d940e9728c3ca-io-stats: io-stats translator unloaded
[2018-09-28 23:34:10.413151] I [rpcsvc.c:2054:rpcsvc_spawn_threads] 0-rpc-service: terminating 1 threads for program 'GlusterFS 4.x v1'
[2018-09-28 23:34:10.413178] I [rpcsvc.c:2054:rpcsvc_spawn_threads] 0-rpc-service: terminating 1 threads for program 'GlusterFS 3.3'
[2018-09-28 23:34:10.413182] I [rpcsvc.c:1993:rpcsvc_request_handler] 0-rpc-service: program 'GlusterFS 4.x v1' thread terminated; total count:3
[2018-09-28 23:34:10.413195] I [rpcsvc.c:1993:rpcsvc_request_handler] 0-rpc-service: program 'GlusterFS 3.3' thread terminated; total count:3
[2018-09-28 23:34:10.413700] I [io-stats.c:3937:fini] 0-1534bc2e092341d7bb7d940e9728c3ca-io-stats: io-stats translator unloaded
[2018-09-28 23:34:10.413775] I [rpcsvc.c:2054:rpcsvc_spawn_threads] 0-rpc-service: terminating 1 threads for program 'GlusterFS 4.x v1'
[2018-09-28 23:34:10.413786] I [rpcsvc.c:2054:rpcsvc_spawn_threads] 0-rpc-service: terminating 1 threads for program 'GlusterFS 3.3'
[2018-09-28 23:34:10.413805] I [rpcsvc.c:1993:rpcsvc_request_handler] 0-rpc-service: program 'GlusterFS 4.x v1' thread terminated; total count:2
[2018-09-28 23:34:10.413807] I [rpcsvc.c:1993:rpcsvc_request_handler] 0-rpc-service: program 'GlusterFS 3.3' thread terminated; total count:2
[2018-09-28 23:34:10.923188] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2018-09-28 23:34:10.923350] I [MSGID: 101191] [event-epoll.c:653:event_dispatch_epoll_worker] 0-epoll: Exited thread with index 3
[2018-09-28 23:34:11.031917] I [MSGID: 101191] [event-epoll.c:653:event_dispatch_epoll_worker] 0-epoll: Exited thread with index 4
[2018-09-28 23:34:57.645690] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/protocol/client.so: undefined symbol: xlator_api. Fall back
 to old symbols
[2018-09-28 23:34:57.646288] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/cluster/replicate.so: undefined symbol: xlator_api. Fall ba
ck to old symbols
[2018-09-28 23:34:57.646714] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/features/read-only.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.646929] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/performance/write-behind.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.647103] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/performance/read-ahead.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.647274] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/performance/readdir-ahead.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.647470] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/performance/io-cache.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.647817] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/performance/open-behind.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.648021] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/debug/io-stats.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.649370] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/cluster/replicate.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.649654] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/features/read-only.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.649858] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/performance/write-behind.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.650027] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/performance/read-ahead.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.650192] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/performance/readdir-ahead.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.650381] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/performance/io-cache.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.650716] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/performance/open-behind.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.650914] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /usr/local/lib/glusterfs/6dev/xlator/debug/io-stats.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-09-28 23:34:57.652878] W [MSGID: 101174] [graph.c:397:_log_if_unknown_option] 7-mirror-client-1: option 'transport.address-family' is not recognized
[2018-09-28 23:34:57.652911] W [MSGID: 101174] [graph.c:397:_log_if_unknown_option] 7-mirror-client-1: option 'transport.tcp-user-timeout' is not recognized
[2018-09-28 23:34:57.652919] W [MSGID: 101174] [graph.c:397:_log_if_unknown_option] 7-mirror-client-1: option 'transport.socket.keepalive-time' is not recognized
[2018-09-28 23:34:57.652927] W [MSGID: 101174] [graph.c:397:_log_if_unknown_option] 7-mirror-client-1: option 'transport.socket.keepalive-interval' is not recognized
[2018-09-28 23:34:57.652934] W [MSGID: 101174] [graph.c:397:_log_if_unknown_option] 7-mirror-client-1: option 'transport.socket.keepalive-count' is not recognized
[2018-09-28 23:34:57.652946] W [MSGID: 101174] [graph.c:397:_log_if_unknown_option] 7-mirror-client-0: option 'transport.address-family' is not recognized
[2018-09-28 23:34:57.652954] W [MSGID: 101174] [graph.c:397:_log_if_unknown_option] 7-mirror-client-0: option 'transport.tcp-user-timeout' is not recognized
[2018-09-28 23:34:57.652962] W [MSGID: 101174] [graph.c:397:_log_if_unknown_option] 7-mirror-client-0: option 'transport.socket.keepalive-time' is not recognized
[2018-09-28 23:34:57.652969] W [MSGID: 101174] [graph.c:397:_log_if_unknown_option] 7-mirror-client-0: option 'transport.socket.keepalive-interval' is not recognized
[2018-09-28 23:34:57.652976] W [MSGID: 101174] [graph.c:397:_log_if_unknown_option] 7-mirror-client-0: option 'transport.socket.keepalive-count' is not recognized
[2018-09-28 23:34:57.683329] I [MSGID: 114020] [client.c:2354:notify] 7-mirror-client-1: parent translators are ready, attempting connect on transport
Final graph:
+------------------------------------------------------------------------------+
  1: volume mirror-client-0
  2:     type protocol/client
  3:     option ping-timeout 42
  4:     option remote-host workspace
  5:     option remote-subvolume /run/gluster/snaps/31717a7458a543dca4bffc8e6b1017cc/brick1/mirror
  6:     option transport-type socket
  7:     option transport.address-family inet
  8:     option username 8b527b11-a2e6-45e3-a12a-99b46dc636ab
  9:     option password 01f72c35-9b3a-47b3-852b-95bdb481e66a
 10:     option transport.tcp-user-timeout 0
 11:     option transport.socket.keepalive-time 20
 12:     option transport.socket.keepalive-interval 2
 13:     option transport.socket.keepalive-count 9
 14:     option send-gids true
 15: end-volume
 16:  
 17: volume mirror-client-1
 18:     type protocol/client
 19:     option ping-timeout 42
 20:     option remote-host workspace
 21:     option remote-subvolume /run/gluster/snaps/31717a7458a543dca4bffc8e6b1017cc/brick2/mirror
 22:     option transport-type socket
 23:     option transport.address-family inet
 24:     option username 8b527b11-a2e6-45e3-a12a-99b46dc636ab
 25:     option password 01f72c35-9b3a-47b3-852b-95bdb481e66a
 26:     option transport.tcp-user-timeout 0
 27:     option transport.socket.keepalive-time 20
 28:     option transport.socket.keepalive-interval 2
 29:     option transport.socket.keepalive-count 9
 30:     option send-gids true
 31: end-volume
 32:  
 33: volume 31717a7458a543dca4bffc8e6b1017cc-replicate-0
 34:     type cluster/replicate
 35:     option afr-pending-xattr mirror-client-0,mirror-client-1
 36:     option use-compound-fops off
 37:     subvolumes mirror-client-0 mirror-client-1
 38: end-volume
 39:
 40: volume 31717a7458a543dca4bffc8e6b1017cc-dht
 41:     type cluster/distribute
 42:     option lock-migration off
 43:     option force-migration off
 44:     subvolumes 31717a7458a543dca4bffc8e6b1017cc-replicate-0
 45: end-volume
 46:  
 47: volume 31717a7458a543dca4bffc8e6b1017cc-read-only
 48:     type features/read-only
 49:     option read-only on
 50:     subvolumes 31717a7458a543dca4bffc8e6b1017cc-dht
 51: end-volume
 52:  
 53: volume 31717a7458a543dca4bffc8e6b1017cc-write-behind
 54:     type performance/write-behind
 55:     subvolumes 31717a7458a543dca4bffc8e6b1017cc-read-only
 56: end-volume
 57:  
 58: volume 31717a7458a543dca4bffc8e6b1017cc-read-ahead
 59:     type performance/read-ahead
 60:     subvolumes 31717a7458a543dca4bffc8e6b1017cc-write-behind
 61: end-volume
 62:  
 63: volume 31717a7458a543dca4bffc8e6b1017cc-readdir-ahead
 64:     type performance/readdir-ahead
 65:     option parallel-readdir off
 66:     option rda-request-size 131072
 67:     option rda-cache-limit 10MB
 68:     subvolumes 31717a7458a543dca4bffc8e6b1017cc-read-ahead
 69: end-volume
 70:  
 71: volume 31717a7458a543dca4bffc8e6b1017cc-io-cache
 72:     type performance/io-cache
 73:     subvolumes 31717a7458a543dca4bffc8e6b1017cc-readdir-ahead
 74: end-volume
 75:  
 76: volume 31717a7458a543dca4bffc8e6b1017cc-quick-read
 77:     type performance/quick-read
 78:     subvolumes 31717a7458a543dca4bffc8e6b1017cc-io-cache
 79: end-volume
 80:  
 81: volume 31717a7458a543dca4bffc8e6b1017cc-open-behind
 82:     type performance/open-behind
 83:     subvolumes 31717a7458a543dca4bffc8e6b1017cc-quick-read
 84: end-volume
 85:  
 86: volume 31717a7458a543dca4bffc8e6b1017cc-md-cache
 87:     type performance/md-cache
 88:     subvolumes 31717a7458a543dca4bffc8e6b1017cc-open-behind
 89: end-volume
 90:
 91: volume 31717a7458a543dca4bffc8e6b1017cc
 92:     type debug/io-stats
 93:     option log-level INFO
 94:     option latency-measurement off
 95:     option count-fop-hits off
 96:     subvolumes 31717a7458a543dca4bffc8e6b1017cc-md-cache
 97: end-volume
 98:  
+------------------------------------------------------------------------------+


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Create couple of snapshots for a gluster volume
2. Then do a snapshot restore
3.

Actual results:
Client volfile is sent to the snapshot brick.

Expected results:
Actual brick volfile should be sent.

Additional info:

--- Additional comment from Worker Ant on 2018-10-01 18:03:32 EDT ---

REVIEW: https://review.gluster.org/21314 (mgmt/glusterd: use proper path to the volfile) posted (#1) for review on master by Raghavendra Bhat

--- Additional comment from Mohammed Rafi KC on 2018-10-03 07:04:01 EDT ---

I tried to reproduce the issue. I wasn't able to hit with the given steps to reproduce. Am I missing anything here?

--- Additional comment from Raghavendra Bhat on 2018-10-03 08:47:30 EDT ---


Rafi, once you do snapshot restore, please check the log file of the multiplexed snapshot brick. It prints the client volfile. IMO it should not happen. In fact, due to another memory corruption bug, the brick process crashed after it got the client volfile (the corruption bug is in the client stack).

--- Additional comment from Worker Ant on 2018-10-04 01:03:07 EDT ---

REVIEW: https://review.gluster.org/21314 (mgmt/glusterd: use proper path to the volfile) posted (#4) for review on master by Atin Mukherjee

Comment 1 Worker Ant 2018-10-04 18:33:01 UTC
REVIEW: https://review.gluster.org/21348 (mgmt/glusterd: use proper path to the volfile) posted (#1) for review on release-4.1 by Raghavendra Bhat

Comment 2 Worker Ant 2018-10-05 14:26:07 UTC
COMMIT: https://review.gluster.org/21348 committed in release-4.1 by "Raghavendra Bhat" <raghavendra@redhat.com> with a commit message- mgmt/glusterd: use proper path to the volfile

Till now, glusterd was generating the volfile path for the snapshot
volume's bricks like this.

/snaps/<snap name>/<brick volfile>

But in reality, the path to the brick volfile for a snapshot volume is

/snaps/<snap name>/<snap volume name>/<brick volfile>

The above workaround was used to distinguish between a mount command used
to mount the snapshot volume, and a brick of the snapshot volume, so that
based on what is actually happening, glusterd can return the proper volfile
(client volfile for the former and the brick volfile for the latter). But,
this was causing problems for snapshot restore when brick multiplexing is
enabled. Because, with brick multiplexing, it tries to find the volfile
and sends GETSPEC rpc call to glusterd using the 2nd style of path i.e.

/snaps/<snap name>/<snap volume name>/<brick volfile>

So, when the snapshot brick (which is multiplexed) sends a GETSPEC rpc
request to glusterd for obtaining the brick volume file, glusterd was
returning the client volume file of the snapshot volume instead of the
brick volume file.

Change-Id: I28b2dfa5d9b379fe943db92c2fdfea879a6a594e
fixes: bz#1636218
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>

Comment 3 Shyamsundar 2018-11-29 15:25:05 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.1.6, please open a new bug report.

glusterfs-4.1.6 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-November/000116.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.