+++ This bug was initially created as a clone of Bug #1614472 +++ Description of problem: glusterd is not starting on upgraded node due to snapshot issue. This is happening after upgrading from glusterfs-3.8.4-54 to glusterfs-3.8.4-54.15. This is RHEV installation. We tried to copy over /var/log/glusterd/snaps/ <snapshot in question> from not upgraded node to the upgraded. glusterd was initially running, after a while failed with issue with another snapshot. We tried to remove the snapshot, glusterd crashed. Version-Release number of selected component (if applicable): mainline Updated node: storage-011 Not updated nodes: storage-001 storage-006 How reproducible: Steps to Reproduce: 1. Upgrade gluster on one node 2. Start glusterd on the upgraded node Actual results: glusterd is starting, then we started to see errors in logs and glusterd stops: [2018-08-09 10:15:01.696962] E [MSGID: 106196] [glusterd-store.c:3610:glusterd_store_retrieve_snap] 0-management: Failed to update snapshot for snap_GMT-2018.03.19-14.53.58 snap [2018-08-09 10:15:01.696991] E [MSGID: 106043] [glusterd-store.c:3772:glusterd_store_retrieve_snaps] 0-management: Unable to restore snapshot: snap_GMT-2018.03.19-14.53.58 [2018-08-09 10:15:01.697018] E [MSGID: 101019] [xlator.c:486:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2018-08-09 10:15:01.697030] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-management: initializing translator failed [2018-08-09 10:15:01.697037] E [MSGID: 101176] [graph.c:680:glusterfs_graph_activate] 0-graph: init failed [2018-08-09 10:15:01.697638] W [glusterfsd.c:1300:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x55a15087db5d] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x1b1) [0x55a15087da01] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55a15087ceeb] ) 0-: received signum (1), shutting down Expected results: glusterd running with no errors Additional info: On collabshell we have extracted the logs: Also the contect of 'snaps' dirs: These are the following observations made by me from my initial analysis. 1) All the 3 nodes seem to have sufficient space in the root filesystem (/var is not a separate filesystem. It is part of root "/"). 2) The snapshot whose restoration from /var/lib/glusterd/snaps/<snap name> failed (snap_GMT-2018.03.19-14.53.58) has the same contents in "/var/lib/glusterd/snaps/snap_GMT-2018.03.19-14.53.58/info" file in all the 3 nodes. 3) On node storage-011 (which was updated first), glusterd went down at 10:20:01 since it could not build the snapshot from the glusterd store (i.e. failed to restore the contents of "/var/lib/glusterd/snaps/snap_GMT-2018.03.19-14.53.58" into memory) [2018-08-09 10:20:01.807845] E [MSGID: 106196] [glusterd-store.c:3610:glusterd_store_retrieve_snap] 0-management: Failed to update snapshot for snap_GMT-2018.03.19-14.53.58 snap [2018-08-09 10:20:01.807869] E [MSGID: 106043] [glusterd-store.c:3772:glusterd_store_retrieve_snaps] 0-management: Unable to restore snapshot: snap_GMT-2018.03.19-14.53.58 [2018-08-09 10:20:01.807890] E [MSGID: 101019] [xlator.c:486:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2018-08-09 10:20:01.807900] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-management: initializing translator failed [2018-08-09 10:20:01.807907] E [MSGID: 101176] [graph.c:680:glusterfs_graph_activate] 0-graph: init failed [2018-08-09 10:20:01.808435] W [glusterfsd.c:1300:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x5556a07ffb5d] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x1b1) [0x5556a07ffa01] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x5556a07feeeb] ) 0-: received signum (1), shutting down 4) On one of the other nodes (storage-001) the following errors were found around the same time (possibly because the glusterd on node storage-011 i.e. 172.16.3.3. went down since its snapshot restoration failed) [2018-08-09 10:20:02.891276] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1019 [2018-08-09 10:20:02.891296] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2018-08-09 10:20:03.003154] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1018 [2018-08-09 10:20:03.003196] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2018-08-09 10:20:03.117127] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1017 [2018-08-09 10:20:03.117144] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2018-08-09 10:20:03.231416] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1016 [2018-08-09 10:20:03.231473] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully The message "I [MSGID: 106499] [glusterd-handler.c:4370:__glusterd_handle_status_volume] 0-management: Received status volume req for volume rhv_hosted_engine" repeated 2 times between [2018-08-09 10:19:19.13116 0] and [2018-08-09 10:20:03.168645] [2018-08-09 10:20:03.331271] I [MSGID: 106499] [glusterd-handler.c:4370:__glusterd_handle_status_volume] 0-management: Received status volume req for volume rhv_ocp_computes_data [2018-08-09 10:20:03.345208] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1015 [2018-08-09 10:20:03.345248] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2018-08-09 10:20:03.459636] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1014 [2018-08-09 10:20:03.459651] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2018-08-09 10:20:03.595680] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1013 [2018-08-09 10:20:03.595702] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully Unfortunately the glusterd logs from the node storage-011 does not contain more logs as to why the function "glusterd_store_update_snap" failed. Possibly because it received op_errno as something else apart from "GD_STORE_EOF". Rafi, can you please try to see analyze based on the above observations? --- Additional comment from Raghavendra Bhat on 2018-08-10 00:14:34 EDT --- OK. One more observation. My suspicion is following. The description field is there in only following 2 snapshots. snap_GMT-2018.03.19-14.53.58 snap_GMT-2018.03.19-14.54.14 None of the other snapshots contain the description. And for the above 2 snapshots, the info file (i.e. /var/lib/glusterd/snaps/<snap name>/info) contains "desc=" line for storage-001 and storage-011 (which has been updated). But the info file for the above 2 snapshots contain "desc=test" for the node storage-006. So, for storage-011 what is happening is while glusterd is trying to restore the information about the snapshot (one of the above 2 and can happen for either of them based on which snapshot is picked first for restoration by glusterd), the value for the key "desc" is NULL Thus the function "gf_store_read_and_tokenize" is returning op_errno as "GD_STORE_VALUE_NULL". Thus the above function's caller "gf_store_iter_get_next" gets a error with op_errno set to "GD_STORE_VALUE_NULL". And the function "glusterd_store_update_snap" which gets the above error from the function "gf_stpre_iter_get_next" treats it as an error and thus returns error to its caller which leads to glusterd shutting itself down. I think the workaround is this: =============================== For the snapshots snap_GMT-2018.03.19-14.53.58 snap_GMT-2018.03.19-14.54.14 copy /var/lib/glusterd/snaps/<snap>/info file from the node 'storage-006" to the nodes "storage-001" and "storage-011". Validate that info file is same for all the 3 nodes by calculating the md5sum of the info file for the above 2 snapshots. And then try restarting glusterd on "storage-011" followed by upgrading of the remaining 2 nodes (i.e. storage-001 and storage-006). Rafi, can you please validate whether the above analysis and the workaround are correct? Please do a thorough review and see if anything is missed. Regards, Raghavendra
REVIEW: https://review.gluster.org/20747 (snapshot/handshake: store description after strdup) posted (#1) for review on master by mohammed rafi kc
REVIEW: https://review.gluster.org/20770 (snapshot:Fail snapshot creation if an empty description provided) posted (#1) for review on master by mohammed rafi kc
REVIEW: https://review.gluster.org/20854 (snapshot:Fix wrong dictionary key in snapshot cleanup code) posted (#1) for review on master by mohammed rafi kc
COMMIT: https://review.gluster.org/20770 committed in master by "Atin Mukherjee" <amukherj> with a commit message- snapshot:Fail snapshot creation if an empty description provided Snapshot description should have a valid string. Creating a snapshot with null value will cause reading from info file to fail with a null exception Change-Id: I9f84154b8e3e7ffefa5438807b3bb9b4e0d964ca updates: bz#1618004 Signed-off-by: Mohammed Rafi KC <rkavunga>
COMMIT: https://review.gluster.org/20854 committed in master by "Atin Mukherjee" <amukherj> with a commit message- snapshot:Fix wrong dictionary key in snapshot cleanup code Snapshot was designed to support multiple volume snapshot, hence the volume name keys are labelled with volume count. So the volume key should have a volume count appended with key Change-Id: I044d73fc86db0e662dc914669aecfb82a6476fb5 fixes: bz#1618004 Signed-off-by: Mohammed Rafi KC <rkavunga>
COMMIT: https://review.gluster.org/20747 committed in master by "Atin Mukherjee" <amukherj> with a commit message- snapshot/handshake: store description after strdup problem: During a handshake, when we import a friend data snap description variable was just referenced to dictionary value. Solution: snap description should have a separate memory allocated through gf_strdup Change-Id: I94da0c57919e1228919231d1563a001362b100b8 fixes: bz#1618004 Signed-off-by: Mohammed Rafi KC <rkavunga>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/