1618004 – [GSS] glusterd not starting after upgrade due to snapshots error in RHEV + RHGS

Bug 1618004 - [GSS] glusterd not starting after upgrade due to snapshots error in RHEV + RHGS

Summary: [GSS] glusterd not starting after upgrade due to snapshots error in RHEV + RHGS

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	snapshot
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Mohammed Rafi KC
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1614472
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-16 11:41 UTC by Mohammed Rafi KC
Modified:	2018-10-23 15:17 UTC (History)
CC List:	13 users (show)
Fixed In Version:	glusterfs-5.0
Clone Of:	1614472
Environment:
Last Closed:	2018-10-23 15:17:10 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Mohammed Rafi KC 2018-08-16 11:41:57 UTC

+++ This bug was initially created as a clone of Bug #1614472 +++

Description of problem:

glusterd is not starting on upgraded node due to snapshot issue. This is happening after upgrading from glusterfs-3.8.4-54 to glusterfs-3.8.4-54.15. This is RHEV installation.

We tried to copy over /var/log/glusterd/snaps/ <snapshot in question> from not upgraded node to the upgraded. glusterd was initially running, after a while failed with issue with another snapshot.

We tried to remove the snapshot, glusterd crashed.

Version-Release number of selected component (if applicable):

mainline

Updated node: 
storage-011

Not updated nodes:
storage-001
storage-006

How reproducible:

Steps to Reproduce:
1. Upgrade gluster on one node
2. Start glusterd on the upgraded node

Actual results:

glusterd is starting, then we started to see errors in logs and glusterd stops:

[2018-08-09 10:15:01.696962] E [MSGID: 106196] [glusterd-store.c:3610:glusterd_store_retrieve_snap] 0-management: Failed to update snapshot for snap_GMT-2018.03.19-14.53.58 snap
[2018-08-09 10:15:01.696991] E [MSGID: 106043] [glusterd-store.c:3772:glusterd_store_retrieve_snaps] 0-management: Unable to restore snapshot: snap_GMT-2018.03.19-14.53.58
[2018-08-09 10:15:01.697018] E [MSGID: 101019] [xlator.c:486:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2018-08-09 10:15:01.697030] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-management: initializing translator failed
[2018-08-09 10:15:01.697037] E [MSGID: 101176] [graph.c:680:glusterfs_graph_activate] 0-graph: init failed
[2018-08-09 10:15:01.697638] W [glusterfsd.c:1300:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x55a15087db5d] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x1b1) [0x55a15087da01] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55a15087ceeb] ) 0-: received signum (1), shutting down

Expected results:

glusterd running with no errors

Additional info:

On collabshell we have extracted the logs:

Also the contect of 'snaps' dirs:



These are the following observations made by me from my initial analysis.

1) All the 3 nodes seem to have sufficient space in the root filesystem (/var is not a separate filesystem. It is part of root "/").
2) The snapshot whose restoration from /var/lib/glusterd/snaps/<snap name> failed (snap_GMT-2018.03.19-14.53.58) has the same contents in "/var/lib/glusterd/snaps/snap_GMT-2018.03.19-14.53.58/info" file in all the 3 nodes.
3) On node storage-011 (which was updated first), glusterd went down at 10:20:01 since it could not build the snapshot from the glusterd store (i.e. failed to restore the contents of "/var/lib/glusterd/snaps/snap_GMT-2018.03.19-14.53.58" into memory)

[2018-08-09 10:20:01.807845] E [MSGID: 106196] [glusterd-store.c:3610:glusterd_store_retrieve_snap] 0-management: Failed to update snapshot for snap_GMT-2018.03.19-14.53.58 snap
[2018-08-09 10:20:01.807869] E [MSGID: 106043] [glusterd-store.c:3772:glusterd_store_retrieve_snaps] 0-management: Unable to restore snapshot: snap_GMT-2018.03.19-14.53.58
[2018-08-09 10:20:01.807890] E [MSGID: 101019] [xlator.c:486:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2018-08-09 10:20:01.807900] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-management: initializing translator failed
[2018-08-09 10:20:01.807907] E [MSGID: 101176] [graph.c:680:glusterfs_graph_activate] 0-graph: init failed
[2018-08-09 10:20:01.808435] W [glusterfsd.c:1300:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x5556a07ffb5d] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x1b1) [0x5556a07ffa01] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x5556a07feeeb] ) 0-: received signum (1), shutting down

4) On one of the other nodes (storage-001) the following errors were found around the same time (possibly because the glusterd on node storage-011 i.e. 172.16.3.3. went down since its snapshot restoration failed)

[2018-08-09 10:20:02.891276] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1019
[2018-08-09 10:20:02.891296] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2018-08-09 10:20:03.003154] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1018
[2018-08-09 10:20:03.003196] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2018-08-09 10:20:03.117127] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1017
[2018-08-09 10:20:03.117144] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2018-08-09 10:20:03.231416] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1016
[2018-08-09 10:20:03.231473] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
The message "I [MSGID: 106499] [glusterd-handler.c:4370:__glusterd_handle_status_volume] 0-management: Received status volume req for volume rhv_hosted_engine" repeated 2 times between [2018-08-09 10:19:19.13116
0] and [2018-08-09 10:20:03.168645]
[2018-08-09 10:20:03.331271] I [MSGID: 106499] [glusterd-handler.c:4370:__glusterd_handle_status_volume] 0-management: Received status volume req for volume rhv_ocp_computes_data
[2018-08-09 10:20:03.345208] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1015
[2018-08-09 10:20:03.345248] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2018-08-09 10:20:03.459636] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1014
[2018-08-09 10:20:03.459651] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2018-08-09 10:20:03.595680] E [rpcsvc.c:300:rpcsvc_program_actor] 0-rpc-service: RPC Program procedure not available for procedure 39 in GlusterD svc cli read-only for 172.16.3.3:1013
[2018-08-09 10:20:03.595702] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully


Unfortunately the glusterd logs from the node storage-011 does not contain more logs as to why the function "glusterd_store_update_snap" failed. Possibly because it received op_errno as something else apart from "GD_STORE_EOF". 

Rafi, can you please try to see analyze based on the above observations?

--- Additional comment from Raghavendra Bhat on 2018-08-10 00:14:34 EDT ---


OK. One more observation.

My suspicion is following.

The description field is there in only following 2 snapshots.

snap_GMT-2018.03.19-14.53.58
snap_GMT-2018.03.19-14.54.14

None of the other snapshots contain the description. And for the above 2 snapshots, the info file (i.e. /var/lib/glusterd/snaps/<snap name>/info) contains 
"desc=" line for storage-001 and storage-011 (which has been updated). 

But the info file for the above 2 snapshots contain 
"desc=test" for the node storage-006.

So, for storage-011 what is happening is while glusterd is trying to restore the information about the snapshot (one of the above 2 and can happen for either of them based on which snapshot is picked first for restoration by glusterd), the value for the key "desc" is NULL

Thus the function "gf_store_read_and_tokenize" is returning op_errno as "GD_STORE_VALUE_NULL". Thus the above function's caller "gf_store_iter_get_next" gets a error with op_errno set to "GD_STORE_VALUE_NULL". And the function "glusterd_store_update_snap" which gets the above error from the function "gf_stpre_iter_get_next" treats it as an error and thus returns error to its caller which leads to glusterd shutting itself down.

I think the workaround is this:
===============================

For the snapshots

snap_GMT-2018.03.19-14.53.58
snap_GMT-2018.03.19-14.54.14

copy /var/lib/glusterd/snaps/<snap>/info file from the node 'storage-006" to the nodes "storage-001" and "storage-011". Validate that info file is same for all the 3 nodes by calculating the md5sum of the info file for the above 2 snapshots. And then try restarting glusterd on "storage-011" followed by upgrading of the remaining 2 nodes (i.e. storage-001 and storage-006).

Rafi, can you please validate whether the above analysis and the workaround are correct? Please do a thorough review and see if anything is missed.

Regards,
Raghavendra

Comment 1 Worker Ant 2018-08-16 11:46:21 UTC

REVIEW: https://review.gluster.org/20747 (snapshot/handshake: store description after strdup) posted (#1) for review on master by mohammed rafi  kc

Comment 2 Worker Ant 2018-08-17 20:01:03 UTC

REVIEW: https://review.gluster.org/20770 (snapshot:Fail snapshot creation if an empty description provided) posted (#1) for review on master by mohammed rafi  kc

Comment 3 Worker Ant 2018-08-18 18:50:11 UTC

REVIEW: https://review.gluster.org/20854 (snapshot:Fix wrong dictionary key in snapshot cleanup code) posted (#1) for review on master by mohammed rafi  kc

Comment 4 Worker Ant 2018-08-19 13:18:40 UTC

COMMIT: https://review.gluster.org/20770 committed in master by "Atin Mukherjee" <amukherj> with a commit message- snapshot:Fail snapshot creation if an empty description provided

Snapshot description should have a valid string. Creating a
snapshot with null value will cause reading from info file
to fail with a null exception

Change-Id: I9f84154b8e3e7ffefa5438807b3bb9b4e0d964ca
updates: bz#1618004
Signed-off-by: Mohammed Rafi KC <rkavunga>

Comment 5 Worker Ant 2018-08-20 02:55:40 UTC

COMMIT: https://review.gluster.org/20854 committed in master by "Atin Mukherjee" <amukherj> with a commit message- snapshot:Fix wrong dictionary key in snapshot cleanup code

Snapshot was designed to support multiple volume snapshot,
hence the volume name keys are labelled with volume count.
So the volume key should have a volume count appended with
key

Change-Id: I044d73fc86db0e662dc914669aecfb82a6476fb5
fixes: bz#1618004
Signed-off-by: Mohammed Rafi KC <rkavunga>

Comment 6 Worker Ant 2018-08-20 02:56:09 UTC

COMMIT: https://review.gluster.org/20747 committed in master by "Atin Mukherjee" <amukherj> with a commit message- snapshot/handshake: store description after strdup

problem:
During a handshake, when we import a friend data
snap description variable was just referenced to
dictionary value.

Solution:
snap description should have a separate memory allocated
through gf_strdup

Change-Id: I94da0c57919e1228919231d1563a001362b100b8
fixes: bz#1618004
Signed-off-by: Mohammed Rafi KC <rkavunga>

Comment 7 Shyamsundar 2018-10-23 15:17:10 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.