1226665 – gf_store_save_value fails to check for errors, leading to emptying files in /var/lib/glusterd/

Bug 1226665 - gf_store_save_value fails to check for errors, leading to emptying files in /var/lib/glusterd/

Summary: gf_store_save_value fails to check for errors, leading to emptying files in /...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.1
Assignee:	Gaurav Kumar Garg
QA Contact:	Byreddy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1226829 1251815 1253148
TreeView+	depends on / blocked

Reported:	2015-05-31 15:47 UTC by Cedric Buissart
Modified:	2019-08-15 04:39 UTC (History)
CC List:	12 users (show)
Fixed In Version:	glusterfs-3.7.1-12
Doc Type:	Bug Fix
Doc Text:	Previously, when there was no space left on the file system and when user performed any operation resulted to change in /var/lib/glusterd/* files, then the glusterd was failing to write to a temporary file. With this fix, a proper error message is displayed when /var/lib/glusterd/* is full.
Clone Of:
Clones:	1226829 (view as bug list)
Environment:
Last Closed:	2015-10-05 07:09:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
check fflush's return value (488 bytes, patch) 2015-05-31 16:39 UTC, Cedric Buissart	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1459813	0	None	None	None	Never
Red Hat Product Errata	RHSA-2015:1845	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.1 update	2015-10-05 11:06:22 UTC

Description Cedric Buissart 2015-05-31 15:47:34 UTC

Description of problem:

gf_store_save_value() might not catch write failures. This can lead to files being emptied in /var/lib/glusterd/.
Noticeably, /var/lib/glusterd/glusterd.info and /var/lib/glusterd/peers/* are prone to being emptied.

It can easily be reproduce when the file system is full.



Version-Release number of selected component (if applicable): all : 3.0 and upstream are impacted


How reproducible: easy peasey


Steps to Reproduce:
1. separate /var/lib/glusterd from /var/log to ensure that you dont miss logs
2. fill up the file system containing /var/lib/glusterd/ (doing it in a loop, to ensure there is no free space at any point of time)
3. restart glusterd on another node, to force an update

Actual results:

glusterd will fail to write the temporary file, but will not catch the error. Thus the empty temporary file takes the place of the previous file, and the content is lost.

-> If this happens to one of the peer file, glusterd will not be able to restart until it is sync'ed back from another node.

-> If this happens to glusterd.info, glusterd will regenerate a new UUID upon restart, and that will lead to the node's disappearance in RHEV-M.

Expected results:

gf_store_save_value() should catch the error, and return an error to the caller, so that it is warned not to make the file replacement, and so that a real error is being written in the logs.

Additional info:

=> Normal logs (glusterd set in <DEBUG> log level) :
----8<----
[2015-05-31 12:03:36.080921] D [store.c:372:gf_store_save_value] 0-: returning: 0
[2015-05-31 12:03:36.081000] D [store.c:372:gf_store_save_value] 0-: returning: 0
[2015-05-31 12:03:36.081090] D [store.c:372:gf_store_save_value] 0-: returning: 0
---->8----

=> No error, gf_store_save_value() didn't produce a single warning, although the data were not written.

This is because we did not check the fflush's return code.
See patch suggested :


in libglusterfs/src/store.c, gf_store_save_value (...)
----8<----
         ret = fflush (fp);
-        if (feof (fp)) {
+        if (feof (fp) || ret) {
                 gf_log ("", GF_LOG_WARNING,
                         "fflush failed, error: %s",
                         strerror (errno));
---->8----

After the patch, in a similar situation, an error is successfully returned :
=> With additional logs (the "CEDRIC" logs have been added to show fflush's behaviour)
----8<----
[2015-05-31 15:23:02.644197] D [store.c:361:gf_store_save_value] 0-: CEDRIC: fflush returned: -1
[2015-05-31 15:23:02.644235] W [store.c:365:gf_store_save_value] 0-: fflush failed, error: No space left on device
[2015-05-31 15:23:02.644272] D [store.c:375:gf_store_save_value] 0-: returning: -1
[2015-05-31 15:23:02.644291] C [glusterd-store.c:1914:glusterd_store_global_info] 0-management: Storing uuid failed ret = -1
[2015-05-31 15:23:02.644548] E [glusterd-store.c:1943:glusterd_store_global_info] 0-management: Failed to store glusterd global-info
[2015-05-31 15:23:02.644593] E [glusterd-handshake.c:1121:__glusterd_mgmt_hndsk_versions_ack] 0-management: Failed to store op-version
---->8----

Comment 2 Cedric Buissart 2015-05-31 16:39:01 UTC

Created attachment 1032891 [details]
check fflush's return value

checking feof() does not seem to be sufficient for catching errors.

Comment 4 Atin Mukherjee 2015-08-08 13:25:58 UTC

Upstream patch http://review.gluster.org/11029 is merged now

Comment 6 Gaurav Kumar Garg 2015-08-12 09:33:47 UTC

downstream patch url: https://code.engineering.redhat.com/gerrit/54977

Comment 8 Byreddy 2015-08-27 12:53:19 UTC

Verified this Bug with version "glusterfs-3.7.1-12"

Steps used:
~~~~~~~~~~~
1. Filled up /var
2. glusterd restart failed.
3. Checked glusterd log for the error message. (no space).

Portion of log:
~~~~~~~~~~~~~~~
[2015-08-27 16:53:08.807647] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30703
[2015-08-27 16:53:08.834497] E [MSGID: 101012] [store.c:72:gf_store_mkstemp] 0-: Failed to open /var/lib/glusterd/glusterd.info.tmp. [No space left on device]
[2015-08-27 16:53:08.834560] E [MSGID: 106177] [glusterd-store.c:1898:glusterd_store_global_info] 0-management: Failed to store glusterd global-info
[2015-08-27 16:53:08.834584] E [MSGID: 106089] [glusterd-handshake.c:1199:__glusterd_mgmt_hndsk_versions_ack] 0-management: Failed to store op-version
[2015-08-27 16:53:11.816557] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30703
[2015-08-27 16:53:11.841544] E [MSGID: 101012] [store.c:72:gf_store_mkstemp] 0-: Failed to open /var/lib/glusterd/glusterd.info.tmp. [No space left on device]
[2015-08-27 16:53:11.841604] E [MSGID: 106177] [glusterd-store.c:1898:glusterd_store_global_info] 0-management: Failed to store glusterd global-info
[2015-08-27 16:53:11.841628] E [MSGID: 106089] [glusterd-handshake.c:1199:__glusterd_mgmt_hndsk_versions_ack] 0-management: Failed to store op-version
[2015-08-27 16:53:14.826202] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30703
[2015-08-27 16:53:14.857281] E [MSGID: 101012] [store.c:72:gf_store_mkstemp] 0-: Failed to open /var/lib/glusterd/glusterd.info.tmp. [No space left on device]
[2015-08-27 16:53:14.857346] E [MSGID: 106177] [glusterd-store.c:1898:glusterd_store_global_info] 0-management: Failed to store glusterd global-info
[2015-08-27 16:53:14.857370] E [MSGID: 106089] [glusterd-handshake.c:1199:__glusterd_mgmt_hndsk_versions_ack] 0-management: Failed to store op-version
[2015-08-27 16:53:17.835288] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30703
[2015-08-27 16:53:17.872672] E [MSGID: 101012] [store.c:72:gf_store_mkstemp] 0-: Failed to open /var/lib/glusterd/glusterd.info.tmp. [No space left on device]
[2015-08-27 16:53:17.872738] E [MSGID: 106177] [glusterd-store.c:1898:glusterd_store_global_info] 0-management: Failed to store glusterd global-info

Moving this bug to verified state based on above verification result.

Comment 9 Divya 2015-09-22 08:48:09 UTC

Gaurav,

Please review and sign-off the edited text.

Comment 11 errata-xmlrpc 2015-10-05 07:09:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1845.html

Note You need to log in before you can comment on or make changes to this bug.