Description of problem: gf_store_save_value() might not catch write failures. This can lead to files being emptied in /var/lib/glusterd/. Noticeably, /var/lib/glusterd/glusterd.info and /var/lib/glusterd/peers/* are prone to being emptied. It can easily be reproduce when the file system is full. Version-Release number of selected component (if applicable): all : 3.0 and upstream are impacted How reproducible: easy peasey Steps to Reproduce: 1. separate /var/lib/glusterd from /var/log to ensure that you dont miss logs 2. fill up the file system containing /var/lib/glusterd/ (doing it in a loop, to ensure there is no free space at any point of time) 3. restart glusterd on another node, to force an update Actual results: glusterd will fail to write the temporary file, but will not catch the error. Thus the empty temporary file takes the place of the previous file, and the content is lost. -> If this happens to one of the peer file, glusterd will not be able to restart until it is sync'ed back from another node. -> If this happens to glusterd.info, glusterd will regenerate a new UUID upon restart, and that will lead to the node's disappearance in RHEV-M. Expected results: gf_store_save_value() should catch the error, and return an error to the caller, so that it is warned not to make the file replacement, and so that a real error is being written in the logs. Additional info: => Normal logs (glusterd set in <DEBUG> log level) : ----8<---- [2015-05-31 12:03:36.080921] D [store.c:372:gf_store_save_value] 0-: returning: 0 [2015-05-31 12:03:36.081000] D [store.c:372:gf_store_save_value] 0-: returning: 0 [2015-05-31 12:03:36.081090] D [store.c:372:gf_store_save_value] 0-: returning: 0 ---->8---- => No error, gf_store_save_value() didn't produce a single warning, although the data were not written. This is because we did not check the fflush's return code. See patch suggested : in libglusterfs/src/store.c, gf_store_save_value (...) ----8<---- ret = fflush (fp); - if (feof (fp)) { + if (feof (fp) || ret) { gf_log ("", GF_LOG_WARNING, "fflush failed, error: %s", strerror (errno)); ---->8---- After the patch, in a similar situation, an error is successfully returned : => With additional logs (the "CEDRIC" logs have been added to show fflush's behaviour) ----8<---- [2015-05-31 15:23:02.644197] D [store.c:361:gf_store_save_value] 0-: CEDRIC: fflush returned: -1 [2015-05-31 15:23:02.644235] W [store.c:365:gf_store_save_value] 0-: fflush failed, error: No space left on device [2015-05-31 15:23:02.644272] D [store.c:375:gf_store_save_value] 0-: returning: -1 [2015-05-31 15:23:02.644291] C [glusterd-store.c:1914:glusterd_store_global_info] 0-management: Storing uuid failed ret = -1 [2015-05-31 15:23:02.644548] E [glusterd-store.c:1943:glusterd_store_global_info] 0-management: Failed to store glusterd global-info [2015-05-31 15:23:02.644593] E [glusterd-handshake.c:1121:__glusterd_mgmt_hndsk_versions_ack] 0-management: Failed to store op-version ---->8----
Created attachment 1032891 [details] check fflush's return value checking feof() does not seem to be sufficient for catching errors.
Upstream patch http://review.gluster.org/11029 is merged now
downstream patch url: https://code.engineering.redhat.com/gerrit/54977
Verified this Bug with version "glusterfs-3.7.1-12" Steps used: ~~~~~~~~~~~ 1. Filled up /var 2. glusterd restart failed. 3. Checked glusterd log for the error message. (no space). Portion of log: ~~~~~~~~~~~~~~~ [2015-08-27 16:53:08.807647] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30703 [2015-08-27 16:53:08.834497] E [MSGID: 101012] [store.c:72:gf_store_mkstemp] 0-: Failed to open /var/lib/glusterd/glusterd.info.tmp. [No space left on device] [2015-08-27 16:53:08.834560] E [MSGID: 106177] [glusterd-store.c:1898:glusterd_store_global_info] 0-management: Failed to store glusterd global-info [2015-08-27 16:53:08.834584] E [MSGID: 106089] [glusterd-handshake.c:1199:__glusterd_mgmt_hndsk_versions_ack] 0-management: Failed to store op-version [2015-08-27 16:53:11.816557] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30703 [2015-08-27 16:53:11.841544] E [MSGID: 101012] [store.c:72:gf_store_mkstemp] 0-: Failed to open /var/lib/glusterd/glusterd.info.tmp. [No space left on device] [2015-08-27 16:53:11.841604] E [MSGID: 106177] [glusterd-store.c:1898:glusterd_store_global_info] 0-management: Failed to store glusterd global-info [2015-08-27 16:53:11.841628] E [MSGID: 106089] [glusterd-handshake.c:1199:__glusterd_mgmt_hndsk_versions_ack] 0-management: Failed to store op-version [2015-08-27 16:53:14.826202] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30703 [2015-08-27 16:53:14.857281] E [MSGID: 101012] [store.c:72:gf_store_mkstemp] 0-: Failed to open /var/lib/glusterd/glusterd.info.tmp. [No space left on device] [2015-08-27 16:53:14.857346] E [MSGID: 106177] [glusterd-store.c:1898:glusterd_store_global_info] 0-management: Failed to store glusterd global-info [2015-08-27 16:53:14.857370] E [MSGID: 106089] [glusterd-handshake.c:1199:__glusterd_mgmt_hndsk_versions_ack] 0-management: Failed to store op-version [2015-08-27 16:53:17.835288] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30703 [2015-08-27 16:53:17.872672] E [MSGID: 101012] [store.c:72:gf_store_mkstemp] 0-: Failed to open /var/lib/glusterd/glusterd.info.tmp. [No space left on device] [2015-08-27 16:53:17.872738] E [MSGID: 106177] [glusterd-store.c:1898:glusterd_store_global_info] 0-management: Failed to store glusterd global-info Moving this bug to verified state based on above verification result.
Gaurav, Please review and sign-off the edited text.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1845.html