Bug 1308487
Summary: | GlusterD is not starting after multiple restarts on one node and parallel volume set options from other node. | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Byreddy <bsrirama> | ||||
Component: | glusterd | Assignee: | Gaurav Kumar Garg <ggarg> | ||||
Status: | CLOSED WONTFIX | QA Contact: | storage-qa-internal <storage-qa-internal> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | rhgs-3.1 | CC: | amukherj, rhs-bugs, smohan, songxin_1980, storage-qa-internal, vbellur | ||||
Target Milestone: | --- | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1408431 (view as bug list) | Environment: | |||||
Last Closed: | 2016-02-22 04:44:11 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1408431 | ||||||
Attachments: |
|
Description
Byreddy
2016-02-15 11:16:31 UTC
This issue is hit at part of the negative testing where while gluster volume set was executed at the same point of time glusterd in another instance was brought down. In the faulty node we could see /var/lib/glusterd/vols/<volname>info file been empty whereas the info.tmp file has the correct contents. This indicates that when we rename from .tmp to the actual one while committing into glusterd store the operation failed. Considering rename been a syscall and it should be atomic this shouldn't have happened. Further analysis to take place. man 2 rename has the following notes: If newpath exists but the operation fails for some reason, rename() guarantees to leave an instance of newpath in place. When overwriting there will probably be a window in which both oldpath and newpath refer to the file being renamed. Considering the above two points there is a possible window where when we bring down GlusterD instance on the node and rename () was in process to rename info.tmp file to info but cleanup_and_exit() forcibly brought down the process resulting into terminating the rename operation in the middle and leaving both the files in place where info.tmp file contains all the content and info file is been zeroed out. If we can ensure that in cleanup_and_exit (glusterfs as a whole) all the threads first finish processing its task and then a graceful shutdown happens we should be able to take care of such problems. Since this is a negative test and occurrence of GlusterD going down while a transaction is been initiated is rare, lowering down the priority & severity. I don't think we'd need to spend much time fixing these kind of issues as this is very much a negative test and unlikely in production set up. I am closing this bug saying Won't fix. Feel free to reopen with proper justification. Created attachment 1219638 [details]
the vols directory
The attachment is the vols directory in the node where glusterd can't be start.
Hi Atin Mukherjee, I face the very similar issue in glusterfs 3.7.6 as you know. When I start the glusterd some error happened. And the log is following. [2016-11-08 07:58:34.989365] I [MSGID: 100030] [glusterfsd.c:2318:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) [2016-11-08 07:58:34.998356] I [MSGID: 106478] [glusterd.c:1350:init] 0-management: Maximum allowed open file descriptors set to 65536 [2016-11-08 07:58:35.000667] I [MSGID: 106479] [glusterd.c:1399:init] 0-management: Using /system/glusterd as working directory [2016-11-08 07:58:35.024508] I [MSGID: 106514] [glusterd-store.c:2075:glusterd_restore_op_version] 0-management: Upgrade detected. Setting op-version to minimum : 1 [2016-11-08 07:58:35.025356] E [MSGID: 106206] [glusterd-store.c:2562:glusterd_store_update_volinfo] 0-management: Failed to get next store iter [2016-11-08 07:58:35.025401] E [MSGID: 106207] [glusterd-store.c:2844:glusterd_store_retrieve_volume] 0-management: Failed to update volinfo for c_glusterfs volume [2016-11-08 07:58:35.025463] E [MSGID: 106201] [glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: c_glusterfs [2016-11-08 07:58:35.025544] E [MSGID: 101019] [xlator.c:428:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2016-11-08 07:58:35.025582] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed [2016-11-08 07:58:35.025629] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed [2016-11-08 07:58:35.026109] W [glusterfsd.c:1236:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init-0x1b260) [0x1000a718] -->/usr/sbin/glusterd(glusterfs_process_volfp-0x1b3b8) [0x1000a5a8] -->/usr/sbin/glusterd(cleanup_and_exit-0x1c02c) [0x100098bc] ) 0-: received signum (0), shutting down And then I found that the size of vols/volume_name/info is 0.It cause glusterd shutdown. But I found that vols/volume_name_info.tmp is not 0. And I found that there is a brick file vols/volume_name/bricks/xxxx.brick is 0, but vols/volume_name/bricks/xxxx.brick.tmp is not 0. you said that "This issue is hit at part of the negative testing where while gluster volume set was executed at the same point of time glusterd in another instance was brought down. In the faulty node we could see /var/lib/glusterd/vols/<volname>info file been empty whereas the info.tmp file has the correct contents." in comment 2. I have two questions for you. 1.Could you reproduce this issue by gluster volume set glusterd which was brought down? 2.Could you be certain that this issue is cause by rename is interrupted in kernel? In my case there are two files, info and 10.32.1.144.-opt-lvmdir-c2-brick, are both empty. But in my view only one rename can be running At the same time. Why there are two files are empty? Or rename("info.tmp", "info") and rename("xxx-brick.tmp", "xxx-brick") are running in two thread? I have added the vols directory in attachment. Thanks, Xin |