Bug 1562484
Summary: | [Tracker-RHGS-BZ#1631664] device remove fails with error 'Id not found', although the device is still present | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | krishnaram Karthick <kramdoss> | ||||
Component: | heketi | Assignee: | Michael Adam <madam> | ||||
Status: | CLOSED DEFERRED | QA Contact: | krishnaram Karthick <kramdoss> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | cns-3.9 | CC: | amukherj, hchiramm, jmulligan, ndevos, rhs-bugs, rtalur, storage-qa-internal | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-01-24 00:16:17 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1631664 | ||||||
Bug Blocks: | 1641915 | ||||||
Attachments: |
|
Description
krishnaram Karthick
2018-03-31 04:00:45 UTC
Created attachment 1415356 [details]
heketi_logs
We will analyze the issue in detail soon, However, I want to make one comment here about the 'error' message itself, ie >> "Error: Failed to remove device, error: Id not found >> I think these 'Id not found' messages should also list the problematic ID, mostly its going to be the same ID provided in the command itself, however its good to display this ID in error and it also help if we are failing to find some other ID ( not sure we have such code path though) in this transaction. volume for which replace brick failed: vol_9480cf14bde1c564b1d67d8e364fe18f original brick: 10.70.47.166:/var/lib/heketi/mounts/vg_2acb109d758154a8d12de3b90e084f0c/brick_32967d19fad73bae52c45bdc3e7574a1/ new brick: 10.70.47.166:/var/lib/heketi/mounts/vg_ea53cfb78199251ba13668b8ff16a350/brick_7cad9f700851411d3ae9b687a7a1f308/ gluster replied with failure on command [kubeexec] ERROR 2018/03/31 03:40:33 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [gluster --mode=script volume replace-brick vol_9480cf14bde1c564b1d67d8e364fe18f 10.70.47.166:/var/lib/heketi/mounts/vg_2acb109d758154a8d12de3b90e084f0c/brick_32967d19fad73bae52c45bdc3e7574a1/brick 10.70.47.166:/var/lib/heketi/mounts/vg_ea53cfb78199251ba13668b8ff16a350/brick_7cad9f700851411d3ae9b687a7a1f308/brick commit force] on glusterfs-storage-zs9kh: Err[command terminated with exit code 1]: Stdout []: Stderr [volume replace-brick: failed: Commit failed on 10.70.47.162. Please check log file for details. original brick list from gluster: 10.70.47.162:/var/lib/heketi/mounts/vg_94138d2abb10d88c1d83ba18c881b389/brick_da6dd5bfff1415b786f8ce88446a0846/brick 10.70.46.45:/var/lib/heketi/mounts/vg_4c39fa5f5ccdd94712b471fc46b92760/brick_8140f263b15c19b5e31cd4f5632a369d/brick 10.70.47.166:/var/lib/heketi/mounts/vg_2acb109d758154a8d12de3b90e084f0c/brick_32967d19fad73bae52c45bdc3e7574a1/brick new brick list from gluster: 10.70.47.162:/var/lib/heketi/mounts/vg_94138d2abb10d88c1d83ba18c881b389/brick_da6dd5bfff1415b786f8ce88446a0846/brick 10.70.46.45:/var/lib/heketi/mounts/vg_4c39fa5f5ccdd94712b471fc46b92760/brick_8140f263b15c19b5e31cd4f5632a369d/brick 10.70.47.166:/var/lib/heketi/mounts/vg_ea53cfb78199251ba13668b8ff16a350/brick_7cad9f700851411d3ae9b687a7a1f308/brick ROOT CAUSE: gluster updated its brick list even though it replied with failure on replace-brick command. For glusterd debugging: We reproduced the issue in our setup, hence the ids and volume names are different but here is the log [2018-04-02 11:30:58.108514] W [socket.c:593:__socket_rwv] 0-management: readv on 192.168.10.103:24007 failed (No data available) [2018-04-02 11:30:58.108550] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer <192.168.10.103> (<cd9b911d-c537-4ef4-82a1-1f8fd26a8e7a>), in state <Peer in Cluster>, has disconnected from glusterd. [2018-04-02 11:30:58.108694] W [glusterd-locks.c:854:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0x2322a) [0x7fb0f80de22a] -->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0x2d198) [0x7fb0f80e8198] -->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0xe458c) [0x7fb0f819f58c] ) 0-management: Lock owner mismatch. Lock for vol vol_1c15c1637ac1e325f03d56fdd41f75f5 held by 0eafd295-aa2d-4170-a7b8-3ce98fbff0fc [2018-04-02 11:30:58.108704] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for vol_1c15c1637ac1e325f03d56fdd41f75f5 [2018-04-02 11:30:58.108722] W [glusterd-locks.c:843:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0x2322a) [0x7fb0f80de22a] -->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0x2d198) [0x7fb0f80e8198] -->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0xe4765) [0x7fb0f819f765] ) 0-management: Lock for vol vol_a12b9272cc1f4cb70c17b0ccbbc57633 not held [2018-04-02 11:30:58.108729] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for vol_a12b9272cc1f4cb70c17b0ccbbc57633 [2018-04-02 11:30:58.108743] W [glusterd-locks.c:843:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0x2322a) [0x7fb0f80de22a] -->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0x2d198) [0x7fb0f80e8198] -->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0xe4765) [0x7fb0f819f765] ) 0-management: Lock for vol vol_b2b243820c1277c994178a58d465db25 not held [2018-04-02 11:30:58.108749] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for vol_b2b243820c1277c994178a58d465db25 [2018-04-02 11:30:58.108762] W [glusterd-locks.c:843:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0x2322a) [0x7fb0f80de22a] -->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0x2d198) [0x7fb0f80e8198] -->/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(+0xe4765) [0x7fb0f819f765] ) 0-management: Lock for vol vol_d1a96444079258dae3e28dd4a8b6fd81 not held [2018-04-02 11:30:58.108768] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for vol_d1a96444079258dae3e28dd4a8b6fd81 [2018-04-02 11:30:58.109188] E [rpc-clnt.c:350:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fb0fd46cedb] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb0fd231e6e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb0fd231f8e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7fb0fd233710] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fb0fd234200] ))))) 0-management: forced unwinding frame type(glusterd mgmt v3) op(--(4)) called at 2018-04-02 11:30:58.107966 (xid=0x13) [2018-04-02 11:30:58.109236] E [MSGID: 106116] [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Commit failed on 192.168.10.103. Please check log file for details. [2018-04-02 11:30:58.109372] I [MSGID: 106144] [glusterd-pmap.c:396:pmap_registry_remove] 0-pmap: removing brick /var/lib/heketi/mounts/vg_e08f107a4127b1b05d577654773be4cc/brick_168912fe5c0deb6c23772ea7acd96550/brick on port 49154 [2018-04-02 11:30:58.109405] I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-04-02 11:31:00.164657] E [MSGID: 106123] [glusterd-mgmt.c:1677:glusterd_mgmt_v3_commit] 0-management: Commit failed on peers [2018-04-02 11:31:00.164697] E [MSGID: 106123] [glusterd-replace-brick.c:669:glusterd_mgmt_v3_initiate_replace_brick_cmd_phases] 0-management: Commit Op Failed [2018-04-02 11:32:14.112970] E [socket.c:2369:socket_connect_finish] 0-management: connection to 192.168.10.103:24007 failed (No route to host); disconnecting socket How to reproduce: execute replace-brick command and bring down a node in cluster which neither the owner of old brick nor the owner of new brick. Impact: High In this case, as heketi considers the command to be a failure, we cleanup(or try to) new bricks that were created. As gluster brick is already using new brick partition, heketi won't be able to unmount it. However, we remove the entry from fstab. This will lead to the brick not coming up on next reboot and the state of volume from heketi and gluster perspective have diverged. John, I think this problem should be prevented with the changes to how the operations are structured in the current version. Could you confirm? If you agree, it will be fixed with cns-3.10.0. Unfortunately, the pending operation protection in device replace is very weak. Plus there seems to be the above discussed issues w/in gluster itself. We can get better here by continuing to restructure how device replace is done, but I don't know when we'll have the time to do that. (In reply to John Mulligan from comment #10) > Unfortunately, the pending operation protection in device replace is very > weak. Plus there seems to be the above discussed issues w/in gluster itself. > We can get better here by continuing to restructure how device replace is > done, but I don't know when we'll have the time to do that. Thanks, we'll reconsider this for cns-3.11. |