Bug 1234725
Summary: | [New] - Bricks fail to restore when a new node is added to the cluster and rebooted when having management and data on two different interfaces | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | RamaKasturi <knarra> |
Component: | core | Assignee: | Kaushal <kaushal> |
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | rhgs-3.1 | CC: | amukherj, kaushal, knarra, nsathyan, rcyriac, rhs-bugs, storage-qa-internal, vagarwal |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | RHGS 3.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | GlusterD | ||
Fixed In Version: | glusterfs-3.7.1-9 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-07-29 05:06:52 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1234842, 1234843, 1241882 | ||
Bug Blocks: | 1202842 |
Description
RamaKasturi
2015-06-23 06:49:47 UTC
This is caused because glusterd wasn't saving the peerinfo data after getting an update on it. This was a regression introduced with the multi-network peer probe support which came in with better peer identification changes. Upstream patch http://review.gluster.org/#/c/11365/ posted for review Backported change on review on https://code.engineering.redhat.com/gerrit/51547 I tried verifying this bug but i am still able to reproduce the issue. Below are the steps i performed: Had a cluster with two nodes in a cluster with two interfaces one for management and one for data. gluster peer status from node1: ================================= gluster peer status Number of Peers: 1 Hostname: 10.70.36.57 Uuid: e31034cd-75e0-49aa-8b40-362f7e94dd3d State: Peer in Cluster (Connected) Other names: 10.70.33.221 gluster peer status from node2: =============================== gluster peer status Number of Peers: 1 Hostname: 10.70.36.48 Uuid: f823b629-4f70-4fdb-84dc-dfd900aea8b7 State: Peer in Cluster (Connected) Other names: 10.70.33.238 Now added a third node to the cluster which has two interfaces one for management and one for data. gluster peer status from node1 after adding third node: ====================================================== gluster peer status Number of Peers: 2 Hostname: 10.70.36.57 Uuid: e31034cd-75e0-49aa-8b40-362f7e94dd3d State: Peer in Cluster (Connected) Other names: 10.70.33.221 Hostname: 10.70.37.87 Uuid: 6d64313d-dc09-4de5-8c81-35a9b01a6082 State: Peer in Cluster (connected) Other names: 10.70.37.200 gluster peer status from node2 after adding third node: ====================================================== gluster peer status Number of Peers: 2 Hostname: 10.70.36.59 Uuid: f823b629-4f70-4fdb-84dc-dfd900aea8b7 State: Peer in Cluster (Connected) Other names: 10.70.33.238 Hostname: 10.70.37.87 Uuid: 6d64313d-dc09-4de5-8c81-35a9b01a6082 State: Peer in Cluster (Connected) Other names: 10.70.37.200 gluster peer status on third node: ==================================== [root@dhcp37-45 ~]# gluster peer status Number of Peers: 2 Hostname: 10.70.36.48 Uuid: c7f020c3-f360-43e3-a0c7-655a75683021 State: Peer in Cluster (Connected) Other names: 10.70.33.238 Hostname: rhs-client33.lab.eng.blr.redhat.com Uuid: f4ed464c-ce35-4a4c-865c-b88c1c5852ae State: Peer in Cluster (Connected) Other names: 10.70.33.221 Now Rebooted the third node.Below is the gluster peer status from all the three nodes: Node1: ============== [root@rhs-client24 ~]# gluster peer status Number of Peers: 2 Hostname: 10.70.36.57 Uuid: e31034cd-75e0-49aa-8b40-362f7e94dd3d State: Peer in Cluster (Connected) Other names: 10.70.33.221 Hostname: 10.70.37.87 Uuid: 6d64313d-dc09-4de5-8c81-35a9b01a6082 State: Peer in Cluster (Disconnected) Other names: Node2: ============== [root@rhs-client33 ~]# gluster peer status Number of Peers: 2 Hostname: 10.70.36.59 Uuid: f823b629-4f70-4fdb-84dc-dfd900aea8b7 State: Peer in Cluster (Connected) Other names: 10.70.33.238 Hostname: 10.70.37.87 Uuid: 6d64313d-dc09-4de5-8c81-35a9b01a6082 State: Peer in Cluster (Disconnected) Other names: 10.70.37.200 Node3: ============================== gluster peer status Connection failed. Please check if gluster daemon is operational. peer status: failed glusterd fails to start because it could not resolve the bricks. snippet from /var/log/glusterfs logs. [2015-07-09 09:50:29.338782] I [MSGID: 106487] [glusterd-handler.c:1402:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2015-07-09 09:50:58.383097] E [glusterd-op-sm.c:254:glusterd_get_txn_opinfo] (--> 0-management: Unable to get transaction opinfo for transaction ID : 56ce6ad6-1c64-4019-b251-7f1b0da5038b [2015-07-09 09:51:05.585444] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down [2015-07-09 09:51:47.856326] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.1 (args: /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid) [2015-07-09 09:51:47.966616] I [MSGID: 106478] [glusterd.c:1377:init] 0-management: Maximum allowed open file descriptors set to 65536 [2015-07-09 09:51:47.966804] I [MSGID: 106479] [glusterd.c:1426:init] 0-management: Using /var/lib/glusterd as working directory [2015-07-09 09:51:48.088966] W [MSGID: 103071] [rdma.c:4590:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device] [2015-07-09 09:51:48.089028] W [MSGID: 103055] [rdma.c:4896:init] 0-rdma.management: Failed to initialize IB Device [2015-07-09 09:51:48.089042] W [rpc-transport.c:358:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2015-07-09 09:51:48.089129] W [rpcsvc.c:1595:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed [2015-07-09 09:51:48.089155] E [MSGID: 106243] [glusterd.c:1656:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2015-07-09 09:51:53.439596] I [MSGID: 106513] [glusterd-store.c:2036:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 30702 [2015-07-09 09:51:53.440563] I [MSGID: 106544] [glusterd.c:157:glusterd_uuid_init] 0-management: retrieved UUID: 6d64313d-dc09-4de5-8c81-35a9b01a6082 [2015-07-09 09:51:53.440791] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2015-07-09 09:51:53.442081] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600 [2015-07-09 09:51:53.445742] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600 [2015-07-09 09:51:53.446831] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600 [2015-07-09 09:51:53.447506] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600 [2015-07-09 09:51:53.457560] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2015-07-09 09:51:53.581465] I [MSGID: 106498] [glusterd-handler.c:3556:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2015-07-09 09:51:53.584863] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2015-07-09 09:51:53.597092] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2015-07-09 09:51:53.584729] I [MSGID: 106498] [glusterd-handler.c:3556:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2015-07-09 09:51:53.604496] E [MSGID: 106187] [glusterd-store.c:4253:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore [2015-07-09 09:51:53.604585] E [MSGID: 101019] [xlator.c:428:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2015-07-09 09:51:53.604612] E [MSGID: 101066] [graph.c:326:glusterfs_graph_init] 0-management: initializing translator failed [2015-07-09 09:51:53.604641] E [MSGID: 101176] [graph.c:672:glusterfs_graph_activate] 0-graph: init failed [2015-07-09 09:51:53.607544] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (0), shutting down Moving this back to assigned as i am still facing the issue. Moving this back to assigned as i am still facing the issue. This is caused by race which hadn't been encountered before. The analysis is as follows. Assuming A, B and C as the peers. A and B are a cluster and have probed each other on the 2 networks. C is probed from A. During the probe, C is first validate by A. Once C is accepted, A sends and update to both B and C to inform them of the each other. The update C gets from A doesn't have A's second network information. C can only get this information when B sends an update to C. The problem faced here was that B didn't send an update to C. This happens because B sending an update to C depends on the ordering of connection establishment between B and C. B and C both try to establish connections to each other once they receive A's update and get to know of each other. If B establishes the connection first then it sends and update to C. But if C establishes the connection first, B will not send an update to C. This is the first time this situation was observed. This doesn't happen always. Downstream change posted for review at https://code.engineering.redhat.com/gerrit/52811 Correct change is https://code.engineering.redhat.com/gerrit/52812 Verified and works fine with build glusterfs-3.7.1-9.el6rhs.x86_64. Followed the steps below to verify: =================================== 1) Added two nodes to the cluster with separate mgmt and data traffic. 2) Created a volume with the two nodes added to the cluster. 3) Now added a third node to the cluster and rebooted the node. Third node comes back up successfully and there is no host which is in disconnected state in the cluster. Peer status from node1: ======================================= [root@rhs-client24 ~]# gluster peer status Number of Peers: 2 Hostname: rhs-client33.lab.eng.blr.redhat.com Uuid: e31034cd-75e0-49aa-8b40-362f7e94dd3d State: Peer in Cluster (Connected) Other names: 10.70.33.221 Hostname: 10.70.37.87 Uuid: 6d64313d-dc09-4de5-8c81-35a9b01a6082 State: Peer in Cluster (Connected) Other names: 10.70.37.200 peer status from node2: =========================================== [root@rhs-client33 ~]# gluster peer status Number of Peers: 2 Hostname: 10.70.36.48 Uuid: f823b629-4f70-4fdb-84dc-dfd900aea8b7 State: Peer in Cluster (Connected) Other names: 10.70.33.238 Hostname: 10.70.37.87 Uuid: 6d64313d-dc09-4de5-8c81-35a9b01a6082 State: Peer in Cluster (Connected) Other names: 10.70.37.200 peer status from node3: ================================================= [root@dhcp37-87 ~]# gluster peer status Number of Peers: 2 Hostname: 10.70.36.48 Uuid: f823b629-4f70-4fdb-84dc-dfd900aea8b7 State: Peer in Cluster (Connected) Other names: 10.70.33.238 Hostname: rhs-client33.lab.eng.blr.redhat.com Uuid: e31034cd-75e0-49aa-8b40-362f7e94dd3d State: Peer in Cluster (Connected) Other names: 10.70.33.221 Above tested scenario does not work if there are any snapshots in the cluster. Bug has been logged for the same. https://bugzilla.redhat.com/show_bug.cgi?id=1238111 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html |