Bug 1108018

Summary: peers go into `Peer Rejected' state while doing rolling upgrade
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Sachidananda Urs <surs>
Component: coreAssignee: Kaushal <kaushal>
Status: CLOSED ERRATA QA Contact: Sachidananda Urs <surs>
Severity: urgent Docs Contact:
Priority: high    
Version: rhgs-3.0CC: amukherj, kaushal, kparthas, lmohanty, nsathyan, psriniva, rcyriac, rhs-bugs, ssamanta, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.6.0.19-1 Doc Type: Bug Fix
Doc Text:
Previously, the glusterFS management service was not backward compatible with the Red Hat Storage 2.1 version. As a result, the peers entered the peer reject state during the rolling upgrade from Red Hat Storage 2.1. With this fix, the glusterFS management service is made backward compatible and the peers no longer enter a 'peer reject' state.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-22 19:41:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1067342    

Description Sachidananda Urs 2014-06-11 10:07:48 UTC
Description of problem:

While doing rolling upgrade when one of the machines is 3.0 and others 2.1U2 the peers go into `Peer Rejected' state; causing volume restart to fail.

Version-Release number of selected component (if applicable):

glusterfs 2.1U2
glusterfs 3.6.0.15

How reproducible:

Always

Steps to Reproduce:

To perform rolling upgrade:

1. Create 2x2 cluster with glusterfs 2.1U2
2. Backup /etc/glusterd on one of the machines.
3. Upgrade one of the machines to 3.0, restore glusterd
4. Peer detach the upgraded machine and probe it back in.
5. Peer will be in rejected state. Thus blocking rolling upgrades.


Additional info:

Please note the below steps.

Restore glusterd:

[root@sulley ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.37.47
Uuid: 649605b1-d001-471a-a7d9-3b54ba621546
State: Peer in Cluster (Connected)

Hostname: 10.70.37.184
Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.115
Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c
State: Peer in Cluster (Connected)

But volume is not seen by status.

[root@sulley ~]# gluster vol status
Status of volume: upgrade
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.40:/rhs/brick1/r0                        49152   Y       3511
Brick 10.70.37.184:/rhs/brick1/r1                       49152   Y       3513
Brick 10.70.37.115:/rhs/brick1/r1                       49152   Y       3337
NFS Server on localhost                                 2049    Y       3525
Self-heal Daemon on localhost                           N/A     Y       3531
NFS Server on 10.70.37.184                              2049    Y       3526
Self-heal Daemon on 10.70.37.184                        N/A     Y       3531
NFS Server on 10.70.37.115                              2049    Y       3352
Self-heal Daemon on 10.70.37.115                        N/A     Y       3357

Task Status of Volume upgrade
------------------------------------------------------------------------------
There are no active volume tasks

Peer goes to rejected state.

[root@sulley ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.37.47
Uuid: 649605b1-d001-471a-a7d9-3b54ba621546
State: Peer Rejected (Connected)

Hostname: 10.70.37.184
Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.115
Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c
State: Peer in Cluster (Connected)
[root@sulley ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.37.47
Uuid: 649605b1-d001-471a-a7d9-3b54ba621546
State: Accepted peer request (Connected)

Hostname: 10.70.37.184
Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.115
Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c
State: Peer in Cluster (Connected)

Detach and attach the peer:

[root@sulley ~]# gluster peer detach 10.70.37.47
peer detach: failed: Brick(s) with the peer 10.70.37.47 exist in cluster
[root@sulley ~]# gluster peer detach 10.70.37.47 force
peer detach: success
[root@sulley ~]# gluster peer probe 10.70.37.47
peer probe: success.
[root@sulley ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.37.184
Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.115
Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c
State: Peer in Cluster (Connected)

Hostname: 10.70.37.47
Uuid: 649605b1-d001-471a-a7d9-3b54ba621546
State: Peer in Cluster (Connected)
[root@sulley ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.37.184
Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.115
Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c
State: Peer in Cluster (Connected)

Hostname: 10.70.37.47
Uuid: 649605b1-d001-471a-a7d9-3b54ba621546
State: Peer in Cluster (Connected)
[root@sulley ~]# gluster volume start upgrade force
volume start: upgrade: failed: Commit failed on 10.70.37.47. Please check log file for details.

Try to start the volume and peer goes to rejected state.

[root@sulley ~]# gluster volume start upgrade force
volume start: upgrade: failed: Commit failed on 10.70.37.47. Please check log file for details.
[root@sulley ~]# getfattr -d -e hex -m. /rhs//brick1/r0/
getfattr: Removing leading '/' from absolute path names
# file: rhs//brick1/r0/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe
trusted.glusterfs.volume-id=0xf3cf113a4c3649b8a5255a273cc65a35

[root@sulley ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.37.184
Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.115
Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c
State: Peer in Cluster (Connected)

Hostname: 10.70.37.47
Uuid: 649605b1-d001-471a-a7d9-3b54ba621546
State: Peer Rejected (Connected)
[root@sulley ~]#


Logs attached from the machine.

Comment 1 Sachidananda Urs 2014-06-11 10:11:47 UTC
Please find sosreports at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1108018/

Comment 4 Kaushal 2014-06-11 11:45:23 UTC
RHS-2.1 doesn't store a volume's calculated operating versions (volinfo.op-version and volinfo.client-op-version), where as RHS-3.0 does this. This can lead to peers entering the rejected state during rolling upgrades.

I'll be sending a fix for RHS-3.0 for this issue right away. Please, mark this bug as a blocker.

Comment 6 ssamanta 2014-06-13 06:27:29 UTC
After discussing with Engineering Leads in RHS3.0 Status meeting providing the qa_ack.

Comment 7 Kaushal 2014-06-17 08:50:50 UTC
Sending an additional patch to with a small fix to the earlier patch. New patch is under review at https://code.engineering.redhat.com/gerrit/27112

Comment 8 Kaushal 2014-06-17 08:59:12 UTC
Patch has been merged. It should be available in the next build.

Comment 9 Lalatendu Mohanty 2014-06-18 09:23:45 UTC
BVT tests related to peer probe failed on build glusterfs-server-3.6.0.18-1.el6rhs.x86_64. The peer probe on a new host stays in "Probe Sent to Peer (Connected)" and did not go to "Peer in Cluster (Connected)"

[root@rhsauto056 ~]# gluster peer status
Number of Peers: 2

Hostname: rhsauto057.lab.eng.blr.redhat.com
Uuid: 168b3f1c-2720-4cce-a1cd-d929d32aa032
State: Peer in Cluster (Connected)

Hostname: rhsauto022.lab.eng.blr.redhat.com
Uuid: 68bf9e48-b239-45d0-ae8f-56cc9abb6c4a
State: Probe Sent to Peer (Connected)

The glusterd log in the PEER node reported error about "op-version"as copied below.
After taking to Kaushal, we concluded that the above patch in #7 fixes the issue.

[2014-06-17 20:55:13.311898] I [glusterd-handshake.c:1014:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000
[2014-06-17 20:55:13.311979] E [store.c:432:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info, returned error: (No such file or directory)
[2014-06-17 20:55:13.312051] I [glusterd.c:176:glusterd_uuid_generate_save] 0-management: generated UUID: 68bf9e48-b239-45d0-ae8f-56cc9abb6c4a
[2014-06-17 20:55:13.318232] I [glusterd-handler.c:2603:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: dfa8e32d-e270-478e-bfa0-30de617a01db
[2014-06-17 20:55:13.320686] I [glusterd-handler.c:2631:__glusterd_handle_probe_query] 0-glusterd: Unable to find peerinfo for host: 10.70.40.131 (24007)
[2014-06-17 20:55:13.323241] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-06-17 20:55:13.326728] I [glusterd-handler.c:3172:glusterd_friend_add] 0-management: connect returned 0
[2014-06-17 20:55:13.326937] I [glusterd-handler.c:2655:__glusterd_handle_probe_query] 0-glusterd: Responded to 10.70.40.131, op_ret: 0, op_errno: 0, ret: 0
[2014-06-17 20:55:13.329205] I [glusterd-handler.c:2307:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: dfa8e32d-e270-478e-bfa0-30de617a01db
[2014-06-17 20:55:13.346039] E [glusterd-utils.c:4159:gd_import_volume_op_versions] 0-management: volume1.op-version missing in payload for hosdu
[2014-06-17 20:55:13.346152] E [glusterd-utils.c:4449:glusterd_import_volinfo] 0-glusterd: Failed to import op-versions for volume hosdu
[2014-06-17 20:55:13.346202] E [glusterd-sm.c:1084:glusterd_friend_sm] 0-glusterd: handler returned: -2

Comment 10 Sachidananda Urs 2014-06-24 09:59:07 UTC
Verified on: glusterfs 3.6.0.22
Looks good.

Comment 11 Pavithra 2014-08-01 06:13:26 UTC
Hi Kaushal,

Please review the edited doc text and sign off on the technical accuracy.

Comment 13 errata-xmlrpc 2014-09-22 19:41:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html