Description of problem: While doing rolling upgrade when one of the machines is 3.0 and others 2.1U2 the peers go into `Peer Rejected' state; causing volume restart to fail. Version-Release number of selected component (if applicable): glusterfs 2.1U2 glusterfs 3.6.0.15 How reproducible: Always Steps to Reproduce: To perform rolling upgrade: 1. Create 2x2 cluster with glusterfs 2.1U2 2. Backup /etc/glusterd on one of the machines. 3. Upgrade one of the machines to 3.0, restore glusterd 4. Peer detach the upgraded machine and probe it back in. 5. Peer will be in rejected state. Thus blocking rolling upgrades. Additional info: Please note the below steps. Restore glusterd: [root@sulley ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.37.47 Uuid: 649605b1-d001-471a-a7d9-3b54ba621546 State: Peer in Cluster (Connected) Hostname: 10.70.37.184 Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a State: Peer in Cluster (Connected) Hostname: 10.70.37.115 Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c State: Peer in Cluster (Connected) But volume is not seen by status. [root@sulley ~]# gluster vol status Status of volume: upgrade Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.40:/rhs/brick1/r0 49152 Y 3511 Brick 10.70.37.184:/rhs/brick1/r1 49152 Y 3513 Brick 10.70.37.115:/rhs/brick1/r1 49152 Y 3337 NFS Server on localhost 2049 Y 3525 Self-heal Daemon on localhost N/A Y 3531 NFS Server on 10.70.37.184 2049 Y 3526 Self-heal Daemon on 10.70.37.184 N/A Y 3531 NFS Server on 10.70.37.115 2049 Y 3352 Self-heal Daemon on 10.70.37.115 N/A Y 3357 Task Status of Volume upgrade ------------------------------------------------------------------------------ There are no active volume tasks Peer goes to rejected state. [root@sulley ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.37.47 Uuid: 649605b1-d001-471a-a7d9-3b54ba621546 State: Peer Rejected (Connected) Hostname: 10.70.37.184 Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a State: Peer in Cluster (Connected) Hostname: 10.70.37.115 Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c State: Peer in Cluster (Connected) [root@sulley ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.37.47 Uuid: 649605b1-d001-471a-a7d9-3b54ba621546 State: Accepted peer request (Connected) Hostname: 10.70.37.184 Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a State: Peer in Cluster (Connected) Hostname: 10.70.37.115 Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c State: Peer in Cluster (Connected) Detach and attach the peer: [root@sulley ~]# gluster peer detach 10.70.37.47 peer detach: failed: Brick(s) with the peer 10.70.37.47 exist in cluster [root@sulley ~]# gluster peer detach 10.70.37.47 force peer detach: success [root@sulley ~]# gluster peer probe 10.70.37.47 peer probe: success. [root@sulley ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.37.184 Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a State: Peer in Cluster (Connected) Hostname: 10.70.37.115 Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c State: Peer in Cluster (Connected) Hostname: 10.70.37.47 Uuid: 649605b1-d001-471a-a7d9-3b54ba621546 State: Peer in Cluster (Connected) [root@sulley ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.37.184 Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a State: Peer in Cluster (Connected) Hostname: 10.70.37.115 Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c State: Peer in Cluster (Connected) Hostname: 10.70.37.47 Uuid: 649605b1-d001-471a-a7d9-3b54ba621546 State: Peer in Cluster (Connected) [root@sulley ~]# gluster volume start upgrade force volume start: upgrade: failed: Commit failed on 10.70.37.47. Please check log file for details. Try to start the volume and peer goes to rejected state. [root@sulley ~]# gluster volume start upgrade force volume start: upgrade: failed: Commit failed on 10.70.37.47. Please check log file for details. [root@sulley ~]# getfattr -d -e hex -m. /rhs//brick1/r0/ getfattr: Removing leading '/' from absolute path names # file: rhs//brick1/r0/ trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000000000007ffffffe trusted.glusterfs.volume-id=0xf3cf113a4c3649b8a5255a273cc65a35 [root@sulley ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.37.184 Uuid: 2f1ac517-f97c-4abc-ac83-dd70e73a5f0a State: Peer in Cluster (Connected) Hostname: 10.70.37.115 Uuid: 5315ccd4-eafb-49ed-97be-b0858803682c State: Peer in Cluster (Connected) Hostname: 10.70.37.47 Uuid: 649605b1-d001-471a-a7d9-3b54ba621546 State: Peer Rejected (Connected) [root@sulley ~]# Logs attached from the machine.
Please find sosreports at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1108018/
RHS-2.1 doesn't store a volume's calculated operating versions (volinfo.op-version and volinfo.client-op-version), where as RHS-3.0 does this. This can lead to peers entering the rejected state during rolling upgrades. I'll be sending a fix for RHS-3.0 for this issue right away. Please, mark this bug as a blocker.
After discussing with Engineering Leads in RHS3.0 Status meeting providing the qa_ack.
Sending an additional patch to with a small fix to the earlier patch. New patch is under review at https://code.engineering.redhat.com/gerrit/27112
Patch has been merged. It should be available in the next build.
BVT tests related to peer probe failed on build glusterfs-server-3.6.0.18-1.el6rhs.x86_64. The peer probe on a new host stays in "Probe Sent to Peer (Connected)" and did not go to "Peer in Cluster (Connected)" [root@rhsauto056 ~]# gluster peer status Number of Peers: 2 Hostname: rhsauto057.lab.eng.blr.redhat.com Uuid: 168b3f1c-2720-4cce-a1cd-d929d32aa032 State: Peer in Cluster (Connected) Hostname: rhsauto022.lab.eng.blr.redhat.com Uuid: 68bf9e48-b239-45d0-ae8f-56cc9abb6c4a State: Probe Sent to Peer (Connected) The glusterd log in the PEER node reported error about "op-version"as copied below. After taking to Kaushal, we concluded that the above patch in #7 fixes the issue. [2014-06-17 20:55:13.311898] I [glusterd-handshake.c:1014:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000 [2014-06-17 20:55:13.311979] E [store.c:432:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info, returned error: (No such file or directory) [2014-06-17 20:55:13.312051] I [glusterd.c:176:glusterd_uuid_generate_save] 0-management: generated UUID: 68bf9e48-b239-45d0-ae8f-56cc9abb6c4a [2014-06-17 20:55:13.318232] I [glusterd-handler.c:2603:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: dfa8e32d-e270-478e-bfa0-30de617a01db [2014-06-17 20:55:13.320686] I [glusterd-handler.c:2631:__glusterd_handle_probe_query] 0-glusterd: Unable to find peerinfo for host: 10.70.40.131 (24007) [2014-06-17 20:55:13.323241] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-06-17 20:55:13.326728] I [glusterd-handler.c:3172:glusterd_friend_add] 0-management: connect returned 0 [2014-06-17 20:55:13.326937] I [glusterd-handler.c:2655:__glusterd_handle_probe_query] 0-glusterd: Responded to 10.70.40.131, op_ret: 0, op_errno: 0, ret: 0 [2014-06-17 20:55:13.329205] I [glusterd-handler.c:2307:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: dfa8e32d-e270-478e-bfa0-30de617a01db [2014-06-17 20:55:13.346039] E [glusterd-utils.c:4159:gd_import_volume_op_versions] 0-management: volume1.op-version missing in payload for hosdu [2014-06-17 20:55:13.346152] E [glusterd-utils.c:4449:glusterd_import_volinfo] 0-glusterd: Failed to import op-versions for volume hosdu [2014-06-17 20:55:13.346202] E [glusterd-sm.c:1084:glusterd_friend_sm] 0-glusterd: handler returned: -2
Verified on: glusterfs 3.6.0.22 Looks good.
Hi Kaushal, Please review the edited doc text and sign off on the technical accuracy.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html