Bug 1687051

Summary:

gluster volume heal failed when online upgrading from 3.12 to 5.x and when rolling back online upgrade from 4.1.4 to 3.12.15

Product:

[Community] GlusterFS

Reporter:

Amgad <amgad.saleh>

Component:

replicate

Assignee:

Sanju <srakonde>

Status:

CLOSED NOTABUG

QA Contact:

Severity:

urgent

Docs Contact:

Priority:

medium

Version:

CC:

amukherj, atumball, bugs, ksubrahm, pasik, rhs-bugs, sankarshan, srangana, vbellur

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-07-18 07:52:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1667103, 1693300, 1697986, 1701203

Attachments:

Description	Flags
gfs-1 when online upgraded from 3.12.15 to 5.3	none
gfs-2 logs when gfs-1 online upgraded from 3.12.15 to 5.3	none
gfs-3new logs when gfs-1 online upgraded from 3.12.15 to 5.3	none
gfs-1 logs when gfs-1 reverted back to 3.12.15	none
gfs-2 logs when gfs-1 reverted back to 3.12.15	none
gfs-3new logs when gfs-1 reverted back to 3.12.15	none
gfs-1 logs when gfs-1 online upgraded from 3.12.15 to 4.1	none
gfs-2 logs when gfs-1 online upgraded from 3.12.15 to 4.1	none
gfs-3new logs when gfs-1 online upgraded from 3.12.15 to 4.1	none
gfs-1 logs when gfs-1 online rolledback from 4.1.4 to 3.12.15	none
gfs-2 logs when gfs-1 online rolledback from 4.1.4 to 3.12.15	none
gfs-3new logs when gfs-1 online rolledback from 4.1.4 to 3.12.15	none
gfs-1 logs when gfs-1 online upgraded from 3.12.15 to 5.5	none
gfs-2 logs when gfs-1 online upgraded from 3.12.15 to 5.5	none
gfs-3new logs when gfs-1 online upgraded from 3.12.15 to 5.5	none
gfs-1 logs when gfs-1 online rolled-back from 5.5 to 3.12.15	none
gfs-2 logs when gfs-1 online rolled-back from 5.5 to 3.12.15	none
gfs-3new logs when gfs-1 online rolled-back from 5.5 to 3.12.15	none
gfs-1 logs when all servers online rolled-back from 5.5 to 3.12.15	none
gfs-2 logs when all servers online rolled-back from 5.5 to 3.12.15	none
gfs-3new logs when all servers online rolled-back from 5.5 to 3.12.15	none
gfs-1 logs when gfs-1 online rolled-back from 5.5 to 3.12.15 with 128 files generated	none
gfs-2 logs when gfs-1 online rolled-back from 5.5 to 3.12.15 with 128 files generated	none
gfs-3new logs when gfs-1 online rolled-back from 5.5 to 3.12.15 with 128 files generated	none

Description Amgad 2019-03-09 03:33:13 UTC

Description of problem:
Did an online upgrade (3 replica servers) from 3.12.15 to 4.1.4 following GlusterFS doc under:

https://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_4.1/

Each server was upgraded to 4.1.4 at a time and "gluster volume heal" worked fine.

AFter upgrading the whole cluster, wanted to rollback to 3.12.15 following the same steps:

The "gluster volume heal" command failed when rolling back the 1st server with the following messages:

# gluster volume heal Vol1
Launching heal operation to perform index self heal on volume Vol1 has been unsuccessful:
Commit failed on <server-2-ip>. Please check log file for details.
Commit failed on <server-3-ip>. Please check log file for details.

Version-Release number of selected component (if applicable):
3.12.15 --> 4.1.4

How reproducible:

Steps to Reproduce:
1. Install release 3.12.15
2. do online upgrade (one server at a time) to 4.1.4 following:
https://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_4.1/
3. run "gluster volume heal <vol> each time you upgrade a server, it should work
4. Rollback - first server to 3.12.15, when executing "gluster volume heal <vol>", it fails with the messages:

Actual results:

Expected results:
Should work the same in upgrade and rollback

Additional info:

Comment 2 Amgad 2019-03-09 04:12:58 UTC

Tried upgrade from 3.12.15 to 5.3-2 and "gluster volume heal" failed during the online upgrade (one server on 5.3-2).
Is that fixed in 5.4?

This will block online upgrade to 5.4 - and it impacts availability if we have to do offline upgrade

Comment 3 Amgad 2019-03-11 13:53:10 UTC

Any update -- this will impact online upgrade to 5.4

Comment 4 Atin Mukherjee 2019-03-11 15:22:25 UTC

Considering (a) this happens during a rollback which isn't something community has tested and support & (b) there're other critical fixes waiting for users for 5.4 which is overdue, we shouldn't be blocking glusterfs-5.4 release. My proposal is to not mark this bug as a blocker to 5.4.

Shyam - what do you think?

Comment 5 Amgad 2019-03-11 20:25:44 UTC

So how do you do online upgrade - keep in mind upgrade is not complete without rollback isn't any deployment. If online upgrade/backout is not supported, reliability drops big time, especially that the cluster is used by all applications in our case!

Comment 6 Amgad 2019-03-11 20:29:47 UTC

Besides online upgrade doesn't work between 3.12. and 5.3, isit working from 3.12 to 5.4?

Comment 7 Karthik U S 2019-03-12 10:06:47 UTC

Can you please provide the following information?

- gluster volume info
- gluster volume status
- logs from all the nodes (path: /var/log/glusterfs/)

Comment 8 Amgad 2019-03-12 14:32:31 UTC

Case 1) online upgrade from 3.12.15 to 5.3

A) I have a cluster of 3 replicas: gfs-1, gfs-2, gfs-3new running 3.12.15. When online upgraded gfs-1 from 3.12.15, here are the outputs:
   (notice that bricks on gfs-1 are offline - both glusterd and glusterfsd are active and running)

[root@gfs-1 ~]# gluster volume info
 
Volume Name: glustervol1
Type: Replicate
Volume ID: 28b16639-7c58-4f28-975b-5ea17274e87b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data1/1
Brick2: 10.76.153.213:/mnt/data1/1
Brick3: 10.76.153.207:/mnt/data1/1
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
 
Volume Name: glustervol2
Type: Replicate
Volume ID: 8637eee7-20b7-4a88-b497-192b4626093d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data2/2
Brick2: 10.76.153.213:/mnt/data2/2
Brick3: 10.76.153.207:/mnt/data2/2
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
 
Volume Name: glustervol3
Type: Replicate
Volume ID: f8c21e8c-0a9a-40ba-b098-931a4219de0f
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data3/3
Brick2: 10.76.153.213:/mnt/data3/3
Brick3: 10.76.153.207:/mnt/data3/3
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
---
[root@gfs-1 ~]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            N/A       N/A        N       N/A  
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       24733
Brick 10.76.153.207:/mnt/data1/1            49152     0          Y       7790 
Self-heal Daemon on localhost               N/A       N/A        Y       14928
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       7780 
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       24723
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            N/A       N/A        N       N/A  
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       24742
Brick 10.76.153.207:/mnt/data2/2            49153     0          Y       7800 
Self-heal Daemon on localhost               N/A       N/A        Y       14928
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       7780 
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       24723
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            N/A       N/A        N       N/A  
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       24751
Brick 10.76.153.207:/mnt/data3/3            49154     0          Y       7809 
Self-heal Daemon on localhost               N/A       N/A        Y       14928
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       7780 
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       24723
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-1 ~]# 

======

Running "gluster volume heal" ==> unsuccessful

[root@gfs-1 ~]# for i in `gluster volume list`; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful:
Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details.
Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful:
Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details.
Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful:
Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details.
[root@gfs-1 ~]# 

B) Reverting gfs-1 back to 3.12.15, bricks are on line and heal is successfull

[root@gfs-1 log]# gluster volume info
 
Volume Name: glustervol1
Type: Replicate
Volume ID: 28b16639-7c58-4f28-975b-5ea17274e87b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data1/1
Brick2: 10.76.153.213:/mnt/data1/1
Brick3: 10.76.153.207:/mnt/data1/1
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
 
Volume Name: glustervol2
Type: Replicate
Volume ID: 8637eee7-20b7-4a88-b497-192b4626093d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data2/2
Brick2: 10.76.153.213:/mnt/data2/2
Brick3: 10.76.153.207:/mnt/data2/2
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
 
Volume Name: glustervol3
Type: Replicate
Volume ID: f8c21e8c-0a9a-40ba-b098-931a4219de0f
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data3/3
Brick2: 10.76.153.213:/mnt/data3/3
Brick3: 10.76.153.207:/mnt/data3/3
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
[root@gfs-1 log]#

[root@gfs-1 log]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       16029
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       24733
Brick 10.76.153.207:/mnt/data1/1            49152     0          Y       7790 
Self-heal Daemon on localhost               N/A       N/A        Y       16019
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       7780 
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       24723
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       16038
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       24742
Brick 10.76.153.207:/mnt/data2/2            49153     0          Y       7800 
Self-heal Daemon on localhost               N/A       N/A        Y       16019
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       7780 
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       24723
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       16047
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       24751
Brick 10.76.153.207:/mnt/data3/3            49154     0          Y       7809 
Self-heal Daemon on localhost               N/A       N/A        Y       16019
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       24723
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       7780 
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-1 log]# 

[root@gfs-1 log]# for i in `gluster volume list`; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol2 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol3 has been successful 
Use heal info commands to check status.
[root@gfs-1 log]# 

Uploading /var/log/glusterfs:
- when upgraded gfs-1 to 5.3: gfs-1-logs.tgz, gfs-2-logs.tgz, and gfs-3new-logs.tgz
- when reverted back to 3.12.15: gfs-1-logs-3.12.15.tgz, gfs-2-logs-3.12.15.tgz, and gfs-3new-logs-3.12.15.tgz

Next comment will have the 2nd case upgrade 3.12.15 -to- 4.1.4 and rollback

Comment 9 Amgad 2019-03-12 14:42:21 UTC

Created attachment 1543212 [details]
gfs-1 when online upgraded from 3.12.15 to 5.3

Comment 10 Amgad 2019-03-12 14:43:10 UTC

Created attachment 1543214 [details]
gfs-2 logs when gfs-1 online upgraded from 3.12.15 to 5.3

Comment 11 Amgad 2019-03-12 14:43:49 UTC

Created attachment 1543215 [details]
gfs-3new logs when gfs-1 online upgraded from 3.12.15 to 5.3

Comment 12 Amgad 2019-03-12 14:44:52 UTC

Created attachment 1543216 [details]
gfs-1 logs when gfs-1 reverted back to 3.12.15

Comment 13 Amgad 2019-03-12 14:45:23 UTC

Created attachment 1543217 [details]
gfs-2 logs when gfs-1 reverted back to 3.12.15

Comment 14 Amgad 2019-03-12 14:46:02 UTC

Created attachment 1543219 [details]
gfs-3new logs when gfs-1 reverted back to 3.12.15

Comment 15 Amgad 2019-03-12 16:20:27 UTC

Case 2) online upgrade from 3.12.15 to 4.1.4 and rollback:

A) I have a cluster of 3 replicas: gfs-1 (10.76.153.206), gfs-2 (10.76.153.213), and gfs-3new (10.76.153.206), running 3.12.15. 
When online upgraded gfs-1 from 3.12.15 to 4.1.4, heal succeeded. Continuing with gfs-2, then gfs-3new, online upgrade and heal succeeded.

1) Here're the outputs after gfs-1 was online upgraded from 3.12.15 to 4.1.4:
Logs uploaded are: gfs-1-logs-gfs-1-UpgFrom3.12.15-to-4.1.4.tgz, gfs-2-logs-gfs-1-UpgFrom3.12.15-to-4.1.4.tgz, and gfs-3new-logs-gfs-1-UpgFrom3.12.15-to-4.1.4.tgz - see the latest upgrade case.

[root@gfs-1 ansible1]# gluster volume info
 
Volume Name: glustervol1
Type: Replicate
Volume ID: 28b16639-7c58-4f28-975b-5ea17274e87b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data1/1
Brick2: 10.76.153.213:/mnt/data1/1
Brick3: 10.76.153.207:/mnt/data1/1
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
 
Volume Name: glustervol2
Type: Replicate
Volume ID: 8637eee7-20b7-4a88-b497-192b4626093d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data2/2
Brick2: 10.76.153.213:/mnt/data2/2
Brick3: 10.76.153.207:/mnt/data2/2
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
 
Volume Name: glustervol3
Type: Replicate
Volume ID: f8c21e8c-0a9a-40ba-b098-931a4219de0f
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data3/3
Brick2: 10.76.153.213:/mnt/data3/3
Brick3: 10.76.153.207:/mnt/data3/3
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
[root@gfs-1 ansible1]# 
[root@gfs-1 ansible1]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49155     0          Y       30270
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       12726
Brick 10.76.153.207:/mnt/data1/1            49152     0          Y       26671
Self-heal Daemon on localhost               N/A       N/A        Y       30260
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       12716
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       26661
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49156     0          Y       30279
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       12735
Brick 10.76.153.207:/mnt/data2/2            49153     0          Y       26680
Self-heal Daemon on localhost               N/A       N/A        Y       30260
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       12716
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       26661
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49157     0          Y       30288
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       12744
Brick 10.76.153.207:/mnt/data3/3            49154     0          Y       26689
Self-heal Daemon on localhost               N/A       N/A        Y       30260
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       12716
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       26661
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-1 ansible1]# for i in `gluster volume list`; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol2 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol3 has been successful 
Use heal info commands to check status.
[root@gfs-1 ansible1]# 
=======================
=====================

2) Here're the outputs after all were online upgraded from 3.12.15 to 4.1.4:
Logs uploaded see the logs for B) which include this case as well

[root@gfs-3new ansible1]# gluster volume info
 
Volume Name: glustervol1
Type: Replicate
Volume ID: 28b16639-7c58-4f28-975b-5ea17274e87b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data1/1
Brick2: 10.76.153.213:/mnt/data1/1
Brick3: 10.76.153.207:/mnt/data1/1
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
 
Volume Name: glustervol2
Type: Replicate
Volume ID: 8637eee7-20b7-4a88-b497-192b4626093d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data2/2
Brick2: 10.76.153.213:/mnt/data2/2
Brick3: 10.76.153.207:/mnt/data2/2
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
 
Volume Name: glustervol3
Type: Replicate
Volume ID: f8c21e8c-0a9a-40ba-b098-931a4219de0f
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data3/3
Brick2: 10.76.153.213:/mnt/data3/3
Brick3: 10.76.153.207:/mnt/data3/3
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
[root@gfs-3new ansible1]# 
[root@gfs-3new ansible1]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49155     0          Y       30270
Brick 10.76.153.213:/mnt/data1/1            49155     0          Y       13874
Brick 10.76.153.207:/mnt/data1/1            49155     0          Y       28144
Self-heal Daemon on localhost               N/A       N/A        Y       28134
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       13864
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       30260
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49156     0          Y       30279
Brick 10.76.153.213:/mnt/data2/2            49156     0          Y       13883
Brick 10.76.153.207:/mnt/data2/2            49156     0          Y       28153
Self-heal Daemon on localhost               N/A       N/A        Y       28134
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       30260
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       13864
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49157     0          Y       30288
Brick 10.76.153.213:/mnt/data3/3            49157     0          Y       13892
Brick 10.76.153.207:/mnt/data3/3            49157     0          Y       28162
Self-heal Daemon on localhost               N/A       N/A        Y       28134
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       30260
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       13864
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-3new ansible1]# 
[root@gfs-3new ansible1]# for i in `gluster volume list`; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol2 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol3 has been successful 
Use heal info commands to check status.
[root@gfs-3new ansible1]# 

======
=======

B) Here're the outputs after gfs-1 was online rollbacked from 4.1.4 to 3.12.15 - rollback succeeded, but "gluster volume heal" was unsuccessful:
Logs uploaded are: gfs-1-logs-gfs-1-RollbackFrom4.1.4-to-3.12.15.tgz, gfs-2-logs-gfs-1-RollbackFrom4.1.4-to-3.12.15.tgz, and gfs-3new-logs-gfs-1-RollbackFrom4.1.4-to-3.12.15.tgz - includes case 2) as well right before

[root@gfs-1 ansible1]# gluster volume info
 
Volume Name: glustervol1
Type: Replicate
Volume ID: 28b16639-7c58-4f28-975b-5ea17274e87b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data1/1
Brick2: 10.76.153.213:/mnt/data1/1
Brick3: 10.76.153.207:/mnt/data1/1
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
 
Volume Name: glustervol2
Type: Replicate
Volume ID: 8637eee7-20b7-4a88-b497-192b4626093d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data2/2
Brick2: 10.76.153.213:/mnt/data2/2
Brick3: 10.76.153.207:/mnt/data2/2
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
 
Volume Name: glustervol3
Type: Replicate
Volume ID: f8c21e8c-0a9a-40ba-b098-931a4219de0f
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.153.206:/mnt/data3/3
Brick2: 10.76.153.213:/mnt/data3/3
Brick3: 10.76.153.207:/mnt/data3/3
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
[root@gfs-1 ansible1]# 
[root@gfs-1 ansible1]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       32078
Brick 10.76.153.213:/mnt/data1/1            49155     0          Y       13874
Brick 10.76.153.207:/mnt/data1/1            49155     0          Y       28144
Self-heal Daemon on localhost               N/A       N/A        Y       32068
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       13864
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       28134
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       32087
Brick 10.76.153.213:/mnt/data2/2            49156     0          Y       13883
Brick 10.76.153.207:/mnt/data2/2            49156     0          Y       28153
Self-heal Daemon on localhost               N/A       N/A        Y       32068
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       13864
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       28134
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       32096
Brick 10.76.153.213:/mnt/data3/3            49157     0          Y       13892
Brick 10.76.153.207:/mnt/data3/3            49157     0          Y       28162
Self-heal Daemon on localhost               N/A       N/A        Y       32068
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       13864
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       28134
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-1 ansible1]# for i in `gluster volume list`; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Commit failed on 10.76.153.213. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful:
Commit failed on 10.76.153.213. Please check log file for details.
Commit failed on 10.76.153.207. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Commit failed on 10.76.153.213. Please check log file for details.
[root@gfs-1 ansible1]#

Comment 16 Amgad 2019-03-12 16:24:03 UTC

Created attachment 1543260 [details]
gfs-1 logs when gfs-1 online upgraded from 3.12.15 to 4.1

Comment 17 Amgad 2019-03-12 16:24:33 UTC

Created attachment 1543261 [details]
gfs-2 logs when gfs-1 online upgraded from 3.12.15 to 4.1

Comment 18 Amgad 2019-03-12 16:25:07 UTC

Created attachment 1543262 [details]
gfs-3new logs when gfs-1 online upgraded from 3.12.15 to 4.1

Comment 19 Amgad 2019-03-12 16:26:12 UTC

Created attachment 1543263 [details]
gfs-1 logs when gfs-1 online rolledback from 4.1.4 to 3.12.15

Comment 20 Amgad 2019-03-12 16:26:44 UTC

Created attachment 1543264 [details]
gfs-2 logs when gfs-1 online rolledback from 4.1.4 to 3.12.15

Comment 21 Amgad 2019-03-12 16:27:29 UTC

Created attachment 1543268 [details]
gfs-3new logs when gfs-1 online rolledback from 4.1.4 to 3.12.15

Comment 22 Amgad 2019-03-13 14:01:29 UTC

Any update!

Comment 23 Karthik U S 2019-03-15 04:56:46 UTC

Hi,

Sorry for the delay.
In the first case of conversion from 3.12.15 to 5.3, the bricks on the upgraded nodes failed to come up. The heal command will fail if any of the bricks are not available or down.
In the second case of conversion from 4.1.4 to 3.12.15 even though we have all the bricks and shd up and running I can see some errors in the glusterd logs during the commit phase of the heal command.

We need to check from glusterd side why this is happening. Sanju are you aware of any such cases? Can you debug this further to see why the brick is failing to come up and why the heal commit fails?

Regards,
Karthik

Comment 24 Amgad 2019-03-15 14:05:27 UTC

Thanks Karthik

I can see for the first case it's because of the “failed to dispatch handler" (Bug 1671556) which should be addressed in 5.4.
The second case, is definitely an issue for rolling from an older release to a newer one.
is there a "heal" incompatibility between 3.12 and later releases? becaucase this will impact 5.4 as well.

Appreciate your support!

Comment 25 Amgad 2019-03-19 13:47:50 UTC

Any update, feedback or any investigation going on?
Any idea about the root cause/fix? will it be in 5.4?

I did more testing and realized that "gluster volume status" doesn't provide the right status when rolled-back the 1st server, "gfs-1" to 3.12.15, 
after the full upgrade (the other two replicas still on 4.1.4).

When rolled-back gfs-1, I got:

[root@gfs-1 ansible1]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            N/A       N/A        N       N/A  
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            N/A       N/A        N       N/A  
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            N/A       N/A        N       N/A  
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 

Then when I rolled-back gfs-2 I got:
====================================

[root@gfs-2 ansible1]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       23400
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       14481
Self-heal Daemon on localhost               N/A       N/A        Y       14472
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       23390
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       23409
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       14490
Self-heal Daemon on localhost               N/A       N/A        Y       14472
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       23390
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       23418
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       14499
Self-heal Daemon on localhost               N/A       N/A        Y       14472
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       23390
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks

Then when rolled-back the third replica, I got the full status:
==============================================================

[root@gfs-3new ansible1]# gluster volume statusStatus of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       23400
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       14481
Brick 10.76.153.207:/mnt/data1/1            49152     0          Y       13184
Self-heal Daemon on localhost               N/A       N/A        Y       13174
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       14472
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       23390
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       23409
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       14490
Brick 10.76.153.207:/mnt/data2/2            49153     0          Y       13193
Self-heal Daemon on localhost               N/A       N/A        Y       13174
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       23390
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       14472
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       23418
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       14499
Brick 10.76.153.207:/mnt/data3/3            49154     0          Y       13202
Self-heal Daemon on localhost               N/A       N/A        Y       13174
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       23390
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       14472
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks

Comment 26 Sanju 2019-03-19 16:35:00 UTC

Amgad,

Thanks for sharing your test results. I will provide an update on this by the end of this week.

Comment 27 Amgad 2019-03-19 19:52:28 UTC

Thanks Sanju.

Per the release notes at:
https://gluster.readthedocs.io/en/latest/release-notes/5.5/

It seems like there won't be a 5.4 because of rolling upgrade issue. I assume this is what is being addressed here.
Let me know if I can help to accelerate the fix.

Amgad

Comment 28 Amgad 2019-03-20 14:04:36 UTC

Is the issue addressed by the following fixes in R5.5?

#1684385: [ovirt-gluster] Rolling gluster upgrade from 3.12.5 to 5.3 led to shard on-disk xattrs disappearing
#1684569: Upgrade from 4.1 and 5 is broken

Regards,
Amgad

Comment 29 Sanju 2019-03-20 14:54:35 UTC

Amgad,

Yes, there won't be a 5.4 as we hit upgrade blocker https://bugzilla.redhat.com/show_bug.cgi?id=1684029

The issue you are facing not same as https://bugzilla.redhat.com/show_bug.cgi?id=1684029 or https://bugzilla.redhat.com/show_bug.cgi?id=1684569. And I don't think you are hitting https://bugzilla.redhat.com/show_bug.cgi?id=1684385 as that issue is seen while upgrade from 3.12 to 5.

I suspect your issue is same as https://bugzilla.redhat.com/show_bug.cgi?id=1676812. Please, let me know whether it is same or not.

Thanks,
Sanju

Comment 30 Amgad 2019-03-20 19:04:07 UTC

Thanks Sanju:
I'm trying to locally build 5.5 RPMs now to test with. BTW, do you know when the Centos 5.5 RPMs will be available?

Regards,
Amgad

Comment 31 Amgad 2019-03-20 19:04:37 UTC

mainly OS release 7

Comment 32 Sanju 2019-03-21 02:20:27 UTC

Amgad,

I'm not sure but you can always write to users/devel mailing lists so that appropriate people can respond.

Thanks,
Sanju

Comment 33 Sanju 2019-03-21 10:39:08 UTC

(In reply to Amgad from comment #30)
> Thanks Sanju:
> I'm trying to locally build 5.5 RPMs now to test with. BTW, do you know when
> the Centos 5.5 RPMs will be available?

@Shyam, can you please answer this?
> 
> Regards,
> Amgad

Comment 34 Shyamsundar 2019-03-21 10:53:57 UTC

(In reply to Sanju from comment #33)
> (In reply to Amgad from comment #30)
> > Thanks Sanju:
> > I'm trying to locally build 5.5 RPMs now to test with. BTW, do you know when
> > the Centos 5.5 RPMs will be available?
> 
> @Shyam, can you please answer this?
> > 
> > Regards,
> > Amgad

5.5 CentOS storage SIG packages have landed on the test repository as of a day or 2 back, and I am smoke testing the same now.

Test packages can be found and installed like so,
# yum install centos-release-gluster
# yum install --enablerepo=centos-gluster5-test glusterfs-server

If my "smoke" testing does not break anything, then packages would be forthcoming later this week or by Monday next week.

Comment 35 Amgad 2019-03-21 15:17:43 UTC

Thanks Sanju and Shyam.

I went ahead and built the 5.5 RPMS and re-did the online upgrade/rollback tests from 3.12.15 to 5.5, and back. I got the same issue with online rollback.
Here is the data (logs are attached as well):

Case 1) online upgrade from 3.12.15 to 5.5 - upgrades stared right after: Thu Mar 21 14:01:06 UTC 2019
==========================================
A) I have same cluster of 3 replicas: gfs-1 (10.76.153.206), gfs-2 (10.76.153.213), and gfs-3new (10.76.153.207), running 3.12.15. 
When online upgraded gfs-1 from 3.12.15 to 5.5, all bricks were online and heal succeeded. Continuing with gfs-2, then gfs-3new, online upgrade, heal succeeded as well.

1) Here's the output after gfs-1 was online upgraded from 3.12.15 to 5.5:
Logs uploaded are: gfs-1_gfs1_upg_log.tgz, gfs-2_gfs1_upg_log.tgz, and gfs-3new_gfs1_upg_log.tgz.

All volumes/bricks are online and heal succeeded.

[root@gfs-1 ansible2]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49155     0          Y       19559
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       11171
Brick 10.76.153.207:/mnt/data1/1            49152     0          Y       25740
Self-heal Daemon on localhost               N/A       N/A        Y       19587
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11161
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       25730
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49156     0          Y       19568
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       11180
Brick 10.76.153.207:/mnt/data2/2            49153     0          Y       25749
Self-heal Daemon on localhost               N/A       N/A        Y       19587
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11161
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       25730
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49157     0          Y       19578
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       11189
Brick 10.76.153.207:/mnt/data3/3            49154     0          Y       25758
Self-heal Daemon on localhost               N/A       N/A        Y       19587
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       25730
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11161
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-1 ansible2]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol2 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol3 has been successful 
Use heal info commands to check status.

Case 2) online rollback from 5.5 to 3.12.15 - upgrades stared right after: Thu Mar 21 14:20:01 UTC 2019
===========================================
A) Here're the outputs after gfs-1 was online rolled back from 5.5 to 3.12.15 - rollback succeeded. All bricks were online, but "gluster volume heal" was unsuccessful:
Logs uploaded are: gfs-1_gfs1_rollbk_log.tgz, gfs-2_gfs1_rollbk_log.tgz, and gfs-3new_gfs1_rollbk_log.tgz


[root@gfs-1 glusterfs]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       21586
Brick 10.76.153.213:/mnt/data1/1            49155     0          Y       9772 
Brick 10.76.153.207:/mnt/data1/1            49155     0          Y       12139
Self-heal Daemon on localhost               N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       9799 
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       21595
Brick 10.76.153.213:/mnt/data2/2            49156     0          Y       9781 
Brick 10.76.153.207:/mnt/data2/2            49156     0          Y       12148
Self-heal Daemon on localhost               N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       9799 
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       21604
Brick 10.76.153.213:/mnt/data3/3            49157     0          Y       9790 
Brick 10.76.153.207:/mnt/data3/3            49157     0          Y       12157
Self-heal Daemon on localhost               N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       9799 
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-1 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Commit failed on 10.76.153.213. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Commit failed on 10.76.153.213. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Commit failed on 10.76.153.213. Please check log file for details.
[root@gfs-1 glusterfs]# 

B) Same "heal" failure after rolling back gfs-2 from 5.5 to 3.12.15
===================================================================

[root@gfs-2 glusterfs]#  gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       21586
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       11313
Brick 10.76.153.207:/mnt/data1/1            49155     0          Y       12139
Self-heal Daemon on localhost               N/A       N/A        Y       11303
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       21595
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       11322
Brick 10.76.153.207:/mnt/data2/2            49156     0          Y       12148
Self-heal Daemon on localhost               N/A       N/A        Y       11303
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       21604
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       11331
Brick 10.76.153.207:/mnt/data3/3            49157     0          Y       12157
Self-heal Daemon on localhost               N/A       N/A        Y       11303
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-2 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
[root@gfs-2 glusterfs]# 

C) After rolling back gfs-3new from 5.5 to 3.12.15 (all are on 3.12.15 now) heal succeeded
Logs uploaded are: gfs-1_all_rollbk_log.tgz, gfs-2_all_rollbk_log.tgz, and gfs-3new_all_rollbk_log.tgz

[root@gfs-3new glusterfs]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       21586
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       11313
Brick 10.76.153.207:/mnt/data1/1            49152     0          Y       13724
Self-heal Daemon on localhost               N/A       N/A        Y       13714
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11303
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       21595
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       11322
Brick 10.76.153.207:/mnt/data2/2            49153     0          Y       13733
Self-heal Daemon on localhost               N/A       N/A        Y       13714
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11303
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       21604
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       11331
Brick 10.76.153.207:/mnt/data3/3            49154     0          Y       13742
Self-heal Daemon on localhost               N/A       N/A        Y       13714
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11303
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-3new glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol2 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol3 has been successful 
Use heal info commands to check status.
[root@gfs-3new glusterfs]#

Regards,
Amgad

Comment 36 Amgad 2019-03-21 15:20:15 UTC

Thanks Sanju and Shyam.

I went ahead and built the 5.5 RPMS and re-did the online upgrade/rollback tests from 3.12.15 to 5.5, and back. I got the same issue with online rollback.
Here is the data (logs are attached as well):

Case 1) online upgrade from 3.12.15 to 5.5 - upgrades stared right after: Thu Mar 21 14:01:06 UTC 2019
==========================================
A) I have same cluster of 3 replicas: gfs-1 (10.76.153.206), gfs-2 (10.76.153.213), and gfs-3new (10.76.153.207), running 3.12.15. 
When online upgraded gfs-1 from 3.12.15 to 5.5, all bricks were online and heal succeeded. Continuing with gfs-2, then gfs-3new, online upgrade, heal succeeded as well.

1) Here's the output after gfs-1 was online upgraded from 3.12.15 to 5.5:
Logs uploaded are: gfs-1_gfs1_upg_log.tgz, gfs-2_gfs1_upg_log.tgz, and gfs-3new_gfs1_upg_log.tgz.

All volumes/bricks are online and heal succeeded.

[root@gfs-1 ansible2]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49155     0          Y       19559
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       11171
Brick 10.76.153.207:/mnt/data1/1            49152     0          Y       25740
Self-heal Daemon on localhost               N/A       N/A        Y       19587
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11161
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       25730
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49156     0          Y       19568
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       11180
Brick 10.76.153.207:/mnt/data2/2            49153     0          Y       25749
Self-heal Daemon on localhost               N/A       N/A        Y       19587
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11161
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       25730
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49157     0          Y       19578
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       11189
Brick 10.76.153.207:/mnt/data3/3            49154     0          Y       25758
Self-heal Daemon on localhost               N/A       N/A        Y       19587
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       25730
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11161
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-1 ansible2]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol2 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol3 has been successful 
Use heal info commands to check status.

Case 2) online rollback from 5.5 to 3.12.15 - upgrades stared right after: Thu Mar 21 14:20:01 UTC 2019
===========================================
A) Here're the outputs after gfs-1 was online rolled back from 5.5 to 3.12.15 - rollback succeeded. All bricks were online, but "gluster volume heal" was unsuccessful:
Logs uploaded are: gfs-1_gfs1_rollbk_log.tgz, gfs-2_gfs1_rollbk_log.tgz, and gfs-3new_gfs1_rollbk_log.tgz


[root@gfs-1 glusterfs]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       21586
Brick 10.76.153.213:/mnt/data1/1            49155     0          Y       9772 
Brick 10.76.153.207:/mnt/data1/1            49155     0          Y       12139
Self-heal Daemon on localhost               N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       9799 
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       21595
Brick 10.76.153.213:/mnt/data2/2            49156     0          Y       9781 
Brick 10.76.153.207:/mnt/data2/2            49156     0          Y       12148
Self-heal Daemon on localhost               N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       9799 
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       21604
Brick 10.76.153.213:/mnt/data3/3            49157     0          Y       9790 
Brick 10.76.153.207:/mnt/data3/3            49157     0          Y       12157
Self-heal Daemon on localhost               N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       9799 
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-1 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Commit failed on 10.76.153.213. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Commit failed on 10.76.153.213. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Commit failed on 10.76.153.213. Please check log file for details.
[root@gfs-1 glusterfs]# 

B) Same "heal" failure after rolling back gfs-2 from 5.5 to 3.12.15
===================================================================

[root@gfs-2 glusterfs]#  gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       21586
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       11313
Brick 10.76.153.207:/mnt/data1/1            49155     0          Y       12139
Self-heal Daemon on localhost               N/A       N/A        Y       11303
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       21595
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       11322
Brick 10.76.153.207:/mnt/data2/2            49156     0          Y       12148
Self-heal Daemon on localhost               N/A       N/A        Y       11303
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       21604
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       11331
Brick 10.76.153.207:/mnt/data3/3            49157     0          Y       12157
Self-heal Daemon on localhost               N/A       N/A        Y       11303
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       12166
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-2 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
[root@gfs-2 glusterfs]# 

C) After rolling back gfs-3new from 5.5 to 3.12.15 (all are on 3.12.15 now) heal succeeded
Logs uploaded are: gfs-1_all_rollbk_log.tgz, gfs-2_all_rollbk_log.tgz, and gfs-3new_all_rollbk_log.tgz

[root@gfs-3new glusterfs]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       21586
Brick 10.76.153.213:/mnt/data1/1            49152     0          Y       11313
Brick 10.76.153.207:/mnt/data1/1            49152     0          Y       13724
Self-heal Daemon on localhost               N/A       N/A        Y       13714
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11303
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       21595
Brick 10.76.153.213:/mnt/data2/2            49153     0          Y       11322
Brick 10.76.153.207:/mnt/data2/2            49153     0          Y       13733
Self-heal Daemon on localhost               N/A       N/A        Y       13714
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11303
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       21604
Brick 10.76.153.213:/mnt/data3/3            49154     0          Y       11331
Brick 10.76.153.207:/mnt/data3/3            49154     0          Y       13742
Self-heal Daemon on localhost               N/A       N/A        Y       13714
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       11303
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       21576
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-3new glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol2 has been successful 
Use heal info commands to check status.
Launching heal operation to perform index self heal on volume glustervol3 has been successful 
Use heal info commands to check status.
[root@gfs-3new glusterfs]#

Regards,
Amgad

Comment 37 Amgad 2019-03-21 15:22:27 UTC

(In reply to Amgad from comment #36)
> Thanks Sanju and Shyam.
> 
> I went ahead and built the 5.5 RPMS and re-did the online upgrade/rollback
> tests from 3.12.15 to 5.5, and back. I got the same issue with online
> rollback.
> Here is the data (logs are attached as well):
> 
> Case 1) online upgrade from 3.12.15 to 5.5 - upgrades stared right after:
> Thu Mar 21 14:01:06 UTC 2019
> ==========================================
> A) I have same cluster of 3 replicas: gfs-1 (10.76.153.206), gfs-2
> (10.76.153.213), and gfs-3new (10.76.153.207), running 3.12.15. 
> When online upgraded gfs-1 from 3.12.15 to 5.5, all bricks were online and
> heal succeeded. Continuing with gfs-2, then gfs-3new, online upgrade, heal
> succeeded as well.
> 
> 1) Here's the output after gfs-1 was online upgraded from 3.12.15 to 5.5:
> Logs uploaded are: gfs-1_gfs1_upg_log.tgz, gfs-2_gfs1_upg_log.tgz, and
> gfs-3new_gfs1_upg_log.tgz.
> 
> All volumes/bricks are online and heal succeeded.
> 
> [root@gfs-1 ansible2]# gluster volume status
> Status of volume: glustervol1
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data1/1            49155     0          Y      
> 19559
> Brick 10.76.153.213:/mnt/data1/1            49152     0          Y      
> 11171
> Brick 10.76.153.207:/mnt/data1/1            49152     0          Y      
> 25740
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 19587
> Self-heal Daemon on 10.76.153.213           N/A       N/A        Y      
> 11161
> Self-heal Daemon on 10.76.153.207           N/A       N/A        Y      
> 25730
>  
> Task Status of Volume glustervol1
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> Status of volume: glustervol2
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data2/2            49156     0          Y      
> 19568
> Brick 10.76.153.213:/mnt/data2/2            49153     0          Y      
> 11180
> Brick 10.76.153.207:/mnt/data2/2            49153     0          Y      
> 25749
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 19587
> Self-heal Daemon on 10.76.153.213           N/A       N/A        Y      
> 11161
> Self-heal Daemon on 10.76.153.207           N/A       N/A        Y      
> 25730
>  
> Task Status of Volume glustervol2
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> Status of volume: glustervol3
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data3/3            49157     0          Y      
> 19578
> Brick 10.76.153.213:/mnt/data3/3            49154     0          Y      
> 11189
> Brick 10.76.153.207:/mnt/data3/3            49154     0          Y      
> 25758
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 19587
> Self-heal Daemon on 10.76.153.207           N/A       N/A        Y      
> 25730
> Self-heal Daemon on 10.76.153.213           N/A       N/A        Y      
> 11161
>  
> Task Status of Volume glustervol3
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> [root@gfs-1 ansible2]# for i in glustervol1 glustervol2 glustervol3; do
> gluster volume heal $i; done
> Launching heal operation to perform index self heal on volume glustervol1
> has been successful 
> Use heal info commands to check status.
> Launching heal operation to perform index self heal on volume glustervol2
> has been successful 
> Use heal info commands to check status.
> Launching heal operation to perform index self heal on volume glustervol3
> has been successful 
> Use heal info commands to check status.
> 
> Case 2) online rollback from 5.5 to 3.12.15 - upgrades stared right after:
> Thu Mar 21 14:20:01 UTC 2019
> ===========================================
> A) Here're the outputs after gfs-1 was online rolled back from 5.5 to
> 3.12.15 - rollback succeeded. All bricks were online, but "gluster volume
> heal" was unsuccessful:
> Logs uploaded are: gfs-1_gfs1_rollbk_log.tgz, gfs-2_gfs1_rollbk_log.tgz, and
> gfs-3new_gfs1_rollbk_log.tgz
> 
> 
> [root@gfs-1 glusterfs]# gluster volume status
> Status of volume: glustervol1
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data1/1            49152     0          Y      
> 21586
> Brick 10.76.153.213:/mnt/data1/1            49155     0          Y      
> 9772 
> Brick 10.76.153.207:/mnt/data1/1            49155     0          Y      
> 12139
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 21576
> Self-heal Daemon on 10.76.153.213           N/A       N/A        Y      
> 9799 
> Self-heal Daemon on 10.76.153.207           N/A       N/A        Y      
> 12166
>  
> Task Status of Volume glustervol1
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> Status of volume: glustervol2
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data2/2            49153     0          Y      
> 21595
> Brick 10.76.153.213:/mnt/data2/2            49156     0          Y      
> 9781 
> Brick 10.76.153.207:/mnt/data2/2            49156     0          Y      
> 12148
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 21576
> Self-heal Daemon on 10.76.153.213           N/A       N/A        Y      
> 9799 
> Self-heal Daemon on 10.76.153.207           N/A       N/A        Y      
> 12166
>  
> Task Status of Volume glustervol2
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> Status of volume: glustervol3
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data3/3            49154     0          Y      
> 21604
> Brick 10.76.153.213:/mnt/data3/3            49157     0          Y      
> 9790 
> Brick 10.76.153.207:/mnt/data3/3            49157     0          Y      
> 12157
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 21576
> Self-heal Daemon on 10.76.153.213           N/A       N/A        Y      
> 9799 
> Self-heal Daemon on 10.76.153.207           N/A       N/A        Y      
> 12166
>  
> Task Status of Volume glustervol3
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> [root@gfs-1 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do
> gluster volume heal $i; done
> Launching heal operation to perform index self heal on volume glustervol1
> has been unsuccessful:
> Commit failed on 10.76.153.207. Please check log file for details.
> Commit failed on 10.76.153.213. Please check log file for details.
> Launching heal operation to perform index self heal on volume glustervol2
> has been unsuccessful:
> Commit failed on 10.76.153.207. Please check log file for details.
> Commit failed on 10.76.153.213. Please check log file for details.
> Launching heal operation to perform index self heal on volume glustervol3
> has been unsuccessful:
> Commit failed on 10.76.153.207. Please check log file for details.
> Commit failed on 10.76.153.213. Please check log file for details.
> [root@gfs-1 glusterfs]# 
> 
> B) Same "heal" failure after rolling back gfs-2 from 5.5 to 3.12.15
> ===================================================================
> 
> [root@gfs-2 glusterfs]#  gluster volume status
> Status of volume: glustervol1
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data1/1            49152     0          Y      
> 21586
> Brick 10.76.153.213:/mnt/data1/1            49152     0          Y      
> 11313
> Brick 10.76.153.207:/mnt/data1/1            49155     0          Y      
> 12139
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 11303
> Self-heal Daemon on 10.76.153.206           N/A       N/A        Y      
> 21576
> Self-heal Daemon on 10.76.153.207           N/A       N/A        Y      
> 12166
>  
> Task Status of Volume glustervol1
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> Status of volume: glustervol2
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data2/2            49153     0          Y      
> 21595
> Brick 10.76.153.213:/mnt/data2/2            49153     0          Y      
> 11322
> Brick 10.76.153.207:/mnt/data2/2            49156     0          Y      
> 12148
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 11303
> Self-heal Daemon on 10.76.153.206           N/A       N/A        Y      
> 21576
> Self-heal Daemon on 10.76.153.207           N/A       N/A        Y      
> 12166
>  
> Task Status of Volume glustervol2
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> Status of volume: glustervol3
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data3/3            49154     0          Y      
> 21604
> Brick 10.76.153.213:/mnt/data3/3            49154     0          Y      
> 11331
> Brick 10.76.153.207:/mnt/data3/3            49157     0          Y      
> 12157
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 11303
> Self-heal Daemon on 10.76.153.206           N/A       N/A        Y      
> 21576
> Self-heal Daemon on 10.76.153.207           N/A       N/A        Y      
> 12166
>  
> Task Status of Volume glustervol3
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> [root@gfs-2 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do
> gluster volume heal $i; done
> Launching heal operation to perform index self heal on volume glustervol1
> has been unsuccessful:
> Commit failed on 10.76.153.207. Please check log file for details.
> Launching heal operation to perform index self heal on volume glustervol2
> has been unsuccessful:
> Commit failed on 10.76.153.207. Please check log file for details.
> Launching heal operation to perform index self heal on volume glustervol3
> has been unsuccessful:
> Commit failed on 10.76.153.207. Please check log file for details.
> [root@gfs-2 glusterfs]# 
> 
> C) After rolling back gfs-3new from 5.5 to 3.12.15 (all are on 3.12.15 now)
> heal succeeded
> Logs uploaded are: gfs-1_all_rollbk_log.tgz, gfs-2_all_rollbk_log.tgz, and
> gfs-3new_all_rollbk_log.tgz
> 
> [root@gfs-3new glusterfs]# gluster volume status
> Status of volume: glustervol1
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data1/1            49152     0          Y      
> 21586
> Brick 10.76.153.213:/mnt/data1/1            49152     0          Y      
> 11313
> Brick 10.76.153.207:/mnt/data1/1            49152     0          Y      
> 13724
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 13714
> Self-heal Daemon on 10.76.153.206           N/A       N/A        Y      
> 21576
> Self-heal Daemon on 10.76.153.213           N/A       N/A        Y      
> 11303
>  
> Task Status of Volume glustervol1
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> Status of volume: glustervol2
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data2/2            49153     0          Y      
> 21595
> Brick 10.76.153.213:/mnt/data2/2            49153     0          Y      
> 11322
> Brick 10.76.153.207:/mnt/data2/2            49153     0          Y      
> 13733
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 13714
> Self-heal Daemon on 10.76.153.206           N/A       N/A        Y      
> 21576
> Self-heal Daemon on 10.76.153.213           N/A       N/A        Y      
> 11303
>  
> Task Status of Volume glustervol2
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> Status of volume: glustervol3
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.76.153.206:/mnt/data3/3            49154     0          Y      
> 21604
> Brick 10.76.153.213:/mnt/data3/3            49154     0          Y      
> 11331
> Brick 10.76.153.207:/mnt/data3/3            49154     0          Y      
> 13742
> Self-heal Daemon on localhost               N/A       N/A        Y      
> 13714
> Self-heal Daemon on 10.76.153.213           N/A       N/A        Y      
> 11303
> Self-heal Daemon on 10.76.153.206           N/A       N/A        Y      
> 21576
>  
> Task Status of Volume glustervol3
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
>  
> [root@gfs-3new glusterfs]# for i in glustervol1 glustervol2 glustervol3; do
> gluster volume heal $i; done
> Launching heal operation to perform index self heal on volume glustervol1
> has been successful 
> Use heal info commands to check status.
> Launching heal operation to perform index self heal on volume glustervol2
> has been successful 
> Use heal info commands to check status.
> Launching heal operation to perform index self heal on volume glustervol3
> has been successful 
> Use heal info commands to check status.
> [root@gfs-3new glusterfs]#
> 
> Regards,
> Amgad

comment seems to be duplicated

Comment 38 Amgad 2019-03-21 15:24:03 UTC

Created attachment 1546575 [details]
gfs-1 logs when gfs-1 online upgraded from 3.12.15 to 5.5

Comment 39 Amgad 2019-03-21 15:25:02 UTC

Created attachment 1546576 [details]
gfs-2 logs when gfs-1 online upgraded from 3.12.15 to 5.5

Comment 40 Amgad 2019-03-21 15:25:55 UTC

Created attachment 1546577 [details]
gfs-3new logs when gfs-1 online upgraded from 3.12.15 to 5.5

Comment 41 Amgad 2019-03-21 15:26:55 UTC

Created attachment 1546578 [details]
gfs-1 logs when gfs-1 online rolled-back from 5.5 to 3.12.15

Comment 42 Amgad 2019-03-21 15:28:22 UTC

Created attachment 1546579 [details]
gfs-2 logs when gfs-1 online rolled-back from 5.5 to 3.12.15

Comment 43 Amgad 2019-03-21 15:29:21 UTC

Created attachment 1546580 [details]
gfs-3new logs when gfs-1 online rolled-back from 5.5 to  3.12.15

Comment 44 Amgad 2019-03-21 15:30:45 UTC

Created attachment 1546588 [details]
gfs-1 logs when all servers online rolled-back from 5.5 to 3.12.15

Comment 45 Amgad 2019-03-21 15:31:22 UTC

Created attachment 1546589 [details]
gfs-2 logs when all servers online rolled-back from 5.5 to 3.12.15

Comment 46 Amgad 2019-03-21 15:32:03 UTC

Created attachment 1546591 [details]
gfs-3new logs when all servers online rolled-back from 5.5 to 3.12.15

Comment 47 Sanju 2019-03-22 09:46:06 UTC

(In reply to Amgad from comment #36)
Amgad,

Did you check whether you are hitting https://bugzilla.redhat.com/show_bug.cgi?id=1676812? I believe that you are facing the same issue.

Thanks,
Sanju

Comment 48 Amgad 2019-03-22 17:01:42 UTC

That's not the case here.
In my scenario, heal is performed after the rolback (from 5.5 to 3.12.15) is done on gfs-1 (gfs-2 and gfs-3new are still on 5.5) and all volumes/bricks were up.

I actually did another test, during the rollback for gfs-1, a client generated 128 files. All files existed on nodes gfs-2 and gfs-3new, but not on gfs-1. 
Heal kept failing despite all bricks are online.

Here's the outputs:
==================
1) On gfs-1, the one rolled-back to 3.12.15

[root@gfs-1 ansible2]# gluster --version
glusterfs 3.12.15
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
[root@gfs-1 ansible2]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       10712
Brick 10.76.153.213:/mnt/data1/1            49155     0          Y       20297
Brick 10.76.153.207:/mnt/data1/1            49155     0          Y       21395
Self-heal Daemon on localhost               N/A       N/A        Y       10703
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       20336
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       21422
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       10721
Brick 10.76.153.213:/mnt/data2/2            49156     0          Y       20312
Brick 10.76.153.207:/mnt/data2/2            49156     0          Y       21404
Self-heal Daemon on localhost               N/A       N/A        Y       10703
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       20336
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       21422
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       10731
Brick 10.76.153.213:/mnt/data3/3            49157     0          Y       20327
Brick 10.76.153.207:/mnt/data3/3            49157     0          Y       21413
Self-heal Daemon on localhost               N/A       N/A        Y       10703
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       21422
Self-heal Daemon on 10.76.153.213           N/A       N/A        Y       20336
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gfs-1 ansible2]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done
Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful:
Commit failed on 10.76.153.213. Please check log file for details.
Commit failed on 10.76.153.207. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful:
Commit failed on 10.76.153.213. Please check log file for details.
Commit failed on 10.76.153.207. Please check log file for details.
Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful:
Commit failed on 10.76.153.207. Please check log file for details.
Commit failed on 10.76.153.213. Please check log file for details.
[root@gfs-1 ansible2]#
[root@gfs-1 ansible2]# gluster volume heal glustervol3 infoBrick 10.76.153.206:/mnt/data3/3
Status: Connected
Number of entries: 0

Brick 10.76.153.213:/mnt/data3/3
/test_file.0 
/ 
/test_file.1 
/test_file.2 
/test_file.3 
/test_file.4 
..
/test_file.125 
/test_file.126 
/test_file.127 
Status: Connected
Number of entries: 129

Brick 10.76.153.207:/mnt/data3/3
/test_file.0 
/ 
/test_file.1 
/test_file.2 
/test_file.3 
/test_file.4 
...
/test_file.125 
/test_file.126 
/test_file.127 
Status: Connected
Number of entries: 129

[root@gfs-1 ansible2]# ls -ltr /mnt/data3/3/         ====> None of the test_file.? exists
total 8
-rw-------. 2 root root  0 Mar 11 15:52 c2file3
-rw-------. 2 root root 66 Mar 11 16:37 c1file3
-rw-------. 2 root root 91 Mar 22 16:36 c1file2
[root@gfs-1 ansible2]#

2) On gfs-2, on 5.5
[root@gfs-2 ansible2]# gluster --version
glusterfs 5.5
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
[root@gfs-2 ansible2]#
[root@gfs-2 ansible2]# gluster volume status
Status of volume: glustervol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data1/1            49152     0          Y       10712
Brick 10.76.153.213:/mnt/data1/1            49155     0          Y       20297
Brick 10.76.153.207:/mnt/data1/1            49155     0          Y       21395
Self-heal Daemon on localhost               N/A       N/A        Y       20336
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       10703
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       21422
 
Task Status of Volume glustervol1
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data2/2            49153     0          Y       10721
Brick 10.76.153.213:/mnt/data2/2            49156     0          Y       20312
Brick 10.76.153.207:/mnt/data2/2            49156     0          Y       21404
Self-heal Daemon on localhost               N/A       N/A        Y       20336
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       10703
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       21422
 
Task Status of Volume glustervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: glustervol3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.76.153.206:/mnt/data3/3            49154     0          Y       10731
Brick 10.76.153.213:/mnt/data3/3            49157     0          Y       20327
Brick 10.76.153.207:/mnt/data3/3            49157     0          Y       21413
Self-heal Daemon on localhost               N/A       N/A        Y       20336
Self-heal Daemon on 10.76.153.206           N/A       N/A        Y       10703
Self-heal Daemon on 10.76.153.207           N/A       N/A        Y       21422
 
Task Status of Volume glustervol3
------------------------------------------------------------------------------
There are no active volume tasks
 
** gluster volume heal glustervol3 info has the same output as gfs-1

[root@gfs-2 ansible2]# ls -ltr /mnt/data3/3/          =====> all test_file.? are there
total 131080
-rw-------. 2 root root       0 Mar 11 15:52 c2file3
-rw-------. 2 root root      66 Mar 11 16:37 c1file3
-rw-------. 2 root root      91 Mar 22 16:36 c1file2
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.0
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.1
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.2
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.3
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.4
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.5
........
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.123
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.124
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.125
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.126
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.127
[root@gfs-2 ansible2]# 

3) On gfs-3new, same as gfs-2
[root@gfs-3new ansible2]# ls -ltr /mnt/data3/3/
total 131080
-rw-------. 2 root root       0 Mar 11 15:52 c2file3
-rw-------. 2 root root      66 Mar 11 16:37 c1file3
-rw-------. 2 root root      91 Mar 22 16:36 c1file2
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.0
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.1
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.2
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.3
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.4
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.5
.....
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.122
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.123
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.124
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.125
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.126
-rw-------. 2 root root 1048576 Mar 22 16:43 test_file.127
[root@gfs-3new ansible2]# 

I'm attaching the logs for this case as well

Regards,
Amgad

Comment 49 Amgad 2019-03-22 17:08:56 UTC

Created attachment 1547013 [details]
gfs-1 logs when gfs-1 online rolled-back from 5.5 to 3.12.15 with 128 files generated

Comment 50 Amgad 2019-03-22 17:09:34 UTC

Created attachment 1547014 [details]
gfs-2 logs when gfs-1 online rolled-back from 5.5 to 3.12.15 with 128 files generated

Comment 51 Amgad 2019-03-22 17:10:12 UTC

Created attachment 1547015 [details]
gfs-3new logs when gfs-1 online rolled-back from 5.5 to 3.12.15 with 128 files generated

Comment 52 Amgad 2019-03-24 03:55:36 UTC

Hi Sanju:

I did more testing to take a closer look and here's a more-fine description of the behavior:

0) Stating with 3-replica: gsf-1, gfs-2, and gfs-3new all on 3.12.15

1) Always had successful replication and success of the "gluster volume heal <vol>" command during 
the online-upgrade from 3.12.15 to 5.5 on all three nodes in all steps.

2) During rolling back one node (gfs-1) to 3.12.15, I added files (128 files) to one volume, the files were replicated between gfs-2 and gfs-3new servers.
   
3) When rollback was complete on gfs-1 to 3.12.15 (while gfs-2 and gfs-3new are still on 5.5), files didn't replicate to gfs-1
   and "gluster volume heal <vol>" command failed (NO bricks were offline). 

   "gluster volume heal <vol> info showed "Number of entries:129" (128 files and a directory) on the bricks on gfs-2 and gfs-3new.

   ** Heal never succeeded even when rebooted gfs-1.

	[root@gfs-1 ~]#  gluster volume heal glustervol3 info
	Brick 10.76.153.206:/mnt/data3/3		==> gfs-1
	Status: Connected
	Number of entries: 0

	Brick 10.76.153.213:/mnt/data3/3		==> gfs-2
	/test_file.0 
	/ 
	/test_file.1 
	/test_file.2 
	.......
	/test_file.124 
	/test_file.125 
	/test_file.126 
	/test_file.127 
	Status: Connected
	Number of entries: 129

	Brick 10.76.153.207:/mnt/data3/3		==> gfs-3new
	/test_file.0 
	/ 
	/test_file.1 
	/test_file.2 
	/test_file.3 
	/test_file.4 
	.....
	/test_file.125 
	/test_file.126 
	/test_file.127 
	Status: Connected
	Number of entries: 129

	[root@gfs-1 ~]# 

4) When rolled-back gfs-2 to 3.12.15 (now gfs-1 is on 3.12.15 and gfs-3new is on 5.5), the moment "glusterd" started on gfs-2,
   replication and heal started and the "Number of entries:" started to go down till "0" within "8" seconds.
    
	Brick 10.76.153.206:/mnt/data3/3
	Status: Connected
	Number of entries: 0

	Brick 10.76.153.213:/mnt/data3/3
	/test_file.0 
	/ - Possibly undergoing heal

	/test_file.1 
	/test_file.2 
	/test_file.3 
	..
	/test_file.124 
	/test_file.125 
	/test_file.126 
	/test_file.127 
	Status: Connected
	Number of entries: 129

	Brick 10.76.153.207:/mnt/data3/3
	/test_file.0 
	/test_file.4 
	/test_file.5 
	/test_file.6 
	/test_file.7 
	/test_file.8 
	..
	/test_file.124 
	/test_file.125 
	/test_file.126 
	/test_file.127 
	Status: Connected
	Number of entries: 125
	==============
	Brick 10.76.153.206:/mnt/data3/3
	Status: Connected
	Number of entries: 0

	Brick 10.76.153.213:/mnt/data3/3
	/test_file.0 
	/test_file.68 
	/test_file.69 
	..
	/test_file.124 
	/test_file.125 
	/test_file.126 
	/test_file.127 
	Status: Connected
	Number of entries: 61

	Brick 10.76.153.207:/mnt/data3/3
	/test_file.0 
	/test_file.76 
	/test_file.77 
	/test_file.78 
	..
	/test_file.122 
	/test_file.123 
	/test_file.124 
	/test_file.125 
	/test_file.126 
	/test_file.127 
	Status: Connected
	Number of entries: 53
	==============
	Brick 10.76.153.206:/mnt/data3/3
	Status: Connected
	Number of entries: 0

	Brick 10.76.153.213:/mnt/data3/3
	/test_file.0 
	Status: Connected
	Number of entries: 1

	Brick 10.76.153.207:/mnt/data3/3
	/test_file.0 
	Status: Connected
	Number of entries: 1
	================
	Brick 10.76.153.206:/mnt/data3/3
	Status: Connected
	Number of entries: 0

	Brick 10.76.153.213:/mnt/data3/3
	Status: Connected
	Number of entries: 0

	Brick 10.76.153.207:/mnt/data3/3
	Status: Connected
	Number of entries: 0

5) Despite heal started when gfs-2 was rolled-back to 3.12.15 (2-nodes now are on 3.12.15),
   the command "gluster volume heal <vol>" was continuously unsuccessful. No bricks were offline.

	[root@gfs-1 ~]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done
	Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful:
	Commit failed on 10.76.153.207. Please check log file for details.
	Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful:
	Commit failed on 10.76.153.207. Please check log file for details.
	Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful:
	Commit failed on 10.76.153.207. Please check log file for details.
	You have new mail in /var/spool/mail/root
	[root@gfs-1 ~]# 

6) When the gfs-3new was rolled back (all three servers are on 3.12.15), the command "gluster volume heal <vol>" was successful.

Conclusions: 
	- "Heal" is not successful with one server is rolled-back to 3.12.15 while the other two are on 5.5.
          The command "gluster volume heal <vol>" is not successful as well

	- Heal starts once two servers are rolled-back to 3.12.15.
  	- The command "gluster volume heal <vol>" is not successful till all servers are rolled-back to 3.12.15.

Comment 53 Amgad 2019-03-25 14:29:04 UTC

Hi Sanju:

I just saw the 5.5 CentOS RPMs posted this morning! Any change, if not would you kindly update the status for the rollback issue here.

Regards,
Amgad

Comment 54 Amgad 2019-03-25 20:39:52 UTC

Downloaded 5.5 Centos RPMS -- same behavior, except that "gluster volume heal <vol> info is slower compared to my private build from github.
""gluster volume heal <vol> info" is taking 10 sec to respond

[root@gfs-1 ansible1]# time gluster volume heal glustervol3 info
Brick 10.75.147.39:/mnt/data3/3
Status: Connected
Number of entries: 0

Brick 10.75.147.46:/mnt/data3/3
Status: Connected
Number of entries: 0

Brick 10.75.147.41:/mnt/data3/3
Status: Connected
Number of entries: 0


real    0m10.548s
user    0m0.031s
sys     0m0.028s
[root@gfs-1 ansible1]#

Comment 55 Sanju 2019-03-26 02:11:50 UTC

Amgad,

Allow me some time, I will get back to you soon.

Thanks,
Sanju

Comment 56 Amgad 2019-03-26 15:36:51 UTC

Thanks for your support!

Comment 57 Amgad 2019-04-09 13:52:55 UTC

(In reply to Sanju from comment #55)
> Amgad,
> 
> Allow me some time, I will get back to you soon.
> 
> Thanks,
> Sanju

Sanju / Shyam

It has been two weeks now. What's the update on this. We're blocked and stuck not able to deploy 5.x because of the online rollback
Appreciate your timely update!

Regards,
Amgad

Comment 58 Amgad 2019-04-09 13:57:07 UTC

is it fixed in 5.6?

Comment 59 Amgad 2019-04-16 15:47:01 UTC

Sanju / Shyam

It has been three weeks now. What's the update on this. We're blocked and stuck not able to deploy 5.x because of the online rollback
Appreciate your timely update!

Regards,
Amgad

Comment 60 Sanju 2019-04-22 06:13:07 UTC

Amgad,

Sorry for the delay in response.
According to https://bugzilla.redhat.com/show_bug.cgi?id=1676812 heal command says 
"Launching heal operation to perform index self heal on volume <volname> has been unsuccessful:
Commit failed on <ip_addr>. Please check log file for details"

when any of the brick in the volume is down. But in background heal operation will continue to happen. Here, the error message is misleading. I request you to take a look at https://review.gluster.org/22209 where we tried to change this message but retained ourselves from doing it based on the discussions over the patch.

I believe in your setup also, if you check the files in bricks they will be healing. and, we never tested the rollback scenario's in our testing. But everything should be fine after rollback.

Thanks,
Sanju

Comment 61 Amgad 2019-04-23 04:12:37 UTC

Thanks Sanju:

We do automate the procedure, we'll need to have a successful check. What command you recommend then to check that the heal is successful during our automated rollback?
We can't just ignore the unsuccessful message because it can be real as well. Appreciate your prompt answer.

Regards,
Amgad

Comment 62 Amgad 2019-04-23 04:21:07 UTC

Please go thru my data on comment - 2019-03-24 03:55:36 UTC
where it shows heal is not happening till the 2nd node is rolled-back as well to 3.12.15 -- so till 2 nodes at 3.12.15,heal doesn't start

Comment 63 Sanju 2019-04-23 07:47:28 UTC

(In reply to Amgad from comment #61)
> Thanks Sanju:
> 
> We do automate the procedure, we'll need to have a successful check. What
> command you recommend then to check that the heal is successful during our
> automated rollback?

You can check whether "Number of entries:" are reducing in "gluster volume heal <vol> info " output.

Karthik, can you please confirm the above statement?

Comment 64 Karthik U S 2019-04-23 09:41:54 UTC

(In reply to Sanju from comment #63)
> (In reply to Amgad from comment #61)
> > Thanks Sanju:
> > 
> > We do automate the procedure, we'll need to have a successful check. What
> > command you recommend then to check that the heal is successful during our
> > automated rollback?
> 
> You can check whether "Number of entries:" are reducing in "gluster volume
> heal <vol> info " output.
> 
> Karthik, can you please confirm the above statement?

Yes, if the heal is progressing, the number of entries should decrease in the heal info output.

Comment 65 Amgad 2019-04-23 13:46:49 UTC

I confirm that "Number of entries:" was not decreasing and was stuck with the original number (129) till a second node was completely rolled-back to 3.12.15. If I don't roll back the second node, it stays there forever!

Comment 66 Amgad 2019-04-23 13:47:32 UTC

It is clear that some mismatch between the versions!

Comment 67 Sanju 2019-04-24 11:56:03 UTC

Amgad,

Did you change your op-version after downgrading node? If you're performing a downgrade you need to manually edit the op-version to a lesser op-version in glusterd.info file in all machines and restart glusterd's. So that glusterd will run with lower op-version. You can't set lower op-version using volume set operation.

and, I would like to mention that, we can't promise anything about downgrade as we don't test/support downgrades. If you are going forward and performing a downgrade, I suggest you to perform a offline downgrade. After the downgrade, you should manually edit op-version in glusterd.info file and restart glusterd. After doing this also, things might go wrong as it is not something tested and supported.

Thanks,
Sanju

Comment 68 Amgad 2019-05-20 20:01:41 UTC

The op-version doesn't change with upgrade. So if I upgrade from 3.12.15 to 5.5, it stays the same

[root@gfs2 ansible]#  gluster volume get all cluster.op-version 
Option                                  Value                                   
------                                  -----                                   
cluster.op-version                      31202                                   
[root@gfs2 ansible]#  gluster --version
glusterfs 5.5
......

So when I rollback is't the lower op-version. I don't change op-version version upgrade till everything is fine (soak), then I change it to the higher value

BTW -- I tested the scenario with 6.1-1 and it's still the same!

Regards,
Amgad

Comment 69 Sanju 2019-07-12 08:58:13 UTC

Amgad,

I would like to highlight that, we don't support rollback. You might face issues with downgrade as it is not tested and supported. If you have any concerns with upgrade please highlight them, or I would like to close this bug as NOT A BUG.

Thanks,
Sanju

Comment 70 Sanju 2019-07-18 07:52:06 UTC

Amgad, I'm closing this bug, if you face any issues with the upgrade to release-5, please feel free to re-open.

Thanks,
Sanju

Comment 71 Amgad 2019-11-11 04:08:45 UTC

Thx Sanju!

Comment 72 Red Hat Bugzilla 2023-09-14 05:25:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days