Bug 1687051
Description
Amgad
2019-03-09 03:33:13 UTC
Tried upgrade from 3.12.15 to 5.3-2 and "gluster volume heal" failed during the online upgrade (one server on 5.3-2). Is that fixed in 5.4? This will block online upgrade to 5.4 - and it impacts availability if we have to do offline upgrade Any update -- this will impact online upgrade to 5.4 Considering (a) this happens during a rollback which isn't something community has tested and support & (b) there're other critical fixes waiting for users for 5.4 which is overdue, we shouldn't be blocking glusterfs-5.4 release. My proposal is to not mark this bug as a blocker to 5.4. Shyam - what do you think? So how do you do online upgrade - keep in mind upgrade is not complete without rollback isn't any deployment. If online upgrade/backout is not supported, reliability drops big time, especially that the cluster is used by all applications in our case! Besides online upgrade doesn't work between 3.12. and 5.3, isit working from 3.12 to 5.4? Can you please provide the following information? - gluster volume info - gluster volume status - logs from all the nodes (path: /var/log/glusterfs/) Case 1) online upgrade from 3.12.15 to 5.3 A) I have a cluster of 3 replicas: gfs-1, gfs-2, gfs-3new running 3.12.15. When online upgraded gfs-1 from 3.12.15, here are the outputs: (notice that bricks on gfs-1 are offline - both glusterd and glusterfsd are active and running) [root@gfs-1 ~]# gluster volume info Volume Name: glustervol1 Type: Replicate Volume ID: 28b16639-7c58-4f28-975b-5ea17274e87b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data1/1 Brick2: 10.76.153.213:/mnt/data1/1 Brick3: 10.76.153.207:/mnt/data1/1 Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet Volume Name: glustervol2 Type: Replicate Volume ID: 8637eee7-20b7-4a88-b497-192b4626093d Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data2/2 Brick2: 10.76.153.213:/mnt/data2/2 Brick3: 10.76.153.207:/mnt/data2/2 Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet Volume Name: glustervol3 Type: Replicate Volume ID: f8c21e8c-0a9a-40ba-b098-931a4219de0f Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data3/3 Brick2: 10.76.153.213:/mnt/data3/3 Brick3: 10.76.153.207:/mnt/data3/3 Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet --- [root@gfs-1 ~]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 N/A N/A N N/A Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 24733 Brick 10.76.153.207:/mnt/data1/1 49152 0 Y 7790 Self-heal Daemon on localhost N/A N/A Y 14928 Self-heal Daemon on 10.76.153.207 N/A N/A Y 7780 Self-heal Daemon on 10.76.153.213 N/A N/A Y 24723 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 N/A N/A N N/A Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 24742 Brick 10.76.153.207:/mnt/data2/2 49153 0 Y 7800 Self-heal Daemon on localhost N/A N/A Y 14928 Self-heal Daemon on 10.76.153.207 N/A N/A Y 7780 Self-heal Daemon on 10.76.153.213 N/A N/A Y 24723 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 N/A N/A N N/A Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 24751 Brick 10.76.153.207:/mnt/data3/3 49154 0 Y 7809 Self-heal Daemon on localhost N/A N/A Y 14928 Self-heal Daemon on 10.76.153.207 N/A N/A Y 7780 Self-heal Daemon on 10.76.153.213 N/A N/A Y 24723 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-1 ~]# ====== Running "gluster volume heal" ==> unsuccessful [root@gfs-1 ~]# for i in `gluster volume list`; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful: Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details. Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful: Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details. Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful: Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details. [root@gfs-1 ~]# B) Reverting gfs-1 back to 3.12.15, bricks are on line and heal is successfull [root@gfs-1 log]# gluster volume info Volume Name: glustervol1 Type: Replicate Volume ID: 28b16639-7c58-4f28-975b-5ea17274e87b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data1/1 Brick2: 10.76.153.213:/mnt/data1/1 Brick3: 10.76.153.207:/mnt/data1/1 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off Volume Name: glustervol2 Type: Replicate Volume ID: 8637eee7-20b7-4a88-b497-192b4626093d Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data2/2 Brick2: 10.76.153.213:/mnt/data2/2 Brick3: 10.76.153.207:/mnt/data2/2 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off Volume Name: glustervol3 Type: Replicate Volume ID: f8c21e8c-0a9a-40ba-b098-931a4219de0f Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data3/3 Brick2: 10.76.153.213:/mnt/data3/3 Brick3: 10.76.153.207:/mnt/data3/3 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off [root@gfs-1 log]# [root@gfs-1 log]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 16029 Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 24733 Brick 10.76.153.207:/mnt/data1/1 49152 0 Y 7790 Self-heal Daemon on localhost N/A N/A Y 16019 Self-heal Daemon on 10.76.153.207 N/A N/A Y 7780 Self-heal Daemon on 10.76.153.213 N/A N/A Y 24723 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 16038 Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 24742 Brick 10.76.153.207:/mnt/data2/2 49153 0 Y 7800 Self-heal Daemon on localhost N/A N/A Y 16019 Self-heal Daemon on 10.76.153.207 N/A N/A Y 7780 Self-heal Daemon on 10.76.153.213 N/A N/A Y 24723 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 16047 Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 24751 Brick 10.76.153.207:/mnt/data3/3 49154 0 Y 7809 Self-heal Daemon on localhost N/A N/A Y 16019 Self-heal Daemon on 10.76.153.213 N/A N/A Y 24723 Self-heal Daemon on 10.76.153.207 N/A N/A Y 7780 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-1 log]# [root@gfs-1 log]# for i in `gluster volume list`; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol2 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol3 has been successful Use heal info commands to check status. [root@gfs-1 log]# Uploading /var/log/glusterfs: - when upgraded gfs-1 to 5.3: gfs-1-logs.tgz, gfs-2-logs.tgz, and gfs-3new-logs.tgz - when reverted back to 3.12.15: gfs-1-logs-3.12.15.tgz, gfs-2-logs-3.12.15.tgz, and gfs-3new-logs-3.12.15.tgz Next comment will have the 2nd case upgrade 3.12.15 -to- 4.1.4 and rollback Created attachment 1543212 [details]
gfs-1 when online upgraded from 3.12.15 to 5.3
Created attachment 1543214 [details]
gfs-2 logs when gfs-1 online upgraded from 3.12.15 to 5.3
Created attachment 1543215 [details]
gfs-3new logs when gfs-1 online upgraded from 3.12.15 to 5.3
Created attachment 1543216 [details]
gfs-1 logs when gfs-1 reverted back to 3.12.15
Created attachment 1543217 [details]
gfs-2 logs when gfs-1 reverted back to 3.12.15
Created attachment 1543219 [details]
gfs-3new logs when gfs-1 reverted back to 3.12.15
Case 2) online upgrade from 3.12.15 to 4.1.4 and rollback: A) I have a cluster of 3 replicas: gfs-1 (10.76.153.206), gfs-2 (10.76.153.213), and gfs-3new (10.76.153.206), running 3.12.15. When online upgraded gfs-1 from 3.12.15 to 4.1.4, heal succeeded. Continuing with gfs-2, then gfs-3new, online upgrade and heal succeeded. 1) Here're the outputs after gfs-1 was online upgraded from 3.12.15 to 4.1.4: Logs uploaded are: gfs-1-logs-gfs-1-UpgFrom3.12.15-to-4.1.4.tgz, gfs-2-logs-gfs-1-UpgFrom3.12.15-to-4.1.4.tgz, and gfs-3new-logs-gfs-1-UpgFrom3.12.15-to-4.1.4.tgz - see the latest upgrade case. [root@gfs-1 ansible1]# gluster volume info Volume Name: glustervol1 Type: Replicate Volume ID: 28b16639-7c58-4f28-975b-5ea17274e87b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data1/1 Brick2: 10.76.153.213:/mnt/data1/1 Brick3: 10.76.153.207:/mnt/data1/1 Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet Volume Name: glustervol2 Type: Replicate Volume ID: 8637eee7-20b7-4a88-b497-192b4626093d Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data2/2 Brick2: 10.76.153.213:/mnt/data2/2 Brick3: 10.76.153.207:/mnt/data2/2 Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet Volume Name: glustervol3 Type: Replicate Volume ID: f8c21e8c-0a9a-40ba-b098-931a4219de0f Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data3/3 Brick2: 10.76.153.213:/mnt/data3/3 Brick3: 10.76.153.207:/mnt/data3/3 Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet [root@gfs-1 ansible1]# [root@gfs-1 ansible1]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49155 0 Y 30270 Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 12726 Brick 10.76.153.207:/mnt/data1/1 49152 0 Y 26671 Self-heal Daemon on localhost N/A N/A Y 30260 Self-heal Daemon on 10.76.153.213 N/A N/A Y 12716 Self-heal Daemon on 10.76.153.207 N/A N/A Y 26661 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49156 0 Y 30279 Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 12735 Brick 10.76.153.207:/mnt/data2/2 49153 0 Y 26680 Self-heal Daemon on localhost N/A N/A Y 30260 Self-heal Daemon on 10.76.153.213 N/A N/A Y 12716 Self-heal Daemon on 10.76.153.207 N/A N/A Y 26661 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49157 0 Y 30288 Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 12744 Brick 10.76.153.207:/mnt/data3/3 49154 0 Y 26689 Self-heal Daemon on localhost N/A N/A Y 30260 Self-heal Daemon on 10.76.153.213 N/A N/A Y 12716 Self-heal Daemon on 10.76.153.207 N/A N/A Y 26661 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-1 ansible1]# for i in `gluster volume list`; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol2 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol3 has been successful Use heal info commands to check status. [root@gfs-1 ansible1]# ======================= ===================== 2) Here're the outputs after all were online upgraded from 3.12.15 to 4.1.4: Logs uploaded see the logs for B) which include this case as well [root@gfs-3new ansible1]# gluster volume info Volume Name: glustervol1 Type: Replicate Volume ID: 28b16639-7c58-4f28-975b-5ea17274e87b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data1/1 Brick2: 10.76.153.213:/mnt/data1/1 Brick3: 10.76.153.207:/mnt/data1/1 Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet Volume Name: glustervol2 Type: Replicate Volume ID: 8637eee7-20b7-4a88-b497-192b4626093d Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data2/2 Brick2: 10.76.153.213:/mnt/data2/2 Brick3: 10.76.153.207:/mnt/data2/2 Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet Volume Name: glustervol3 Type: Replicate Volume ID: f8c21e8c-0a9a-40ba-b098-931a4219de0f Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data3/3 Brick2: 10.76.153.213:/mnt/data3/3 Brick3: 10.76.153.207:/mnt/data3/3 Options Reconfigured: performance.client-io-threads: off nfs.disable: on transport.address-family: inet [root@gfs-3new ansible1]# [root@gfs-3new ansible1]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49155 0 Y 30270 Brick 10.76.153.213:/mnt/data1/1 49155 0 Y 13874 Brick 10.76.153.207:/mnt/data1/1 49155 0 Y 28144 Self-heal Daemon on localhost N/A N/A Y 28134 Self-heal Daemon on 10.76.153.213 N/A N/A Y 13864 Self-heal Daemon on 10.76.153.206 N/A N/A Y 30260 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49156 0 Y 30279 Brick 10.76.153.213:/mnt/data2/2 49156 0 Y 13883 Brick 10.76.153.207:/mnt/data2/2 49156 0 Y 28153 Self-heal Daemon on localhost N/A N/A Y 28134 Self-heal Daemon on 10.76.153.206 N/A N/A Y 30260 Self-heal Daemon on 10.76.153.213 N/A N/A Y 13864 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49157 0 Y 30288 Brick 10.76.153.213:/mnt/data3/3 49157 0 Y 13892 Brick 10.76.153.207:/mnt/data3/3 49157 0 Y 28162 Self-heal Daemon on localhost N/A N/A Y 28134 Self-heal Daemon on 10.76.153.206 N/A N/A Y 30260 Self-heal Daemon on 10.76.153.213 N/A N/A Y 13864 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-3new ansible1]# [root@gfs-3new ansible1]# for i in `gluster volume list`; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol2 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol3 has been successful Use heal info commands to check status. [root@gfs-3new ansible1]# ====== ======= B) Here're the outputs after gfs-1 was online rollbacked from 4.1.4 to 3.12.15 - rollback succeeded, but "gluster volume heal" was unsuccessful: Logs uploaded are: gfs-1-logs-gfs-1-RollbackFrom4.1.4-to-3.12.15.tgz, gfs-2-logs-gfs-1-RollbackFrom4.1.4-to-3.12.15.tgz, and gfs-3new-logs-gfs-1-RollbackFrom4.1.4-to-3.12.15.tgz - includes case 2) as well right before [root@gfs-1 ansible1]# gluster volume info Volume Name: glustervol1 Type: Replicate Volume ID: 28b16639-7c58-4f28-975b-5ea17274e87b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data1/1 Brick2: 10.76.153.213:/mnt/data1/1 Brick3: 10.76.153.207:/mnt/data1/1 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off Volume Name: glustervol2 Type: Replicate Volume ID: 8637eee7-20b7-4a88-b497-192b4626093d Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data2/2 Brick2: 10.76.153.213:/mnt/data2/2 Brick3: 10.76.153.207:/mnt/data2/2 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off Volume Name: glustervol3 Type: Replicate Volume ID: f8c21e8c-0a9a-40ba-b098-931a4219de0f Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.76.153.206:/mnt/data3/3 Brick2: 10.76.153.213:/mnt/data3/3 Brick3: 10.76.153.207:/mnt/data3/3 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off [root@gfs-1 ansible1]# [root@gfs-1 ansible1]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 32078 Brick 10.76.153.213:/mnt/data1/1 49155 0 Y 13874 Brick 10.76.153.207:/mnt/data1/1 49155 0 Y 28144 Self-heal Daemon on localhost N/A N/A Y 32068 Self-heal Daemon on 10.76.153.213 N/A N/A Y 13864 Self-heal Daemon on 10.76.153.207 N/A N/A Y 28134 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 32087 Brick 10.76.153.213:/mnt/data2/2 49156 0 Y 13883 Brick 10.76.153.207:/mnt/data2/2 49156 0 Y 28153 Self-heal Daemon on localhost N/A N/A Y 32068 Self-heal Daemon on 10.76.153.213 N/A N/A Y 13864 Self-heal Daemon on 10.76.153.207 N/A N/A Y 28134 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 32096 Brick 10.76.153.213:/mnt/data3/3 49157 0 Y 13892 Brick 10.76.153.207:/mnt/data3/3 49157 0 Y 28162 Self-heal Daemon on localhost N/A N/A Y 32068 Self-heal Daemon on 10.76.153.213 N/A N/A Y 13864 Self-heal Daemon on 10.76.153.207 N/A N/A Y 28134 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-1 ansible1]# for i in `gluster volume list`; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Commit failed on 10.76.153.213. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful: Commit failed on 10.76.153.213. Please check log file for details. Commit failed on 10.76.153.207. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Commit failed on 10.76.153.213. Please check log file for details. [root@gfs-1 ansible1]# Created attachment 1543260 [details]
gfs-1 logs when gfs-1 online upgraded from 3.12.15 to 4.1
Created attachment 1543261 [details]
gfs-2 logs when gfs-1 online upgraded from 3.12.15 to 4.1
Created attachment 1543262 [details]
gfs-3new logs when gfs-1 online upgraded from 3.12.15 to 4.1
Created attachment 1543263 [details]
gfs-1 logs when gfs-1 online rolledback from 4.1.4 to 3.12.15
Created attachment 1543264 [details]
gfs-2 logs when gfs-1 online rolledback from 4.1.4 to 3.12.15
Created attachment 1543268 [details]
gfs-3new logs when gfs-1 online rolledback from 4.1.4 to 3.12.15
Any update! Hi, Sorry for the delay. In the first case of conversion from 3.12.15 to 5.3, the bricks on the upgraded nodes failed to come up. The heal command will fail if any of the bricks are not available or down. In the second case of conversion from 4.1.4 to 3.12.15 even though we have all the bricks and shd up and running I can see some errors in the glusterd logs during the commit phase of the heal command. We need to check from glusterd side why this is happening. Sanju are you aware of any such cases? Can you debug this further to see why the brick is failing to come up and why the heal commit fails? Regards, Karthik Thanks Karthik I can see for the first case it's because of the “failed to dispatch handler" (Bug 1671556) which should be addressed in 5.4. The second case, is definitely an issue for rolling from an older release to a newer one. is there a "heal" incompatibility between 3.12 and later releases? becaucase this will impact 5.4 as well. Appreciate your support! Any update, feedback or any investigation going on? Any idea about the root cause/fix? will it be in 5.4? I did more testing and realized that "gluster volume status" doesn't provide the right status when rolled-back the 1st server, "gfs-1" to 3.12.15, after the full upgrade (the other two replicas still on 4.1.4). When rolled-back gfs-1, I got: [root@gfs-1 ansible1]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 N/A N/A N N/A Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 N/A N/A N N/A Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 N/A N/A N N/A Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks Then when I rolled-back gfs-2 I got: ==================================== [root@gfs-2 ansible1]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 23400 Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 14481 Self-heal Daemon on localhost N/A N/A Y 14472 Self-heal Daemon on 10.76.153.206 N/A N/A Y 23390 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 23409 Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 14490 Self-heal Daemon on localhost N/A N/A Y 14472 Self-heal Daemon on 10.76.153.206 N/A N/A Y 23390 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 23418 Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 14499 Self-heal Daemon on localhost N/A N/A Y 14472 Self-heal Daemon on 10.76.153.206 N/A N/A Y 23390 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks Then when rolled-back the third replica, I got the full status: ============================================================== [root@gfs-3new ansible1]# gluster volume statusStatus of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 23400 Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 14481 Brick 10.76.153.207:/mnt/data1/1 49152 0 Y 13184 Self-heal Daemon on localhost N/A N/A Y 13174 Self-heal Daemon on 10.76.153.213 N/A N/A Y 14472 Self-heal Daemon on 10.76.153.206 N/A N/A Y 23390 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 23409 Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 14490 Brick 10.76.153.207:/mnt/data2/2 49153 0 Y 13193 Self-heal Daemon on localhost N/A N/A Y 13174 Self-heal Daemon on 10.76.153.206 N/A N/A Y 23390 Self-heal Daemon on 10.76.153.213 N/A N/A Y 14472 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 23418 Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 14499 Brick 10.76.153.207:/mnt/data3/3 49154 0 Y 13202 Self-heal Daemon on localhost N/A N/A Y 13174 Self-heal Daemon on 10.76.153.206 N/A N/A Y 23390 Self-heal Daemon on 10.76.153.213 N/A N/A Y 14472 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks Amgad, Thanks for sharing your test results. I will provide an update on this by the end of this week. Thanks Sanju. Per the release notes at: https://gluster.readthedocs.io/en/latest/release-notes/5.5/ It seems like there won't be a 5.4 because of rolling upgrade issue. I assume this is what is being addressed here. Let me know if I can help to accelerate the fix. Amgad Is the issue addressed by the following fixes in R5.5? #1684385: [ovirt-gluster] Rolling gluster upgrade from 3.12.5 to 5.3 led to shard on-disk xattrs disappearing #1684569: Upgrade from 4.1 and 5 is broken Regards, Amgad Amgad, Yes, there won't be a 5.4 as we hit upgrade blocker https://bugzilla.redhat.com/show_bug.cgi?id=1684029 The issue you are facing not same as https://bugzilla.redhat.com/show_bug.cgi?id=1684029 or https://bugzilla.redhat.com/show_bug.cgi?id=1684569. And I don't think you are hitting https://bugzilla.redhat.com/show_bug.cgi?id=1684385 as that issue is seen while upgrade from 3.12 to 5. I suspect your issue is same as https://bugzilla.redhat.com/show_bug.cgi?id=1676812. Please, let me know whether it is same or not. Thanks, Sanju Thanks Sanju: I'm trying to locally build 5.5 RPMs now to test with. BTW, do you know when the Centos 5.5 RPMs will be available? Regards, Amgad mainly OS release 7 Amgad, I'm not sure but you can always write to users/devel mailing lists so that appropriate people can respond. Thanks, Sanju (In reply to Amgad from comment #30) > Thanks Sanju: > I'm trying to locally build 5.5 RPMs now to test with. BTW, do you know when > the Centos 5.5 RPMs will be available? @Shyam, can you please answer this? > > Regards, > Amgad (In reply to Sanju from comment #33) > (In reply to Amgad from comment #30) > > Thanks Sanju: > > I'm trying to locally build 5.5 RPMs now to test with. BTW, do you know when > > the Centos 5.5 RPMs will be available? > > @Shyam, can you please answer this? > > > > Regards, > > Amgad 5.5 CentOS storage SIG packages have landed on the test repository as of a day or 2 back, and I am smoke testing the same now. Test packages can be found and installed like so, # yum install centos-release-gluster # yum install --enablerepo=centos-gluster5-test glusterfs-server If my "smoke" testing does not break anything, then packages would be forthcoming later this week or by Monday next week. Thanks Sanju and Shyam. I went ahead and built the 5.5 RPMS and re-did the online upgrade/rollback tests from 3.12.15 to 5.5, and back. I got the same issue with online rollback. Here is the data (logs are attached as well): Case 1) online upgrade from 3.12.15 to 5.5 - upgrades stared right after: Thu Mar 21 14:01:06 UTC 2019 ========================================== A) I have same cluster of 3 replicas: gfs-1 (10.76.153.206), gfs-2 (10.76.153.213), and gfs-3new (10.76.153.207), running 3.12.15. When online upgraded gfs-1 from 3.12.15 to 5.5, all bricks were online and heal succeeded. Continuing with gfs-2, then gfs-3new, online upgrade, heal succeeded as well. 1) Here's the output after gfs-1 was online upgraded from 3.12.15 to 5.5: Logs uploaded are: gfs-1_gfs1_upg_log.tgz, gfs-2_gfs1_upg_log.tgz, and gfs-3new_gfs1_upg_log.tgz. All volumes/bricks are online and heal succeeded. [root@gfs-1 ansible2]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49155 0 Y 19559 Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 11171 Brick 10.76.153.207:/mnt/data1/1 49152 0 Y 25740 Self-heal Daemon on localhost N/A N/A Y 19587 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11161 Self-heal Daemon on 10.76.153.207 N/A N/A Y 25730 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49156 0 Y 19568 Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 11180 Brick 10.76.153.207:/mnt/data2/2 49153 0 Y 25749 Self-heal Daemon on localhost N/A N/A Y 19587 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11161 Self-heal Daemon on 10.76.153.207 N/A N/A Y 25730 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49157 0 Y 19578 Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 11189 Brick 10.76.153.207:/mnt/data3/3 49154 0 Y 25758 Self-heal Daemon on localhost N/A N/A Y 19587 Self-heal Daemon on 10.76.153.207 N/A N/A Y 25730 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11161 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-1 ansible2]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol2 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol3 has been successful Use heal info commands to check status. Case 2) online rollback from 5.5 to 3.12.15 - upgrades stared right after: Thu Mar 21 14:20:01 UTC 2019 =========================================== A) Here're the outputs after gfs-1 was online rolled back from 5.5 to 3.12.15 - rollback succeeded. All bricks were online, but "gluster volume heal" was unsuccessful: Logs uploaded are: gfs-1_gfs1_rollbk_log.tgz, gfs-2_gfs1_rollbk_log.tgz, and gfs-3new_gfs1_rollbk_log.tgz [root@gfs-1 glusterfs]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 21586 Brick 10.76.153.213:/mnt/data1/1 49155 0 Y 9772 Brick 10.76.153.207:/mnt/data1/1 49155 0 Y 12139 Self-heal Daemon on localhost N/A N/A Y 21576 Self-heal Daemon on 10.76.153.213 N/A N/A Y 9799 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 21595 Brick 10.76.153.213:/mnt/data2/2 49156 0 Y 9781 Brick 10.76.153.207:/mnt/data2/2 49156 0 Y 12148 Self-heal Daemon on localhost N/A N/A Y 21576 Self-heal Daemon on 10.76.153.213 N/A N/A Y 9799 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 21604 Brick 10.76.153.213:/mnt/data3/3 49157 0 Y 9790 Brick 10.76.153.207:/mnt/data3/3 49157 0 Y 12157 Self-heal Daemon on localhost N/A N/A Y 21576 Self-heal Daemon on 10.76.153.213 N/A N/A Y 9799 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-1 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Commit failed on 10.76.153.213. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Commit failed on 10.76.153.213. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Commit failed on 10.76.153.213. Please check log file for details. [root@gfs-1 glusterfs]# B) Same "heal" failure after rolling back gfs-2 from 5.5 to 3.12.15 =================================================================== [root@gfs-2 glusterfs]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 21586 Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 11313 Brick 10.76.153.207:/mnt/data1/1 49155 0 Y 12139 Self-heal Daemon on localhost N/A N/A Y 11303 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 21595 Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 11322 Brick 10.76.153.207:/mnt/data2/2 49156 0 Y 12148 Self-heal Daemon on localhost N/A N/A Y 11303 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 21604 Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 11331 Brick 10.76.153.207:/mnt/data3/3 49157 0 Y 12157 Self-heal Daemon on localhost N/A N/A Y 11303 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-2 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. [root@gfs-2 glusterfs]# C) After rolling back gfs-3new from 5.5 to 3.12.15 (all are on 3.12.15 now) heal succeeded Logs uploaded are: gfs-1_all_rollbk_log.tgz, gfs-2_all_rollbk_log.tgz, and gfs-3new_all_rollbk_log.tgz [root@gfs-3new glusterfs]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 21586 Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 11313 Brick 10.76.153.207:/mnt/data1/1 49152 0 Y 13724 Self-heal Daemon on localhost N/A N/A Y 13714 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11303 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 21595 Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 11322 Brick 10.76.153.207:/mnt/data2/2 49153 0 Y 13733 Self-heal Daemon on localhost N/A N/A Y 13714 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11303 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 21604 Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 11331 Brick 10.76.153.207:/mnt/data3/3 49154 0 Y 13742 Self-heal Daemon on localhost N/A N/A Y 13714 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11303 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-3new glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol2 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol3 has been successful Use heal info commands to check status. [root@gfs-3new glusterfs]# Regards, Amgad Thanks Sanju and Shyam. I went ahead and built the 5.5 RPMS and re-did the online upgrade/rollback tests from 3.12.15 to 5.5, and back. I got the same issue with online rollback. Here is the data (logs are attached as well): Case 1) online upgrade from 3.12.15 to 5.5 - upgrades stared right after: Thu Mar 21 14:01:06 UTC 2019 ========================================== A) I have same cluster of 3 replicas: gfs-1 (10.76.153.206), gfs-2 (10.76.153.213), and gfs-3new (10.76.153.207), running 3.12.15. When online upgraded gfs-1 from 3.12.15 to 5.5, all bricks were online and heal succeeded. Continuing with gfs-2, then gfs-3new, online upgrade, heal succeeded as well. 1) Here's the output after gfs-1 was online upgraded from 3.12.15 to 5.5: Logs uploaded are: gfs-1_gfs1_upg_log.tgz, gfs-2_gfs1_upg_log.tgz, and gfs-3new_gfs1_upg_log.tgz. All volumes/bricks are online and heal succeeded. [root@gfs-1 ansible2]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49155 0 Y 19559 Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 11171 Brick 10.76.153.207:/mnt/data1/1 49152 0 Y 25740 Self-heal Daemon on localhost N/A N/A Y 19587 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11161 Self-heal Daemon on 10.76.153.207 N/A N/A Y 25730 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49156 0 Y 19568 Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 11180 Brick 10.76.153.207:/mnt/data2/2 49153 0 Y 25749 Self-heal Daemon on localhost N/A N/A Y 19587 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11161 Self-heal Daemon on 10.76.153.207 N/A N/A Y 25730 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49157 0 Y 19578 Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 11189 Brick 10.76.153.207:/mnt/data3/3 49154 0 Y 25758 Self-heal Daemon on localhost N/A N/A Y 19587 Self-heal Daemon on 10.76.153.207 N/A N/A Y 25730 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11161 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-1 ansible2]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol2 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol3 has been successful Use heal info commands to check status. Case 2) online rollback from 5.5 to 3.12.15 - upgrades stared right after: Thu Mar 21 14:20:01 UTC 2019 =========================================== A) Here're the outputs after gfs-1 was online rolled back from 5.5 to 3.12.15 - rollback succeeded. All bricks were online, but "gluster volume heal" was unsuccessful: Logs uploaded are: gfs-1_gfs1_rollbk_log.tgz, gfs-2_gfs1_rollbk_log.tgz, and gfs-3new_gfs1_rollbk_log.tgz [root@gfs-1 glusterfs]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 21586 Brick 10.76.153.213:/mnt/data1/1 49155 0 Y 9772 Brick 10.76.153.207:/mnt/data1/1 49155 0 Y 12139 Self-heal Daemon on localhost N/A N/A Y 21576 Self-heal Daemon on 10.76.153.213 N/A N/A Y 9799 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 21595 Brick 10.76.153.213:/mnt/data2/2 49156 0 Y 9781 Brick 10.76.153.207:/mnt/data2/2 49156 0 Y 12148 Self-heal Daemon on localhost N/A N/A Y 21576 Self-heal Daemon on 10.76.153.213 N/A N/A Y 9799 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 21604 Brick 10.76.153.213:/mnt/data3/3 49157 0 Y 9790 Brick 10.76.153.207:/mnt/data3/3 49157 0 Y 12157 Self-heal Daemon on localhost N/A N/A Y 21576 Self-heal Daemon on 10.76.153.213 N/A N/A Y 9799 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-1 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Commit failed on 10.76.153.213. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Commit failed on 10.76.153.213. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Commit failed on 10.76.153.213. Please check log file for details. [root@gfs-1 glusterfs]# B) Same "heal" failure after rolling back gfs-2 from 5.5 to 3.12.15 =================================================================== [root@gfs-2 glusterfs]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 21586 Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 11313 Brick 10.76.153.207:/mnt/data1/1 49155 0 Y 12139 Self-heal Daemon on localhost N/A N/A Y 11303 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 21595 Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 11322 Brick 10.76.153.207:/mnt/data2/2 49156 0 Y 12148 Self-heal Daemon on localhost N/A N/A Y 11303 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 21604 Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 11331 Brick 10.76.153.207:/mnt/data3/3 49157 0 Y 12157 Self-heal Daemon on localhost N/A N/A Y 11303 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Self-heal Daemon on 10.76.153.207 N/A N/A Y 12166 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-2 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. [root@gfs-2 glusterfs]# C) After rolling back gfs-3new from 5.5 to 3.12.15 (all are on 3.12.15 now) heal succeeded Logs uploaded are: gfs-1_all_rollbk_log.tgz, gfs-2_all_rollbk_log.tgz, and gfs-3new_all_rollbk_log.tgz [root@gfs-3new glusterfs]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 21586 Brick 10.76.153.213:/mnt/data1/1 49152 0 Y 11313 Brick 10.76.153.207:/mnt/data1/1 49152 0 Y 13724 Self-heal Daemon on localhost N/A N/A Y 13714 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11303 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 21595 Brick 10.76.153.213:/mnt/data2/2 49153 0 Y 11322 Brick 10.76.153.207:/mnt/data2/2 49153 0 Y 13733 Self-heal Daemon on localhost N/A N/A Y 13714 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11303 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 21604 Brick 10.76.153.213:/mnt/data3/3 49154 0 Y 11331 Brick 10.76.153.207:/mnt/data3/3 49154 0 Y 13742 Self-heal Daemon on localhost N/A N/A Y 13714 Self-heal Daemon on 10.76.153.213 N/A N/A Y 11303 Self-heal Daemon on 10.76.153.206 N/A N/A Y 21576 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-3new glusterfs]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol2 has been successful Use heal info commands to check status. Launching heal operation to perform index self heal on volume glustervol3 has been successful Use heal info commands to check status. [root@gfs-3new glusterfs]# Regards, Amgad (In reply to Amgad from comment #36) > Thanks Sanju and Shyam. > > I went ahead and built the 5.5 RPMS and re-did the online upgrade/rollback > tests from 3.12.15 to 5.5, and back. I got the same issue with online > rollback. > Here is the data (logs are attached as well): > > Case 1) online upgrade from 3.12.15 to 5.5 - upgrades stared right after: > Thu Mar 21 14:01:06 UTC 2019 > ========================================== > A) I have same cluster of 3 replicas: gfs-1 (10.76.153.206), gfs-2 > (10.76.153.213), and gfs-3new (10.76.153.207), running 3.12.15. > When online upgraded gfs-1 from 3.12.15 to 5.5, all bricks were online and > heal succeeded. Continuing with gfs-2, then gfs-3new, online upgrade, heal > succeeded as well. > > 1) Here's the output after gfs-1 was online upgraded from 3.12.15 to 5.5: > Logs uploaded are: gfs-1_gfs1_upg_log.tgz, gfs-2_gfs1_upg_log.tgz, and > gfs-3new_gfs1_upg_log.tgz. > > All volumes/bricks are online and heal succeeded. > > [root@gfs-1 ansible2]# gluster volume status > Status of volume: glustervol1 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data1/1 49155 0 Y > 19559 > Brick 10.76.153.213:/mnt/data1/1 49152 0 Y > 11171 > Brick 10.76.153.207:/mnt/data1/1 49152 0 Y > 25740 > Self-heal Daemon on localhost N/A N/A Y > 19587 > Self-heal Daemon on 10.76.153.213 N/A N/A Y > 11161 > Self-heal Daemon on 10.76.153.207 N/A N/A Y > 25730 > > Task Status of Volume glustervol1 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > Status of volume: glustervol2 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data2/2 49156 0 Y > 19568 > Brick 10.76.153.213:/mnt/data2/2 49153 0 Y > 11180 > Brick 10.76.153.207:/mnt/data2/2 49153 0 Y > 25749 > Self-heal Daemon on localhost N/A N/A Y > 19587 > Self-heal Daemon on 10.76.153.213 N/A N/A Y > 11161 > Self-heal Daemon on 10.76.153.207 N/A N/A Y > 25730 > > Task Status of Volume glustervol2 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > Status of volume: glustervol3 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data3/3 49157 0 Y > 19578 > Brick 10.76.153.213:/mnt/data3/3 49154 0 Y > 11189 > Brick 10.76.153.207:/mnt/data3/3 49154 0 Y > 25758 > Self-heal Daemon on localhost N/A N/A Y > 19587 > Self-heal Daemon on 10.76.153.207 N/A N/A Y > 25730 > Self-heal Daemon on 10.76.153.213 N/A N/A Y > 11161 > > Task Status of Volume glustervol3 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > [root@gfs-1 ansible2]# for i in glustervol1 glustervol2 glustervol3; do > gluster volume heal $i; done > Launching heal operation to perform index self heal on volume glustervol1 > has been successful > Use heal info commands to check status. > Launching heal operation to perform index self heal on volume glustervol2 > has been successful > Use heal info commands to check status. > Launching heal operation to perform index self heal on volume glustervol3 > has been successful > Use heal info commands to check status. > > Case 2) online rollback from 5.5 to 3.12.15 - upgrades stared right after: > Thu Mar 21 14:20:01 UTC 2019 > =========================================== > A) Here're the outputs after gfs-1 was online rolled back from 5.5 to > 3.12.15 - rollback succeeded. All bricks were online, but "gluster volume > heal" was unsuccessful: > Logs uploaded are: gfs-1_gfs1_rollbk_log.tgz, gfs-2_gfs1_rollbk_log.tgz, and > gfs-3new_gfs1_rollbk_log.tgz > > > [root@gfs-1 glusterfs]# gluster volume status > Status of volume: glustervol1 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data1/1 49152 0 Y > 21586 > Brick 10.76.153.213:/mnt/data1/1 49155 0 Y > 9772 > Brick 10.76.153.207:/mnt/data1/1 49155 0 Y > 12139 > Self-heal Daemon on localhost N/A N/A Y > 21576 > Self-heal Daemon on 10.76.153.213 N/A N/A Y > 9799 > Self-heal Daemon on 10.76.153.207 N/A N/A Y > 12166 > > Task Status of Volume glustervol1 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > Status of volume: glustervol2 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data2/2 49153 0 Y > 21595 > Brick 10.76.153.213:/mnt/data2/2 49156 0 Y > 9781 > Brick 10.76.153.207:/mnt/data2/2 49156 0 Y > 12148 > Self-heal Daemon on localhost N/A N/A Y > 21576 > Self-heal Daemon on 10.76.153.213 N/A N/A Y > 9799 > Self-heal Daemon on 10.76.153.207 N/A N/A Y > 12166 > > Task Status of Volume glustervol2 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > Status of volume: glustervol3 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data3/3 49154 0 Y > 21604 > Brick 10.76.153.213:/mnt/data3/3 49157 0 Y > 9790 > Brick 10.76.153.207:/mnt/data3/3 49157 0 Y > 12157 > Self-heal Daemon on localhost N/A N/A Y > 21576 > Self-heal Daemon on 10.76.153.213 N/A N/A Y > 9799 > Self-heal Daemon on 10.76.153.207 N/A N/A Y > 12166 > > Task Status of Volume glustervol3 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > [root@gfs-1 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do > gluster volume heal $i; done > Launching heal operation to perform index self heal on volume glustervol1 > has been unsuccessful: > Commit failed on 10.76.153.207. Please check log file for details. > Commit failed on 10.76.153.213. Please check log file for details. > Launching heal operation to perform index self heal on volume glustervol2 > has been unsuccessful: > Commit failed on 10.76.153.207. Please check log file for details. > Commit failed on 10.76.153.213. Please check log file for details. > Launching heal operation to perform index self heal on volume glustervol3 > has been unsuccessful: > Commit failed on 10.76.153.207. Please check log file for details. > Commit failed on 10.76.153.213. Please check log file for details. > [root@gfs-1 glusterfs]# > > B) Same "heal" failure after rolling back gfs-2 from 5.5 to 3.12.15 > =================================================================== > > [root@gfs-2 glusterfs]# gluster volume status > Status of volume: glustervol1 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data1/1 49152 0 Y > 21586 > Brick 10.76.153.213:/mnt/data1/1 49152 0 Y > 11313 > Brick 10.76.153.207:/mnt/data1/1 49155 0 Y > 12139 > Self-heal Daemon on localhost N/A N/A Y > 11303 > Self-heal Daemon on 10.76.153.206 N/A N/A Y > 21576 > Self-heal Daemon on 10.76.153.207 N/A N/A Y > 12166 > > Task Status of Volume glustervol1 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > Status of volume: glustervol2 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data2/2 49153 0 Y > 21595 > Brick 10.76.153.213:/mnt/data2/2 49153 0 Y > 11322 > Brick 10.76.153.207:/mnt/data2/2 49156 0 Y > 12148 > Self-heal Daemon on localhost N/A N/A Y > 11303 > Self-heal Daemon on 10.76.153.206 N/A N/A Y > 21576 > Self-heal Daemon on 10.76.153.207 N/A N/A Y > 12166 > > Task Status of Volume glustervol2 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > Status of volume: glustervol3 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data3/3 49154 0 Y > 21604 > Brick 10.76.153.213:/mnt/data3/3 49154 0 Y > 11331 > Brick 10.76.153.207:/mnt/data3/3 49157 0 Y > 12157 > Self-heal Daemon on localhost N/A N/A Y > 11303 > Self-heal Daemon on 10.76.153.206 N/A N/A Y > 21576 > Self-heal Daemon on 10.76.153.207 N/A N/A Y > 12166 > > Task Status of Volume glustervol3 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > [root@gfs-2 glusterfs]# for i in glustervol1 glustervol2 glustervol3; do > gluster volume heal $i; done > Launching heal operation to perform index self heal on volume glustervol1 > has been unsuccessful: > Commit failed on 10.76.153.207. Please check log file for details. > Launching heal operation to perform index self heal on volume glustervol2 > has been unsuccessful: > Commit failed on 10.76.153.207. Please check log file for details. > Launching heal operation to perform index self heal on volume glustervol3 > has been unsuccessful: > Commit failed on 10.76.153.207. Please check log file for details. > [root@gfs-2 glusterfs]# > > C) After rolling back gfs-3new from 5.5 to 3.12.15 (all are on 3.12.15 now) > heal succeeded > Logs uploaded are: gfs-1_all_rollbk_log.tgz, gfs-2_all_rollbk_log.tgz, and > gfs-3new_all_rollbk_log.tgz > > [root@gfs-3new glusterfs]# gluster volume status > Status of volume: glustervol1 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data1/1 49152 0 Y > 21586 > Brick 10.76.153.213:/mnt/data1/1 49152 0 Y > 11313 > Brick 10.76.153.207:/mnt/data1/1 49152 0 Y > 13724 > Self-heal Daemon on localhost N/A N/A Y > 13714 > Self-heal Daemon on 10.76.153.206 N/A N/A Y > 21576 > Self-heal Daemon on 10.76.153.213 N/A N/A Y > 11303 > > Task Status of Volume glustervol1 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > Status of volume: glustervol2 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data2/2 49153 0 Y > 21595 > Brick 10.76.153.213:/mnt/data2/2 49153 0 Y > 11322 > Brick 10.76.153.207:/mnt/data2/2 49153 0 Y > 13733 > Self-heal Daemon on localhost N/A N/A Y > 13714 > Self-heal Daemon on 10.76.153.206 N/A N/A Y > 21576 > Self-heal Daemon on 10.76.153.213 N/A N/A Y > 11303 > > Task Status of Volume glustervol2 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > Status of volume: glustervol3 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.76.153.206:/mnt/data3/3 49154 0 Y > 21604 > Brick 10.76.153.213:/mnt/data3/3 49154 0 Y > 11331 > Brick 10.76.153.207:/mnt/data3/3 49154 0 Y > 13742 > Self-heal Daemon on localhost N/A N/A Y > 13714 > Self-heal Daemon on 10.76.153.213 N/A N/A Y > 11303 > Self-heal Daemon on 10.76.153.206 N/A N/A Y > 21576 > > Task Status of Volume glustervol3 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > [root@gfs-3new glusterfs]# for i in glustervol1 glustervol2 glustervol3; do > gluster volume heal $i; done > Launching heal operation to perform index self heal on volume glustervol1 > has been successful > Use heal info commands to check status. > Launching heal operation to perform index self heal on volume glustervol2 > has been successful > Use heal info commands to check status. > Launching heal operation to perform index self heal on volume glustervol3 > has been successful > Use heal info commands to check status. > [root@gfs-3new glusterfs]# > > Regards, > Amgad comment seems to be duplicated Created attachment 1546575 [details]
gfs-1 logs when gfs-1 online upgraded from 3.12.15 to 5.5
Created attachment 1546576 [details]
gfs-2 logs when gfs-1 online upgraded from 3.12.15 to 5.5
Created attachment 1546577 [details]
gfs-3new logs when gfs-1 online upgraded from 3.12.15 to 5.5
Created attachment 1546578 [details]
gfs-1 logs when gfs-1 online rolled-back from 5.5 to 3.12.15
Created attachment 1546579 [details]
gfs-2 logs when gfs-1 online rolled-back from 5.5 to 3.12.15
Created attachment 1546580 [details]
gfs-3new logs when gfs-1 online rolled-back from 5.5 to 3.12.15
Created attachment 1546588 [details]
gfs-1 logs when all servers online rolled-back from 5.5 to 3.12.15
Created attachment 1546589 [details]
gfs-2 logs when all servers online rolled-back from 5.5 to 3.12.15
Created attachment 1546591 [details]
gfs-3new logs when all servers online rolled-back from 5.5 to 3.12.15
(In reply to Amgad from comment #36) Amgad, Did you check whether you are hitting https://bugzilla.redhat.com/show_bug.cgi?id=1676812? I believe that you are facing the same issue. Thanks, Sanju That's not the case here. In my scenario, heal is performed after the rolback (from 5.5 to 3.12.15) is done on gfs-1 (gfs-2 and gfs-3new are still on 5.5) and all volumes/bricks were up. I actually did another test, during the rollback for gfs-1, a client generated 128 files. All files existed on nodes gfs-2 and gfs-3new, but not on gfs-1. Heal kept failing despite all bricks are online. Here's the outputs: ================== 1) On gfs-1, the one rolled-back to 3.12.15 [root@gfs-1 ansible2]# gluster --version glusterfs 3.12.15 Repository revision: git://git.gluster.org/glusterfs.git Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/> GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. [root@gfs-1 ansible2]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 10712 Brick 10.76.153.213:/mnt/data1/1 49155 0 Y 20297 Brick 10.76.153.207:/mnt/data1/1 49155 0 Y 21395 Self-heal Daemon on localhost N/A N/A Y 10703 Self-heal Daemon on 10.76.153.213 N/A N/A Y 20336 Self-heal Daemon on 10.76.153.207 N/A N/A Y 21422 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 10721 Brick 10.76.153.213:/mnt/data2/2 49156 0 Y 20312 Brick 10.76.153.207:/mnt/data2/2 49156 0 Y 21404 Self-heal Daemon on localhost N/A N/A Y 10703 Self-heal Daemon on 10.76.153.213 N/A N/A Y 20336 Self-heal Daemon on 10.76.153.207 N/A N/A Y 21422 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 10731 Brick 10.76.153.213:/mnt/data3/3 49157 0 Y 20327 Brick 10.76.153.207:/mnt/data3/3 49157 0 Y 21413 Self-heal Daemon on localhost N/A N/A Y 10703 Self-heal Daemon on 10.76.153.207 N/A N/A Y 21422 Self-heal Daemon on 10.76.153.213 N/A N/A Y 20336 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gfs-1 ansible2]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful: Commit failed on 10.76.153.213. Please check log file for details. Commit failed on 10.76.153.207. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful: Commit failed on 10.76.153.213. Please check log file for details. Commit failed on 10.76.153.207. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Commit failed on 10.76.153.213. Please check log file for details. [root@gfs-1 ansible2]# [root@gfs-1 ansible2]# gluster volume heal glustervol3 infoBrick 10.76.153.206:/mnt/data3/3 Status: Connected Number of entries: 0 Brick 10.76.153.213:/mnt/data3/3 /test_file.0 / /test_file.1 /test_file.2 /test_file.3 /test_file.4 .. /test_file.125 /test_file.126 /test_file.127 Status: Connected Number of entries: 129 Brick 10.76.153.207:/mnt/data3/3 /test_file.0 / /test_file.1 /test_file.2 /test_file.3 /test_file.4 ... /test_file.125 /test_file.126 /test_file.127 Status: Connected Number of entries: 129 [root@gfs-1 ansible2]# ls -ltr /mnt/data3/3/ ====> None of the test_file.? exists total 8 -rw-------. 2 root root 0 Mar 11 15:52 c2file3 -rw-------. 2 root root 66 Mar 11 16:37 c1file3 -rw-------. 2 root root 91 Mar 22 16:36 c1file2 [root@gfs-1 ansible2]# 2) On gfs-2, on 5.5 [root@gfs-2 ansible2]# gluster --version glusterfs 5.5 Repository revision: git://git.gluster.org/glusterfs.git Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/> GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. [root@gfs-2 ansible2]# [root@gfs-2 ansible2]# gluster volume status Status of volume: glustervol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data1/1 49152 0 Y 10712 Brick 10.76.153.213:/mnt/data1/1 49155 0 Y 20297 Brick 10.76.153.207:/mnt/data1/1 49155 0 Y 21395 Self-heal Daemon on localhost N/A N/A Y 20336 Self-heal Daemon on 10.76.153.206 N/A N/A Y 10703 Self-heal Daemon on 10.76.153.207 N/A N/A Y 21422 Task Status of Volume glustervol1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data2/2 49153 0 Y 10721 Brick 10.76.153.213:/mnt/data2/2 49156 0 Y 20312 Brick 10.76.153.207:/mnt/data2/2 49156 0 Y 21404 Self-heal Daemon on localhost N/A N/A Y 20336 Self-heal Daemon on 10.76.153.206 N/A N/A Y 10703 Self-heal Daemon on 10.76.153.207 N/A N/A Y 21422 Task Status of Volume glustervol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: glustervol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.76.153.206:/mnt/data3/3 49154 0 Y 10731 Brick 10.76.153.213:/mnt/data3/3 49157 0 Y 20327 Brick 10.76.153.207:/mnt/data3/3 49157 0 Y 21413 Self-heal Daemon on localhost N/A N/A Y 20336 Self-heal Daemon on 10.76.153.206 N/A N/A Y 10703 Self-heal Daemon on 10.76.153.207 N/A N/A Y 21422 Task Status of Volume glustervol3 ------------------------------------------------------------------------------ There are no active volume tasks ** gluster volume heal glustervol3 info has the same output as gfs-1 [root@gfs-2 ansible2]# ls -ltr /mnt/data3/3/ =====> all test_file.? are there total 131080 -rw-------. 2 root root 0 Mar 11 15:52 c2file3 -rw-------. 2 root root 66 Mar 11 16:37 c1file3 -rw-------. 2 root root 91 Mar 22 16:36 c1file2 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.0 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.1 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.2 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.3 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.4 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.5 ........ -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.123 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.124 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.125 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.126 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.127 [root@gfs-2 ansible2]# 3) On gfs-3new, same as gfs-2 [root@gfs-3new ansible2]# ls -ltr /mnt/data3/3/ total 131080 -rw-------. 2 root root 0 Mar 11 15:52 c2file3 -rw-------. 2 root root 66 Mar 11 16:37 c1file3 -rw-------. 2 root root 91 Mar 22 16:36 c1file2 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.0 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.1 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.2 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.3 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.4 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.5 ..... -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.122 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.123 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.124 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.125 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.126 -rw-------. 2 root root 1048576 Mar 22 16:43 test_file.127 [root@gfs-3new ansible2]# I'm attaching the logs for this case as well Regards, Amgad Created attachment 1547013 [details]
gfs-1 logs when gfs-1 online rolled-back from 5.5 to 3.12.15 with 128 files generated
Created attachment 1547014 [details]
gfs-2 logs when gfs-1 online rolled-back from 5.5 to 3.12.15 with 128 files generated
Created attachment 1547015 [details]
gfs-3new logs when gfs-1 online rolled-back from 5.5 to 3.12.15 with 128 files generated
Hi Sanju: I did more testing to take a closer look and here's a more-fine description of the behavior: 0) Stating with 3-replica: gsf-1, gfs-2, and gfs-3new all on 3.12.15 1) Always had successful replication and success of the "gluster volume heal <vol>" command during the online-upgrade from 3.12.15 to 5.5 on all three nodes in all steps. 2) During rolling back one node (gfs-1) to 3.12.15, I added files (128 files) to one volume, the files were replicated between gfs-2 and gfs-3new servers. 3) When rollback was complete on gfs-1 to 3.12.15 (while gfs-2 and gfs-3new are still on 5.5), files didn't replicate to gfs-1 and "gluster volume heal <vol>" command failed (NO bricks were offline). "gluster volume heal <vol> info showed "Number of entries:129" (128 files and a directory) on the bricks on gfs-2 and gfs-3new. ** Heal never succeeded even when rebooted gfs-1. [root@gfs-1 ~]# gluster volume heal glustervol3 info Brick 10.76.153.206:/mnt/data3/3 ==> gfs-1 Status: Connected Number of entries: 0 Brick 10.76.153.213:/mnt/data3/3 ==> gfs-2 /test_file.0 / /test_file.1 /test_file.2 ....... /test_file.124 /test_file.125 /test_file.126 /test_file.127 Status: Connected Number of entries: 129 Brick 10.76.153.207:/mnt/data3/3 ==> gfs-3new /test_file.0 / /test_file.1 /test_file.2 /test_file.3 /test_file.4 ..... /test_file.125 /test_file.126 /test_file.127 Status: Connected Number of entries: 129 [root@gfs-1 ~]# 4) When rolled-back gfs-2 to 3.12.15 (now gfs-1 is on 3.12.15 and gfs-3new is on 5.5), the moment "glusterd" started on gfs-2, replication and heal started and the "Number of entries:" started to go down till "0" within "8" seconds. Brick 10.76.153.206:/mnt/data3/3 Status: Connected Number of entries: 0 Brick 10.76.153.213:/mnt/data3/3 /test_file.0 / - Possibly undergoing heal /test_file.1 /test_file.2 /test_file.3 .. /test_file.124 /test_file.125 /test_file.126 /test_file.127 Status: Connected Number of entries: 129 Brick 10.76.153.207:/mnt/data3/3 /test_file.0 /test_file.4 /test_file.5 /test_file.6 /test_file.7 /test_file.8 .. /test_file.124 /test_file.125 /test_file.126 /test_file.127 Status: Connected Number of entries: 125 ============== Brick 10.76.153.206:/mnt/data3/3 Status: Connected Number of entries: 0 Brick 10.76.153.213:/mnt/data3/3 /test_file.0 /test_file.68 /test_file.69 .. /test_file.124 /test_file.125 /test_file.126 /test_file.127 Status: Connected Number of entries: 61 Brick 10.76.153.207:/mnt/data3/3 /test_file.0 /test_file.76 /test_file.77 /test_file.78 .. /test_file.122 /test_file.123 /test_file.124 /test_file.125 /test_file.126 /test_file.127 Status: Connected Number of entries: 53 ============== Brick 10.76.153.206:/mnt/data3/3 Status: Connected Number of entries: 0 Brick 10.76.153.213:/mnt/data3/3 /test_file.0 Status: Connected Number of entries: 1 Brick 10.76.153.207:/mnt/data3/3 /test_file.0 Status: Connected Number of entries: 1 ================ Brick 10.76.153.206:/mnt/data3/3 Status: Connected Number of entries: 0 Brick 10.76.153.213:/mnt/data3/3 Status: Connected Number of entries: 0 Brick 10.76.153.207:/mnt/data3/3 Status: Connected Number of entries: 0 5) Despite heal started when gfs-2 was rolled-back to 3.12.15 (2-nodes now are on 3.12.15), the command "gluster volume heal <vol>" was continuously unsuccessful. No bricks were offline. [root@gfs-1 ~]# for i in glustervol1 glustervol2 glustervol3; do gluster volume heal $i; done Launching heal operation to perform index self heal on volume glustervol1 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol2 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. Launching heal operation to perform index self heal on volume glustervol3 has been unsuccessful: Commit failed on 10.76.153.207. Please check log file for details. You have new mail in /var/spool/mail/root [root@gfs-1 ~]# 6) When the gfs-3new was rolled back (all three servers are on 3.12.15), the command "gluster volume heal <vol>" was successful. Conclusions: - "Heal" is not successful with one server is rolled-back to 3.12.15 while the other two are on 5.5. The command "gluster volume heal <vol>" is not successful as well - Heal starts once two servers are rolled-back to 3.12.15. - The command "gluster volume heal <vol>" is not successful till all servers are rolled-back to 3.12.15. Hi Sanju: I just saw the 5.5 CentOS RPMs posted this morning! Any change, if not would you kindly update the status for the rollback issue here. Regards, Amgad Downloaded 5.5 Centos RPMS -- same behavior, except that "gluster volume heal <vol> info is slower compared to my private build from github. ""gluster volume heal <vol> info" is taking 10 sec to respond [root@gfs-1 ansible1]# time gluster volume heal glustervol3 info Brick 10.75.147.39:/mnt/data3/3 Status: Connected Number of entries: 0 Brick 10.75.147.46:/mnt/data3/3 Status: Connected Number of entries: 0 Brick 10.75.147.41:/mnt/data3/3 Status: Connected Number of entries: 0 real 0m10.548s user 0m0.031s sys 0m0.028s [root@gfs-1 ansible1]# Amgad, Allow me some time, I will get back to you soon. Thanks, Sanju Thanks for your support! (In reply to Sanju from comment #55) > Amgad, > > Allow me some time, I will get back to you soon. > > Thanks, > Sanju Sanju / Shyam It has been two weeks now. What's the update on this. We're blocked and stuck not able to deploy 5.x because of the online rollback Appreciate your timely update! Regards, Amgad is it fixed in 5.6? Sanju / Shyam It has been three weeks now. What's the update on this. We're blocked and stuck not able to deploy 5.x because of the online rollback Appreciate your timely update! Regards, Amgad Amgad, Sorry for the delay in response. According to https://bugzilla.redhat.com/show_bug.cgi?id=1676812 heal command says "Launching heal operation to perform index self heal on volume <volname> has been unsuccessful: Commit failed on <ip_addr>. Please check log file for details" when any of the brick in the volume is down. But in background heal operation will continue to happen. Here, the error message is misleading. I request you to take a look at https://review.gluster.org/22209 where we tried to change this message but retained ourselves from doing it based on the discussions over the patch. I believe in your setup also, if you check the files in bricks they will be healing. and, we never tested the rollback scenario's in our testing. But everything should be fine after rollback. Thanks, Sanju Thanks Sanju: We do automate the procedure, we'll need to have a successful check. What command you recommend then to check that the heal is successful during our automated rollback? We can't just ignore the unsuccessful message because it can be real as well. Appreciate your prompt answer. Regards, Amgad Please go thru my data on comment - 2019-03-24 03:55:36 UTC where it shows heal is not happening till the 2nd node is rolled-back as well to 3.12.15 -- so till 2 nodes at 3.12.15,heal doesn't start (In reply to Amgad from comment #61) > Thanks Sanju: > > We do automate the procedure, we'll need to have a successful check. What > command you recommend then to check that the heal is successful during our > automated rollback? You can check whether "Number of entries:" are reducing in "gluster volume heal <vol> info " output. Karthik, can you please confirm the above statement? (In reply to Sanju from comment #63) > (In reply to Amgad from comment #61) > > Thanks Sanju: > > > > We do automate the procedure, we'll need to have a successful check. What > > command you recommend then to check that the heal is successful during our > > automated rollback? > > You can check whether "Number of entries:" are reducing in "gluster volume > heal <vol> info " output. > > Karthik, can you please confirm the above statement? Yes, if the heal is progressing, the number of entries should decrease in the heal info output. I confirm that "Number of entries:" was not decreasing and was stuck with the original number (129) till a second node was completely rolled-back to 3.12.15. If I don't roll back the second node, it stays there forever! It is clear that some mismatch between the versions! Amgad, Did you change your op-version after downgrading node? If you're performing a downgrade you need to manually edit the op-version to a lesser op-version in glusterd.info file in all machines and restart glusterd's. So that glusterd will run with lower op-version. You can't set lower op-version using volume set operation. and, I would like to mention that, we can't promise anything about downgrade as we don't test/support downgrades. If you are going forward and performing a downgrade, I suggest you to perform a offline downgrade. After the downgrade, you should manually edit op-version in glusterd.info file and restart glusterd. After doing this also, things might go wrong as it is not something tested and supported. Thanks, Sanju The op-version doesn't change with upgrade. So if I upgrade from 3.12.15 to 5.5, it stays the same [root@gfs2 ansible]# gluster volume get all cluster.op-version Option Value ------ ----- cluster.op-version 31202 [root@gfs2 ansible]# gluster --version glusterfs 5.5 ...... So when I rollback is't the lower op-version. I don't change op-version version upgrade till everything is fine (soak), then I change it to the higher value BTW -- I tested the scenario with 6.1-1 and it's still the same! Regards, Amgad Amgad, I would like to highlight that, we don't support rollback. You might face issues with downgrade as it is not tested and supported. If you have any concerns with upgrade please highlight them, or I would like to close this bug as NOT A BUG. Thanks, Sanju Amgad, I'm closing this bug, if you face any issues with the upgrade to release-5, please feel free to re-open. Thanks, Sanju Thx Sanju! The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |