Bug 1500728

Summary: [RFE] Provide a way to upgrade RHHI cluster via Ansible
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Sahina Bose <sabose>
Component: rhhiAssignee: Ritesh Chikatwar <rchikatw>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: medium Docs Contact:
Priority: high    
Version: rhhi-1.0CC: dwalveka, godas, guillaume.pavese, myllynen, pasik, rchikatw, rcyriac, rhs-bugs, sasundar
Target Milestone: ---Keywords: FutureFeature
Target Release: RHHI-V 1.8   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
The new ovirt-ansible-cluster-upgrade role is added to help simplify the upgrade process.
Story Points: ---
Clone Of:
: 1683634 1685951 (view as bug list) Environment:
Last Closed: 2020-08-04 14:50:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1685951    
Bug Blocks: 1721366, 1779976    
Attachments:
Description Flags
ovirt-upgrade.yml none

Description Sahina Bose 2017-10-11 11:43:14 UTC
Description of problem:
For customers that have multiple RHHI clusters, an ansible based upgrade path would be easier. Requirement is to provide an ansible role that can be used to upgrade a cluster.

Version-Release number of selected component (if applicable):


How reproducible:
NA

Comment 1 Sahina Bose 2018-11-29 07:52:58 UTC
We already have an ovirt-role to upgrade cluster. This needs to be tested. Moving to ON_QA to test this - https://github.com/oVirt/ovirt-ansible-cluster-upgrade/blob/master/README.md

Comment 2 bipin 2019-02-26 09:04:28 UTC
Assigning back the bug since the verification failed. While running the playbook, could see the absence of gluster roles.
While upgrading could see none of the gluster bricks were stopped, and the PID were active though the rhev mount's were unmounted.
There should be a way where the gluster bricks should be killed before upgrading.


Filesystem                                                           Type            Size  Used Avail Use% Mounted on
/dev/mapper/rhvh_rhsqa--grafton7--nic2-rhvh--4.3.0.5--0.20190221.0+1 ext4            786G  2.6G  744G   1% /
devtmpfs                                                             devtmpfs        126G     0  126G   0% /dev
tmpfs                                                                tmpfs           126G   16K  126G   1% /dev/shm
tmpfs                                                                tmpfs           126G  566M  126G   1% /run
tmpfs                                                                tmpfs           126G     0  126G   0% /sys/fs/cgroup
/dev/mapper/rhvh_rhsqa--grafton7--nic2-var                           ext4             15G  4.2G  9.8G  31% /var
/dev/mapper/rhvh_rhsqa--grafton7--nic2-tmp                           ext4            976M  3.9M  905M   1% /tmp
/dev/mapper/rhvh_rhsqa--grafton7--nic2-home                          ext4            976M  2.6M  907M   1% /home
/dev/mapper/gluster_vg_sdc-gluster_lv_engine                         xfs             100G  6.9G   94G   7% /gluster_bricks/engine
/dev/sda1                                                            ext4            976M  253M  657M  28% /boot
/dev/mapper/gluster_vg_sdb-gluster_lv_vmstore                        xfs             4.0T   11G  3.9T   1% /gluster_bricks/vmstore
/dev/mapper/gluster_vg_sdb-gluster_lv_data                           xfs              12T  1.5T   11T  13% /gluster_bricks/data
rhsqa-grafton7-nic2.lab.eng.blr.redhat.com:/engine                   fuse.glusterfs  100G  7.9G   93G   8% /rhev/data-center/mnt/glusterSD/rhsqa-grafton7-nic2.lab.eng.blr.redhat.com:_engine
tmpfs                                                                tmpfs            26G     0   26G   0% /run/user/0


[root@rhsqa-grafton7 ~]# pidof glusterfs
41191 38408 38286 38000

Comment 5 Gobinda Das 2019-03-11 06:19:07 UTC
Hi Bipin,
 As per Martin update in https://bugzilla.redhat.com/show_bug.cgi?id=1685951  the issue is password obfuscation.
Can you please try with a password that's not a part of FQDN?
You can change the password via engine-config tool.
Based on this I am moving this to ON_QA.

Comment 6 SATHEESARAN 2019-03-11 11:31:39 UTC
(In reply to Gobinda Das from comment #5)
> Hi Bipin,
>  As per Martin update in https://bugzilla.redhat.com/show_bug.cgi?id=1685951
> the issue is password obfuscation.
> Can you please try with a password that's not a part of FQDN?
> You can change the password via engine-config tool.
> Based on this I am moving this to ON_QA.

Hi Gobinda,

Initially the test was done with password being the substring of the hostname,
but later we could get past that.

But the real problem here is that, there are few pre-requisites before moving
the HC node in to maintenance, and that was not respected, while performing
cluster-upgrade using these ovirt-roles.

All that I was asking for is, there should be roles considering HC related
activities too.

Comment 7 Gobinda Das 2019-03-13 09:23:17 UTC
Hi bipin/sas,
 I just now tried ovirt-ansible-cluster-upgrade role
It works fine, my host got upgraded sucessfully.
All gluster process stopped except glustereventsd
VDSM uses /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh to stop all gluster processes
But this script does not stop glustereventsd
But in role we are using "reboot_after_upgrade: true" , so it does not cause any issue as host got rebooted after upgrade.

During upgrade:
[root@tendrl27 ~]# service glusterd status
\Redirecting to /bin/systemctl status glusterd.service
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/glusterd.service.d
           └─99-cpu.conf
   Active: inactive (dead) since Wed 2019-03-13 13:42:03 IST; 3min 15s ago
 Main PID: 1445 (code=exited, status=15)
   CGroup: /glusterfs.slice/glusterd.service

Mar 12 17:41:01 tendrl27.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server...
Mar 12 17:41:04 tendrl27.lab.eng.blr.redhat.com systemd[1]: Started GlusterFS, a clustered file-system server.
Mar 12 17:41:07 tendrl27.lab.eng.blr.redhat.com glusterd[1445]: [2019-03-12 12:11:07.605004] C [MSGID: 106003] [glusterd-server-quorum.c:354:glusterd_do_volume_quorum_action] 0-management: Server ...ocal bricks.
Mar 12 17:41:07 tendrl27.lab.eng.blr.redhat.com glusterd[1445]: [2019-03-12 12:11:07.793049] C [MSGID: 106003] [glusterd-server-quorum.c:354:glusterd_do_volume_quorum_action] 0-management: Server ...ocal bricks.
Mar 12 17:41:08 tendrl27.lab.eng.blr.redhat.com glusterd[1445]: [2019-03-12 12:11:08.014252] C [MSGID: 106003] [glusterd-server-quorum.c:354:glusterd_do_volume_quorum_action] 0-management: Server ...ocal bricks.
Mar 13 13:42:03 tendrl27.lab.eng.blr.redhat.com systemd[1]: Stopping GlusterFS, a clustered file-system server...
Mar 13 13:42:03 tendrl27.lab.eng.blr.redhat.com systemd[1]: Stopped GlusterFS, a clustered file-system server.
Hint: Some lines were ellipsized, use -l to show in full.

[root@tendrl26 ~]# gluster peer s
Number of Peers: 2

Hostname: tendrl27.lab.eng.blr.redhat.com
Uuid: ee92badb-d199-43f0-8092-76dc6a37ba9c
State: Peer in Cluster (Disconnected)

Hostname: tendrl25.lab.eng.blr.redhat.com
Uuid: 9373b871-cfce-41ba-a815-0b330f6975c8
State: Peer in Cluster (Connected)
[root@tendrl26 ~]# gluster v status
Status of volume: data
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick tendrl26.lab.eng.blr.redhat.com:/glus
ter_bricks/data/data                        49152     0          Y       2480 
Brick tendrl25.lab.eng.blr.redhat.com:/glus
ter_bricks/data/data                        49152     0          Y       15950
Self-heal Daemon on localhost               N/A       N/A        Y       2660 
Self-heal Daemon on tendrl25.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       9529 
 
Task Status of Volume data
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: engine
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick tendrl26.lab.eng.blr.redhat.com:/glus
ter_bricks/engine/engine                    49158     0          Y       2531 
Brick tendrl25.lab.eng.blr.redhat.com:/glus
ter_bricks/engine/engine                    49153     0          Y       15969
Self-heal Daemon on localhost               N/A       N/A        Y       2660 
Self-heal Daemon on tendrl25.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       9529 
 
Task Status of Volume engine
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: vmstore
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick tendrl26.lab.eng.blr.redhat.com:/glus
ter_bricks/vmstore/vmstore                  49154     0          Y       2540 
Brick tendrl25.lab.eng.blr.redhat.com:/glus
ter_bricks/vmstore/vmstore                  49154     0          Y       15998
Self-heal Daemon on localhost               N/A       N/A        Y       2660 
Self-heal Daemon on tendrl25.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       9529 
 
Task Status of Volume vmstore
------------------------------------------------------------------------------
There are no active volume tasks

[root@tendrl27 ~]# pidof glusterfs
13425

[root@tendrl27 ~]# ps -ef | grep gluster
root      9740     1  0 Feb26 ?        00:02:05 python /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
root      9851  9740  0 Feb26 ?        00:00:01 python /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
root     13425     1  0 13:42 ?        00:00:02 /usr/sbin/glusterfs --volfile-server=tendrl27.lab.eng.blr.redhat.com --volfile-server=tendrl26.lab.eng.blr.redhat.com --volfile-server=tendrl25.lab.eng.blr.redhat.com --volfile-id=/engine /rhev/data-center/mnt/glusterSD/tendrl27.lab.eng.blr.redhat.com:_engine

I am also attaching the playbook.

Comment 8 Gobinda Das 2019-03-13 09:24:15 UTC
Created attachment 1543547 [details]
ovirt-upgrade.yml

Comment 9 Gobinda Das 2019-03-13 09:42:37 UTC
Based on my discussion with bipin moving this to ON_QA for retest.

Comment 10 bipin 2019-03-19 03:55:33 UTC
Moving back the bug to assigned state based on Bug 1689853 and 1685951

Comment 11 SATHEESARAN 2019-03-22 02:26:24 UTC
So there are 3 issues altogether

1. HC Pre-requisites are not handled well. For instance, geo-rep session, if in progress is not stopped. BZ 1685951
2. HE VM is stuck in migration forever during upgrade - BZ 1689853
3. There are timeouts happening even though the host is upgraded/updated successfully - this bug BZ 150078.

So, these 3 issues needs to rectified to support automated upgrade in the cluster

Comment 12 SATHEESARAN 2019-03-22 02:27:19 UTC
(In reply to SATHEESARAN from comment #11)
> So there are 3 issues altogether
> 
> 1. HC Pre-requisites are not handled well. For instance, geo-rep session, if
> in progress is not stopped. BZ 1685951
> 2. HE VM is stuck in migration forever during upgrade - BZ 1689853
> 3. There are timeouts happening even though the host is upgraded/updated
> successfully - this bug BZ 150078.
This bug, I am referring to the same bug I am commenting on - BZ 1500728
> 
> So, these 3 issues needs to rectified to support automated upgrade in the
> cluster

Comment 15 Yaniv Kaul 2019-07-02 12:11:35 UTC
Does that support fast forward upgrade? From 4.1 to 4.3?

Comment 16 Gobinda Das 2019-07-22 14:11:17 UTC
Hi Yaniv,
 Yes It does support fast forward upgrade.

Comment 23 SATHEESARAN 2020-07-16 09:18:34 UTC
Tested with ovirt-ansible-cluster-upgrade-1.2.3 and RHV Manager 4.4.1.

The feature works good. It updates the cluster and proceeds to upgrade all the hosts in the cluster.
As there are no real upgrade image is available, all the testing is done with interim build RHVH images


Note that this feature is not helpful to migrate from RHEL 7 based RHVH 4.3.z to RHEL 8 based RHVH 4.4.1, 
this procedure should be useful for RHVH 4.4.2+ updates

Comment 25 errata-xmlrpc 2020-08-04 14:50:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHHI for Virtualization 1.8 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3314