Bug 1685951

Summary:	[RFE] HC prerequisites are not carried out before cluster upgrade
Product:	[oVirt] ovirt-ansible-collection	Reporter:	SATHEESARAN <sasundar>
Component:	cluster-upgrade	Assignee:	Ritesh Chikatwar <rchikatw>
Status:	CLOSED CURRENTRELEASE	QA Contact:	SATHEESARAN <sasundar>
Severity:	medium	Docs Contact:
Priority:	high
Version:	unspecified	CC:	bugs, godas, guillaume.pavese, lleistne, lsvaty, mperina, omachace, rchikatw, rcyriac, rhs-bugs, sabose, sasundar
Target Milestone:	ovirt-4.4.1	Keywords:	FutureFeature, Reopened
Target Release:	---	Flags:	sasundar: ovirt-4.4?
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	rhv-4.4.0-29	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1500728	Environment:
Last Closed:	2020-08-05 06:25:28 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Gluster	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1689838
Bug Blocks:	1500728

Description SATHEESARAN 2019-03-06 11:45:30 UTC

Description of problem:
-----------------------
For customers that have multiple RHHI clusters, an ansible based upgrade path would be easier. Requirement is to provide an ansible role that can be used to upgrade a cluster.

Version-Release number of selected component (if applicable):


How reproducible:
NA

--- Additional comment from Sahina Bose on 2018-11-29 07:52:58 UTC ---

We already have an ovirt-role to upgrade cluster. This needs to be tested. Moving to ON_QA to test this - https://github.com/oVirt/ovirt-ansible-cluster-upgrade/blob/master/README.md

--- Additional comment from bipin on 2019-02-26 09:04:28 UTC ---

Assigning back the bug since the verification failed. While running the playbook, could see the absence of gluster roles.
While upgrading could see none of the gluster bricks were stopped, and the PID were active though the rhev mount's were unmounted.
There should be a way where the gluster bricks should be killed before upgrading.


Filesystem                                                           Type            Size  Used Avail Use% Mounted on
/dev/mapper/rhvh_rhsqa--grafton7--nic2-rhvh--4.3.0.5--0.20190221.0+1 ext4            786G  2.6G  744G   1% /
devtmpfs                                                             devtmpfs        126G     0  126G   0% /dev
tmpfs                                                                tmpfs           126G   16K  126G   1% /dev/shm
tmpfs                                                                tmpfs           126G  566M  126G   1% /run
tmpfs                                                                tmpfs           126G     0  126G   0% /sys/fs/cgroup
/dev/mapper/rhvh_rhsqa--grafton7--nic2-var                           ext4             15G  4.2G  9.8G  31% /var
/dev/mapper/rhvh_rhsqa--grafton7--nic2-tmp                           ext4            976M  3.9M  905M   1% /tmp
/dev/mapper/rhvh_rhsqa--grafton7--nic2-home                          ext4            976M  2.6M  907M   1% /home
/dev/mapper/gluster_vg_sdc-gluster_lv_engine                         xfs             100G  6.9G   94G   7% /gluster_bricks/engine
/dev/sda1                                                            ext4            976M  253M  657M  28% /boot
/dev/mapper/gluster_vg_sdb-gluster_lv_vmstore                        xfs             4.0T   11G  3.9T   1% /gluster_bricks/vmstore
/dev/mapper/gluster_vg_sdb-gluster_lv_data                           xfs              12T  1.5T   11T  13% /gluster_bricks/data
rhsqa-grafton7-nic2.lab.eng.blr.redhat.com:/engine                   fuse.glusterfs  100G  7.9G   93G   8% /rhev/data-center/mnt/glusterSD/rhsqa-grafton7-nic2.lab.eng.blr.redhat.com:_engine
tmpfs                                                                tmpfs            26G     0   26G   0% /run/user/0


[root@rhsqa-grafton7 ~]# pidof glusterfs
41191 38408 38286 38000

Comment 1 SATHEESARAN 2019-03-06 11:48:17 UTC

HC pre-requisites includes:
1. stopping geo-rep session if anything is in progress
2. check for self-heal progress, if self-heal in progress, fail the upgrade.
3. check for brick quorum is met for the volume.
4. Stop glusterfs processes, glusterd service

Comment 2 Gobinda Das 2019-03-07 10:42:46 UTC

Is it easy to fix or needs more time?

Comment 3 Gobinda Das 2019-03-08 06:04:00 UTC

Hi sas,
Looks like problem with HE fqdn
In log i can see Error: Failed to read response: [(<pycurl.Curl object at 0x7fa4270569d8>, 6, 'Could not resolve host: hostedenginesm3.lab.eng.blr.********.com; Unknown error')]
So because of HE fqdn not resolved the api call failed I think

Here is full error:

2019-02-26 12:52:46,070 p=29986 u=root |  TASK [ovirt.cluster-upgrade : Get hosts] **************************************************************************************************************************************************************************
2019-02-26 12:52:46,070 p=29986 u=root |  task path: /usr/share/ansible/roles/ovirt.cluster-upgrade/tasks/main.yml:24
2019-02-26 12:52:46,276 p=29986 u=root |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/cloud/ovirt/ovirt_host_facts.py
2019-02-26 12:52:46,517 p=29986 u=root |  The full traceback is:
Traceback (most recent call last):
  File "/tmp/ansible_ovirt_host_facts_payload_N4_GxY/__main__.py", line 88, in main
    all_content=module.params['all_content'],
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 11714, in list
    return self._internal_get(headers, query, wait)
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 211, in _internal_get
    return future.wait() if wait else future
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 54, in wait
    response = self._connection.wait(self._context)
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/__init__.py", line 496, in wait
    return self.__wait(context, failed_auth)
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/__init__.py", line 510, in __wait
    raise Error("Failed to read response: {}".format(err_list))
Error: Failed to read response: [(<pycurl.Curl object at 0x7fa4270569d8>, 6, 'Could not resolve host: hostedenginesm3.lab.eng.blr.********.com; Unknown error')]

2019-02-26 12:52:46,518 p=29986 u=root |  fatal: [localhost]: FAILED! => {
    "changed": false, 
    "invocation": {
        "module_args": {
            "all_content": false, 
            "fetch_nested": false, 
            "nested_attributes": [], 
            "pattern": "cluster=Default  name=* status=up"
        }
    }, 
    "msg": "Failed to read response: [(<pycurl.Curl object at 0x7fa4270569d8>, 6, 'Could not resolve host: hostedenginesm3.lab.eng.blr.********.com; Unknown error')]"

Comment 4 Ondra Machacek 2019-03-08 10:39:00 UTC

Not that easy, need more time. Should RHV infra team work on this or do you (gluster team) work on this?

Your issue is that you are using password which is also contained in hostname. So it's obfuscated for more info see:
https://github.com/ansible/ansible/issues/19278

Comment 5 Martin Perina 2019-03-08 14:17:55 UTC

(In reply to Ondra Machacek from comment #4)
> Not that easy, need more time. Should RHV infra team work on this or do you
> (gluster team) work on this?
> 
> Your issue is that you are using password which is also contained in
> hostname. So it's obfuscated for more info see:
> https://github.com/ansible/ansible/issues/19278

This is already known issue within Ansible no_log implementation, I don't think we should do anything about it with cluster-upgrade role, this needs to be fixed in Ansible itself:

https://github.com/ansible/ansible/issues/19278

My recommendation is to use safe passwords instead of well-known strings which can be part of FQDNS, domains, ...

Comment 6 SATHEESARAN 2019-03-11 11:28:34 UTC

(In reply to Martin Perina from comment #5)
> (In reply to Ondra Machacek from comment #4)
> > Not that easy, need more time. Should RHV infra team work on this or do you
> > (gluster team) work on this?
> > 
> > Your issue is that you are using password which is also contained in
> > hostname. So it's obfuscated for more info see:
> > https://github.com/ansible/ansible/issues/19278
> 
> This is already known issue within Ansible no_log implementation, I don't
> think we should do anything about it with cluster-upgrade role, this needs
> to be fixed in Ansible itself:
> 
> https://github.com/ansible/ansible/issues/19278
> 
> My recommendation is to use safe passwords instead of well-known strings
> which can be part of FQDNS, domains, ...


Thanks Martin & Ondra,
Yes, initially the password was part of the hostname used.

But that's not the problem here. RHHI-V needs set of pre-requisites to be done and that's
been taken care while testing with this cluster-upgrade role

Comment 7 SATHEESARAN 2019-03-11 11:29:48 UTC

Please check comment1 for the set of pre-requisites.
As gluster team is aware of these set of pre-requisites, this upgrade-cluster
role should be updated for HC environment

Comment 8 bipin 2019-03-14 10:24:54 UTC

While testing the upgrade, i see a exception error while the host goes for a reboot
But i see once the host comes up, its updated to the latest image and all the services running.


Error:
=====
******************************************
2019-03-14 14:14:56,893 p=61390 u=root |  ok: [localhost]
2019-03-14 14:14:56,955 p=61390 u=root |  TASK [ovirt.cluster-upgrade : Upgrade host] ***********************************************************************************************************************************************************************
2019-03-14 14:28:24,163 p=61390 u=root |  An exception occurred during task execution. To see the full traceback, use -vvv. The error was: Exception: Error while waiting on result state of the entity.
2019-03-14 14:28:24,163 p=61390 u=root |  fatal: [localhost]: FAILED! => {"changed": false, "msg": "Error while waiting on result state of the entity."}
2019-03-14 14:28:24,225 p=61390 u=root |  TASK [ovirt.cluster-upgrade : Log event about cluster upgrade failed] *********************************************************************************************************************************************
2019-03-14 14:28:24,654 p=61390 u=root |  changed: [localhost]
2019-03-14 14:28:24,716 p=61390 u=root |  TASK [ovirt.cluster-upgrade : Set original cluster policy] ********************************************************************************************************************************************************
2019-03-14 14:28:25,224 p=61390 u=root |  changed: [localhost]
2019-03-14 14:28:25,287 p=61390 u=root |  TASK [ovirt.cluster-upgrade : Start again stopped VMs] ************************************************************************************************************************************************************
2019-03-14 14:28:25,363 p=61390 u=root |  TASK [ovirt.cluster-upgrade : Start again pin to host VMs] ********************************************************************************************************************************************************
2019-03-14 14:28:25,442 p=61390 u=root |  TASK [ovirt.cluster-upgrade : Logout from oVirt] ******************************************************************************************************************************************************************
2019-03-14 14:28:25,457 p=61390 u=root |  skipping: [localhost]
2019-03-14 14:28:25,458 p=61390 u=root |  PLAY RECAP ********************************************************************************************************************************************************************************************************
2019-03-14 14:28:25,459 p=61390 u=root |  localhost                  : ok=22   changed=5    unreachable=0    failed=1   


Ondra,

Could you please take a look?Attaching the required files

Comment 10 Sahina Bose 2019-04-23 06:19:39 UTC

Gobinda, assigning to you to take a look

Comment 11 Ondra Machacek 2019-04-23 06:24:14 UTC

I am clearing the needinfo here as the issue was reported in bug 1689838

Comment 13 Sandro Bonazzola 2020-03-20 17:02:39 UTC

This bug is targeted to 4.4.3 and in modified state. can we re-target to 4.4.0 and move to QA?

Comment 14 Ritesh Chikatwar 2020-03-23 07:49:23 UTC

yes can be targeted...

HC pre-requisites includes:
1. Engine fqdn should not contain in host password.
2. stopping geo-rep session if anything is in progress.
3. check for brick quorum is met for the volume.

Comment 16 Gobinda Das 2020-04-20 06:07:15 UTC

Clearing needinfo as it's already targeted to 4.4.0

Comment 19 Sandro Bonazzola 2020-06-04 12:06:29 UTC

Moving to 4.4.1 since 4.4.0 has been already released

Comment 21 SATHEESARAN 2020-07-16 09:15:26 UTC

Tested with ovirt-ansible-cluster-upgrade-1.2.3 and RHV Manager 4.4.1.

The feature works good. It updates the cluster and proceeds to upgrade all the hosts in the cluster.
As there are no real upgrade image is available, all the testing is done with interim build RHVH images

All the prerequisites are handled well

Comment 22 Sandro Bonazzola 2020-08-05 06:25:28 UTC

This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.