Description of problem: ----------------------- For customers that have multiple RHHI clusters, an ansible based upgrade path would be easier. Requirement is to provide an ansible role that can be used to upgrade a cluster. Version-Release number of selected component (if applicable): How reproducible: NA --- Additional comment from Sahina Bose on 2018-11-29 07:52:58 UTC --- We already have an ovirt-role to upgrade cluster. This needs to be tested. Moving to ON_QA to test this - https://github.com/oVirt/ovirt-ansible-cluster-upgrade/blob/master/README.md --- Additional comment from bipin on 2019-02-26 09:04:28 UTC --- Assigning back the bug since the verification failed. While running the playbook, could see the absence of gluster roles. While upgrading could see none of the gluster bricks were stopped, and the PID were active though the rhev mount's were unmounted. There should be a way where the gluster bricks should be killed before upgrading. Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/rhvh_rhsqa--grafton7--nic2-rhvh--4.3.0.5--0.20190221.0+1 ext4 786G 2.6G 744G 1% / devtmpfs devtmpfs 126G 0 126G 0% /dev tmpfs tmpfs 126G 16K 126G 1% /dev/shm tmpfs tmpfs 126G 566M 126G 1% /run tmpfs tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/mapper/rhvh_rhsqa--grafton7--nic2-var ext4 15G 4.2G 9.8G 31% /var /dev/mapper/rhvh_rhsqa--grafton7--nic2-tmp ext4 976M 3.9M 905M 1% /tmp /dev/mapper/rhvh_rhsqa--grafton7--nic2-home ext4 976M 2.6M 907M 1% /home /dev/mapper/gluster_vg_sdc-gluster_lv_engine xfs 100G 6.9G 94G 7% /gluster_bricks/engine /dev/sda1 ext4 976M 253M 657M 28% /boot /dev/mapper/gluster_vg_sdb-gluster_lv_vmstore xfs 4.0T 11G 3.9T 1% /gluster_bricks/vmstore /dev/mapper/gluster_vg_sdb-gluster_lv_data xfs 12T 1.5T 11T 13% /gluster_bricks/data rhsqa-grafton7-nic2.lab.eng.blr.redhat.com:/engine fuse.glusterfs 100G 7.9G 93G 8% /rhev/data-center/mnt/glusterSD/rhsqa-grafton7-nic2.lab.eng.blr.redhat.com:_engine tmpfs tmpfs 26G 0 26G 0% /run/user/0 [root@rhsqa-grafton7 ~]# pidof glusterfs 41191 38408 38286 38000
HC pre-requisites includes: 1. stopping geo-rep session if anything is in progress 2. check for self-heal progress, if self-heal in progress, fail the upgrade. 3. check for brick quorum is met for the volume. 4. Stop glusterfs processes, glusterd service
Is it easy to fix or needs more time?
Hi sas, Looks like problem with HE fqdn In log i can see Error: Failed to read response: [(<pycurl.Curl object at 0x7fa4270569d8>, 6, 'Could not resolve host: hostedenginesm3.lab.eng.blr.********.com; Unknown error')] So because of HE fqdn not resolved the api call failed I think Here is full error: 2019-02-26 12:52:46,070 p=29986 u=root | TASK [ovirt.cluster-upgrade : Get hosts] ************************************************************************************************************************************************************************** 2019-02-26 12:52:46,070 p=29986 u=root | task path: /usr/share/ansible/roles/ovirt.cluster-upgrade/tasks/main.yml:24 2019-02-26 12:52:46,276 p=29986 u=root | Using module file /usr/lib/python2.7/site-packages/ansible/modules/cloud/ovirt/ovirt_host_facts.py 2019-02-26 12:52:46,517 p=29986 u=root | The full traceback is: Traceback (most recent call last): File "/tmp/ansible_ovirt_host_facts_payload_N4_GxY/__main__.py", line 88, in main all_content=module.params['all_content'], File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 11714, in list return self._internal_get(headers, query, wait) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 211, in _internal_get return future.wait() if wait else future File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 54, in wait response = self._connection.wait(self._context) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/__init__.py", line 496, in wait return self.__wait(context, failed_auth) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/__init__.py", line 510, in __wait raise Error("Failed to read response: {}".format(err_list)) Error: Failed to read response: [(<pycurl.Curl object at 0x7fa4270569d8>, 6, 'Could not resolve host: hostedenginesm3.lab.eng.blr.********.com; Unknown error')] 2019-02-26 12:52:46,518 p=29986 u=root | fatal: [localhost]: FAILED! => { "changed": false, "invocation": { "module_args": { "all_content": false, "fetch_nested": false, "nested_attributes": [], "pattern": "cluster=Default name=* status=up" } }, "msg": "Failed to read response: [(<pycurl.Curl object at 0x7fa4270569d8>, 6, 'Could not resolve host: hostedenginesm3.lab.eng.blr.********.com; Unknown error')]"
Not that easy, need more time. Should RHV infra team work on this or do you (gluster team) work on this? Your issue is that you are using password which is also contained in hostname. So it's obfuscated for more info see: https://github.com/ansible/ansible/issues/19278
(In reply to Ondra Machacek from comment #4) > Not that easy, need more time. Should RHV infra team work on this or do you > (gluster team) work on this? > > Your issue is that you are using password which is also contained in > hostname. So it's obfuscated for more info see: > https://github.com/ansible/ansible/issues/19278 This is already known issue within Ansible no_log implementation, I don't think we should do anything about it with cluster-upgrade role, this needs to be fixed in Ansible itself: https://github.com/ansible/ansible/issues/19278 My recommendation is to use safe passwords instead of well-known strings which can be part of FQDNS, domains, ...
(In reply to Martin Perina from comment #5) > (In reply to Ondra Machacek from comment #4) > > Not that easy, need more time. Should RHV infra team work on this or do you > > (gluster team) work on this? > > > > Your issue is that you are using password which is also contained in > > hostname. So it's obfuscated for more info see: > > https://github.com/ansible/ansible/issues/19278 > > This is already known issue within Ansible no_log implementation, I don't > think we should do anything about it with cluster-upgrade role, this needs > to be fixed in Ansible itself: > > https://github.com/ansible/ansible/issues/19278 > > My recommendation is to use safe passwords instead of well-known strings > which can be part of FQDNS, domains, ... Thanks Martin & Ondra, Yes, initially the password was part of the hostname used. But that's not the problem here. RHHI-V needs set of pre-requisites to be done and that's been taken care while testing with this cluster-upgrade role
Please check comment1 for the set of pre-requisites. As gluster team is aware of these set of pre-requisites, this upgrade-cluster role should be updated for HC environment
While testing the upgrade, i see a exception error while the host goes for a reboot But i see once the host comes up, its updated to the latest image and all the services running. Error: ===== ****************************************** 2019-03-14 14:14:56,893 p=61390 u=root | ok: [localhost] 2019-03-14 14:14:56,955 p=61390 u=root | TASK [ovirt.cluster-upgrade : Upgrade host] *********************************************************************************************************************************************************************** 2019-03-14 14:28:24,163 p=61390 u=root | An exception occurred during task execution. To see the full traceback, use -vvv. The error was: Exception: Error while waiting on result state of the entity. 2019-03-14 14:28:24,163 p=61390 u=root | fatal: [localhost]: FAILED! => {"changed": false, "msg": "Error while waiting on result state of the entity."} 2019-03-14 14:28:24,225 p=61390 u=root | TASK [ovirt.cluster-upgrade : Log event about cluster upgrade failed] ********************************************************************************************************************************************* 2019-03-14 14:28:24,654 p=61390 u=root | changed: [localhost] 2019-03-14 14:28:24,716 p=61390 u=root | TASK [ovirt.cluster-upgrade : Set original cluster policy] ******************************************************************************************************************************************************** 2019-03-14 14:28:25,224 p=61390 u=root | changed: [localhost] 2019-03-14 14:28:25,287 p=61390 u=root | TASK [ovirt.cluster-upgrade : Start again stopped VMs] ************************************************************************************************************************************************************ 2019-03-14 14:28:25,363 p=61390 u=root | TASK [ovirt.cluster-upgrade : Start again pin to host VMs] ******************************************************************************************************************************************************** 2019-03-14 14:28:25,442 p=61390 u=root | TASK [ovirt.cluster-upgrade : Logout from oVirt] ****************************************************************************************************************************************************************** 2019-03-14 14:28:25,457 p=61390 u=root | skipping: [localhost] 2019-03-14 14:28:25,458 p=61390 u=root | PLAY RECAP ******************************************************************************************************************************************************************************************************** 2019-03-14 14:28:25,459 p=61390 u=root | localhost : ok=22 changed=5 unreachable=0 failed=1 Ondra, Could you please take a look?Attaching the required files
Gobinda, assigning to you to take a look
I am clearing the needinfo here as the issue was reported in bug 1689838
This bug is targeted to 4.4.3 and in modified state. can we re-target to 4.4.0 and move to QA?
yes can be targeted... HC pre-requisites includes: 1. Engine fqdn should not contain in host password. 2. stopping geo-rep session if anything is in progress. 3. check for brick quorum is met for the volume.
Clearing needinfo as it's already targeted to 4.4.0
Moving to 4.4.1 since 4.4.0 has been already released
Tested with ovirt-ansible-cluster-upgrade-1.2.3 and RHV Manager 4.4.1. The feature works good. It updates the cluster and proceeds to upgrade all the hosts in the cluster. As there are no real upgrade image is available, all the testing is done with interim build RHVH images All the prerequisites are handled well
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.