Description of problem: For customers that have multiple RHHI clusters, an ansible based upgrade path would be easier. Requirement is to provide an ansible role that can be used to upgrade a cluster. Version-Release number of selected component (if applicable): How reproducible: NA
We already have an ovirt-role to upgrade cluster. This needs to be tested. Moving to ON_QA to test this - https://github.com/oVirt/ovirt-ansible-cluster-upgrade/blob/master/README.md
Assigning back the bug since the verification failed. While running the playbook, could see the absence of gluster roles. While upgrading could see none of the gluster bricks were stopped, and the PID were active though the rhev mount's were unmounted. There should be a way where the gluster bricks should be killed before upgrading. Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/rhvh_rhsqa--grafton7--nic2-rhvh--4.3.0.5--0.20190221.0+1 ext4 786G 2.6G 744G 1% / devtmpfs devtmpfs 126G 0 126G 0% /dev tmpfs tmpfs 126G 16K 126G 1% /dev/shm tmpfs tmpfs 126G 566M 126G 1% /run tmpfs tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/mapper/rhvh_rhsqa--grafton7--nic2-var ext4 15G 4.2G 9.8G 31% /var /dev/mapper/rhvh_rhsqa--grafton7--nic2-tmp ext4 976M 3.9M 905M 1% /tmp /dev/mapper/rhvh_rhsqa--grafton7--nic2-home ext4 976M 2.6M 907M 1% /home /dev/mapper/gluster_vg_sdc-gluster_lv_engine xfs 100G 6.9G 94G 7% /gluster_bricks/engine /dev/sda1 ext4 976M 253M 657M 28% /boot /dev/mapper/gluster_vg_sdb-gluster_lv_vmstore xfs 4.0T 11G 3.9T 1% /gluster_bricks/vmstore /dev/mapper/gluster_vg_sdb-gluster_lv_data xfs 12T 1.5T 11T 13% /gluster_bricks/data rhsqa-grafton7-nic2.lab.eng.blr.redhat.com:/engine fuse.glusterfs 100G 7.9G 93G 8% /rhev/data-center/mnt/glusterSD/rhsqa-grafton7-nic2.lab.eng.blr.redhat.com:_engine tmpfs tmpfs 26G 0 26G 0% /run/user/0 [root@rhsqa-grafton7 ~]# pidof glusterfs 41191 38408 38286 38000
Hi Bipin, As per Martin update in https://bugzilla.redhat.com/show_bug.cgi?id=1685951 the issue is password obfuscation. Can you please try with a password that's not a part of FQDN? You can change the password via engine-config tool. Based on this I am moving this to ON_QA.
(In reply to Gobinda Das from comment #5) > Hi Bipin, > As per Martin update in https://bugzilla.redhat.com/show_bug.cgi?id=1685951 > the issue is password obfuscation. > Can you please try with a password that's not a part of FQDN? > You can change the password via engine-config tool. > Based on this I am moving this to ON_QA. Hi Gobinda, Initially the test was done with password being the substring of the hostname, but later we could get past that. But the real problem here is that, there are few pre-requisites before moving the HC node in to maintenance, and that was not respected, while performing cluster-upgrade using these ovirt-roles. All that I was asking for is, there should be roles considering HC related activities too.
Hi bipin/sas, I just now tried ovirt-ansible-cluster-upgrade role It works fine, my host got upgraded sucessfully. All gluster process stopped except glustereventsd VDSM uses /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh to stop all gluster processes But this script does not stop glustereventsd But in role we are using "reboot_after_upgrade: true" , so it does not cause any issue as host got rebooted after upgrade. During upgrade: [root@tendrl27 ~]# service glusterd status \Redirecting to /bin/systemctl status glusterd.service ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/glusterd.service.d └─99-cpu.conf Active: inactive (dead) since Wed 2019-03-13 13:42:03 IST; 3min 15s ago Main PID: 1445 (code=exited, status=15) CGroup: /glusterfs.slice/glusterd.service Mar 12 17:41:01 tendrl27.lab.eng.blr.redhat.com systemd[1]: Starting GlusterFS, a clustered file-system server... Mar 12 17:41:04 tendrl27.lab.eng.blr.redhat.com systemd[1]: Started GlusterFS, a clustered file-system server. Mar 12 17:41:07 tendrl27.lab.eng.blr.redhat.com glusterd[1445]: [2019-03-12 12:11:07.605004] C [MSGID: 106003] [glusterd-server-quorum.c:354:glusterd_do_volume_quorum_action] 0-management: Server ...ocal bricks. Mar 12 17:41:07 tendrl27.lab.eng.blr.redhat.com glusterd[1445]: [2019-03-12 12:11:07.793049] C [MSGID: 106003] [glusterd-server-quorum.c:354:glusterd_do_volume_quorum_action] 0-management: Server ...ocal bricks. Mar 12 17:41:08 tendrl27.lab.eng.blr.redhat.com glusterd[1445]: [2019-03-12 12:11:08.014252] C [MSGID: 106003] [glusterd-server-quorum.c:354:glusterd_do_volume_quorum_action] 0-management: Server ...ocal bricks. Mar 13 13:42:03 tendrl27.lab.eng.blr.redhat.com systemd[1]: Stopping GlusterFS, a clustered file-system server... Mar 13 13:42:03 tendrl27.lab.eng.blr.redhat.com systemd[1]: Stopped GlusterFS, a clustered file-system server. Hint: Some lines were ellipsized, use -l to show in full. [root@tendrl26 ~]# gluster peer s Number of Peers: 2 Hostname: tendrl27.lab.eng.blr.redhat.com Uuid: ee92badb-d199-43f0-8092-76dc6a37ba9c State: Peer in Cluster (Disconnected) Hostname: tendrl25.lab.eng.blr.redhat.com Uuid: 9373b871-cfce-41ba-a815-0b330f6975c8 State: Peer in Cluster (Connected) [root@tendrl26 ~]# gluster v status Status of volume: data Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick tendrl26.lab.eng.blr.redhat.com:/glus ter_bricks/data/data 49152 0 Y 2480 Brick tendrl25.lab.eng.blr.redhat.com:/glus ter_bricks/data/data 49152 0 Y 15950 Self-heal Daemon on localhost N/A N/A Y 2660 Self-heal Daemon on tendrl25.lab.eng.blr.re dhat.com N/A N/A Y 9529 Task Status of Volume data ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: engine Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick tendrl26.lab.eng.blr.redhat.com:/glus ter_bricks/engine/engine 49158 0 Y 2531 Brick tendrl25.lab.eng.blr.redhat.com:/glus ter_bricks/engine/engine 49153 0 Y 15969 Self-heal Daemon on localhost N/A N/A Y 2660 Self-heal Daemon on tendrl25.lab.eng.blr.re dhat.com N/A N/A Y 9529 Task Status of Volume engine ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: vmstore Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick tendrl26.lab.eng.blr.redhat.com:/glus ter_bricks/vmstore/vmstore 49154 0 Y 2540 Brick tendrl25.lab.eng.blr.redhat.com:/glus ter_bricks/vmstore/vmstore 49154 0 Y 15998 Self-heal Daemon on localhost N/A N/A Y 2660 Self-heal Daemon on tendrl25.lab.eng.blr.re dhat.com N/A N/A Y 9529 Task Status of Volume vmstore ------------------------------------------------------------------------------ There are no active volume tasks [root@tendrl27 ~]# pidof glusterfs 13425 [root@tendrl27 ~]# ps -ef | grep gluster root 9740 1 0 Feb26 ? 00:02:05 python /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid root 9851 9740 0 Feb26 ? 00:00:01 python /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid root 13425 1 0 13:42 ? 00:00:02 /usr/sbin/glusterfs --volfile-server=tendrl27.lab.eng.blr.redhat.com --volfile-server=tendrl26.lab.eng.blr.redhat.com --volfile-server=tendrl25.lab.eng.blr.redhat.com --volfile-id=/engine /rhev/data-center/mnt/glusterSD/tendrl27.lab.eng.blr.redhat.com:_engine I am also attaching the playbook.
Created attachment 1543547 [details] ovirt-upgrade.yml
Based on my discussion with bipin moving this to ON_QA for retest.
Moving back the bug to assigned state based on Bug 1689853 and 1685951
So there are 3 issues altogether 1. HC Pre-requisites are not handled well. For instance, geo-rep session, if in progress is not stopped. BZ 1685951 2. HE VM is stuck in migration forever during upgrade - BZ 1689853 3. There are timeouts happening even though the host is upgraded/updated successfully - this bug BZ 150078. So, these 3 issues needs to rectified to support automated upgrade in the cluster
(In reply to SATHEESARAN from comment #11) > So there are 3 issues altogether > > 1. HC Pre-requisites are not handled well. For instance, geo-rep session, if > in progress is not stopped. BZ 1685951 > 2. HE VM is stuck in migration forever during upgrade - BZ 1689853 > 3. There are timeouts happening even though the host is upgraded/updated > successfully - this bug BZ 150078. This bug, I am referring to the same bug I am commenting on - BZ 1500728 > > So, these 3 issues needs to rectified to support automated upgrade in the > cluster
Does that support fast forward upgrade? From 4.1 to 4.3?
Hi Yaniv, Yes It does support fast forward upgrade.
Tested with ovirt-ansible-cluster-upgrade-1.2.3 and RHV Manager 4.4.1. The feature works good. It updates the cluster and proceeds to upgrade all the hosts in the cluster. As there are no real upgrade image is available, all the testing is done with interim build RHVH images Note that this feature is not helpful to migrate from RHEL 7 based RHVH 4.3.z to RHEL 8 based RHVH 4.4.1, this procedure should be useful for RHVH 4.4.2+ updates
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHHI for Virtualization 1.8 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3314