Red Hat Bugzilla – Bug 1466103
[RFE] Improve Hosted-Engine Upgrade Processes
Last modified: 2017-09-28 04:38:04 EDT
Description of problem:
This is not a RFE per see. I just want to list some of the challenges customers go through when doing Hosted-Engine upgrades and explain the complexity of fixing broken upgrades when things go sideways.
A) 3.5 to 3.6 with EL6 to EL7
It has 11 steps, with too many things to check and that can go sideways.
This is the upgrade path:
EL6 (ha-v1.2) -> EL7 (ha-v1.2) -> EL7 (ha-v1.3)
And now with 3.6 NGN, we can add another one:
EL6 (ha-v1.2) -> EL7 (ha-v1.2) -> EL7 (ha-v1.3) -> NGN (ha-v1.3)
That's 2 full re-installs and 1 upgrade. There is no way to jump any of these steps because we need to trigger the HE SD upgrade, and that requires to add a host with ha-v1.2(3.5) and upgrade it (not wipe and re-install) to ha-v1.3(3.6). So we can't do EL7 (ha-v1.2) -> NGN (ha-v1.3), as it requires as wipe and re-install.
Not to mention EL6 (ha-v1.2) -> NGN (ha-v1.3) would be the ideal path.
B) 3.5 to 3.6 with EL7 to EL7
This is the upgrade path:
EL7 (ha-v1.2) -> EL7 (ha-v1.3) -> NGN (ha-v1.3)
Again, to trigger the HE SD upgrade, we cannot do EL7 (ha-v1.2) -> NGN (ha-v1.3).
C) 3.6 to 4.0 on broken (not correctly upgraded to 3.6 environments)
This is more common that one would think. Just one example below:
- Engine on 3.6, DC already at 3.6 level
- Hosted-Engine SD on 3.5 format, never auto-imported (we check this now during upgrade to 4.0, good, but some people ignore warnings)
- Hosts upgraded to ha-2.0
How do we get the HE SD level upgraded if it needs a v1.2 -> v1.3 upgrade? Adding a fresh RHEV-H 3.5 (v1.2) fails as it doesn't meet the DC level requirements. Adding a RHEV-H 3.6 (v1.3) fails as it can't find the shared conf from storage. We end up mixing packages on RHEL hosts or trying to trigger HE SD upgrade, but it's not always that this succeeds. Sometimes we have to build a new 4.0 HE from scratch, and manually set it up, restoring the 3.6 engine-backup. Sort of a manual upgrade and building a completely new HE SD. This is far from the documentation and tested upgrade paths.
* this is clearly reached by not following documentation. But note the upgrade paths are not simple and it's not too hard to end up in
these situations. The problem is that once we are off the correct path, it's usually hard and time consuming to recover, and these
can be production environments. If things go wrong, it should be easier to recover, perhaps we should do more tests with upgrades on
environments that are not exactly 100% healthy.
As some suggestions, I think all these upgrades and troubleshooting would be much easier if:
1. we could do ovirt-hosted-engine-setup using any version against any format of HE SD
This would, for example allow us to wipe a host with el6 (ha-v1.2, HE SD 3.5) and directly install NGN (ha-v1.3). In this specific scenario we would need to copy answers.conf from the initial host, but in the future it should already be on shared storage. (I tried this, it sort of works but doesn't trigger HE SD upgrade). Or in one of those badly upgraded 3.5 to 3.6 that fail in the middle of an upgrade to 4.0, we could add a fresh host with ha-v2.0 to a 3.5 format HE (copying answers.conf probably as well), and trigger the necessary upgrades.
2.the upgrades manually triggered/controlled independent of ovirt-ha packages. A ha-v2.1 should be capable of adding a 3.5 format HE and triggering the necessary upgrades/modifications all the way to 4.1 format.
3. HE SD format managed and enforced by the ovirt-engine. If the DC level is changed to 3.6, the HE SD must be enforced in 3.6 format. This would save several customers from trouble.
4. Things like OVEHOSTED_STORAGE/storageDomainName should be automatic. This option is involved in far too many support cases.
It's probably too late for any improvements on the current 3.6/4.0 code. But I hope some of this can be taken into consideration for future planning of hosted-engine upgrade paths, and when the next EL bump comes we can do simpler upgrades and the failures are easier to recover.
I did quite some testing on trying to do EL6 (HE 3.5) to EL (NGN - HE 3.6) in one jump by using InClusterUpgrade, removing a patch with some el6 checks and using answers.conf from previous host. I went further than I thought it would. If this data is useful please let me know and I can upload it.