Bug 1693816 - An engine restore from a backup can fail to start due to missing custom certificates (or similar)
Summary: An engine restore from a backup can fail to start due to missing custom certi...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backup-Restore.Engine
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-4.3.5
: ---
Assignee: Yedidyah Bar David
QA Contact: Petr Matyáš
URL:
Whiteboard:
: 1717176 (view as bug list)
Depends On:
Blocks: 1700655
TreeView+ depends on / blocked
 
Reported: 2019-03-28 17:06 UTC by Simone Tiraboschi
Modified: 2020-08-03 15:38 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The documentation for configuring SSO instructs to create two new files, /etc/httpd/http.keytab and /etc/httpd/conf.d/ovirt-sso.conf . Now, these two files are backed up and restored by engine-backup, if they exist. This allows the restored engine to start successfully, including keeping the SSO configuration.
Clone Of:
Environment:
Last Closed: 2019-07-30 14:08:37 UTC
oVirt Team: Integration
Embargoed:
sbonazzo: ovirt-4.3?
sbonazzo: planning_ack?
sbonazzo: devel_ack+
lleistne: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 99066 0 'None' MERGED packaging: engine-backup: Handle httpd SSO 2020-11-10 14:00:05 UTC
oVirt gerrit 99068 0 'None' ABANDONED packaging: engine-backup: Exclude non-configuration files 2020-11-10 13:59:45 UTC
oVirt gerrit 99069 0 'None' ABANDONED packaging: engine-backup: Handle all of /etc/httpd 2020-11-10 13:59:45 UTC
oVirt gerrit 99800 0 'None' MERGED packaging: engine-backup: Handle httpd SSO 2020-11-10 13:59:45 UTC
oVirt gerrit 100746 0 'None' ABANDONED packaging: engine-backup: Allow including hooks in backup 2020-11-10 14:00:05 UTC

Description Simone Tiraboschi 2019-03-28 17:06:28 UTC
Description of problem:
see https://bugzilla.redhat.com/show_bug.cgi?id=1660595#c16
according to the reports we have case where, after the restore, the engine doesn't correctly accepts connection due to a missing custom certificate or to a broken kerberos configuration with an external identity provider.


Version-Release number of selected component (if applicable):
4.3

How reproducible:
?

Steps to Reproduce:
1. ?
2.
3.

Actual results:
a restored engine starts but doesn't accepts user connections.

Expected results:
- backup and restore what's (including custom certificates or kerberos conf) needed to "always" (not really sure on how we can be exhaustive here) be able to correctly start the engine after the restore
- provide a kind of "safe mode" for the engine where we are confident that at least the core set of functionalities can always go up on disaster recovery and then let the user add/restore what's missing

Additional info:

Comment 1 Yedidyah Bar David 2019-03-31 06:01:25 UTC
(In reply to Simone Tiraboschi from comment #0)
> Description of problem:
> see https://bugzilla.redhat.com/show_bug.cgi?id=1660595#c16
> according to the reports we have case where, after the restore, the engine
> doesn't correctly accepts connection due to a missing custom certificate or
> to a broken kerberos configuration with an external identity provider.
> 
> 
> Version-Release number of selected component (if applicable):
> 4.3
> 
> How reproducible:
> ?

Probably always

> 
> Steps to Reproduce:
> 1. ?

I guess at least one concrete example provided by the referenced bug 1660595
is to follow the documentation:

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/administration_guide/configuring_ldap_and_kerberos_for_single_sign-on

> 2.
> 3.
> 
> Actual results:
> a restored engine starts but doesn't accepts user connections.
> 
> Expected results:
> - backup and restore what's (including custom certificates or kerberos conf)
> needed to "always" (not really sure on how we can be exhaustive here) be
> able to correctly start the engine after the restore

So far, we only backed up stuff written by our *code*, not documentation.

Another option is to change the documentation to adhere to the way the code
works. Meaning - to tell users to write custom conf/stuff in engine-specific
locations, not system-wide ones. Need to check the specific cases, though.

> - provide a kind of "safe mode" for the engine where we are confident that
> at least the core set of functionalities can always go up on disaster
> recovery and then let the user add/restore what's missing
> 

Might make sense. We need a different bug for this, and also there, we need
to enumerate the concrete failure modes that fail-safe should prevent. And/or
to make it a tracker.

Comment 2 Yedidyah Bar David 2019-03-31 06:18:48 UTC
How about this:

Backup all of /etc/pki, /etc/httpd (perhaps others as well)

On restore, skip all files that are packaged by some installed rpm

Perhaps skip optionally, so that a user changing a packaged file
still has an option to tell "Please overwrite packaged file with
restored ones".

This would also have prevented cases like bug 1452182.

Comment 3 Yedidyah Bar David 2019-04-01 10:18:18 UTC
It's hard to decide what to do :-)

99066 just adds two files to be backed up/restored. Simple and easy, but covers only the specific case of single-sign-on-following-RHV-docs.

99068+99069 are more general, but somewhat more risky. And also might be not enough in the general case. I talked with Simone in private, and he said "Why not all of /etc?", then checked changed files there on a clean machine, and we agreed we do not want to start thinking about each and every one of them. People that want a full-machine-image backup (and restore) should use other means, not engine-backup.

I don't mind adding 99066, because I think it's very low risk, but generally speaking, engine-backup should handle what engine-setup ovirt-engine-rename change. Nothing more, nothing less.

Regarding André's point, which is, basically, above last statement, _plus_ "... and the official documentation instructs the user to change/add". Well, not sure about that in general. As a user, when I read documentation, and if it's obvious to me that what I read is mainly an example, and not the only way to do something, I might not follow it strictly. Simplest example: there is nothing magical about the name "/etc/httpd/http.keytab". You can use any file name you want, as long as you set GssapiCredStore to point at it. There is no other component that strictly expects to find/use this file. But if I merge 99066, we can no longer treat it as a mere example - then it becomes part of the code. If a user reads the docs, says "well, I think I prefer to call it /etc/my-local-stuff/ovirt-engine.keytab", engine-backup will no longer handle it. So how to treat this? Not sure. I currently tend to think that the best solution is to update the documentation, adding there:

mkdir -p /etc/ovirt-engine-backup/engine-backup-defaults.d

cat << __EOF__ >> /etc/ovirt-engine-backup/engine-backup-defaults.d/sso.sh
BACKUP_PATHS="\${BACKUP_PATHS}
/etc/httpd/http.keytab
/etc/httpd/conf.d/ovirt-sso.conf"
__EOF__

( This will work, since 3.6: https://gerrit.ovirt.org/#/q/Ice63783,n,z )

Comment 4 Yedidyah Bar David 2019-04-01 10:28:27 UTC
Sandro, what do you think?

The main drawback with end of comment 3 is for people that already configured sso and backup, but never tested restore.
These will benefit from 99066 but not from a doc update.

Comment 5 Yedidyah Bar David 2019-04-02 06:43:56 UTC
Sorry, that was too quick. End of comment 3 is enough to _backup_ these files, but not enough to _restore_ them. For that, we'll either need a patch to engine-backup, or something quite longer than what I wrote there, which is also quite ugly, so I didn't test it yet - something like add a file to /etc/ovirt-engine-backup/engine-backup-config.d that redefines dump_config_for_restore, adding there to VARS_TO_SAVE also BACKUP_PATHS. Otherwise, you'll have to do end-of-comment-3 also on the restored machine, before running restore, so would not be better than adding a custom playbook, for the hosted-engine restore case.

Comment 6 Sandro Bonazzola 2019-04-03 07:24:12 UTC
Let's ensure documentation states to use exact names as in docs so backup and restore will work out of the box for patch 99066.
Let's also create those files with comments within rpm to make it clear they're part of the product.

Comment 7 Steve Goodman 2019-04-16 10:26:35 UTC
I'm looking at the page referenced in comment 1 and it's not clear to me precisely what you want the documentation to say.

Could you please indicate (just paste into this bug) the specific location and text you're proposing?

Comment 8 Steve Goodman 2019-04-16 10:29:19 UTC
Also, if the doc change you are suggesting is something in addition to the scope of this particular bug, would you please create a new bug so we can track it better?

Comment 9 Yedidyah Bar David 2019-04-17 05:40:12 UTC
Created doc bug 1700655, replied there.

Comment 10 Yedidyah Bar David 2019-05-07 09:10:23 UTC
Note to QE: Please do a full verification, not just of the files. E.g.:

1. Install engine
2. Setup
3. Configure SSO following the documentation
4. Backup
5. Restore to a new machine, or to a snapshot of same one taken after step 1.
6. Make sure the engine starts and that SSO works as after step 3.

Personally I didn't do this. I only made sure that these two files are backed up and restored.

Comment 11 Petr Matyáš 2019-06-18 11:09:08 UTC
Verified on ovirt-engine-4.3.5-0.1.el7.noarch

You obviously have to create new keytab (or copy the old one if the hostname is the same) but otherwise all the files are there and set up correctly.

Comment 12 Yedidyah Bar David 2019-06-19 09:39:18 UTC
(In reply to Petr Matyáš from comment #11)
> Verified on ovirt-engine-4.3.5-0.1.el7.noarch
> 
> You obviously have to create new keytab (or copy the old one if the hostname
> is the same) but otherwise all the files are there and set up correctly.

Sorry, I do not follow. With the patched engine-backup, /etc/httpd/http.keytab should be backed up and restored, you should not need to manually copy it. Please clarify.

Obviously, if you use/need a different FQDN, this won't be enough. But we do want to make sure that engine-backup restore is enough to make the engine work well, especially for hosted-engine restore (which does this unattendedly and therefore should be bullet-proof (hopefully)).

Comment 13 Petr Matyáš 2019-06-19 09:59:58 UTC
That is probably true, however I didn't use name http.keytab and thus it wasn't moved at all.

Comment 14 Yedidyah Bar David 2019-06-19 10:28:52 UTC
So you should try again with the exact name, see above (e.g. comment 6)... (I already did that when verifying my patch, but still).

Comment 15 Petr Matyáš 2019-06-19 11:11:13 UTC
I don't really see any point in that, as ovirt-sso configs were copied correctly and as you said it should work alright with name 'http.keytab'.

Also I take this as sort of negative test case that it doesn't copy something that might cause some risks (different keytab name).

Comment 16 Yedidyah Bar David 2019-06-20 05:45:04 UTC
(In reply to Petr Matyáš from comment #15)
> I don't really see any point in that, as ovirt-sso configs were copied
> correctly and as you said it should work alright with name 'http.keytab'.

I did not say "bug verification is bogus" and didn't move back to QE. I am
saying I already did that myself. This patch merely adds two file names
to some list, not more and not less. If I had a bug in one of the filenames,
or in the syntax (as I did have e.g. in the fix for bug 1471833 and no-one
noticed until recently - bug 1715519 [1] - these things do happen), I'd
expect verification to catch it.

> 
> Also I take this as sort of negative test case that it doesn't copy
> something that might cause some risks (different keytab name).

Not sure exactly what you mean, but if you claim that hard-coding these
names is imperfect, than I agree. See above discussion and bug 1715767,
but even after testing it and moving to QE I decided to keep current bug
as-is, and do not revert the patch nor tell you to ignore - also because
this would still require some work (so not sure will be worth it) but
mainly for people that already followed the procedure and run backups
without noticing that we changed the documentation (as I decided to do
in bug 1715767).

Comment 17 Simone Tiraboschi 2019-06-26 08:53:32 UTC
*** Bug 1717176 has been marked as a duplicate of this bug. ***

Comment 18 Sandro Bonazzola 2019-07-30 14:08:37 UTC
This bugzilla is included in oVirt 4.3.5 release, published on July 30th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.