Bug 1700021 - [RFE] engine-setup should warn and prompt if ca.pem is missing but other generated pki files exist
Summary: [RFE] engine-setup should warn and prompt if ca.pem is missing but other gene...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.2.8-2
Hardware: All
OS: Linux
medium
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: Yedidyah Bar David
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks: 1789291
TreeView+ depends on / blocked
 
Reported: 2019-04-15 15:27 UTC by nijin ashok
Modified: 2020-08-04 13:19 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Previously, if a Certificate Authority `ca.pem` file was not present, the engine-setup tool automatically regenerated all PKI files, requiring you to reinstall or re-enroll certificates for all hosts. Now, if `ca.pem` is not present but other PKI files are, engine-setup prompts you to restore ca.pem from backup without regenerating all PKI files. If a backup is present and you select this option, then you no longer need to reinstall or re-enroll certificates for all hosts.
Clone Of:
Environment:
Last Closed: 2020-08-04 13:17:36 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4098801 0 Upgrade None All RHV hypervisors go non responsive after rhvm upgrade 2019-05-01 15:53:24 UTC
Red Hat Product Errata RHSA-2020:3247 0 None None None 2020-08-04 13:19:18 UTC
oVirt gerrit 103189 0 'None' MERGED packaging: setup: pki: Prompt if ca.pem is missing 2020-12-26 13:34:21 UTC

Description nijin ashok 2019-04-15 15:27:52 UTC
Description of problem:

If the "ca.pem" is missing during the upgrade and if the engine-setup is executed, it will re-generate all the certificates in the manager which will make all the hosts in the environment "not responding". If there is no backup, there is no way to recover from this situation other than "enrolling" the certificates of each host which need downtime of the complete environment.

Setup logs show it's creating the CA.

===
2019-04-15 10:46:00,912-0400 DEBUG otopi.transaction transaction._prepare:61 preparing 'CA Transaction'
2019-04-15 10:46:00,913-0400 INFO otopi.plugins.ovirt_engine_setup.ovirt_engine.pki.ca ca._misc:711 Creating CA
2019-04-15 10:46:00,913-0400 DEBUG otopi.transaction transaction._prepare:61 preparing 'File transaction for '/etc/pki/ovirt-engine/cacert.template''
===

Certificate before/after engine-setup.

===
# openssl x509 -noout  -in /etc/pki/ovirt-engine_bak/ca.pem   -dates -subject
notBefore=Apr  7 16:51:25 2019 GMT
notAfter=Apr  5 16:51:25 2029 GMT
subject= /C=US/O=Test/CN=dhcp131-76.gsslab.pnq2.redhat.com.13006

# openssl x509 -noout  -in /etc/pki/ovirt-engine/ca.pem   -dates -subject
notBefore=Apr 14 14:46:01 2019 GMT
notAfter=Apr 12 14:46:01 2029 GMT
subject= /C=US/O=Test/CN=dhcp131-76.gsslab.pnq2.redhat.com.84111

# openssl x509 -noout  -in /etc/pki/ovirt-engine/certs/engine.cer   -dates -subject
notBefore=Apr 14 14:46:02 2019 GMT
notAfter=Mar 19 14:46:02 2024 GMT
subject= /C=US/O=Test/CN=dhcp131-76.gsslab.pnq2.redhat.com

# openssl x509 -noout  -in /etc/pki/ovirt-engine_bak/certs/engine.cer   -dates -subject
notBefore=Apr  7 16:51:29 2019 GMT
notAfter=Mar 12 16:51:29 2024 GMT
subject= /C=US/O=Test/CN=dhcp131-76.gsslab.pnq2.redhat.com
===

All the certificates on the manager side were regenerated. 


Version-Release number of selected component (if applicable):

RHV 4.2.

How reproducible:

100 %

Steps to Reproduce:

1. Remove the ca.pem.  

2. Run engine-setup.

3. After the setup all the hosts in the environment will go "not responding".

Actual results:

Missing ca.pem during the upgrade(engine-setup) can result in the regeneration of all the certificates

Expected results:

Even with missing "/etc/pki/ovirt-engine_bak/ca.pem", the engine can communicate with the hosts and most of the activities in the RHV will continue work without any issue. So a user may not even observe that the file is missing. However, after engine-setup, the whole environment will go down and this is production down scenario.

I think instead of regenerating everything, we should exit and stop the engine-setup with a valid error message if the ca.pem is missing.


Additional info:

Comment 2 Sandro Bonazzola 2019-04-17 07:08:40 UTC
How the ca.pem got lost on the engine system?

Comment 3 Yedidyah Bar David 2019-04-17 08:00:26 UTC
Just to clarify: The behavior you observe is by design, and seems to work as expected. This is how it worked "forever" (since 3.3, at least).

A missing ca.pem will definitely cause several different common flows to fail. Admittedly, this can still go unnoticed for months, if you happen to not run into such a flow.

That said, I do not mind adding a warning/prompt about this, should be easy.

Comment 4 nijin ashok 2019-04-18 14:50:58 UTC
(In reply to Sandro Bonazzola from comment #2)
> How the ca.pem got lost on the engine system?

I am not sure and I have asked the customer the same. 

I think the _only_ possibility that engine-setup regenerate everything in a working setup is only if "ca.pem" is missing. Please correct me if I am wrong.

(In reply to Yedidyah Bar David from comment #3)
> Just to clarify: The behavior you observe is by design, and seems to work as
> expected. This is how it worked "forever" (since 3.3, at least).
> 

Got it.

> A missing ca.pem will definitely cause several different common flows to
> fail. Admittedly, this can still go unnoticed for months, if you happen to
> not run into such a flow.
> 
> That said, I do not mind adding a warning/prompt about this, should be easy.

If we are just missing ca.pem and all other certificates are intact, then we can easily recover it by getting one from any of the hypervisor since it contains a copy. However, if the engine-setup was executed, everything will be regenerated and it's almost no way to recover if you don't have a backup.

Comment 5 Yedidyah Bar David 2019-04-30 09:50:39 UTC
(In reply to nijin ashok from comment #4)
> (In reply to Sandro Bonazzola from comment #2)
> > How the ca.pem got lost on the engine system?
> 
> I am not sure and I have asked the customer the same. 
> 
> I think the _only_ possibility that engine-setup regenerate everything in a
> working setup is only if "ca.pem" is missing. Please correct me if I am
> wrong.

engine-setup has several different ways to decide if it needs to do something or have already done it.

For pki, the decision is indeed solely based on the existence of ca.pem.

For an example of something else (httpd configuration) that has a different check, which broke us, see bug 1558500.

> 
> (In reply to Yedidyah Bar David from comment #3)
> > Just to clarify: The behavior you observe is by design, and seems to work as
> > expected. This is how it worked "forever" (since 3.3, at least).
> > 
> 
> Got it.
> 
> > A missing ca.pem will definitely cause several different common flows to
> > fail. Admittedly, this can still go unnoticed for months, if you happen to
> > not run into such a flow.
> > 
> > That said, I do not mind adding a warning/prompt about this, should be easy.
> 
> If we are just missing ca.pem and all other certificates are intact, then we
> can easily recover it by getting one from any of the hypervisor since it
> contains a copy. However, if the engine-setup was executed, everything will
> be regenerated and it's almost no way to recover if you don't have a backup.

engine-setup should keeps backups of all config files it overwrites, including pki. If it does not, please open a bug with details. That said, pki specifically is not always handled by code directly inside engine-setup, but also uses shell scripts in /usr/share/ovirt-engine/bin. These too should keep backups. Worst case, it should usually be possible to e.g. extract private/public keys from the .p12 file (or a backup of it). Obviously, this is just a workaround - if you want to be prepared for a similar next case, you should carefully test and document what you do. But it should work.

Bottom line: I am keeping current bug open, considering it low priority, changing the subject accordingly.

Comment 6 Yedidyah Bar David 2019-04-30 09:51:46 UTC
I suggest the following specific list of files to be checked - if any of them exists and ca.pem does not, warn/prompt:

/etc/pki/ovirt-engine/keys/engine_id_rsa
/etc/pki/ovirt-engine/keys/engine.p12
/etc/pki/ovirt-engine/.truststore

I don't mind adding a few files if you want, but do not see much point in making the list much longer. To get a full list of the files you might want to consider, try this:

find /etc/pki/ovirt-engine

Excluding backups:

find /etc/pki/ovirt-engine | grep -v '\.20[0-9][0-9][0-9][0-9]'

Excluding packaged files:

find /etc/pki/ovirt-engine | grep -v '\.20[0-9][0-9][0-9][0-9]' | while read f; do rpm -qf $f > /dev/null 2>&1 || echo $f; done

On a tiny test machine I have, last one shows 87 files. Some of them are optional (websocket-proxy, vmconsole-proxy-helper, ovn, ...), some might be gone in the future (reports, imageio-proxy).

Comment 7 nijin ashok 2019-04-30 10:37:59 UTC
(In reply to Yedidyah Bar David from comment #5)

> engine-setup should keeps backups of all config files it overwrites,
> including pki. If it does not, please open a bug with details. That said,
> pki specifically is not always handled by code directly inside engine-setup,
> but also uses shell scripts in /usr/share/ovirt-engine/bin. These too should
> keep backups. Worst case, it should usually be possible to e.g. extract
> private/public keys from the .p12 file (or a backup of it). Obviously, this
> is just a workaround - if you want to be prepared for a similar next case,
> you should carefully test and document what you do. But it should work.

Thank you for the detailed explanation. In my test, I don't find a backup being taken for the certificates when it is overwritten. However, to be sure, I will redo the test and will open a new bug if no backups are taken.

Comment 8 nijin ashok 2019-05-01 17:26:26 UTC
(In reply to nijin ashok from comment #7)
> (In reply to Yedidyah Bar David from comment #5)
> 
> > engine-setup should keeps backups of all config files it overwrites,
> > including pki. If it does not, please open a bug with details. That said,
> > pki specifically is not always handled by code directly inside engine-setup,
> > but also uses shell scripts in /usr/share/ovirt-engine/bin. These too should
> > keep backups. Worst case, it should usually be possible to e.g. extract
> > private/public keys from the .p12 file (or a backup of it). Obviously, this
> > is just a workaround - if you want to be prepared for a similar next case,
> > you should carefully test and document what you do. But it should work.
> 
> Thank you for the detailed explanation. In my test, I don't find a backup
> being taken for the certificates when it is overwritten. However, to be
> sure, I will redo the test and will open a new bug if no backups are taken.

The whole PKI directory is not backed up. It indeed has an individual file backup for each certificate and Keys which the script is modifying. I think these file backup will help to get the environment back but will be a tedious task as there are many files :)

Comment 9 Yedidyah Bar David 2019-05-02 06:03:12 UTC
(In reply to nijin ashok from comment #8)
> The whole PKI directory is not backed up. It indeed has an individual file
> backup for each certificate and Keys which the script is modifying. I think
> these file backup will help to get the environment back but will be a
> tedious task as there are many files :)

Correct.

That's why I wrote "you should carefully test and document what you do".

Personally, on my test/dev machines, I do this, right after installation:

yum install -y git
cd /etc
git init
git add .
git commit -m 'basic stuff'

And then, after each significant change (e.g. updating packages that have files in /etc, manually changing files there, or running engine-setup):

git add --all .
git commit -m '$STUFF' (where $STUFF can be 'engine-setup', but many times it's actually simply 'stuff'. Still much much better than nothing).

Without this, it is probably much much more work to find the exact list of backups for pki files, but it's still doable. All backups should be ORIGFILE.$(date +"%Y%m%d%H%M%S"). You can find the first timestamp to check by checking the engine-setup log filename, and the last by checking that log file's timestamp. All backups between these should be the ones you want.

Sorry for not setting needinfo earlier, about first part of comment 6. Any files you want to add there?

Comment 10 nijin ashok 2019-05-20 04:35:37 UTC
(In reply to Yedidyah Bar David from comment #9)
> Correct.
> 
> That's why I wrote "you should carefully test and document what you do".
> 
> Personally, on my test/dev machines, I do this, right after installation:
> 
> yum install -y git
> cd /etc
> git init
> git add .
> git commit -m 'basic stuff'
> 
> And then, after each significant change (e.g. updating packages that have
> files in /etc, manually changing files there, or running engine-setup):
> 
> git add --all .
> git commit -m '$STUFF' (where $STUFF can be 'engine-setup', but many times
> it's actually simply 'stuff'. Still much much better than nothing).
> 
> Without this, it is probably much much more work to find the exact list of
> backups for pki files, but it's still doable. All backups should be
> ORIGFILE.$(date +"%Y%m%d%H%M%S"). You can find the first timestamp to check
> by checking the engine-setup log filename, and the last by checking that log
> file's timestamp. All backups between these should be the ones you want.
 
Sure. I will try to put that in a KCS.

> Sorry for not setting needinfo earlier, about first part of comment 6. Any
> files you want to add there?

I think this list is good. I don't have anything else to add.

Comment 11 Yedidyah Bar David 2019-09-25 10:13:50 UTC
Not sure why 103189 was not added automatically.

Comment 12 Yedidyah Bar David 2019-10-06 12:01:02 UTC
QE: Reproduction/verification:

1. Install and setup engine
2. rm /etc/pki/ovirt-engine/ca.pem
3. engine-setup

With a previous version, PKI will be regenerated (can be seen by checking files in /etc/pki/ovirt-engine ) and all hosts will be inaccessible or something like that.

With a fixed version, user is prompted.

Comment 13 Steve Goodman 2019-10-07 08:08:37 UTC
Didi,

I edited doc text. If it's not OK, let me know.

Comment 14 Yedidyah Bar David 2019-10-07 08:28:38 UTC
I think the main change in this bug is not about allowing restoring from backup - you could do that also before. It's in prompting the user asking what to do, and defaulting to Abort, with the assumption that many users will not read it but just press Enter.

Comment 15 RHV bug bot 2019-12-13 13:15:26 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 16 Steve Goodman 2019-12-15 12:10:44 UTC
(In reply to Yedidyah Bar David from comment #14)
> I think the main change in this bug is not about allowing restoring from
> backup - you could do that also before. It's in prompting the user asking
> what to do, and defaulting to Abort, with the assumption that many users
> will not read it but just press Enter.

So how's this:

Previously, engine-setup automatically regenerated all PKI files if ca.pem was not present. Now, if ca.pem is not present but other PKI files are, engine-setup prompts you to restore ca.pem from backup without regenerating all PKI files. If if a backup is present and you select this option, then you no longer need to reinstall or re-enroll certificates for all hosts.

Comment 17 Yedidyah Bar David 2019-12-16 06:36:18 UTC
I think it's much better, yes. Technically it's accurate. I am not sure I like, though, the unwritten implication that the user is supposed to guess, that regenerating PKI requires reinstalling or re-enrolling certs for all hosts (which is correct). I realize that adding that will make the text longer. That's up to you, though...

Also, the last sentence starts with a double "If if".

Comment 18 RHV bug bot 2019-12-20 17:45:05 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 19 Steve Goodman 2019-12-22 13:45:13 UTC
(In reply to Yedidyah Bar David from comment #17)
> I think it's much better, yes. Technically it's accurate. I am not sure I
> like, though, the unwritten implication that the user is supposed to guess,
> that regenerating PKI requires reinstalling or re-enrolling certs for all
> hosts (which is correct). I realize that adding that will make the text
> longer. That's up to you, though...
> 
> Also, the last sentence starts with a double "If if".

How's this?

Previously, if ca.pem was not present, engine-setup automatically regenerated all PKI files, requiring you to reinstall or re-enroll certificates for all hosts. Now, if ca.pem is not present but other PKI files are, engine-setup prompts you to restore ca.pem from backup without regenerating all PKI files. If  a backup is present and you select this option, then you no longer need to reinstall or re-enroll certificates for all hosts.

Comment 20 Yedidyah Bar David 2019-12-23 06:29:17 UTC
Looks good to me. Thanks!

Comment 21 Petr Matyáš 2020-01-08 12:53:18 UTC
With ca.pem missing

[root@engine ~]# ls /etc/pki/ovirt-engine/
apache-ca.pem       cert.conf      cert.template.20200106100400  database.txt.attr.old  private        serial.txt
cacert.conf         certs          cert.template.in              database.txt.old       qemu-ca.pem    serial.txt.old
cacert.template     certs-qemu     database.txt                  keys                   requests
cacert.template.in  cert.template  database.txt.attr             openssl.conf           requests-qemu

I'm not queried about PKI at all during engine-setup

          --== STORAGE CONFIGURATION ==--


          --== PKI CONFIGURATION ==--


          --== APACHE CONFIGURATION ==--

I have ovirt-engine-4.4.0-0.13.master.el7.noarch on an engine upgraded from 4.3 (and from 4.2).

Comment 22 Yedidyah Bar David 2020-01-09 06:21:03 UTC
Please attach setup log. Thanks.

Comment 23 Petr Matyáš 2020-01-09 09:24:33 UTC
I see now, I put Cancel when asked to stop services, but the question is not part of PKI section at all, which is IMO wrong.

Verified on ovirt-engine-4.4.0-0.13.master.el7.noarch

Comment 24 Yedidyah Bar David 2020-01-09 09:30:59 UTC
(In reply to Petr Matyáš from comment #23)
> I see now, I put Cancel when asked to stop services, but the question is not
> part of PKI section at all, which is IMO wrong.

It's a very simple change to move it there, if you want.

Generally speaking, we have two relevant stages there: Customization, in which we ask questions that change the behavior ("Please input this", "Please input that") and Validation, in which we only decide if it's ok to continue ("Is this ok", "Is that ok"). So I added this question to Validation. There, we do not have titles, nor a concrete order for the questions, so their order is semi-random, mostly (unless we need order for specific things).

Comment 25 RHV bug bot 2020-01-24 19:49:11 UTC
WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 28 errata-xmlrpc 2020-08-04 13:17:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247


Note You need to log in before you can comment on or make changes to this bug.