Description of problem: If the "ca.pem" is missing during the upgrade and if the engine-setup is executed, it will re-generate all the certificates in the manager which will make all the hosts in the environment "not responding". If there is no backup, there is no way to recover from this situation other than "enrolling" the certificates of each host which need downtime of the complete environment. Setup logs show it's creating the CA. === 2019-04-15 10:46:00,912-0400 DEBUG otopi.transaction transaction._prepare:61 preparing 'CA Transaction' 2019-04-15 10:46:00,913-0400 INFO otopi.plugins.ovirt_engine_setup.ovirt_engine.pki.ca ca._misc:711 Creating CA 2019-04-15 10:46:00,913-0400 DEBUG otopi.transaction transaction._prepare:61 preparing 'File transaction for '/etc/pki/ovirt-engine/cacert.template'' === Certificate before/after engine-setup. === # openssl x509 -noout -in /etc/pki/ovirt-engine_bak/ca.pem -dates -subject notBefore=Apr 7 16:51:25 2019 GMT notAfter=Apr 5 16:51:25 2029 GMT subject= /C=US/O=Test/CN=dhcp131-76.gsslab.pnq2.redhat.com.13006 # openssl x509 -noout -in /etc/pki/ovirt-engine/ca.pem -dates -subject notBefore=Apr 14 14:46:01 2019 GMT notAfter=Apr 12 14:46:01 2029 GMT subject= /C=US/O=Test/CN=dhcp131-76.gsslab.pnq2.redhat.com.84111 # openssl x509 -noout -in /etc/pki/ovirt-engine/certs/engine.cer -dates -subject notBefore=Apr 14 14:46:02 2019 GMT notAfter=Mar 19 14:46:02 2024 GMT subject= /C=US/O=Test/CN=dhcp131-76.gsslab.pnq2.redhat.com # openssl x509 -noout -in /etc/pki/ovirt-engine_bak/certs/engine.cer -dates -subject notBefore=Apr 7 16:51:29 2019 GMT notAfter=Mar 12 16:51:29 2024 GMT subject= /C=US/O=Test/CN=dhcp131-76.gsslab.pnq2.redhat.com === All the certificates on the manager side were regenerated. Version-Release number of selected component (if applicable): RHV 4.2. How reproducible: 100 % Steps to Reproduce: 1. Remove the ca.pem. 2. Run engine-setup. 3. After the setup all the hosts in the environment will go "not responding". Actual results: Missing ca.pem during the upgrade(engine-setup) can result in the regeneration of all the certificates Expected results: Even with missing "/etc/pki/ovirt-engine_bak/ca.pem", the engine can communicate with the hosts and most of the activities in the RHV will continue work without any issue. So a user may not even observe that the file is missing. However, after engine-setup, the whole environment will go down and this is production down scenario. I think instead of regenerating everything, we should exit and stop the engine-setup with a valid error message if the ca.pem is missing. Additional info:
How the ca.pem got lost on the engine system?
Just to clarify: The behavior you observe is by design, and seems to work as expected. This is how it worked "forever" (since 3.3, at least). A missing ca.pem will definitely cause several different common flows to fail. Admittedly, this can still go unnoticed for months, if you happen to not run into such a flow. That said, I do not mind adding a warning/prompt about this, should be easy.
(In reply to Sandro Bonazzola from comment #2) > How the ca.pem got lost on the engine system? I am not sure and I have asked the customer the same. I think the _only_ possibility that engine-setup regenerate everything in a working setup is only if "ca.pem" is missing. Please correct me if I am wrong. (In reply to Yedidyah Bar David from comment #3) > Just to clarify: The behavior you observe is by design, and seems to work as > expected. This is how it worked "forever" (since 3.3, at least). > Got it. > A missing ca.pem will definitely cause several different common flows to > fail. Admittedly, this can still go unnoticed for months, if you happen to > not run into such a flow. > > That said, I do not mind adding a warning/prompt about this, should be easy. If we are just missing ca.pem and all other certificates are intact, then we can easily recover it by getting one from any of the hypervisor since it contains a copy. However, if the engine-setup was executed, everything will be regenerated and it's almost no way to recover if you don't have a backup.
(In reply to nijin ashok from comment #4) > (In reply to Sandro Bonazzola from comment #2) > > How the ca.pem got lost on the engine system? > > I am not sure and I have asked the customer the same. > > I think the _only_ possibility that engine-setup regenerate everything in a > working setup is only if "ca.pem" is missing. Please correct me if I am > wrong. engine-setup has several different ways to decide if it needs to do something or have already done it. For pki, the decision is indeed solely based on the existence of ca.pem. For an example of something else (httpd configuration) that has a different check, which broke us, see bug 1558500. > > (In reply to Yedidyah Bar David from comment #3) > > Just to clarify: The behavior you observe is by design, and seems to work as > > expected. This is how it worked "forever" (since 3.3, at least). > > > > Got it. > > > A missing ca.pem will definitely cause several different common flows to > > fail. Admittedly, this can still go unnoticed for months, if you happen to > > not run into such a flow. > > > > That said, I do not mind adding a warning/prompt about this, should be easy. > > If we are just missing ca.pem and all other certificates are intact, then we > can easily recover it by getting one from any of the hypervisor since it > contains a copy. However, if the engine-setup was executed, everything will > be regenerated and it's almost no way to recover if you don't have a backup. engine-setup should keeps backups of all config files it overwrites, including pki. If it does not, please open a bug with details. That said, pki specifically is not always handled by code directly inside engine-setup, but also uses shell scripts in /usr/share/ovirt-engine/bin. These too should keep backups. Worst case, it should usually be possible to e.g. extract private/public keys from the .p12 file (or a backup of it). Obviously, this is just a workaround - if you want to be prepared for a similar next case, you should carefully test and document what you do. But it should work. Bottom line: I am keeping current bug open, considering it low priority, changing the subject accordingly.
I suggest the following specific list of files to be checked - if any of them exists and ca.pem does not, warn/prompt: /etc/pki/ovirt-engine/keys/engine_id_rsa /etc/pki/ovirt-engine/keys/engine.p12 /etc/pki/ovirt-engine/.truststore I don't mind adding a few files if you want, but do not see much point in making the list much longer. To get a full list of the files you might want to consider, try this: find /etc/pki/ovirt-engine Excluding backups: find /etc/pki/ovirt-engine | grep -v '\.20[0-9][0-9][0-9][0-9]' Excluding packaged files: find /etc/pki/ovirt-engine | grep -v '\.20[0-9][0-9][0-9][0-9]' | while read f; do rpm -qf $f > /dev/null 2>&1 || echo $f; done On a tiny test machine I have, last one shows 87 files. Some of them are optional (websocket-proxy, vmconsole-proxy-helper, ovn, ...), some might be gone in the future (reports, imageio-proxy).
(In reply to Yedidyah Bar David from comment #5) > engine-setup should keeps backups of all config files it overwrites, > including pki. If it does not, please open a bug with details. That said, > pki specifically is not always handled by code directly inside engine-setup, > but also uses shell scripts in /usr/share/ovirt-engine/bin. These too should > keep backups. Worst case, it should usually be possible to e.g. extract > private/public keys from the .p12 file (or a backup of it). Obviously, this > is just a workaround - if you want to be prepared for a similar next case, > you should carefully test and document what you do. But it should work. Thank you for the detailed explanation. In my test, I don't find a backup being taken for the certificates when it is overwritten. However, to be sure, I will redo the test and will open a new bug if no backups are taken.
(In reply to nijin ashok from comment #7) > (In reply to Yedidyah Bar David from comment #5) > > > engine-setup should keeps backups of all config files it overwrites, > > including pki. If it does not, please open a bug with details. That said, > > pki specifically is not always handled by code directly inside engine-setup, > > but also uses shell scripts in /usr/share/ovirt-engine/bin. These too should > > keep backups. Worst case, it should usually be possible to e.g. extract > > private/public keys from the .p12 file (or a backup of it). Obviously, this > > is just a workaround - if you want to be prepared for a similar next case, > > you should carefully test and document what you do. But it should work. > > Thank you for the detailed explanation. In my test, I don't find a backup > being taken for the certificates when it is overwritten. However, to be > sure, I will redo the test and will open a new bug if no backups are taken. The whole PKI directory is not backed up. It indeed has an individual file backup for each certificate and Keys which the script is modifying. I think these file backup will help to get the environment back but will be a tedious task as there are many files :)
(In reply to nijin ashok from comment #8) > The whole PKI directory is not backed up. It indeed has an individual file > backup for each certificate and Keys which the script is modifying. I think > these file backup will help to get the environment back but will be a > tedious task as there are many files :) Correct. That's why I wrote "you should carefully test and document what you do". Personally, on my test/dev machines, I do this, right after installation: yum install -y git cd /etc git init git add . git commit -m 'basic stuff' And then, after each significant change (e.g. updating packages that have files in /etc, manually changing files there, or running engine-setup): git add --all . git commit -m '$STUFF' (where $STUFF can be 'engine-setup', but many times it's actually simply 'stuff'. Still much much better than nothing). Without this, it is probably much much more work to find the exact list of backups for pki files, but it's still doable. All backups should be ORIGFILE.$(date +"%Y%m%d%H%M%S"). You can find the first timestamp to check by checking the engine-setup log filename, and the last by checking that log file's timestamp. All backups between these should be the ones you want. Sorry for not setting needinfo earlier, about first part of comment 6. Any files you want to add there?
(In reply to Yedidyah Bar David from comment #9) > Correct. > > That's why I wrote "you should carefully test and document what you do". > > Personally, on my test/dev machines, I do this, right after installation: > > yum install -y git > cd /etc > git init > git add . > git commit -m 'basic stuff' > > And then, after each significant change (e.g. updating packages that have > files in /etc, manually changing files there, or running engine-setup): > > git add --all . > git commit -m '$STUFF' (where $STUFF can be 'engine-setup', but many times > it's actually simply 'stuff'. Still much much better than nothing). > > Without this, it is probably much much more work to find the exact list of > backups for pki files, but it's still doable. All backups should be > ORIGFILE.$(date +"%Y%m%d%H%M%S"). You can find the first timestamp to check > by checking the engine-setup log filename, and the last by checking that log > file's timestamp. All backups between these should be the ones you want. Sure. I will try to put that in a KCS. > Sorry for not setting needinfo earlier, about first part of comment 6. Any > files you want to add there? I think this list is good. I don't have anything else to add.
Not sure why 103189 was not added automatically.
QE: Reproduction/verification: 1. Install and setup engine 2. rm /etc/pki/ovirt-engine/ca.pem 3. engine-setup With a previous version, PKI will be regenerated (can be seen by checking files in /etc/pki/ovirt-engine ) and all hosts will be inaccessible or something like that. With a fixed version, user is prompted.
Didi, I edited doc text. If it's not OK, let me know.
I think the main change in this bug is not about allowing restoring from backup - you could do that also before. It's in prompting the user asking what to do, and defaulting to Abort, with the assumption that many users will not read it but just press Enter.
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
(In reply to Yedidyah Bar David from comment #14) > I think the main change in this bug is not about allowing restoring from > backup - you could do that also before. It's in prompting the user asking > what to do, and defaulting to Abort, with the assumption that many users > will not read it but just press Enter. So how's this: Previously, engine-setup automatically regenerated all PKI files if ca.pem was not present. Now, if ca.pem is not present but other PKI files are, engine-setup prompts you to restore ca.pem from backup without regenerating all PKI files. If if a backup is present and you select this option, then you no longer need to reinstall or re-enroll certificates for all hosts.
I think it's much better, yes. Technically it's accurate. I am not sure I like, though, the unwritten implication that the user is supposed to guess, that regenerating PKI requires reinstalling or re-enrolling certs for all hosts (which is correct). I realize that adding that will make the text longer. That's up to you, though... Also, the last sentence starts with a double "If if".
(In reply to Yedidyah Bar David from comment #17) > I think it's much better, yes. Technically it's accurate. I am not sure I > like, though, the unwritten implication that the user is supposed to guess, > that regenerating PKI requires reinstalling or re-enrolling certs for all > hosts (which is correct). I realize that adding that will make the text > longer. That's up to you, though... > > Also, the last sentence starts with a double "If if". How's this? Previously, if ca.pem was not present, engine-setup automatically regenerated all PKI files, requiring you to reinstall or re-enroll certificates for all hosts. Now, if ca.pem is not present but other PKI files are, engine-setup prompts you to restore ca.pem from backup without regenerating all PKI files. If a backup is present and you select this option, then you no longer need to reinstall or re-enroll certificates for all hosts.
Looks good to me. Thanks!
With ca.pem missing [root@engine ~]# ls /etc/pki/ovirt-engine/ apache-ca.pem cert.conf cert.template.20200106100400 database.txt.attr.old private serial.txt cacert.conf certs cert.template.in database.txt.old qemu-ca.pem serial.txt.old cacert.template certs-qemu database.txt keys requests cacert.template.in cert.template database.txt.attr openssl.conf requests-qemu I'm not queried about PKI at all during engine-setup --== STORAGE CONFIGURATION ==-- --== PKI CONFIGURATION ==-- --== APACHE CONFIGURATION ==-- I have ovirt-engine-4.4.0-0.13.master.el7.noarch on an engine upgraded from 4.3 (and from 4.2).
Please attach setup log. Thanks.
I see now, I put Cancel when asked to stop services, but the question is not part of PKI section at all, which is IMO wrong. Verified on ovirt-engine-4.4.0-0.13.master.el7.noarch
(In reply to Petr Matyáš from comment #23) > I see now, I put Cancel when asked to stop services, but the question is not > part of PKI section at all, which is IMO wrong. It's a very simple change to move it there, if you want. Generally speaking, we have two relevant stages there: Customization, in which we ask questions that change the behavior ("Please input this", "Please input that") and Validation, in which we only decide if it's ok to continue ("Is this ok", "Is that ok"). So I added this question to Validation. There, we do not have titles, nor a concrete order for the questions, so their order is semi-random, mostly (unless we need order for specific things).
WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247