Bug 1660595
Summary: | Hosted Engine Deploy fails with SSO authentication errors | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Anitha Udgiri <audgiri> |
Component: | ovirt-hosted-engine-setup | Assignee: | Simone Tiraboschi <stirabos> |
Status: | CLOSED ERRATA | QA Contact: | Nikolai Sednev <nsednev> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.2.5 | CC: | andre.liebe, audgiri, didi, guillaume.pavese, lsurette, lsvaty, mtessun, sbonazzo, sborella, sgoodman, stirabos, tamay.mueller, um1 |
Target Milestone: | ovirt-4.3.3 | Keywords: | Triaged |
Target Release: | 4.3.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | ovirt-ansible-hosted-engine-setup-1.0.14 | Doc Type: | Bug Fix |
Doc Text: |
During a self-hosted engine deployment, SSO authentication errors may occur stating that a valid profile cannot be found in credentials and to check the logs for more details. The interim workaround is to retry the authentication attempt more than once. See BZ#1695523 for a specific example involving Kerberos SSO and engine-backup.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-05-08 12:32:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Anitha Udgiri
2018-12-18 18:54:10 UTC
This has been opened on 4.2.5. Does this reproduce with 4.2.7 too? (In reply to Sandro Bonazzola from comment #6) > This has been opened on 4.2.5. Does this reproduce with 4.2.7 too? Sandro, Here is Customer's response : Translated : I was able to continue with the deployment: 1. Correct DNS entry (the second octet was incorrect on DNS) 2. Clean previous hosted-engine installation 3. Clean /var/tmp on host where the rhv-m image was 4. Re-try the installation, everything went ok. Everything under 4.2.7 RHV version Moving to 4.3.2 not being identified as blocker for 4.3.1. *** Bug 1664123 has been marked as a duplicate of this bug. *** Hi, I'm using ovirt 4.3.1 and i'm facing same issue, when tried to deploy self-hosted-engine on my server, It is blocking me from hosting the engine, at the same step and error initially reported. (In reply to Umashankar from comment #11) > Hi, I'm using ovirt 4.3.1 and i'm facing same issue, when tried to deploy > self-hosted-engine on my server, It is blocking me from hosting the engine, > at the same step and error initially reported. For now I can only suggest to simply try again: the issue is not systematic at all. *** Bug 1674540 has been marked as a duplicate of this bug. *** okay, I tried again and failed again. - ovirt-hosted-engine-cleanup - rm -rf /var/tmp* - re-run: hosted-engine --deploy --restore-from-file=/mnt/backups/engine/ovirt-engine-backup-full.tar.gz fails at same step. host is up to date to current 4.3.2 ovirt-hosted-engine-ha-2.3.1-1.el7.noarch ovirt-ansible-engine-setup-1.1.9-1.el7.noarch ovirt-ansible-hosted-engine-setup-1.0.13-1.el7.noarch André, can you please try locally applying https://github.com/oVirt/ovirt-ansible-hosted-engine-setup/pull/149/files on your /usr/share/ansible/roles/ovirt.hosted_engine_setup/tasks/create_target_vm/03_hosted_engine_final_tasks.yml ? Honestly I never managed to reproduce this in a systematic way. patch -u -p1 < /root/093f02a.patch patching file tasks/create_target_vm/03_hosted_engine_final_tasks.yml Hunk #1 succeeded at 321 (offset -3 lines). But it fails again. I tried to authenticate against temporary reachable web gui on https://lvh3:6900/hosted-engine, but failed (like ansible script) While looking through engine.log I found a major problem, which may cause the trouble: 2019-03-26 10:21:05,136+01 ERROR [org.ovirt.engine.core.sso.utils.SsoExtensionsManager] (ServerService Thread Pool -- 49) [] Could not load extension based on configuration file '/etc/ovirt-engine/extensions.d/kerberos-http-authn.properties'. Please check the configuration file is valid. Exception message is: Error loading extension 'kerberos-http-authn': The module 'org.ovirt.engine-extensions.aaa.misc' cannot be loaded: org.ovirt.engine-extensions.aaa.misc 2019-03-26 10:21:05,136+01 ERROR [org.ovirt.engine.core.sso.utils.SsoExtensionsManager] (ServerService Thread Pool -- 49) [] Could not load extension based on configuration file '/etc/ovirt-engine/extensions.d/kerberos-http-mapping.properties'. Please check the configuration file is valid. Exception message is: Error loading extension 'kerberos-http-mapping': The module 'org.ovirt.engine-extensions.aaa.misc' cannot be loaded: org.ovirt.engine-extensions.aaa.misc ... 2019-03-26 10:21:05,575+01 WARN [org.ovirt.engineextensions.aaa.ldap.Framework] (ServerService Thread Pool -- 49) [] Error while connecting to 'ucs1.lab.gematik.de': LDAPException(resultCode=82 (local error), errorMessage='The connection reader was unable to successfully complete TLS negotiation: SSLHandshakeException(sun.security.validator.ValidatorException: No trusted certificate found), ldapSDKVersion=4.0.7, revision=b28fb50058dfe2864171df2448ad2ad2b4c2ad58') 2019-03-26 10:21:05,575+01 WARN [org.ovirt.engineextensions.aaa.ldap.AuthnExtension] (ServerService Thread Pool -- 49) [] [ovirt-engine-extension-aaa-ldap.authn::lab.gematik.de-authn] Cannot initialize LDAP framework, deferring initialization. Error: The connection reader was unable to successfully complete TLS negotiation: SSLHandshakeException(sun.security.validator.ValidatorException: No trusted certificate found), ldapSDKVersion=4.0.7, revision=b28fb50058dfe2864171df2448ad2ad2b4c2ad58 Side note: engine was configured with aaa to authenticate throgh kerberos and LDAPs against a Domain Controller before - internal CA certificate was deployed manually to /etc/pki/ca-trust/source/anchors/internal-ca.pem and installed globally by update-ca-trust extract - Kerberos was konfigured (krb5.keytab was deployed to /etc/httpd/http.keytab, httpd configuration extensions etc) according to https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/administration_guide/configuring_ldap_and_kerberos_for_single_sign-on (In reply to André Liebe from comment #16) > Side note: engine was configured with aaa to authenticate throgh kerberos > and LDAPs against a Domain Controller before > - internal CA certificate was deployed manually to > /etc/pki/ca-trust/source/anchors/internal-ca.pem and installed globally by > update-ca-trust extract > - Kerberos was konfigured (krb5.keytab was deployed to > /etc/httpd/http.keytab, httpd configuration extensions etc) according to > https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/ > html/administration_guide/configuring_ldap_and_kerberos_for_single_sign-on Didi, are we confident that we are correctly covering also such cases in engine-backup? (In reply to Simone Tiraboschi from comment #17) > (In reply to André Liebe from comment #16) > > Side note: engine was configured with aaa to authenticate throgh kerberos > > and LDAPs against a Domain Controller before > > - internal CA certificate was deployed manually to > > /etc/pki/ca-trust/source/anchors/internal-ca.pem and installed globally by > > update-ca-trust extract > > - Kerberos was konfigured (krb5.keytab was deployed to > > /etc/httpd/http.keytab, httpd configuration extensions etc) according to > > https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/ > > html/administration_guide/configuring_ldap_and_kerberos_for_single_sign-on > > Didi, > are we confident that we are correctly covering also such cases in > engine-backup? We do not. Please open a bug, thanks. That said, not sure where the border is between "engine backup" and "engine machine backup". User can have all kinds of local modifications (backup agents, monitoring, whatever) that we do not backup/restore. (In reply to Yedidyah Bar David from comment #18) > > Didi, > > are we confident that we are correctly covering also such cases in > > engine-backup? > > We do not. Please open a bug, thanks. That said, not sure where the border is > between "engine backup" and "engine machine backup". User can have all kinds > of local modifications (backup agents, monitoring, whatever) that we do not > backup/restore. Yes, of course we cannot cover every possible user changes without really taking a VM snapshot of something like that. I think that we should instead probably focus more on a kind of safe mode for the engine where we are sure that the engine can always start with bare minimal functionalities letting then the user fix what's still missing. Normally I would have setup/prepared the virtual machine myself, like before ansible setup was the one and only option. From my point of view engine-backup should at least contain all neccessary files it was configured with, if file/folder path was suggested by documentation [1],[2]. Or at least a a strong warning should go into every customization part of documentation which will break restore procedure [1] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/administration_guide/configuring_ldap_and_kerberos_for_single_sign-on [2] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/administration_guide/appe-red_hat_enterprise_virtualization_and_ssl So, whats the best way to include a customization step (show ssh connection details and wait for user interaction to continue) with ansible, so I can customize the new engine VM after everything is installed but before starting engine services within the new vm? (In reply to André Liebe from comment #21) > So, whats the best way to include a customization step (show ssh connection > details and wait for user interaction to continue) with ansible, so I can > customize the new engine VM after everything is installed but before > starting engine services within the new vm? Simone and I discussed this recently, but I do not remember the conclusion. IMO you can already do that in principle, because we do ask questions after the engine is already up, e.g. storage. So when prompted, you can find the local IP address of the engine vm (it will be in libvirt's default network), ssh there and/or connect to the web admin ui, customize stuff, then reply to the question prompt. I agree that we should make this more user-friendly, and also discussed allowing doing this seamlessly using a 'ssh -w' tunnel, so that you can connect to the engine web ui right from your laptop. Simone - any more details? Do we have a bug for this? Hmm, isn't it already too late when web ui available? The ca certificate needs to be deployed before engine/wildfly is started, so it is able to connect to LDAPs (or remote postgre with tls). Simone could you help me out with an ansible patch that waits for user interaction after setup? (In reply to Yedidyah Bar David from comment #22) > Simone and I discussed this recently, but I do not remember the conclusion. > IMO you can already do that in principle, because we do ask questions after > the engine is already up, e.g. storage. So when prompted, you can find the > local IP address of the engine vm (it will be in libvirt's default network), > ssh there and/or connect to the web admin ui, customize stuff, then reply to > the question prompt. > > I agree that we should make this more user-friendly, and also discussed > allowing doing this seamlessly using a 'ssh -w' tunnel, so that you can > connect to the engine web ui right from your laptop. Simone - any more > details? Do we have a bug for this? Yes, and the ssh tunnel to reach the engine over the bootstrap VM is already there now. But this is a different case: here the user has to customise the engine VM after engine-backup but before engine-setup and we already have an hook mechanism for that. Creating a custom ansible tasks file with all the missing steps and saving it under /usr/share/ansible/roles/ovirt.hosted_engine_setup/hooks/enginevm_before_engine_setup will be enough here. (In reply to André Liebe from comment #23) > Hmm, isn't it already too late when web ui available? yes, exactly. (In reply to Yedidyah Bar David from comment #18) > We do not. Please open a bug, thanks. That said, not sure where the border is > between "engine backup" and "engine machine backup". User can have all kinds > of local modifications (backup agents, monitoring, whatever) that we do not > backup/restore. Done: https://bugzilla.redhat.com/1693816 (In reply to Simone Tiraboschi from comment #24) > (In reply to Yedidyah Bar David from comment #22) > > Simone and I discussed this recently, but I do not remember the conclusion. > > IMO you can already do that in principle, because we do ask questions after > > the engine is already up, e.g. storage. So when prompted, you can find the > > local IP address of the engine vm (it will be in libvirt's default network), > > ssh there and/or connect to the web admin ui, customize stuff, then reply to > > the question prompt. > > > > I agree that we should make this more user-friendly, and also discussed > > allowing doing this seamlessly using a 'ssh -w' tunnel, so that you can > > connect to the engine web ui right from your laptop. Simone - any more > > details? Do we have a bug for this? > > Yes, and the ssh tunnel to reach the engine over the bootstrap VM is already > there now. > But this is a different case: here the user has to customise the engine VM > after engine-backup but before engine-setup and we already have an hook > mechanism for that. In theory this is enough, in practice it requires lots of testing (also routinely, on new versions) to make sure such a playbook keeps working as expected. IMO we should also (perhaps optionally) prompt between restore and setup, saying "Restore finished. Press Enter when ready to continue and run Setup". (In reply to André Liebe from comment #23) > Hmm, isn't it already too late when web ui available? The ca certificate > needs to be deployed before engine/wildfly is started, so it is able to > connect to LDAPs (or remote postgre with tls). OK, I agree. But in this specific case, if the version used to take the backup and the version used during restore are identical, engine-setup should not need to do very much, and it's probably safe to simply try manually fixing what's needed and then run it again manually. That is, if we indeed prompt at that step (instead of abort). (In reply to Yedidyah Bar David from comment #28) > IMO we should also (perhaps optionally) prompt between restore and setup, > saying "Restore finished. Press Enter when ready to continue and run Setup". Unfortunately we cannot easily pause Ansible execution in the middle. In theory we have two ways to freeze Ansible execution: https://docs.ansible.com/ansible/latest/modules/pause_module.html https://docs.ansible.com/ansible/latest/modules/wait_for_module.html In practice, pause is not really going to work if executed via ansible-tower or ansible-runner: see "Note: Playbooks should not use the pause feature of Ansible without a timeout, as Tower does not allow for interactively cancelling a pause. If you must use pause, ensure that you set a timeout." from https://docs.ansible.com/ansible-tower/latest/html/userguide/best_practices.html ovirt-hosted-engine-setup is currently just wrapping ansible-playbook via subprocess.Popen: https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/ovirt_hosted_engine_setup/ansible_utils.py#L198 but even in that case a pause task is going to be skipped with a: [WARNING]: Not waiting for response to prompt as stdin is not interactive The second option is wait_for: in that case we could for instance wait until a specific lock file is removed or something like that. But exiting the paused status is not just as simple as pressing a key and we should eventually think about an "unpause" utility command (something like 'hosted-engine --unpause-deploy' ) to be executed over a second shell. Not really sure about that. (In reply to Simone Tiraboschi from comment #30) > (In reply to Yedidyah Bar David from comment #28) > > IMO we should also (perhaps optionally) prompt between restore and setup, > > saying "Restore finished. Press Enter when ready to continue and run Setup". > > Unfortunately we cannot easily pause Ansible execution in the middle. > > In theory we have two ways to freeze Ansible execution: > https://docs.ansible.com/ansible/latest/modules/pause_module.html > https://docs.ansible.com/ansible/latest/modules/wait_for_module.html > > In practice, pause is not really going to work if executed via ansible-tower > or ansible-runner: > see "Note: Playbooks should not use the pause feature of Ansible without a > timeout, as Tower does not allow for interactively cancelling a pause. If > you must use pause, ensure that you set a timeout." from > https://docs.ansible.com/ansible-tower/latest/html/userguide/best_practices. > html > > ovirt-hosted-engine-setup is currently just wrapping ansible-playbook via > subprocess.Popen: > https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/ > ovirt_hosted_engine_setup/ansible_utils.py#L198 > > but even in that case a pause task is going to be skipped with a: > [WARNING]: Not waiting for response to prompt as stdin is not interactive > > The second option is wait_for: > in that case we could for instance wait until a specific lock file is > removed or something like that. > But exiting the paused status is not just as simple as pressing a key and we > should eventually think about an "unpause" utility command (something like > 'hosted-engine --unpause-deploy' ) to be executed over a second shell. > Not really sure about that. Two other options: 1. Create some temp file, tell the user to remove it when ready, wait until it's gone (or until some timeout, if we want). 2. Split the playbook to two and prompt in between. (In reply to Yedidyah Bar David from comment #31) > Two other options: > > 1. Create some temp file, tell the user to remove it when ready, wait until > it's gone (or until some timeout, if we want). - name: Wait until the lock file is removed wait_for: path: /var/lock/file.lock state: absent will do exactly this > 2. Split the playbook to two and prompt in between. this is more complex since the whole logic is now packaged in a role and the playbook is just a two lines wrapper over the role André, is the example in comment 32 good enough for your current needs? We might want to include it in the docs. I think it will serve 95% of the cases. I already worked a round this issue by adding a customization file in here /usr/share/ansible/roles/ovirt.hosted_engine_setup/hooks/enginevm_before_engine_setup (which copies keytab, ca-cert and runs a trust-extract) to run just into another problem: bug 1694116 I'd definitley favour the wait for lock step to be included (in a sane way like /root/DELETE-TO-CONTINUE) in `hosted-engine --deplopy` which could be a toggle (e.g. --manual-custimization) So, yeah this suggestion from comment 32 will definitley work for me And of course one needs to bee quick to use the workaround from comment 32 as it will timeout after 300 seconds [ INFO ] TASK [ovirt.hosted_engine_setup : Wait until the lock file is removed] [ ERROR ] fatal: [localhost -> engine.lab.gematik.de]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for /root/DELETE_TO_CONTINUE to be absent."} Yes, sorry, of course we can simply enter a longer timeout value or remove it at all. Deployment over NFS on clean environment had succeeded using these components on hosts: Works for me on these components: ovirt-hosted-engine-setup-2.3.7-1.el7ev.noarch ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch rhvm-appliance-4.3-20190328.1.el7.x86_64 Linux 3.10.0-957.10.1.el7.x86_64 #1 SMP Thu Feb 7 07:12:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo) Tested on RHEL hosts. Moving to verified. (In reply to André Liebe from comment #34) > I already worked a round this issue by adding a customization file in here > /usr/share/ansible/roles/ovirt.hosted_engine_setup/hooks/ > enginevm_before_engine_setup (which copies keytab, ca-cert and runs a > trust-extract) to run just into another problem: bug 1694116 > > I'd definitley favour the wait for lock step to be included (in a sane way > like /root/DELETE-TO-CONTINUE) in `hosted-engine --deplopy` which could be a > toggle (e.g. --manual-custimization) > > So, yeah this suggestion from comment 32 will definitley work for me Filed for this bug 1695523. Is there a clear Action Item for docs here? Looking through this, it's not clear to me. Comment 32 has what appears to me to be a workaround, and it's not clear if there is consensus on docs addressing something specific. (In reply to Steve Goodman from comment #39) > Comment 32 has what appears to me to be a workaround, and it's not clear if > there is consensus on docs addressing something specific. Since comment 14 we are talking with André about a specific subcase: Kerberso SSO was configured on the original environment but engine-backup is not correctly handling it. We filed https://bugzilla.redhat.com/show_bug.cgi?id=1695523 on that specific case, something on doc side will be probably required there. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1050 sync2jira sync2jira |