Bug 1772955
| Summary: | overcloud deployment fails at "Sync cached facts" ansible step | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | David Hill <dhill> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Alex Schultz <aschultz> |
| Status: | CLOSED DUPLICATE | QA Contact: | Sasha Smolyak <ssmolyak> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 13.0 (Queens) | CC: | apetrich, ariveral, aschultz, jeroen.van_bemmel, kesteven, mburns |
| Target Milestone: | --- | Keywords: | Reopened, Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-30 21:19:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This happens if z8 content is used with z9 THT. Please make sure that the systems have been properly updated (packages/overcloud images/containers). It may not technically be a bug, but it is a hard-to-catch mistake and one that could be avoided through a simple compatibility check. I just hit this same issue using RHOSP 13.0-20190827.1.el7ost. Please consider how a check on z8 content versus z9 could be added We can try and add some additional logic in, however it won't land until at least z10. This highlights a break down in best practices for applying updates and ensuring environments have the correct content which seems like an end user problem that isn't necessarily solved by more checks in director. It also highlights an implicit dependency that end users have no way of identifying. Adding checks to Director may not be ideal, but Director code (or rather its developers) is in a better position to understand its dependencies. In my case, I pinned the RHOSP Director image version to one that was working for me at the time. I believe it is quite common for deployments to not always take the latest and greatest - perhaps some guidance on how to pick a particular version and the consequences of that choice ( e.g. if you pick RHOSP overcloud image 13.0-20190827.1.el7ost, also pick xyz ) could help users avoid this particular mistake Thanks for reopening Pinning to an image is not a good idea. The image must match the current version or scale outs/deployments will fail. When you update the overcloud, the overcloud-full image must match the software being updated to. This is just one instance where this is, but there is always an assumption of the correct base when performing deployment actions. You can't use an OSP10 overcloud image with an OSP13 THT, it just doesn't work. You'll hit problems all over the place for example minor RHEL updates doing that as well when things like pacemaker or libvirt end up being mix-matched. This would be fine if you use the THT that went with that overcloud image, however because a newer THT is used the expectations were out of date. Container versions are have this similar expectations. Specifically this is documented as part of upgrading the undercloud: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/keeping_red_hat_openstack_platform_updated/assembly-upgrading_the_undercloud We expect that if you update the undercloud (which would imply a newer THT) that you upload the current overcloud-full images. There were some compatibility issues with the newer overcloud image, as it was based on RHEL 7.6 versus 7.5 and our custom installed packages weren't ready for the migration. I am not saying you are wrong, but users may also have valid reasons for staying with a particular image version. And sometimes it may still work with an older disk image. In order to avoid "bug reports" like this one, I think it would benefit everyone involved if there were some explicit version checks added. The link you referenced says "only use OpenStack Platform 13 images with the OpenStack Platform 13 Heat templates". But apparently (in this case) the compatibility requirements are more strict than that, i.e. only use OSP13 z9 images with OSP13 z9 Heat templates (and OSP13 z9 python-tripleo client, etc. ) In my view, the key moment to check is during a fresh install - i.e. during 'overcloud deploy', as one of the first things to check when running Ansible scripts on the individual node. I am curious as to where exactly things break down, and I wonder if it wouldn't be possible to design things to be 'self contained' - i.e. Director would run an install script that is part of the overcloud image itself, and therefore (by definition) compatible. I think version control has always been an important part of OSP deployments (and software in general), and it goes beyond simple major version equality. Right but if you're locking to a version, it needs to be consistent. Unfortunately the deployment framework doesn't have a concept of these minor version so it assumes that is handled externally via satellite or some other method. If you have a need to lock to a specific zstream, it must be a complete lock otherwise you will end up in this case. We try to make them as backwards compatible but OSP ends up also being at the mercy of RHEL minor releases as well. When we test we're testing a complete zstream to another complete zstream as that is the recommended way to keep your environment. Anyway I've got a proposed patch to make this specific functionality ignore the error since it should be fine if it fails as it's just for speed improvements. If you hit this error, you may have speed related deployment issues in environments with large number of network interfaces. Ah, the wonderful world of software package versions and their dependencies... Anyway, agreed - this specific issue can be a simple matter of ignoring the error, making the code more compatible with previous releases. We can solve world hunger next time... Turns out the fix referenced here was also to address Bug 1772201 (same). Feel free to follow that bz for updates on when the fix will be shipped. *** This bug has been marked as a duplicate of bug 1772201 *** |
Description of problem: overcloud deployment fails at "Sync cached facts" ansible step: cloud.AllNodesDeploySteps.ComputeDeployment_Step1.1: resource_type: OS::Heat::StructuredDeployment physical_resource_id: 3eddf3f5-5c1f-494f-bdf2-69af583e358a status: CREATE_FAILED status_reason: | Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... fatal: [localhost]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true} ...ignoring TASK [Sync cached facts] ******************************************************* fatal: [localhost -> localhost]: FAILED! => {"changed": false, "cmd": "/usr/bin/rsync --delay-updates -F --compress --archive --out-format=<<CHANGED>>%i %n%L /opt/puppetlabs/ /var/lib/container-puppet/puppetlabs/", "msg": "rsync: change_dir \"/opt/puppetlabs\" failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]\n", "rc": 23} to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/b068b9ec-8798-4153-9b00-b7c3635505a2_playbook.retry PLAY RECAP ********************************************************************* localhost : ok=30 changed=13 unreachable=0 failed=1 (truncated, view all with --long) deploy_stderr: | Version-Release number of selected component (if applicable): [dhill@supportshell sosreport-director-2019-11-10-imjfqtd]$ cat installed-rpms | grep tripleo ansible-tripleo-ipsec-8.1.1-0.20190513184007.7eb892c.el7ost.noarch Mon Oct 7 15:46:08 2019 openstack-tripleo-common-8.7.1-2.el7ost.noarch Thu Nov 7 18:00:28 2019 openstack-tripleo-common-containers-8.7.1-2.el7ost.noarch Thu Nov 7 18:00:28 2019 openstack-tripleo-heat-templates-8.4.1-13.el7ost.noarch Thu Nov 7 18:00:32 2019 openstack-tripleo-image-elements-8.0.3-1.el7ost.noarch Thu Nov 7 18:00:27 2019 openstack-tripleo-puppet-elements-8.1.1-1.el7ost.noarch Thu Nov 7 18:00:24 2019 openstack-tripleo-validations-8.5.0-2.el7ost.noarch Thu Nov 7 18:00:27 2019 puppet-tripleo-8.5.1-3.el7ost.noarch Thu Nov 7 18:00:25 2019 python-tripleoclient-9.3.1-4.el7ost.noarch Thu Nov 7 18:00:34 2019 [supportshell.prod.useraccess-us-west-2.redhat.com] [17:17:41+0000] [dhill@supportshell sosreport-director-2019-11-10-imjfqtd]$ cat installed-rpms | grep rhosp-imag [supportshell.prod.useraccess-us-west-2.redhat.com] [17:17:59+0000] [dhill@supportshell sosreport-director-2019-11-10-imjfqtd]$ cat installed-rpms | grep rhosp rhosp-director-images-13.0-20190920.1.el7ost.noarch Tue Oct 8 10:40:04 2019 rhosp-director-images-13.0-20191031.1.el7ost.noarch Thu Nov 7 11:39:25 2019 rhosp-director-images-ipa-13.0-20191031.1.el7ost.noarch Thu Nov 7 18:42:15 2019 rhosp-director-images-ipa-x86_64-13.0-20190920.1.el7ost.noarch Tue Oct 8 10:38:56 2019 rhosp-director-images-ipa-x86_64-13.0-20191031.1.el7ost.noarch Thu Nov 7 11:37:01 2019 rhosp-director-images-x86_64-13.0-20190920.1.el7ost.noarch Tue Oct 8 10:39:52 2019 rhosp-director-images-x86_64-13.0-20191031.1.el7ost.noarch Thu Nov 7 11:37:59 2019 rhosp-release-13.0.9-1.el7ost.noarch Thu Nov 7 11:36:21 2019 How reproducible: This deployment Steps to Reproduce: 1. Deploy an overcloud 2. 3. Actual results: Fails Expected results: Succeeds Additional info: