Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1772955

Summary: overcloud deployment fails at "Sync cached facts" ansible step
Product: Red Hat OpenStack Reporter: David Hill <dhill>
Component: openstack-tripleo-heat-templatesAssignee: Alex Schultz <aschultz>
Status: CLOSED DUPLICATE QA Contact: Sasha Smolyak <ssmolyak>
Severity: low Docs Contact:
Priority: low    
Version: 13.0 (Queens)CC: apetrich, ariveral, aschultz, jeroen.van_bemmel, kesteven, mburns
Target Milestone: ---Keywords: Reopened, Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-30 21:19:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Hill 2019-11-15 17:18:29 UTC
Description of problem:
overcloud deployment fails at "Sync cached facts" ansible step:

cloud.AllNodesDeploySteps.ComputeDeployment_Step1.1:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 3eddf3f5-5c1f-494f-bdf2-69af583e358a
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
    fatal: [localhost]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}
    ...ignoring

    TASK [Sync cached facts] *******************************************************
    fatal: [localhost -> localhost]: FAILED! => {"changed": false, "cmd": "/usr/bin/rsync --delay-updates -F --compress --archive --out-format=<<CHANGED>>%i %n%L /opt/puppetlabs/ /var/lib/container-puppet/puppetlabs/", "msg": "rsync: change_dir \"/opt/puppetlabs\" failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]\n", "rc": 23}
    	to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/b068b9ec-8798-4153-9b00-b7c3635505a2_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=30   changed=13   unreachable=0    failed=1

    (truncated, view all with --long)
  deploy_stderr: |

Version-Release number of selected component (if applicable):
[dhill@supportshell sosreport-director-2019-11-10-imjfqtd]$ cat installed-rpms  | grep tripleo
ansible-tripleo-ipsec-8.1.1-0.20190513184007.7eb892c.el7ost.noarch Mon Oct  7 15:46:08 2019
openstack-tripleo-common-8.7.1-2.el7ost.noarch              Thu Nov  7 18:00:28 2019
openstack-tripleo-common-containers-8.7.1-2.el7ost.noarch   Thu Nov  7 18:00:28 2019
openstack-tripleo-heat-templates-8.4.1-13.el7ost.noarch     Thu Nov  7 18:00:32 2019
openstack-tripleo-image-elements-8.0.3-1.el7ost.noarch      Thu Nov  7 18:00:27 2019
openstack-tripleo-puppet-elements-8.1.1-1.el7ost.noarch     Thu Nov  7 18:00:24 2019
openstack-tripleo-validations-8.5.0-2.el7ost.noarch         Thu Nov  7 18:00:27 2019
puppet-tripleo-8.5.1-3.el7ost.noarch                        Thu Nov  7 18:00:25 2019
python-tripleoclient-9.3.1-4.el7ost.noarch                  Thu Nov  7 18:00:34 2019
[supportshell.prod.useraccess-us-west-2.redhat.com] [17:17:41+0000]
[dhill@supportshell sosreport-director-2019-11-10-imjfqtd]$ cat installed-rpms  | grep rhosp-imag
[supportshell.prod.useraccess-us-west-2.redhat.com] [17:17:59+0000]
[dhill@supportshell sosreport-director-2019-11-10-imjfqtd]$ cat installed-rpms  | grep rhosp
rhosp-director-images-13.0-20190920.1.el7ost.noarch         Tue Oct  8 10:40:04 2019
rhosp-director-images-13.0-20191031.1.el7ost.noarch         Thu Nov  7 11:39:25 2019
rhosp-director-images-ipa-13.0-20191031.1.el7ost.noarch     Thu Nov  7 18:42:15 2019
rhosp-director-images-ipa-x86_64-13.0-20190920.1.el7ost.noarch Tue Oct  8 10:38:56 2019
rhosp-director-images-ipa-x86_64-13.0-20191031.1.el7ost.noarch Thu Nov  7 11:37:01 2019
rhosp-director-images-x86_64-13.0-20190920.1.el7ost.noarch  Tue Oct  8 10:39:52 2019
rhosp-director-images-x86_64-13.0-20191031.1.el7ost.noarch  Thu Nov  7 11:37:59 2019
rhosp-release-13.0.9-1.el7ost.noarch                        Thu Nov  7 11:36:21 2019


How reproducible:
This deployment

Steps to Reproduce:
1. Deploy an overcloud
2.
3.

Actual results:
Fails

Expected results:
Succeeds

Additional info:

Comment 1 Alex Schultz 2019-11-21 21:15:16 UTC
This happens if z8 content is used with z9 THT. Please make sure that the systems have been properly updated (packages/overcloud images/containers).

Comment 2 Jeroen van Bemmel 2019-11-22 20:59:51 UTC
It may not technically be a bug, but it is a hard-to-catch mistake and one that could be avoided through a simple compatibility check.

I just hit this same issue using RHOSP 13.0-20190827.1.el7ost. Please consider how a check on z8 content versus z9 could be added

Comment 3 Alex Schultz 2019-11-22 21:03:12 UTC
We can try and add some additional logic in, however it won't land until at least z10. This highlights a break down in best practices for applying updates and ensuring environments have the correct content which seems like an end user problem that isn't necessarily solved by more checks in director.

Comment 4 Jeroen van Bemmel 2019-11-22 21:21:35 UTC
It also highlights an implicit dependency that end users have no way of identifying. Adding checks to Director may not be ideal, but Director code (or rather its developers) is in a better position to understand its dependencies. 

In my case, I pinned the RHOSP Director image version to one that was working for me at the time. I believe it is quite common for deployments to not always take the latest and greatest - perhaps some guidance on how to pick a particular version and the consequences of that choice ( e.g. if you pick RHOSP overcloud image 13.0-20190827.1.el7ost, also pick xyz ) could help users avoid this particular mistake

Thanks for reopening

Comment 5 Alex Schultz 2019-11-22 21:30:15 UTC
Pinning to an image is not a good idea. The image must match the current version or scale outs/deployments will fail. When you update the overcloud, the overcloud-full image must match the software being updated to.  This is just one instance where this is, but there is always an assumption of the correct base when performing deployment actions. You can't use an OSP10 overcloud image with an OSP13 THT, it just doesn't work.  You'll hit problems all over the place for example minor RHEL updates doing that as well when things like pacemaker or libvirt end up being mix-matched.  This would be fine if you use the THT that went with that overcloud image, however because a newer THT is used the expectations were out of date. Container versions are have this similar expectations.

Comment 6 Alex Schultz 2019-11-22 21:33:46 UTC
Specifically this is documented as part of upgrading the undercloud: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/keeping_red_hat_openstack_platform_updated/assembly-upgrading_the_undercloud

We expect that if you update the undercloud (which would imply a newer THT) that you upload the current overcloud-full images.

Comment 7 Jeroen van Bemmel 2019-11-22 22:03:10 UTC
There were some compatibility issues with the newer overcloud image, as it was based on RHEL 7.6 versus 7.5 and our custom installed packages weren't ready for the migration. I am not saying you are wrong, but users may also have valid reasons for staying with a particular image version.

And sometimes it may still work with an older disk image. In order to avoid "bug reports" like this one, I think it would benefit everyone involved if there were some explicit version checks added.

The link you referenced says "only use OpenStack Platform 13 images with the OpenStack Platform 13 Heat templates". But apparently (in this case) the compatibility requirements are more strict than that, i.e. only use OSP13 z9 images with OSP13 z9 Heat templates (and OSP13 z9 python-tripleo client, etc. )

In my view, the key moment to check is during a fresh install - i.e. during 'overcloud deploy', as one of the first things to check when running Ansible scripts on the individual node. I am curious as to where exactly things break down, and I wonder if it wouldn't be possible to design things to be 'self contained' - i.e. Director would run an install script that is part of the overcloud image itself, and therefore (by definition) compatible.

I think version control has always been an important part of OSP deployments (and software in general), and it goes beyond simple major version equality.

Comment 8 Alex Schultz 2019-11-22 22:08:49 UTC
Right but if you're locking to a version, it needs to be consistent. Unfortunately the deployment framework doesn't have a concept of these minor version so it assumes that is handled externally via satellite or some other method.  If you have a need to lock to a specific zstream, it must be a complete lock otherwise you will end up in this case.  We try to make them as backwards compatible but OSP ends up also being at the mercy of RHEL minor releases as well.  When we test we're testing a complete zstream to another complete zstream as that is the recommended way to keep your environment.  

Anyway I've got a proposed patch to make this specific functionality ignore the error since it should be fine if it fails as it's just for speed improvements. If you hit this error, you may have speed related deployment issues in environments with large number of network interfaces.

Comment 9 Jeroen van Bemmel 2019-11-22 22:46:05 UTC
Ah, the wonderful world of software package versions and their dependencies...

Anyway, agreed - this specific issue can be a simple matter of ignoring the error, making the code more compatible with previous releases.
We can solve world hunger next time...

Comment 10 Alex Schultz 2020-01-30 21:19:45 UTC
Turns out the fix referenced here was also to address Bug 1772201 (same). Feel free to follow that bz for updates on when the fix will be shipped.

*** This bug has been marked as a duplicate of bug 1772201 ***