Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1577220 - [DOCS] Ansible-based installation reuses incorrect/unintended configuration value
Summary: [DOCS] Ansible-based installation reuses incorrect/unintended configuration v...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 3.10.0
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: ---
Assignee: Michael Burke
QA Contact: Johnny Liu
Vikram Goyal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-11 13:41 UTC by Chris Evich
Modified: 2018-06-12 01:29 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-12 01:29:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github /openshift openshift-ansible pull 8322 0 None None None 2018-05-11 13:44:03 UTC

Description Chris Evich 2018-05-11 13:41:25 UTC
Document URL: 
https://docs.openshift.org/latest/install_config/install/advanced_install.html#running-the-advanced-installation


Section Number and Name: 
All sections that reference the a "playbook" (e.g. prerequisites.yml, deploy_cluster.yml, etc)


Describe the issue: 
The default ansible-based installer is configured to utilize fact-caching with a TTL of 10-minutes.  During that window an admin may correct an installation problem by fixing a mistake somewhere.  If they then re-run any playbook, within the cache TTL, _old_ (broken or incompatible) variable value will be used (instead of a new/correct value).

For example, see description of https://github.com/openshift/openshift-ansible/pull/8322  This would cause astonishment and confusion for any reasonable person.  However, it was advised (in upstream PR linked above), this default behavior (presumably) has a positive effect for the common-case, large-cluster installations.

Suggestions for improvement: 
Add a warning, recommending removal of invalid cache contents, should any change happen to system, network, or inventory configuration.  The location of the cache is set by the `fact_caching_connection` value in `ansible.conf`.  The Ansible Tower cache, or by the `$CACHE_PLUGIN_CONNECTION` environment variable.


Additional information: 

Default OpenShift Ansible Configuration: https://github.com/openshift/openshift-ansible/blob/master/ansible.cfg

Cache configuration docs: http://docs.ansible.com/ansible/latest/plugins/cache.html

Cache options ref.: http://docs.ansible.com/ansible/latest/reference_appendices/config.html#cache-plugin

Comment 1 Michael Gugino 2018-05-11 17:50:56 UTC
Fact caching works as designed.  Perhaps you're hitting some bug in openshift-ansible.

What value are you having trouble with precisely?

Comment 2 Chris Evich 2018-05-11 18:00:35 UTC
(In reply to Michael Gugino from comment #1)
> What value are you having trouble with precisely?

The system's hostname.  Correcting it between playbook runs, and gee-wiz, it kept using the wrong value :S

Fact caching isn't only for variables, it caches _everything_.  If I can hit this and get confused, you-betcha an ansible-noob is going to pull their hair out.  Personally, I'd prefer cache was disabled by default, but a documentation "fix" was called for, so this is that.

At least add a warning:  Kill or remove cache between playbook runs, after any changes are made (to system, inventory, network, etc).

Comment 3 Michael Gugino 2018-05-11 20:15:15 UTC
(In reply to Chris Evich from comment #2)
> (In reply to Michael Gugino from comment #1)
> > What value are you having trouble with precisely?
> 
> The system's hostname.  Correcting it between playbook runs, and gee-wiz, it
> kept using the wrong value :S
> 
> Fact caching isn't only for variables, it caches _everything_.  If I can hit
> this and get confused, you-betcha an ansible-noob is going to pull their
> hair out.  Personally, I'd prefer cache was disabled by default, but a
> documentation "fix" was called for, so this is that.
> 
> At least add a warning:  Kill or remove cache between playbook runs, after
> any changes are made (to system, inventory, network, etc).

So, this was not related to a variable in particular?

Hostname and other facts (not variables) are cached by the fact cache.

I think we should consider disabling fact-caching by default; I don't believe a typical deployment has much use for it, though it might save a couple of minutes in really large environments when going from prerequisites.yml to deploy_cluster.yml.

Comment 4 Chris Evich 2018-05-14 17:52:27 UTC
> So, this was not related to a variable in particular?

I think so, though it was several test-installations ago so I may be misremembering.  I believe what happened is, I had 'openshift_hostname' set but the actual system hostname was incorrect.  I removed that setting and corrected the actual hostname.  However, upon re-running prerequisites.yml, it failed failed b/c both(?) were still cached.

I too think disabling the cache by default, or having an alternate configuration is a better solution.  I tried that, and they sent me here for a docs-fix :S  I guess perhaps the thing to do is proove w/n both system-facts and variables (set_facts) are cached (I believe they are)...

Comment 5 Chris Evich 2018-05-14 19:00:02 UTC
...okay, so with setting ``fact_caching = yaml`` and looking at the changes to the cache file:

* Cache is not checked/utilized if ``gather_facts`` is set False, even if values were previously cached.

* The ``set_facts`` task will utilize cache if it finds a value set, and its current ``cacheable`` attribute is set true.  It doesn't check how a value made it into the cache or if it's valid.

* The contents of ``ansible_env`` are cached, however appear to be always refreshed.  I would guess ``ansible_date_time`` behave similarly.

* Variables set in static inventory do not appear to be cached.  I did not test dynamic inventory, but I'd guess it works the same (it has it's own caching API).

* Local facts (/etc/ansible/facts.d) are cached, including invalid value state (does not refresh when contents corrected).

The last point could be especially problematic, especially if the playbook sets a local fact ``foo`` based on a ``when: ansible_local.foo is undefined`` condition.

Anyway for the this bug (after testing),  my guess is it was a cached ``ansible_hostname`` or local fact that caused it, and not my ``openshift_hostname`` setting change (in inventory).  I can attempt to re-create this invalid-hostname situation if required.

Comment 6 Chris Evich 2018-05-14 19:05:07 UTC
Correction: ``ansible_env`` contents _are_ cached.  My ``fact_caching_timeout`` was too low previously (30 seconds).

Comment 7 Michael Burke 2018-06-04 18:24:37 UTC
Chris --

We have a note in Known Issues:
"On failure of the Ansible installer, you must start from a clean operating system installation. If you are using virtual machines, start from a fresh image. If you are using bare metal machines, see Uninstalling OpenShift Container Platform for instructions."
https://docs.openshift.com/container-platform/3.9/install_config/install/advanced_install.html#installer-known-issues

We link to this note from the Advanced Installer topic:
If for any reason the installation fails, before re-running the installer, see Known Issues to check for any specific instructions or workarounds.
https://docs.openshift.com/container-platform/3.9/install_config/install/advanced_install.html#running-the-advanced-installation-rpm

Does this address your concern, or do you think something less "dramatic" would be better?

Comment 8 Michael Burke 2018-06-04 18:43:01 UTC
Chris --

How does this look:
https://github.com/openshift/openshift-docs/pull/9815

The installer caches playbook configuration values for 10 minutes, by default. If for some reason you change any system, network, 
or inventory configuration, then re-run the installer within that 10 minute period, the new values are not used and the
previous valuea are used instead. You can delete the contents of the cashe, which is defined
by the `fact_caching_connection` value in the *_/etc/ansible/ansible.cfg_* file, which is
shown in xref:../../scaling_performance/install_practices.adoc#scaling-performance-install-optimization[Recommended Installation Practices].

Comment 9 Chris Evich 2018-06-05 14:36:59 UTC
Looks good.

It's a nasty time-suck problem when you hit it, where noticing a visible "caution" would be much appreciated.  OTOH, probably 99.999% of the time, it's unnecessary reading.   I'm comfortable leaving he degree of underlining/callout to your judgement, as my opinion on the matter is heavily bias.

Comment 10 Johnny Liu 2018-06-05 15:59:55 UTC
LGTM.


Note You need to log in before you can comment on or make changes to this bug.