Bug 1471322 - Default image tags for logging components allow new images to deploy without required configmap or deploymentconfig changes
Default image tags for logging components allow new images to deploy without ...
Status: VERIFIED
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging (Show other bugs)
3.6.1
All Linux
unspecified Severity low
: ---
: 3.6.z
Assigned To: Jan Wozniak
Anping Li
aos-scalability-36
: NeedsTestCase, OpsBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-14 21:55 EDT by Peter Portante
Modified: 2017-09-22 05:41 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Github openshift/openshift-ansible/pull/5138 None None None 2017-08-25 16:31 EDT

  None (edit)
Description Peter Portante 2017-07-14 21:55:46 EDT
The default image tags for logging components allow new images to deploy without required configmap or deploymentconfig changes.

While this is filed against 3.6.1, it applies to all releases of logging in OpenShift.

The scenario is actually simple:

1. Images for a component of logging are pushed a registry:

   logging-elasticsearch 3.6.1-15

2. Logging is deployed with a deployment config that by default specifies 3.6.1, 3.6, or v2.6

3. New images for a given component are pushed to the same registry:

   logging-elasticsearch 3.6.1-16

4. The new images require an environment variable to be present when the container starts

5. The node in the cluster goes down in an unplanned fashion

6. The failed node is replaced, the pod is recreated on the failed node, pulling the latest image, 3.6.1-16

7. However, the required changes to the deploymentconfig that governs that pod have not been deployed

8. The new deployed pod enters a crash-loop-backup state because it is missing a variable it requires

This scenario is all too common.

There are at least four potential methods to solve this situation:

1. (Operator side) the operators of an openshift environment only push new images to the registry referenced by logging when they are ready to be deployed and the updates made in a controlled manner

2. (Operator side) the operators change the version numbers in the contents of the daemonsets and deploymentconfigs to be fixed to a particular version with build number

3. (Product side) All changes made to a particular version and build number of an image never require changes to configmaps or deploymentconfigs as well

4. The ansible method of installing and updating images first inspects what the latest image in the registry at the time of deployment, and deploys them setting the fixed version and build number in the deploymentconfigs and daemonsets

The above assumes we continue with the <major>.<minor>.<update>-<build> scheme.

We could consider changing to a <major>.<minor>.<release>.<update>-<build> scheme where we don't change configmaps, deploymentconfigs, or daemonsets if the build number changes, but _do_ allow changing configmaps, deploymentconfigs, or daemonsets when changing <major>.<minor>.<release>.<update>.

So for example:

  3.6.1.9-13 -> 3.6.1.9-24  :  configmaps, deploymentconfigs, daemonsets can't change

  3.6.1.9-13 -> 3.6.1.12-3  :  configmaps, deploymentconfigs, daemonsets _can_ change

In this scenario, the deployer would always set the image tag as 3.6.1.9 initially, and the tag would only be changed by an explicit deployment.
Comment 1 Stefanie Forrester 2017-07-17 10:00:03 EDT
As it is now, option #2 is well within Ops' control. That's what I've been using to fix the issue when I see it in prod.

Option #1 will be available to us as soon as I can migrate prod to the new registry (within the next couple weeks). I don't think any product-side changes need to happen in order to meet Ops' needs.

But I think even when we have option #1, it will be a logistical challenge for Ops to schedule the updates across all the customer clusters and get them done in time. Because, as soon as the new image is pushed, they will be racing the clock to get the clusters upgraded before the new image is accidentally pulled onto a node affected by an unplanned outage, which would break logging for that cluster. So I'm more in favor of option #2, with option #1 as extra insurance against accidental image pulls.
Comment 2 Jeff Cantrill 2017-07-17 10:25:48 EDT
Peter brought this issue to my attention and I am of the opinion that we should be advising customers to peg the logging component versions to a specific version to avoid unintentional changes to the logging stack.  This seems to me to just be good practice by an operational team along with things like:

* version controlling configurations
* schema and data backups before upgrading databases.

We should, if we do not already, advise customers to version control their ansible inventory files.

I think we should also consider modifying the default pull policy
Comment 3 Jeff Cantrill 2017-07-17 13:33:59 EDT
Lowering the severity to remove from the blocker list.  We allow configuration of component images independently which should allow users to manage the version they are running in a controlled fashion.  There may be some additional modifications we can make to help avoid this issue.
Comment 4 Peter Portante 2017-07-17 14:15:11 EDT
@Jeff, can we really lower this if we don't know yet if the openshift-ansible code supports specific versions for each component?
Comment 5 Jeff Cantrill 2017-08-08 11:57:56 EDT
The openshift-ansible code supports providing component specific versions by setting inventory vars [1]. 

Does this satisfy this issue?  The changes you have proposed in our conversations should be captured as an enhancement via Trello if they are not already.  By pegging the versions we can avoid the unintended pulling new images on restart issue and require upgrades through ansible.


[1] https://github.com/openshift/openshift-ansible/blob/release-3.6/roles/openshift_logging_elasticsearch/defaults/main.yml#L3-L4
Comment 6 Rich Megginson 2017-08-08 12:31:14 EDT
Are these component specific versions actually used anywhere?

> pwd
openshift-ansible
> git branch
* release-3.6
> git grep openshift_logging_elasticsearch_image_version
roles/openshift_logging_elasticsearch/defaults/main.yml:openshift_logging_elasticsearch_image_version: "{{ openshift_hosted_logging_deployer_version | default('latest') }}"

That is, there is no reference to this being used anywhere.

So, let's take a look at the template es.j2, where the image is set:

> grep image: roles/openshift_logging_elasticsearch/templates/es.j2

          image: {{image}}

> vi roles/openshift_logging_elasticsearch/tasks/main.yaml
# DC
- name: Set ES dc templates
  template:
    src: es.j2
    dest: "{{ tempdir }}/templates/logging-es-dc.yml"
...
    image: "{{ openshift_logging_image_prefix }}logging-elasticsearch:{{ openshift_logging_image_version }}"


AFAICT, the openshift_logging_image_version is used and not the component specific version.
Comment 7 Peter Portante 2017-08-08 13:23:34 EDT
(In reply to Jeff Cantrill from comment #5)
> Does this satisfy this issue?

As Rich pointed out, it does not seem to be sufficient.  We need to be able to specify a unique version for each image, since they don't all share the same version numbers.
Comment 9 Jeff Cantrill 2017-08-25 16:31:21 EDT
Specific image version is addressed by: https://github.com/openshift/openshift-ansible/pull/5138
Comment 10 openshift-github-bot 2017-08-25 19:46:40 EDT
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/00e7908e983f97ab31360aeff32824f52fd25efd
Bug 1471322: logging roles based image versions

Allowing to specify an image version for each logging component

https://bugzilla.redhat.com/show_bug.cgi?id=1471322
Comment 11 Peter Portante 2017-08-25 21:00:32 EDT
Are we declaring this BZ fixed by enabling specific image versions?  If we are, have we considered the original problem?  Do the images we provide today work in the face of DC, DS, and configmaps that don't provide what they expect?
Comment 12 Jeff Cantrill 2017-09-01 15:01:56 EDT
Peter,  I consider this to resolve the issue by requiring users to explicitly identify the images they desire.  Additionally, I believe there is work such that our CD process will be tagging every image of Openshift so they will all be in lockstep.  If we think there is additional work, let's move it to a trello card so we can schedule it properly and discuss it.
Comment 13 Peter Portante 2017-09-01 15:27:42 EDT
Great, please move this to a trello card then, thanks!
Comment 15 Anping Li 2017-09-14 02:38:21 EDT
The fix is in openshift-ansible-3.6.173.0.33.  but the errata 30362 doesn't include openshift-ansible packages. Could you confirm if we use this errata to publish the fix?
Comment 16 Anping Li 2017-09-21 07:30:25 EDT
@scott, The fix is in openshift-ansible. could you move this bug to the proper installer errata?
Comment 19 Anping Li 2017-09-22 05:41:52 EDT
Verified and pass

Scenarios 1:
1) Tag image to different prefix and version on docker-registry

2) Set variable in inventory

openshift_logging_image_prefix=registry.example.com/openshift3/
openshift_logging_image_version=v3.6
openshift_logging_curator_image_prefix=registry.example.com/test1/
openshift_logging_curator_image_version=v3.6.0.1
openshift_logging_elasticsearch_image_prefix=registry.example.com/test2/
openshift_logging_elasticsearch_image_version=v3.6.0.2
openshift_logging_fluentd_image_prefix=registry.example.com/test3/
openshift_logging_fluentd_image_version=v3.6.0.3
openshift_logging_kibana_image_prefix=registry.example.com/test4/
openshift_logging_kibana_image_version=v3.6.0.4
openshift_logging_kibana_proxy_image_prefix=registry.example.com/test5/
openshift_logging_kibana_proxy_image_version=v3.6.0.5
openshift_logging_mux_image_prefix=registry.example.com/test6/
openshift_logging_mux_image_version=v3.6.0.6

3) deploy and check the image in deploymentconfig and daemonset

# oc get dc -o yaml|grep 'image:'
          image: /test1/logging-curator:v3.6.0.1
          image: registry.example.com/test2/logging-elasticsearch:v3.6.0.2
          image: registry.example.com/test4/logging-kibana:v3.6.0.4
          image: registry.example.com/test5/logging-auth-proxy:v3.6.0.5
          image: registry.example.com/test6/logging-fluentd:v3.6.0.6
# oc get ds -o yaml|grep 'image:'
          image: registry.example.com/test3/logging-fluentd:v3.6.0.3


Scenarios 2:
1) set different version and don't set prefix for each image
openshift_logging_image_prefix=registry.example.com/openshift3/
openshift_logging_image_version=v3.6
openshift_logging_curator_image_version=v3.6.0.1
openshift_logging_elasticsearch_image_version=v3.6.0.2
openshift_logging_fluentd_image_version=v3.6.0.3
openshift_logging_kibana_image_version=v3.6.0.4
openshift_logging_kibana_proxy_image_version=v3.6.0.5
openshift_logging_mux_image_version=v3.6.0.6

2). deploy logging
3)  deploy and check the dc and ds
         # oc get dc -o yaml |grep 'image:'
          image: registry.example.com/openshift3/logging-curator:v3.6.0.1
          image: registry.example.com/openshift3/logging-elasticsearch:v3.6.0.2
          image: registry.example.com/openshift3/logging-kibana:v3.6.0.4
          image: registry.example.com/openshift3/logging-auth-proxy:v3.6.0.5
          image: registry.example.com/openshift3/logging-fluentd:v3.6.0.6
          # oc get ds -o yaml |grep 'image:'
          image: registry.example.com/openshift3/logging-fluentd:v3.6.0.3

Note You need to log in before you can comment on or make changes to this bug.