Bug 1367844

Summary: Openshift metrics will ignore preloaded "latest" images on openshift nodes
Product: OpenShift Container Platform Reporter: Elvir Kuric <ekuric>
Component: Image RegistryAssignee: Michal Fojtik <mfojtik>
Status: CLOSED NOTABUG QA Contact: Wei Sun <wsun>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.2.1CC: aos-bugs, jeder, mwringe, tstclair
Target Milestone: ---Keywords: UpcomingRelease
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-26 13:38:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Elvir Kuric 2016-08-17 15:49:15 UTC
Description of problem:

Add option to start metrics with using preloaded "latest" images on openshift nodes as (first) option when starting openshift metrics.

openshift metrics will try to download "latest" metrics images and will fail to start if they cannot be downloaded - this will happen even if "latest' metrics images are present and preloaded in advance on nodes where metrics pods are scheduled to start. Currently it does not check for presence of "latest" metrics images and tries directly to download them as first option



Version-Release number of selected component (if applicable):
 
Openshift v3.2 

How reproducible:

Preload "latest" images on openshift nodes and try to deploy openshift metrics with broken network 


Steps to Reproduce:
Preload "latest" images on openshift nodes and try to deploy openshift metrics with broken network 

Actual results:

starting openshift-metrics fails 


Expected results:
starting openshift-metrics to succeed

Comment 1 Matt Wringe 2016-08-17 16:12:44 UTC
This needs to be handled at the OpenShift level to allow containers to be pulled in locally if the remote registry is not acceptable.

Comment 2 Michal Fojtik 2016-08-17 16:24:09 UTC
Elvir, what pull policy are the metrics pods using? (https://github.com/kubernetes/kubernetes/blob/master/pkg/api/v1/types.go#L1069)

Comment 3 Jeremy Eder 2016-08-19 01:27:14 UTC
I have this issue reproduced on my system.

I don't see that a policy is specified:
https://github.com/openshift/origin-metrics/pull/9/files

oot@mvirt2-j: ~/origin-metrics # oc describe pod metrics-deployer-3n92o
Name:                   metrics-deployer-3n92o
Namespace:              openshift-infra
Security Policy:        anyuid
Node:                   192.2.11.8/192.2.11.8
Start Time:             Thu, 18 Aug 2016 21:15:27 -0400
Labels:                 component=deployer
                        metrics-infra=deployer
                        provider=openshift
Status:                 Pending
IP:                     10.129.6.2
Controllers:            <none>
Containers:
  deployer:
    Container ID:
    Image:              registry.qe.openshift.com/openshift3/metrics-deployer:latest
    Image ID:
    Port:
    State:              Waiting
      Reason:           ErrImagePull
    Ready:              False
    Restart Count:      0
    Volume Mounts:
      /etc/deploy from empty (rw)
      /secret from secret (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from metrics-deployer-token-icsi3 (ro)
    Environment Variables:
      PROJECT:                          openshift-infra (v1:metadata.namespace)
      POD_NAME:                         metrics-deployer-3n92o (v1:metadata.name)
      IMAGE_PREFIX:                     registry.qe.openshift.com/openshift3/
      IMAGE_VERSION:                    latest
      MASTER_URL:                       https://kubernetes.default.svc:443
      MODE:                             deploy
      REDEPLOY:                         false
      IGNORE_PREFLIGHT:                 false
      USE_PERSISTENT_STORAGE:           false
      DYNAMICALLY_PROVISION_STORAGE:    false
      HAWKULAR_METRICS_HOSTNAME:        192.2.8.32
      CASSANDRA_NODES:                  1
      CASSANDRA_PV_SIZE:                10Gi
      METRIC_DURATION:                  7
      USER_WRITE_ACCESS:                false
      HEAPSTER_NODE_ID:                 nodename
      METRIC_RESOLUTION:                10s
Conditions:
  Type          Status
  Initialized   True 
  Ready         False 
  PodScheduled  True 
Volumes:
  empty:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  secret:
    Type:       Secret (a volume populated by a Secret)
    SecretName: metrics-deployer
  metrics-deployer-token-icsi3:
    Type:       Secret (a volume populated by a Secret)
    SecretName: metrics-deployer-token-icsi3
QoS Tier:       BestEffort
Events:
  FirstSeen     LastSeen        Count   From                    SubobjectPath                   Type            Reason          Message
  ---------     --------        -----   ----                    -------------                   --------        ------          -------
  10m           10m             1       {default-scheduler }                                    Normal          Scheduled       Successfully assigned metrics-deployer-3n92o to 192.2.11.8
  10m           27s             43      {kubelet 192.2.11.8}    spec.containers{deployer}       Normal          BackOff         Back-off pulling image "registry.qe.openshift.com/openshift3/metrics-deployer:latest"
  10m           27s             43      {kubelet 192.2.11.8}                                    Warning         FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "deployer" with ImagePullBackOff: "Back-off pulling image \"registry.qe.openshift.com/openshift3/metrics-deployer:latest\""

  10m   13s     7       {kubelet 192.2.11.8}    spec.containers{deployer}       Normal  Pulling pulling image "registry.qe.openshift.com/openshift3/metrics-deployer:latest"
  10m   13s     7       {kubelet 192.2.11.8}    spec.containers{deployer}       Warning Failed  Failed to pull image "registry.qe.openshift.com/openshift3/metrics-deployer:latest": unable to ping registry endpoint https://registry.qe.openshift.com/v0/
v2 ping attempt failed with error: Get https://registry.qe.openshift.com/v2/: dial tcp: lookup registry.qe.openshift.com on 192.2.0.2:53: server misbehaving
 v1 ping attempt failed with error: Get https://registry.qe.openshift.com/v1/_ping: dial tcp: lookup registry.qe.openshift.com on 192.2.0.2:53: server misbehaving
  10m   13s     7       {kubelet 192.2.11.8}            Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "deployer" with ErrImagePull: "unable to ping registry endpoint https://registry.qe.openshift.com/v0/\nv2 ping attempt failed with error: Get https://registry.qe.openshift.com/v2/: dial tcp: lookup registry.qe.openshift.com on 192.2.0.2:53: server misbehaving\n v1 ping attempt failed with error: Get https://registry.qe.openshift.com/v1/_ping: dial tcp: lookup registry.qe.openshift.com on 192.2.0.2:53: server misbehaving"



root@mvirt2-j: ~/origin-metrics # ssh root.11.8 docker images|grep metrics
registry.qe.openshift.com/openshift3/metrics-hawkular-metrics    latest              3ab97c3f7395        2 weeks ago         1.663 GB
registry.qe.openshift.com/openshift3/metrics-cassandra           latest              f460976d4f99        2 weeks ago         837.8 MB
registry.qe.openshift.com/openshift3/metrics-deployer            latest              91a831b58627        2 weeks ago         754.4 MB
registry.qe.openshift.com/openshift3/metrics-heapster            latest              179e3ed5c3b2        2 weeks ago         288.4 MB

Comment 4 Elvir Kuric 2016-08-26 09:13:27 UTC
(In reply to Michal Fojtik from comment #2)
> Elvir, what pull policy are the metrics pods using?
> (https://github.com/kubernetes/kubernetes/blob/master/pkg/api/v1/types.
> go#L1069)

Was on PTO, sorry for delay. 

pull policy for metrics pods is not present , what leads us to : 

If a container’s imagePullPolicy parameter is not specified, OpenShift Origin sets it based on the image’s tag:
If the tag is latest, OpenShift Origin defaults imagePullPolicy to Always.

from  : https://docs.openshift.org/latest/dev_guide/managing_images.html#image-pull-policy

Metrics will by default try to pull "latest" images for metrics pods https://github.com/openshift/origin-metrics/blob/master/metrics.yaml#L104 

The particular issue I see here is, if one get these images in advance ( eg. prebuild kvm images for openstack installations, or prebuild ami instances ) on machines, then metrics will by default try to get "latest" images even if these images are already preloaded and present on machines and tagged with "latest" 


Intention of this bug is to change this, if specific image with tag "latest"  is present on system where pod is scheduled to start - then use it as first option and do not try to pull it again. 


metrics at other side supports IMAGE_VERSION option ( which is default "latest" ) https://github.com/openshift/origin-metrics/blob/master/docs/deployer_configuration.adoc#deployer-template-parameters 

In my test case, I used latest bits with below workaround 

1) get latest images before starting metrics - eg. preload them on nodes
2) tag them with new tag , eg " latest_local" 
3) start metrics with --IMAGE_VERSION=latest_local

Comment 5 Matt Wringe 2016-08-26 13:05:06 UTC
"Intention of this bug is to change this, if specific image with tag "latest"  is present on system where pod is scheduled to start - then use it as first option and do not try to pull it again."

No, this should not be the intention of this bug, this breaks a lot of functionality as it will never bring in the latest images as intended. If we wanted this behaviour, we would use the 'IfNotPresent' policy. But we don't want it to act this way.

The issue here is more along the lines of what to do if you cannot connect to the docker registry, or there was an error connecting to it, but the images are available locally.

It looks like @ekuric would like the system to be able to use the local images instead of failing.

Comment 6 Timothy St. Clair 2016-08-26 13:11:13 UTC
It is common best practice to pre-pull your images, especially in the immutable vm infrastructure or "air-gapped" deployments.  Operators will ensure that the images associated with the infrastructure are available locally on the vm's and only the vm's b/c they have curated and locked their infrastructure.  

Forcing a pull, when the operator knows what they want and have pre-baked their images for security purposes is bug.  We need to support this mode of operation, as it's a best practice in large deployments.

Comment 7 Timothy St. Clair 2016-08-26 13:26:32 UTC
ok, ignore my comment.  

We can't pre-pull "latest" images, they need to be versioned.

Comment 8 Elvir Kuric 2016-08-26 13:38:54 UTC
Openshift metrics starts fine using specific image version if it is started with --IMAGE_VERSION option.

If images are preloaded on openshift nodes and if --IMAGE_VERSION is specified as startup option, then it will use image with specified IMAGE_VERSION to start metrics and use these images as first choice. 

Closing this BZ.