1531500 – Redundant deployment appear when trigger 1 time of deployment operation

Bug 1531500 - Redundant deployment appear when trigger 1 time of deployment operation

Summary: Redundant deployment appear when trigger 1 time of deployment operation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	3.9.0
Hardware:	x86_64
OS:	Other
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Ben Parees
QA Contact:	Dongbo Yan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1541934
TreeView+	depends on / blocked

Reported:	2018-01-05 10:46 UTC by ge liu
Modified:	2018-12-13 19:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Clones:	1541934 (view as bug list)
Environment:
Last Closed:	2018-12-13 19:26:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3748	0	None	None	None	2018-12-13 19:26:58 UTC

Description ge liu 2018-01-05 10:46:52 UTC

Description of problem:
Create template and trigger 1 time of deployment with template, but it triggered 2 deployment and create two pods.

oc v3.8.26
kubernetes v1.8.1+0d5291c
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://api.free-int.openshift.com:443
openshift v3.8.18
kubernetes v1.8.1+0d5291c


How reproducible:
Always

Steps to Reproduce:
1.# oc process -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/deployment/OCP-11384/application-template-stibuild.json| oc create -f -

2.# oc get pods
NAME                        READY     STATUS      RESTARTS   AGE
database-1-deploy           0/1       Error       0          1m
frontend-2-bs7n9            1/1       Running     0          25s
frontend-2-qkfbk            1/1       Running     0          25s
ruby-sample-build-1-build   0/1       Completed   0          1m

3.# oc get dc 
NAME       REVISION   DESIRED   CURRENT   TRIGGERED BY
database   1          1         0         config
frontend   2          2         2         config,image(origin-ruby-sample:latest)

# oc get dc frontend -o yaml
apiVersion: v1
kind: DeploymentConfig
metadata:
  annotations:
    template.alpha.openshift.io/wait-for-ready: "true"
  creationTimestamp: 2018-01-05T10:34:37Z
  generation: 3
  labels:
    template: application-template-stibuild
  name: frontend
  namespace: lgp3
  resourceVersion: "147125443"
  selfLink: /oapi/v1/namespaces/lgp3/deploymentconfigs/frontend
  uid: 071b5fa2-f204-11e7-98b7-0ac586c2eb16
spec:
  replicas: 2
  selector:
    name: frontend
  strategy:
    activeDeadlineSeconds: 21600
    resources: {}
    rollingParams:
      intervalSeconds: 1
      maxSurge: 25%
      maxUnavailable: 25%
      post:
        execNewPod:
          command:
          - /bin/true
          containerName: ruby-helloworld
          env:
          - name: CUSTOM_VAR2
            value: custom_value2
        failurePolicy: Ignore
      pre:
        execNewPod:
          command:
          - /bin/true
          containerName: ruby-helloworld
          env:
          - name: CUSTOM_VAR1
            value: custom_value1
        failurePolicy: Abort
      timeoutSeconds: 120
      updatePeriodSeconds: 1
    type: Rolling
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: frontend
    spec:
      containers:
      - env:
        - name: MYSQL_USER
          valueFrom:
            secretKeyRef:
              key: mysql-user
              name: dbsecret
        - name: MYSQL_PASSWORD
          valueFrom:
            secretKeyRef:
              key: mysql-password
              name: dbsecret
        - name: MYSQL_DATABASE
          value: root
        image: docker-registry.default.svc:5000/lgp3/origin-ruby-sample@sha256:e4a3e3b47961386374696299edcccd406b5920cd1b9a1663b757c44ab5d2b233
        imagePullPolicy: IfNotPresent
        name: ruby-helloworld
        ports:
        - containerPort: 8080
          protocol: TCP
        resources: {}
        securityContext:
          capabilities: {}
          privileged: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
  test: false
  triggers:
  - imageChangeParams:
      automatic: true
      containerNames:
      - ruby-helloworld
      from:
        kind: ImageStreamTag
        name: origin-ruby-sample:latest
        namespace: lgp3
      lastTriggeredImage: docker-registry.default.svc:5000/lgp3/origin-ruby-sample@sha256:e4a3e3b47961386374696299edcccd406b5920cd1b9a1663b757c44ab5d2b233
    type: ImageChange
  - type: ConfigChange
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: 2018-01-05T10:35:54Z
    lastUpdateTime: 2018-01-05T10:35:54Z
    message: Deployment config has minimum availability.
    status: "True"
    type: Available
  - lastTransitionTime: 2018-01-05T10:35:51Z
    lastUpdateTime: 2018-01-05T10:36:02Z
    message: replication controller "frontend-2" successfully rolled out
    reason: NewReplicationControllerAvailable
    status: "True"
    type: Progressing
  details:
    causes:
    - imageTrigger:
        from:
          kind: DockerImage
          name: docker-registry.default.svc:5000/lgp3/origin-ruby-sample@sha256:e4a3e3b47961386374696299edcccd406b5920cd1b9a1663b757c44ab5d2b233
      type: ImageChange
    message: image change
  latestVersion: 2
  observedGeneration: 3
  readyReplicas: 2
  replicas: 2
  unavailableReplicas: 0
  updatedReplicas: 2

Actual results:
Redundant deployemnt appear when trigger 1 time of deployment operation
Expected results:
Trigger 1 time of deployment operation should not trigger redundant deployement

Comment 1 Justin Pierce 2018-01-05 13:26:58 UTC

Doesn't "replicas: 2" in you dc indicate that there should be two pods?

Comment 2 Wang Haoran 2018-01-05 15:39:57 UTC

(In reply to Justin Pierce from comment #1)
> Doesn't "replicas: 2" in you dc indicate that there should be two pods?

Yes, it's should be two pods, but the image change should only trigger one deploy, which create a new rc with replicas=2, should not trigger two deploy, which create two deployer pod.

Comment 3 Michal Fojtik 2018-01-09 09:57:45 UTC

(In reply to Wang Haoran from comment #2)
> (In reply to Justin Pierce from comment #1)
> > Doesn't "replicas: 2" in you dc indicate that there should be two pods?
> 
> Yes, it's should be two pods, but the image change should only trigger one
> deploy, which create a new rc with replicas=2, should not trigger two
> deploy, which create two deployer pod.

I tried to reproduce based on the steps. The first time I created the template, the database errored (quota) and the build was started. When the build finished, I got:

~ → oc get dc
NAME       REVISION   DESIRED   CURRENT   TRIGGERED BY
database   1          1         0         config
frontend   1          2         2         config,image(origin-ruby-sample:latest)

Meaning just 1 deployment was triggered. I retried this 3x times and I was not able to hit double-deployment.

How you triggered the second deployment? Was it triggered manually via 'oc rollout latest' or it triggered automatically for unknown reason?

Comment 5 Michal Fojtik 2018-01-09 10:25:58 UTC

Lowering priority as QA can't reproduce this in present time as well. When they successfully reproduce, logs will be provided and we can raise the priority again.

Comment 6 ge liu 2018-01-12 07:30:59 UTC

Close this bug since it could not be produced.

Comment 9 Michal Fojtik 2018-01-17 11:17:39 UTC

The problem is in the docker-registry configuration. When you use the DNS for docker-registry, the registry has to know what hostname should be used when the new images are created. By default, registry will use the docker-registry IP address.

This will cause that the initial "create" call will create the image with wrong DockerImageReference (172.30.88.55:5000/haowang). However, in OpenShift we have decorator field for DockerImageReference so when the informer cache is updated (1s later), the image field value is changed to DNS based format.

As a result, we can see 2 deployments (if the cache resync is fast, you see just one).

To fix this, you have to set the 'REGISTRY_OPENSHIFT_SERVER_ADDR' environment variable for the ds/docker-registry to 'docker-registry.default.svc:5000'.

This is likely an installer bug, when the installer should do this automatically.

Ben: I guess your team owns the ansible part in registry, we need to make sure that environment variable is set when the DNS is configured for docker-registry.

Comment 10 Justin Pierce 2018-01-17 14:14:25 UTC

The docker registries on the free/starter clusters are very old, so the installer has not touched them in some time. We could manually apply this change if necessary.

Comment 11 Ben Parees 2018-01-17 14:45:03 UTC

Ansible appears to be setting OPENSHIFT_DEFAULT_REGISTRY which as best I can tell was later deprecated in favor of REGISTRY_OPENSHIFT_SERVER_ADDR but should accomplish the same goal.

Comment 13 ge liu 2018-01-23 06:15:22 UTC

Can't recreate this issue in ocp env(3.9.0-0.22.0), I think it appears randomly according to analysis of comment 9.

Comment 15 Ben Parees 2018-01-23 14:49:33 UTC

> @Ben, as you said, OPENSHIFT_DEFAULT_REGISTRY is deprecated, would you mind update to use REGISTRY_OPENSHIFT_SERVER_ADDR ?

we can do it, but setting either value should currently work so it's not the problem here.  Can you open a separate bug to track making that change?

Was either value set on the registry in this case?  I suspect neither value was set.

Comment 16 Wang Haoran 2018-01-23 14:51:37 UTC

@Ben, OPENSHIFT_DEFAULT_REGISTRY was set in te DC of registry.

Comment 17 Wang Haoran 2018-01-23 14:58:21 UTC

@Ben bug opened https://bugzilla.redhat.com/show_bug.cgi?id=1537593

Comment 18 Ben Parees 2018-01-23 20:26:13 UTC

If this cannot be recreated anymore because the environment was taken down, I suggest closing it.

If we have an environment where it can be recreated, we need to see:

the registry pod yaml
the master config
the imagestream yaml the deploymentconfig is referencing
the deploymentconfig yaml

Note that in my environment where I was not able to recreate it, the only variable I set was OPENSHIFT_DEFAULT_REGISTRY=docker-registry.default.svc.local and I set it for the master process, I did not explicitly set it on the registry (the registry will default to this value anyway).

Regarding comment 9+13, I do not think this is random.  Getting two deployments might be random(it's a bit of a race) but it's fundamentally caused by the imagestream being defined in terms of the registry ip address (imagestream.status.dockerImageRepository points to an ip address), while the master has an OPENSHIFT_DEFAULT_REGISTRY value set to something else.  (Michal can clarify the behavior but it sounds like there is a controller his team owns that, upon seeing an image reference to an ip that corresponds to the registry, rewrites that image reference to match the OPENSHIFT_DEFAULT_REGISTRY value).

How you get an imagestream that's defined in terms of an ip address, when OPENSHIFT_DEFAULT_REGISTRY is defined, is not obvious to me (again, I could not recreate it, even when i did not explicitly configure the registry deploymentconfig).

Comment 31 Ben Parees 2018-01-24 15:13:26 UTC

Discussed w/ Michal.  This definitely appears to be what he inititially suspected:  if you set OPENSHIFT_DEFAULT_REGISTRY on the master, you must also set the registry address env variable on the registry DC (there are several variables that will accomplish this, but REGISTRY_OPENSHIFT_SERVER_ADDR is that current preferred variable name).

I've also confirmed that the ansible installer does set this on the registry for new installs, and i've updated the ansible installer to set REGISTRY_OPENSHIFT_SERVER_ADDR (previously it was setting  OPENSHIFT_DEFAULT_REGISTRY which also works but is deprecated for the registry).

So basically this is also working as expected.  However i'm going to update the documentation to make it clear to users that they must keep the registry url setting on the Master in sync w/ the registry url setting on the Registry, so leaving this bug open to do that.

Comment 32 Ben Parees 2018-01-24 15:51:00 UTC

doc fixes here: 
https://github.com/openshift/openshift-docs/pull/7296

Comment 33 openshift-github-bot 2018-01-24 19:42:42 UTC

Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/c019b45d2030f95f85d0a1404b80ee8d5f3e1bd3
document keeping the registry addr configuration in sync

bug 1531500

https://bugzilla.redhat.com/show_bug.cgi?id=1531500

Comment 34 Wang Haoran 2018-01-25 02:30:00 UTC

Moved to verify

Comment 37 Ben Parees 2018-01-30 14:12:59 UTC

i'm guessing you ran into this:
https://github.com/openshift/image-registry/issues/58

Comment 38 Ben Parees 2018-01-30 15:58:14 UTC

registry logs would allow us to confirm.

(since that issue is now fixed, a new install might also resolve it for you).

Comment 39 ge liu 2018-01-31 06:49:32 UTC

I tried it on new installed env(3.9.0-0.34.0), this problem fixed, and regrading to comment 37, I tried it, there is not problem now.

#  oc logs docker-registry-1-4p66z -n default | grep URL
time="2018-01-31T05:55:54.960130259Z" level=info msg="Using \"docker-registry.default.svc:5000\" as Docker Registry URL" go.version=go1.9.2 instance.id=e505ee2c-5dc9-46b3-aeed-36263726f317

Comment 42 errata-xmlrpc 2018-12-13 19:26:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748

Note You need to log in before you can comment on or make changes to this bug.