1395081 – docker-registry did not update to specified version after OCP upgrade

Bug 1395081 - docker-registry did not update to specified version after OCP upgrade

Summary: docker-registry did not update to specified version after OCP upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Scott Dodson
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-15 05:38 UTC by liujia
Modified:	2017-07-24 14:11 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: In some situations node upgrade could terminate a running pod that was upgrading the router/registry. Consequence: Router/registry would fail to upgrade. Fix: Router/registry upgrade re-ordered to follow node upgrade when performing a full cluster in-place upgrade. Result: Nodes no longer taken offline for upgrade while the router/registry is still running.
Clone Of:
Environment:
Last Closed:	2017-04-12 18:48:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
upgrade.log (273.54 KB, text/plain) 2016-11-16 06:58 UTC, liujia	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0903	0	normal	SHIPPED_LIVE	OpenShift Container Platform atomic-openshift-utils bug fix and enhancement	2017-04-12 22:45:42 UTC

Description liujia 2016-11-15 05:38:15 UTC

Description of problem:
Triggered upgrade from OCP3.3 to OCP3.4, upgrade will complete with no failure.  but we found the docker-registry pod still use old version of docker-registry although the dc had been upgraded to the specified version(v3.4.0.25) .
It seems the docker-registry can't  be deployed due to the node is tagged as unscheable by the follow upgrade-node playbook.

# oc get po docker-registry-2-3xu3t -o json | grep image
                "image": "openshift3/ose-docker-registry:v3.3.1.3",
                "imagePullPolicy": "IfNotPresent",
        "imagePullSecrets": [
                "image": "openshift3/ose-docker-registry:v3.3.1.3",
                "imageID": "docker-pullable://virt-openshift-05.lab.eng.nay.redhat.com:5000/openshift3/ose-docker-registry@sha256:e81bd5ba741ad86f90ea4b03b64ea408e6416914309140508a6130317fd9d8bd",
 

# oc get dc docker-registry -o json | grep image
                        "image": "openshift3/ose-docker-registry:v3.4.0.25",
                        "imagePullPolicy": "IfNotPresent",
 
# oc status

https://docker-registry-default.router.default.svc.cluster.local (passthrough) to pod port 5000-tcp (svc/docker-registry)
  dc/docker-registry deploys docker.io/openshift3/ose-docker-registry:v3.4.0.25 
    deployment #3 failed 17 hours ago: config change
    deployment #2 deployed 19 hours ago - 1 pod
    deployment #1 failed 19 hours ago: newer deployment was found running

svc/kubernetes - 172.30.0.1 ports 443, 53->8053, 53->8053

https://registry-console-default.router.default.svc.cluster.local (passthrough) to pod port registry-console (svc/registry-console)
  dc/registry-console deploys registry.access.redhat.com/openshift3/registry-console:3.3 
    deployment #1 deployed 19 hours ago - 1 pod

svc/router - 172.30.189.178 ports 80, 443, 1936
  dc/router deploys docker.io/openshift3/ose-haproxy-router:v3.4.0.25 
    deployment #2 deployed 17 hours ago - 1 pod
    deployment #1 deployed 19 hours ago


Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.4.23-1.git.0.317b2cd.el7.noarch

How reproducible:
sometimes

Steps to Reproduce:
1.Containerized install HA  OCP3.3 on rhel hosts.(3masters+1node)
2.Upgrade OCP3.3 to OCP3.4


Actual results:
Upgrade will complete with no failure,  and the dc was update, but docker-registry deploy failed.

Expected results:
Docker-registry should use to specified version image.

Comment 2 liujia 2016-11-15 09:29:54 UTC

Sometimes, router hit the same issue.

Comment 3 Devan Goodwin 2016-11-15 13:41:05 UTC

Could you provide the upgrade logs, demonstrate the node status (unschedulable), and include relevant logs from the failing deployment.

Comment 5 Devan Goodwin 2016-11-15 17:57:37 UTC

Additional question, did the upgrade fail once, then get run successfully afterwards?

Comment 6 liujia 2016-11-16 06:57:37 UTC

(In reply to Devan Goodwin from comment #3)
> Could you provide the upgrade logs, demonstrate the node status
> (unschedulable), and include relevant logs from the failing deployment.

upgrade.log is in attachment.

From the log, it looks successfully upgrade. I guess it seems the docker-registry can't  be deployed due to the node is tagged as unschedulable because "oc status" shows that docker-registry deployed failed as above comment#1, and then i found it would run "Update registry image to current version" in post_control_plane.yml and run "Mark unschedulable if host is a node" in following upgrade_nodes.yml.

Comment 7 liujia 2016-11-16 06:58:21 UTC

Created attachment 1221039 [details]
upgrade.log

Comment 9 liujia 2016-11-16 07:03:51 UTC

(In reply to Devan Goodwin from comment #4)
> Tried searching the master hosts above but it looks like the environment was
> torn down, rpms are now 3.3.

yes, the env had been re-build.

Comment 10 liujia 2016-11-16 07:06:21 UTC

(In reply to Devan Goodwin from comment #5)
> Additional question, did the upgrade fail once, then get run successfully
> afterwards?

No. Upgrade has been done successfully without re-run process or any fail.

Comment 11 Devan Goodwin 2016-11-16 13:15:44 UTC

Thanks for the additional info, if system with IP 118 is the dedicated node, it does appear it was set to be schedulable again following upgrade:

TASK [Set node schedulability] *************************************************
changed: [openshift-118.lab.eng.nay.redhat.com -> openshift-128.lab.eng.nay.redhat.com]

However if there is only one node where pods can run, at some point during upgrade there will be nowhere for the registry to run, this probably explains the events indicating there are no schedulable nodes. 

I am curious what "oc get nodes" shows after this has happened. 

I am also curious what happens if you manually trigger another deployment with: "oc deploy docker-registry --latest"

If it were possible to create an environment where this has happened and leave it up to login to that may be very helpful.

I will be trying to reproduce today but at this point I don't think this is a 3.4 blocker, rather I suspect something is going wrong because there's only one node in an HA environment.

Moving to upcoming release for now.

Comment 12 Devan Goodwin 2016-11-16 18:59:08 UTC

I am unable to reproduce, I got an HA cluster set up with just one active node. Because the node is down at some point you will naturally see "no nodes available to schedule pods" at some point during the upgrade. However once it was complete, the node was back online:

[root@ip-172-18-5-36 yum.repos.d]# oc get nodes
NAME                          STATUS                     AGE
ip-172-18-5-35.ec2.internal   Ready                      2h
ip-172-18-5-36.ec2.internal   Ready,SchedulingDisabled   2h
ip-172-18-5-38.ec2.internal   Ready,SchedulingDisabled   2h
ip-172-18-5-39.ec2.internal   Ready,SchedulingDisabled   2h

And the pods re-ran:

[root@ip-172-18-5-36 yum.repos.d]# oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-7-awcvb    1/1       Running   0          4m
registry-console-2-8zvz7   1/1       Running   0          4m
router-3-gmx5v             1/1       Running   0          4m

"image": "openshift3/ose-docker-registry:v3.4.0.26"
"image": "openshift3/ose-haproxy-router:v3.4.0.26"


I think we might need an environment left up after this is reproduced, will involve a lot of digging through logs and timestamps to figure this one out.

Comment 13 Mike Barrett 2016-11-17 06:47:34 UTC

This is a good bug in that it might explain why we have a high number of complaints that the registry and router fail to upgrade.  If this does turn out what you think it is, can you add an ability for the upgrade to tell the user that they do not have enough active nodes to upgrade the registry?

Comment 14 Devan Goodwin 2016-11-17 12:20:54 UTC

Will definitely try to convey the issue if we identify what's going on and can't fix it outright.

Comment 15 liujia 2016-11-17 13:57:20 UTC

@Devan
I just failed to reproduce it in my env. I'll try again later, but i am not sure if it can be reproduced tonight. If i reproduced it again, i will keep it for you and notify you ASAP.

Comment 16 Devan Goodwin 2016-11-17 13:58:40 UTC

Thanks, I will be trying again today to reproduce.

Comment 18 Devan Goodwin 2016-11-21 14:44:10 UTC

Thanks liujia, I think this explains what is happening.

Recently I reordered some steps that upgrade the registry/router to occur just after the control plane is upgraded, and right before the nodes are upgraded. I believe that we're mid-way through deploying the new route/registry when we get to node upgrade, and if that deploy is still underway on the node being upgraded the deployment fails. 

I can reproduce locally with a single system installation, the interesting part shows up in oc status:

https://docker-registry-default.router.default.svc.cluster.local (passthrough) to pod port 5000-tcp (svc/docker-registry)
  dc/docker-registry deploys docker.io/openshift3/ose-docker-registry:v3.4.0.26 
    deployment #3 failed 5 hours ago: deployer pod no longer exists
    deployment #2 deployed 6 hours ago - 1 pod

This appears for the router as well.

Unfortunately your earlier oc status showed a different error message, "config change". We will have to keep an eye out for that but it's possible fixing this will take care of both issues. 

I am going to re-order the router/registry upgrade to occur *after* node upgrade in the case of a full cluster upgrade. If you're just running upgrade_control_plane.yml, I will leave it where it is, occurring at the end. Presumably the user won't run upgrade_nodes.yml so quickly that the deploy is still underway.

If this does surface, the workaround is simply to re-deploy the latest:

oc deploy docker-registry --latest

Comment 19 Devan Goodwin 2016-11-21 15:34:33 UTC

https://github.com/openshift/openshift-ansible/pull/2831

Comment 21 liujia 2017-02-10 11:29:08 UTC

Version:
atomic-openshift-utils-3.5.6-1.git.0.5e6099d.el7.noarch

Steps:
1.Containerized install HA OCP3.4 on rhel hosts.(3masters+1node)
2.Upgrade OCP3.4 to OCP3.5 with sppecified openshift_image_tag(v3.5.0.16)

Result:
Upgrade successfully. Docker-registry was upgraded to specified version together with other images.

# docker ps
CONTAINER ID        IMAGE                                                        COMMAND                  CREATED             STATUS              PORTS               NAMES
e392eb138b71        openshift3/ose-docker-registry:v3.5.0.16                     "/bin/sh -c 'DOCKER_R"   11 minutes ago      Up 11 minutes                           k8s_registry.5f7b798b_docker-registry-7-lqrcl_default_53f23697-ef82-11e6-a554-42010af0004c_e54f4ead
4d9abd7b62f6        openshift3/ose-haproxy-router:v3.5.0.16                      "/usr/bin/openshift-r"   11 minutes ago      Up 11 minutes                           k8s_router.64fb396e_router-2-brz4n_default_508b4f94-ef82-11e6-a554-42010af0004c_cb914471
53df900d5205        openshift3/ose-pod:v3.5.0.16                                 "/pod"                   11 minutes ago      Up 11 minutes                           k8s_POD.6745dc2a_docker-registry-7-lqrcl_default_53f23697-ef82-11e6-a554-42010af0004c_cdd04987
735bf6399aa1        openshift3/ose-pod:v3.5.0.16                                 "/pod"                   11 minutes ago      Up 11 minutes                           k8s_POD.6745dc2a_router-2-brz4n_default_508b4f94-ef82-11e6-a554-42010af0004c_cb8f8ca5
84a5c77e7719        registry.access.redhat.com/openshift3/registry-console:3.3   "/usr/libexec/cockpit"   12 minutes ago      Up 12 minutes                           k8s_registry-console.a8f9b97c_registry-console-1-c03xh_default_6211f88c-ef81-11e6-a554-42010af0004c_b24c4fc7
b33e30024939        openshift3/ose-pod:v3.5.0.16                                 "/pod"                   12 minutes ago      Up 12 minutes                           k8s_POD.6745dc2a_registry-console-1-c03xh_default_6211f88c-ef81-11e6-a554-42010af0004c_2f1884d5
91f20804c695        openshift3/node:v3.5.0.16                                    "/usr/local/bin/origi"   12 minutes ago      Up 12 minutes                           atomic-openshift-node
bd723e846ace        openshift3/openvswitch:v3.5.0.16                             "/usr/local/bin/ovs-r"   13 minutes ago      Up 13 minutes                           openvswitch


change bug status.

Comment 23 errata-xmlrpc 2017-04-12 18:48:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0903

Note You need to log in before you can comment on or make changes to this bug.