Description of problem: Triggered upgrade from OCP3.3 to OCP3.4, upgrade will complete with no failure. but we found the docker-registry pod still use old version of docker-registry although the dc had been upgraded to the specified version(v3.4.0.25) . It seems the docker-registry can't be deployed due to the node is tagged as unscheable by the follow upgrade-node playbook. # oc get po docker-registry-2-3xu3t -o json | grep image "image": "openshift3/ose-docker-registry:v3.3.1.3", "imagePullPolicy": "IfNotPresent", "imagePullSecrets": [ "image": "openshift3/ose-docker-registry:v3.3.1.3", "imageID": "docker-pullable://virt-openshift-05.lab.eng.nay.redhat.com:5000/openshift3/ose-docker-registry@sha256:e81bd5ba741ad86f90ea4b03b64ea408e6416914309140508a6130317fd9d8bd", # oc get dc docker-registry -o json | grep image "image": "openshift3/ose-docker-registry:v3.4.0.25", "imagePullPolicy": "IfNotPresent", # oc status https://docker-registry-default.router.default.svc.cluster.local (passthrough) to pod port 5000-tcp (svc/docker-registry) dc/docker-registry deploys docker.io/openshift3/ose-docker-registry:v3.4.0.25 deployment #3 failed 17 hours ago: config change deployment #2 deployed 19 hours ago - 1 pod deployment #1 failed 19 hours ago: newer deployment was found running svc/kubernetes - 172.30.0.1 ports 443, 53->8053, 53->8053 https://registry-console-default.router.default.svc.cluster.local (passthrough) to pod port registry-console (svc/registry-console) dc/registry-console deploys registry.access.redhat.com/openshift3/registry-console:3.3 deployment #1 deployed 19 hours ago - 1 pod svc/router - 172.30.189.178 ports 80, 443, 1936 dc/router deploys docker.io/openshift3/ose-haproxy-router:v3.4.0.25 deployment #2 deployed 17 hours ago - 1 pod deployment #1 deployed 19 hours ago Version-Release number of selected component (if applicable): atomic-openshift-utils-3.4.23-1.git.0.317b2cd.el7.noarch How reproducible: sometimes Steps to Reproduce: 1.Containerized install HA OCP3.3 on rhel hosts.(3masters+1node) 2.Upgrade OCP3.3 to OCP3.4 Actual results: Upgrade will complete with no failure, and the dc was update, but docker-registry deploy failed. Expected results: Docker-registry should use to specified version image.
Sometimes, router hit the same issue.
Could you provide the upgrade logs, demonstrate the node status (unschedulable), and include relevant logs from the failing deployment.
Additional question, did the upgrade fail once, then get run successfully afterwards?
(In reply to Devan Goodwin from comment #3) > Could you provide the upgrade logs, demonstrate the node status > (unschedulable), and include relevant logs from the failing deployment. upgrade.log is in attachment. From the log, it looks successfully upgrade. I guess it seems the docker-registry can't be deployed due to the node is tagged as unschedulable because "oc status" shows that docker-registry deployed failed as above comment#1, and then i found it would run "Update registry image to current version" in post_control_plane.yml and run "Mark unschedulable if host is a node" in following upgrade_nodes.yml.
Created attachment 1221039 [details] upgrade.log
(In reply to Devan Goodwin from comment #4) > Tried searching the master hosts above but it looks like the environment was > torn down, rpms are now 3.3. yes, the env had been re-build.
(In reply to Devan Goodwin from comment #5) > Additional question, did the upgrade fail once, then get run successfully > afterwards? No. Upgrade has been done successfully without re-run process or any fail.
Thanks for the additional info, if system with IP 118 is the dedicated node, it does appear it was set to be schedulable again following upgrade: TASK [Set node schedulability] ************************************************* changed: [openshift-118.lab.eng.nay.redhat.com -> openshift-128.lab.eng.nay.redhat.com] However if there is only one node where pods can run, at some point during upgrade there will be nowhere for the registry to run, this probably explains the events indicating there are no schedulable nodes. I am curious what "oc get nodes" shows after this has happened. I am also curious what happens if you manually trigger another deployment with: "oc deploy docker-registry --latest" If it were possible to create an environment where this has happened and leave it up to login to that may be very helpful. I will be trying to reproduce today but at this point I don't think this is a 3.4 blocker, rather I suspect something is going wrong because there's only one node in an HA environment. Moving to upcoming release for now.
I am unable to reproduce, I got an HA cluster set up with just one active node. Because the node is down at some point you will naturally see "no nodes available to schedule pods" at some point during the upgrade. However once it was complete, the node was back online: [root@ip-172-18-5-36 yum.repos.d]# oc get nodes NAME STATUS AGE ip-172-18-5-35.ec2.internal Ready 2h ip-172-18-5-36.ec2.internal Ready,SchedulingDisabled 2h ip-172-18-5-38.ec2.internal Ready,SchedulingDisabled 2h ip-172-18-5-39.ec2.internal Ready,SchedulingDisabled 2h And the pods re-ran: [root@ip-172-18-5-36 yum.repos.d]# oc get pods NAME READY STATUS RESTARTS AGE docker-registry-7-awcvb 1/1 Running 0 4m registry-console-2-8zvz7 1/1 Running 0 4m router-3-gmx5v 1/1 Running 0 4m "image": "openshift3/ose-docker-registry:v3.4.0.26" "image": "openshift3/ose-haproxy-router:v3.4.0.26" I think we might need an environment left up after this is reproduced, will involve a lot of digging through logs and timestamps to figure this one out.
This is a good bug in that it might explain why we have a high number of complaints that the registry and router fail to upgrade. If this does turn out what you think it is, can you add an ability for the upgrade to tell the user that they do not have enough active nodes to upgrade the registry?
Will definitely try to convey the issue if we identify what's going on and can't fix it outright.
@Devan I just failed to reproduce it in my env. I'll try again later, but i am not sure if it can be reproduced tonight. If i reproduced it again, i will keep it for you and notify you ASAP.
Thanks, I will be trying again today to reproduce.
Thanks liujia, I think this explains what is happening. Recently I reordered some steps that upgrade the registry/router to occur just after the control plane is upgraded, and right before the nodes are upgraded. I believe that we're mid-way through deploying the new route/registry when we get to node upgrade, and if that deploy is still underway on the node being upgraded the deployment fails. I can reproduce locally with a single system installation, the interesting part shows up in oc status: https://docker-registry-default.router.default.svc.cluster.local (passthrough) to pod port 5000-tcp (svc/docker-registry) dc/docker-registry deploys docker.io/openshift3/ose-docker-registry:v3.4.0.26 deployment #3 failed 5 hours ago: deployer pod no longer exists deployment #2 deployed 6 hours ago - 1 pod This appears for the router as well. Unfortunately your earlier oc status showed a different error message, "config change". We will have to keep an eye out for that but it's possible fixing this will take care of both issues. I am going to re-order the router/registry upgrade to occur *after* node upgrade in the case of a full cluster upgrade. If you're just running upgrade_control_plane.yml, I will leave it where it is, occurring at the end. Presumably the user won't run upgrade_nodes.yml so quickly that the deploy is still underway. If this does surface, the workaround is simply to re-deploy the latest: oc deploy docker-registry --latest
https://github.com/openshift/openshift-ansible/pull/2831
Version: atomic-openshift-utils-3.5.6-1.git.0.5e6099d.el7.noarch Steps: 1.Containerized install HA OCP3.4 on rhel hosts.(3masters+1node) 2.Upgrade OCP3.4 to OCP3.5 with sppecified openshift_image_tag(v3.5.0.16) Result: Upgrade successfully. Docker-registry was upgraded to specified version together with other images. # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES e392eb138b71 openshift3/ose-docker-registry:v3.5.0.16 "/bin/sh -c 'DOCKER_R" 11 minutes ago Up 11 minutes k8s_registry.5f7b798b_docker-registry-7-lqrcl_default_53f23697-ef82-11e6-a554-42010af0004c_e54f4ead 4d9abd7b62f6 openshift3/ose-haproxy-router:v3.5.0.16 "/usr/bin/openshift-r" 11 minutes ago Up 11 minutes k8s_router.64fb396e_router-2-brz4n_default_508b4f94-ef82-11e6-a554-42010af0004c_cb914471 53df900d5205 openshift3/ose-pod:v3.5.0.16 "/pod" 11 minutes ago Up 11 minutes k8s_POD.6745dc2a_docker-registry-7-lqrcl_default_53f23697-ef82-11e6-a554-42010af0004c_cdd04987 735bf6399aa1 openshift3/ose-pod:v3.5.0.16 "/pod" 11 minutes ago Up 11 minutes k8s_POD.6745dc2a_router-2-brz4n_default_508b4f94-ef82-11e6-a554-42010af0004c_cb8f8ca5 84a5c77e7719 registry.access.redhat.com/openshift3/registry-console:3.3 "/usr/libexec/cockpit" 12 minutes ago Up 12 minutes k8s_registry-console.a8f9b97c_registry-console-1-c03xh_default_6211f88c-ef81-11e6-a554-42010af0004c_b24c4fc7 b33e30024939 openshift3/ose-pod:v3.5.0.16 "/pod" 12 minutes ago Up 12 minutes k8s_POD.6745dc2a_registry-console-1-c03xh_default_6211f88c-ef81-11e6-a554-42010af0004c_2f1884d5 91f20804c695 openshift3/node:v3.5.0.16 "/usr/local/bin/origi" 12 minutes ago Up 12 minutes atomic-openshift-node bd723e846ace openshift3/openvswitch:v3.5.0.16 "/usr/local/bin/ovs-r" 13 minutes ago Up 13 minutes openvswitch change bug status.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0903