Bug 1395081
Summary: | docker-registry did not update to specified version after OCP upgrade | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> | ||||
Component: | Cluster Version Operator | Assignee: | Scott Dodson <sdodson> | ||||
Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.4.0 | CC: | aos-bugs, dgoodwin, jiajliu, jokerman, mbarrett, mmccomas | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: In some situations node upgrade could terminate a running pod that was upgrading the router/registry.
Consequence: Router/registry would fail to upgrade.
Fix: Router/registry upgrade re-ordered to follow node upgrade when performing a full cluster in-place upgrade.
Result: Nodes no longer taken offline for upgrade while the router/registry is still running.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-04-12 18:48:19 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
liujia
2016-11-15 05:38:15 UTC
Sometimes, router hit the same issue. Could you provide the upgrade logs, demonstrate the node status (unschedulable), and include relevant logs from the failing deployment. Additional question, did the upgrade fail once, then get run successfully afterwards? (In reply to Devan Goodwin from comment #3) > Could you provide the upgrade logs, demonstrate the node status > (unschedulable), and include relevant logs from the failing deployment. upgrade.log is in attachment. From the log, it looks successfully upgrade. I guess it seems the docker-registry can't be deployed due to the node is tagged as unschedulable because "oc status" shows that docker-registry deployed failed as above comment#1, and then i found it would run "Update registry image to current version" in post_control_plane.yml and run "Mark unschedulable if host is a node" in following upgrade_nodes.yml. Created attachment 1221039 [details]
upgrade.log
(In reply to Devan Goodwin from comment #4) > Tried searching the master hosts above but it looks like the environment was > torn down, rpms are now 3.3. yes, the env had been re-build. (In reply to Devan Goodwin from comment #5) > Additional question, did the upgrade fail once, then get run successfully > afterwards? No. Upgrade has been done successfully without re-run process or any fail. Thanks for the additional info, if system with IP 118 is the dedicated node, it does appear it was set to be schedulable again following upgrade: TASK [Set node schedulability] ************************************************* changed: [openshift-118.lab.eng.nay.redhat.com -> openshift-128.lab.eng.nay.redhat.com] However if there is only one node where pods can run, at some point during upgrade there will be nowhere for the registry to run, this probably explains the events indicating there are no schedulable nodes. I am curious what "oc get nodes" shows after this has happened. I am also curious what happens if you manually trigger another deployment with: "oc deploy docker-registry --latest" If it were possible to create an environment where this has happened and leave it up to login to that may be very helpful. I will be trying to reproduce today but at this point I don't think this is a 3.4 blocker, rather I suspect something is going wrong because there's only one node in an HA environment. Moving to upcoming release for now. I am unable to reproduce, I got an HA cluster set up with just one active node. Because the node is down at some point you will naturally see "no nodes available to schedule pods" at some point during the upgrade. However once it was complete, the node was back online: [root@ip-172-18-5-36 yum.repos.d]# oc get nodes NAME STATUS AGE ip-172-18-5-35.ec2.internal Ready 2h ip-172-18-5-36.ec2.internal Ready,SchedulingDisabled 2h ip-172-18-5-38.ec2.internal Ready,SchedulingDisabled 2h ip-172-18-5-39.ec2.internal Ready,SchedulingDisabled 2h And the pods re-ran: [root@ip-172-18-5-36 yum.repos.d]# oc get pods NAME READY STATUS RESTARTS AGE docker-registry-7-awcvb 1/1 Running 0 4m registry-console-2-8zvz7 1/1 Running 0 4m router-3-gmx5v 1/1 Running 0 4m "image": "openshift3/ose-docker-registry:v3.4.0.26" "image": "openshift3/ose-haproxy-router:v3.4.0.26" I think we might need an environment left up after this is reproduced, will involve a lot of digging through logs and timestamps to figure this one out. This is a good bug in that it might explain why we have a high number of complaints that the registry and router fail to upgrade. If this does turn out what you think it is, can you add an ability for the upgrade to tell the user that they do not have enough active nodes to upgrade the registry? Will definitely try to convey the issue if we identify what's going on and can't fix it outright. @Devan I just failed to reproduce it in my env. I'll try again later, but i am not sure if it can be reproduced tonight. If i reproduced it again, i will keep it for you and notify you ASAP. Thanks, I will be trying again today to reproduce. Thanks liujia, I think this explains what is happening. Recently I reordered some steps that upgrade the registry/router to occur just after the control plane is upgraded, and right before the nodes are upgraded. I believe that we're mid-way through deploying the new route/registry when we get to node upgrade, and if that deploy is still underway on the node being upgraded the deployment fails. I can reproduce locally with a single system installation, the interesting part shows up in oc status: https://docker-registry-default.router.default.svc.cluster.local (passthrough) to pod port 5000-tcp (svc/docker-registry) dc/docker-registry deploys docker.io/openshift3/ose-docker-registry:v3.4.0.26 deployment #3 failed 5 hours ago: deployer pod no longer exists deployment #2 deployed 6 hours ago - 1 pod This appears for the router as well. Unfortunately your earlier oc status showed a different error message, "config change". We will have to keep an eye out for that but it's possible fixing this will take care of both issues. I am going to re-order the router/registry upgrade to occur *after* node upgrade in the case of a full cluster upgrade. If you're just running upgrade_control_plane.yml, I will leave it where it is, occurring at the end. Presumably the user won't run upgrade_nodes.yml so quickly that the deploy is still underway. If this does surface, the workaround is simply to re-deploy the latest: oc deploy docker-registry --latest Version: atomic-openshift-utils-3.5.6-1.git.0.5e6099d.el7.noarch Steps: 1.Containerized install HA OCP3.4 on rhel hosts.(3masters+1node) 2.Upgrade OCP3.4 to OCP3.5 with sppecified openshift_image_tag(v3.5.0.16) Result: Upgrade successfully. Docker-registry was upgraded to specified version together with other images. # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES e392eb138b71 openshift3/ose-docker-registry:v3.5.0.16 "/bin/sh -c 'DOCKER_R" 11 minutes ago Up 11 minutes k8s_registry.5f7b798b_docker-registry-7-lqrcl_default_53f23697-ef82-11e6-a554-42010af0004c_e54f4ead 4d9abd7b62f6 openshift3/ose-haproxy-router:v3.5.0.16 "/usr/bin/openshift-r" 11 minutes ago Up 11 minutes k8s_router.64fb396e_router-2-brz4n_default_508b4f94-ef82-11e6-a554-42010af0004c_cb914471 53df900d5205 openshift3/ose-pod:v3.5.0.16 "/pod" 11 minutes ago Up 11 minutes k8s_POD.6745dc2a_docker-registry-7-lqrcl_default_53f23697-ef82-11e6-a554-42010af0004c_cdd04987 735bf6399aa1 openshift3/ose-pod:v3.5.0.16 "/pod" 11 minutes ago Up 11 minutes k8s_POD.6745dc2a_router-2-brz4n_default_508b4f94-ef82-11e6-a554-42010af0004c_cb8f8ca5 84a5c77e7719 registry.access.redhat.com/openshift3/registry-console:3.3 "/usr/libexec/cockpit" 12 minutes ago Up 12 minutes k8s_registry-console.a8f9b97c_registry-console-1-c03xh_default_6211f88c-ef81-11e6-a554-42010af0004c_b24c4fc7 b33e30024939 openshift3/ose-pod:v3.5.0.16 "/pod" 12 minutes ago Up 12 minutes k8s_POD.6745dc2a_registry-console-1-c03xh_default_6211f88c-ef81-11e6-a554-42010af0004c_2f1884d5 91f20804c695 openshift3/node:v3.5.0.16 "/usr/local/bin/origi" 12 minutes ago Up 12 minutes atomic-openshift-node bd723e846ace openshift3/openvswitch:v3.5.0.16 "/usr/local/bin/ovs-r" 13 minutes ago Up 13 minutes openvswitch change bug status. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0903 |