Bug 1624495
| Summary: | Upgrade fails: (api down) Ensure openshift-web-console project exists | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Michael Gugino <mgugino> | ||||
| Component: | Installer | Assignee: | Michael Gugino <mgugino> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | liujia <jiajliu> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.11.0 | CC: | aos-bugs, ccoleman, deads, gpei, jiajliu, jialiu, jokerman, mfojtik, mgugino, mmccomas, sdodson, shlao, wmeng, wsun, xxia | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.11.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-12-21 15:23:38 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
So, this seems to be due to my testing, the api is all the way down and unable to start. The real issue is we're not actually waiting for the api to come back online. *** Bug 1624657 has been marked as a duplicate of this bug. *** log of command: journalctl -u atomic-openshift-node.service I also hit such upgrade failure.
PLAY [Upgrade web console] *****************************************************
TASK [Gathering Facts] *********************************************************
ok: [qe-jialiu3101-master-etcd-1.0904-hn2.qe.rhcloud.com]
TASK [openshift_web_console : include_tasks] ***********************************
included: /home/slave6/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/roles/openshift_web_console/tasks/install.yml for qe-jialiu3101-master-etcd-1.0904-hn2.qe.rhcloud.com
TASK [openshift_web_console : Ensure openshift-web-console project exists] *****
fatal: [qe-jialiu3101-master-etcd-1.0904-hn2.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": {"cmd": "/usr/bin/oc adm new-project openshift-web-console --admin-role=admin --node-selector=", "results": {}, "returncode": 1, "stderr": "The connection to the server qe-jialiu3101-master-etcd-1:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}
to retry, use: --limit @/home/slave6/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade.retry
After the failure, log into master, run the same command manually, it is working well.
[root@qe-jialiu3101-master-etcd-1 ~]# /usr/bin/oc adm new-project openshift-web-console --admin-role=admin --node-selector=
error: project openshift-web-console already exists
From the attached logs the only thing I found failing is: poststarthook/authorization.openshift.io-bootstrapclusterroles failed: reason withheld I can't find anything useful in the logs, though. If you can reliable reproduce the problem please invoke following command: oc get --raw /healthz/poststarthook/authorization.openshift.io-bootstrapclusterroles and report here, so that I can further investigate what failed with bootstraping culsterroles. Still hit it on openshift-ansible-3.11.0-0.28.0.git.0.730d4be.el7.noarch Liujia can you provide the information I requested in comment 10? Eventually, can you give me access to the instance where you have this reproduced? (In reply to Maciej Szulik from comment #13) > Eventually, can you give me access to the instance where you have this > reproduced? Sorry. I just read comment10. I re-run upgrade when hit the issue. So current info from the cluster is useless. I will add more info when I hit it again. It's a high rate during recent test. Looking at the instance you've pointed me to everything seems to be just fine. Are you sure you saw the exact same problem? Next time you hit it, please gather all master logs (if possible with increased verbosity 5 or higher), and invoke: oc get --raw /healthz which should give you information what failed (iow. which poststarthook), and then invoke oc get --raw /healthz/poststarthook/<name of the failed poststarthook> (In reply to Maciej Szulik from comment #17) > Looking at the instance you've pointed me to everything seems to be just > fine. Are you sure you saw the exact same problem? > > Next time you hit it, please gather all master logs (if possible with > increased verbosity 5 or higher), and invoke: > oc get --raw /healthz > which should give you information what failed (iow. which poststarthook), > and then invoke oc get --raw /healthz/poststarthook/<name of the failed > poststarthook> The master is failing to come up, openshift-ansible is failing to detect that condition. Master restart shim needs to be rewritten to account for this. I know why the master is not coming up (unable to pull an image). What I need is the restart shim to be fixed. > Master restart shim needs to be rewritten to account for this. I know why the
> master is not coming up (unable to pull an image). What I need is the restart
> shim to be fixed.
Scott do you think you could provide such a fix for the ansible?
(In reply to Maciej Szulik from comment #19) > > Master restart shim needs to be rewritten to account for this. I know why the > > master is not coming up (unable to pull an image). What I need is the restart > > shim to be fixed. > > Scott do you think you could provide such a fix for the ansible? No, I don't think so. Clayton, David, and I have discussed this in the past. It seemed that Clayton is 100% against making the restart script responsible for blocking until the service is ready. David and I have discussed various heuristics that could be improved to ensure we wait for not only core API services to become available but for aggregated API endpoints to also be available in this bug https://bugzilla.redhat.com/show_bug.cgi?id=1623571 I'd really like for the master team to review the overall upgrade process and ideally provide a blocking method for ensuring that the API has been definitively restarted. The methods we've devised in ansible are clearly insufficient. It's unfortunate that no one in this bug has provided logs that make it easy to walk through the current behavior with a clear indication of the timing and order of operations. Created attachment 1481674 [details]
docker ps upgrade output
Shows changes of ose-pod images for pods and static pods
1) immediately after master pods are restarted
2) After nodes are updated and restarted.
(In reply to Michael Gugino from comment #21) > Created attachment 1481674 [details] > docker ps upgrade output > > Shows changes of ose-pod images for pods and static pods > > 1) immediately after master pods are restarted > > 2) After nodes are updated and restarted. This attachement shows that updating the node version and restarting the node service causes all static pods to be killed and recreated due to ose-pod image being updated. This behavior was previously unknown, we will have to patch openshift-ansible. Following some testing, I believe I have a testable condition for this issue: oc get pod -ojson <api pod> output shows a unique meta_data.uid value for each pod. For static pods, meta_data.uid is updated during the following conditions: ose-control-plane image is updated in static pod definition (eg, during an upgrade of master components) atomic-openshift-node binary is updated between major versions (eg, 3.10.x to 3.11.x). It does not appear to be updated between minor upgrades (eg 3.10.0 to 3.10.1) but I need to test more to confirm. meta_data.uid is not updated for static pods during the following conditions: 1) restarting the static pod via master-restart <service>, eg master-restart api 2) restarting atomic-openshift-node service 3) Fully stopping atomic-openshift-node service AND fully stopping docker service, waiting 30 seconds, and restarting both services. I believe we can monitor this value to ensure that the static pods are actually restarted with the new ose-pod images. I'm unsure how we'll handle minor upgrades or re-running of the playbooks in case of failure (might have to brute-force docker ps and grep for the pod image name). Hey Scott, I think this is one we discussed earlier today and the solution was going to be reworking the node upgrade flow to avoid having to use restart and to ensure that every newly create pod has the correct pod images from the beginning. My memory says it was going to be 0. assume that the control plane static pod definitions have already been updated. 1. move all static pod files to a safe location 2. wait for the kubelet to stop all static pod containers 3. shut down kubelet 4. upgrade kubelet 5. move all static pod files back into the static pod manifest directory @sdodson, can you make sure I've remembered those steps properly? Yes, that's accurate, we'd do this as part of the normal node upgrade playbooks. There's no need to special case this for control plane, we just need to make sure that we don't abort abnormally when there are no static pods on the host. My only outstanding question is how exactly we know that we've stopped all static pod containers. My plan was to move and stop instead of waiting. My plan was to only do this for control plane hosts. We don't put static pods on non-control plane, no need to run skipped tasks there. PR Submitted in master: https://github.com/openshift/openshift-ansible/pull/9784 Based on my findings, 'mv /etc/origin/node/pods /etc/origin/node/stopped-pods' does not result in any static pods stopping. What does stop static pods: Stop node service, stop docker. Proposed workflow in patch: pre-pull ose-pod image prior to stopping nodes. Stop node, stop docker, upgrade node, restart services. Static pods immediately come back with the right ose-pod image, we poll for master-api on the node we're working on just in case, install completes as expected. That's a terrifying bug if true. We can't ship with that bug in place. Moving the directory doesn't have any effect with notify, because the kernel still knows about the folder. You can't move the directory, you have to move contents (without a code change to kubelet). Yeah, that's what I thought. When we discussed this morning we decided just to restart docker to achieve the same outcome, is that fine? This all happens when the node has been drained of pods otherwise. Wouldn't it be better to just create the "-disabled" directory and move the files (static pods) there instead of restarting docker/etc. When you delete the files, the kubelet should delete the pods successfully. Offline discussion is that we'll rely on stopping docker or removing cri-o pods as implemented in #9784 which is merged in master, cherrypick is in the merge queue for release-3.11. https://github.com/openshift/openshift-ansible/pull/10030 release-3.11 pick The PR 10030 has been merged to openshift-ansible-3.11.2-1,please check the bug. Blocked verify by bz1628730 Verified on openshift-ansible-3.11.7-1.git.0.911481d.el7_5.noarch (In reply to liujia from comment #36) > Verified on openshift-ansible-3.11.7-1.git.0.911481d.el7_5.noarch (In reply to liujia from comment #36) > Verified on openshift-ansible-3.11.7-1.git.0.911481d.el7_5.noarch Sorry, a wrong pasted for another bug. Change back. Verified on openshift-ansible-3.11.9-1.git.0.63f7970.el7_5.noarch Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content. |
Description of problem: Upgrading from 3.10 to 3.11, upgrade fails on task 'Ensure openshift-web-console project exists' with the following error: "msg": { "cmd": "/bin/oc adm new-project openshift-web-console --admin-role=admin --node-selector=", "results": {}, "returncode": 1, "stderr": "The connection to the server ip-172-18-13-240.ec2.internal:8443 was refused - did you specify the right host or port?\n", "stdout": "" } How reproducible: IDK Steps to Reproduce: 1. Start Upgrade 2. Upgrade fails at task. Actual results: Failure Expected results: Not failure. Additional info: We should make everything with command: oc tasks retry. Also, need to figure out why the api dies randomly?