Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1624495

Summary: Upgrade fails: (api down) Ensure openshift-web-console project exists
Product: OpenShift Container Platform Reporter: Michael Gugino <mgugino>
Component: InstallerAssignee: Michael Gugino <mgugino>
Status: CLOSED CURRENTRELEASE QA Contact: liujia <jiajliu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, ccoleman, deads, gpei, jiajliu, jialiu, jokerman, mfojtik, mgugino, mmccomas, sdodson, shlao, wmeng, wsun, xxia
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-21 15:23:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
docker ps upgrade output none

Description Michael Gugino 2018-08-31 19:56:45 UTC
Description of problem:
Upgrading from 3.10 to 3.11, upgrade fails on task 'Ensure openshift-web-console project exists'

with the following error:
    "msg": {
        "cmd": "/bin/oc adm new-project openshift-web-console --admin-role=admin --node-selector=",
        "results": {},
        "returncode": 1,
        "stderr": "The connection to the server ip-172-18-13-240.ec2.internal:8443 was refused - did you specify the right host or port?\n",
        "stdout": ""
    }


How reproducible: IDK

Steps to Reproduce:
1. Start Upgrade
2. Upgrade fails at task.

Actual results:
Failure

Expected results:
Not failure.

Additional info:
We should make everything with command: oc tasks retry.  Also, need to figure out why the api dies randomly?

Comment 1 Michael Gugino 2018-08-31 20:15:22 UTC
So, this seems to be due to my testing, the api is all the way down and unable to start.  The real issue is we're not actually waiting for the api to come back online.

Comment 2 Weihua Meng 2018-09-03 02:22:03 UTC
*** Bug 1624657 has been marked as a duplicate of this bug. ***

Comment 5 sheng.lao 2018-09-03 10:40:12 UTC
log of command: journalctl -u atomic-openshift-node.service

Comment 6 Johnny Liu 2018-09-04 03:22:51 UTC
I also hit such upgrade failure.

PLAY [Upgrade web console] *****************************************************

TASK [Gathering Facts] *********************************************************

ok: [qe-jialiu3101-master-etcd-1.0904-hn2.qe.rhcloud.com]

TASK [openshift_web_console : include_tasks] ***********************************
included: /home/slave6/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/roles/openshift_web_console/tasks/install.yml for qe-jialiu3101-master-etcd-1.0904-hn2.qe.rhcloud.com

TASK [openshift_web_console : Ensure openshift-web-console project exists] *****
fatal: [qe-jialiu3101-master-etcd-1.0904-hn2.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": {"cmd": "/usr/bin/oc adm new-project openshift-web-console --admin-role=admin --node-selector=", "results": {}, "returncode": 1, "stderr": "The connection to the server qe-jialiu3101-master-etcd-1:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}
	to retry, use: --limit @/home/slave6/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade.retry

After the failure, log into master, run the same command manually, it is working well.

[root@qe-jialiu3101-master-etcd-1 ~]# /usr/bin/oc adm new-project openshift-web-console --admin-role=admin --node-selector=
error: project openshift-web-console already exists

Comment 10 Maciej Szulik 2018-09-05 13:11:57 UTC
From the attached logs the only thing I found failing is:

poststarthook/authorization.openshift.io-bootstrapclusterroles failed: reason withheld

I can't find anything useful in the logs, though. If you can reliable reproduce the problem please invoke following command:

oc get --raw /healthz/poststarthook/authorization.openshift.io-bootstrapclusterroles

and report here, so that I can further investigate what failed with bootstraping culsterroles.

Comment 11 liujia 2018-09-06 07:20:06 UTC
Still hit it on openshift-ansible-3.11.0-0.28.0.git.0.730d4be.el7.noarch

Comment 12 Maciej Szulik 2018-09-06 10:23:15 UTC
Liujia can you provide the information I requested in comment 10?

Comment 13 Maciej Szulik 2018-09-06 10:24:17 UTC
Eventually, can you give me access to the instance where you have this reproduced?

Comment 14 liujia 2018-09-06 10:29:03 UTC
(In reply to Maciej Szulik from comment #13)
> Eventually, can you give me access to the instance where you have this
> reproduced?

Sorry. I just read comment10. I re-run upgrade when hit the issue. So current info from the cluster is useless. I will add more info when I hit it again. It's a high rate during recent test.

Comment 17 Maciej Szulik 2018-09-07 13:45:30 UTC
Looking at the instance you've pointed me to everything seems to be just fine. Are you sure you saw the exact same problem?

Next time you hit it, please gather all master logs (if possible with increased verbosity 5 or higher), and invoke:
oc get --raw /healthz
which should give you information what failed (iow. which poststarthook), and then invoke oc get --raw /healthz/poststarthook/<name of the failed poststarthook>

Comment 18 Michael Gugino 2018-09-07 14:40:32 UTC
(In reply to Maciej Szulik from comment #17)
> Looking at the instance you've pointed me to everything seems to be just
> fine. Are you sure you saw the exact same problem?
> 
> Next time you hit it, please gather all master logs (if possible with
> increased verbosity 5 or higher), and invoke:
> oc get --raw /healthz
> which should give you information what failed (iow. which poststarthook),
> and then invoke oc get --raw /healthz/poststarthook/<name of the failed
> poststarthook>

The master is failing to come up, openshift-ansible is failing to detect that condition.

Master restart shim needs to be rewritten to account for this.  I know why the master is not coming up (unable to pull an image).  What I need is the restart shim to be fixed.

Comment 19 Maciej Szulik 2018-09-07 15:38:37 UTC
> Master restart shim needs to be rewritten to account for this.  I know why the 
> master is not coming up (unable to pull an image).  What I need is the restart 
> shim to be fixed.

Scott do you think you could provide such a fix for the ansible?

Comment 20 Scott Dodson 2018-09-07 16:21:53 UTC
(In reply to Maciej Szulik from comment #19)
> > Master restart shim needs to be rewritten to account for this.  I know why the 
> > master is not coming up (unable to pull an image).  What I need is the restart 
> > shim to be fixed.
> 
> Scott do you think you could provide such a fix for the ansible?

No, I don't think so.

Clayton, David, and I have discussed this in the past. It seemed that Clayton is 100% against making the restart script responsible for blocking until the service is ready. 

David and I have discussed various heuristics that could be improved to ensure we wait for not only core API services to become available but for aggregated API endpoints to also be available in this bug

https://bugzilla.redhat.com/show_bug.cgi?id=1623571

I'd really like for the master team to review the overall upgrade process and ideally provide a blocking method for ensuring that the API has been definitively restarted. The methods we've devised in ansible are clearly insufficient.


It's unfortunate that no one in this bug has provided logs that make it easy to walk through the current behavior with a clear indication of the timing and order of operations.

Comment 21 Michael Gugino 2018-09-07 20:25:48 UTC
Created attachment 1481674 [details]
docker ps upgrade output

Shows changes of ose-pod images for pods and static pods

1) immediately after master pods are restarted

2) After nodes are updated and restarted.

Comment 22 Michael Gugino 2018-09-07 20:27:08 UTC
(In reply to Michael Gugino from comment #21)
> Created attachment 1481674 [details]
> docker ps upgrade output
> 
> Shows changes of ose-pod images for pods and static pods
> 
> 1) immediately after master pods are restarted
> 
> 2) After nodes are updated and restarted.

This attachement shows that updating the node version and restarting the node service causes all static pods to be killed and recreated due to ose-pod image being updated.  This behavior was previously unknown, we will have to patch openshift-ansible.

Comment 23 Michael Gugino 2018-09-07 23:02:24 UTC
Following some testing, I believe I have a testable condition for this issue:

oc get pod -ojson <api pod>
output shows a unique meta_data.uid value for each pod.

For static pods, meta_data.uid is updated during the following conditions:
ose-control-plane image is updated in static pod definition (eg, during an upgrade of master components)
atomic-openshift-node binary is updated between major versions (eg, 3.10.x to 3.11.x).  It does not appear to be updated between minor upgrades (eg 3.10.0 to 3.10.1) but I need to test more to confirm.

meta_data.uid is not updated for static pods during the following conditions:
1) restarting the static pod via master-restart <service>, eg master-restart api
2) restarting atomic-openshift-node service
3) Fully stopping atomic-openshift-node service AND fully stopping docker service, waiting 30 seconds, and restarting both services.

I believe we can monitor this value to ensure that the static pods are actually restarted with the new ose-pod images.  I'm unsure how we'll handle minor upgrades or re-running of the playbooks in case of failure (might have to brute-force docker ps and grep for the pod image name).

Comment 24 David Eads 2018-09-10 19:48:55 UTC
Hey Scott,

I think this is one we discussed earlier today and the solution was going to be reworking the node upgrade flow to avoid having to use restart and to ensure that every newly create pod has the correct pod images from the beginning.  My memory says it was going to be 

 0. assume that the control plane static pod definitions have already been updated.
 1. move all static pod files to a safe location
 2. wait for the kubelet to stop all static pod containers
 3. shut down kubelet
 4. upgrade kubelet
 5. move all static pod files back into the static pod manifest directory

@sdodson, can you make sure I've remembered those steps properly?

Comment 25 Scott Dodson 2018-09-10 20:14:50 UTC
Yes, that's accurate, we'd do this as part of the normal node upgrade playbooks. There's no need to special case this for control plane, we just need to make sure that we don't abort abnormally when there are no static pods on the host.

My only outstanding question is how exactly we know that we've stopped all static pod containers.

Comment 26 Michael Gugino 2018-09-10 20:33:31 UTC
My plan was to move and stop instead of waiting.  My plan was to only do this for control plane hosts.  We don't put static pods on non-control plane, no need to run skipped tasks there.

Comment 27 Michael Gugino 2018-09-10 22:15:32 UTC
PR Submitted in master: https://github.com/openshift/openshift-ansible/pull/9784

Based on my findings, 'mv /etc/origin/node/pods /etc/origin/node/stopped-pods' does not result in any static pods stopping.

What does stop static pods:  Stop node service, stop docker.

Proposed workflow in patch:

pre-pull ose-pod image prior to stopping nodes.  Stop node, stop docker, upgrade node, restart services.  Static pods immediately come back with the right ose-pod image, we poll for master-api on the node we're working on just in case, install completes as expected.

Comment 28 Clayton Coleman 2018-09-12 00:40:07 UTC
That's a terrifying bug if true.  We can't ship with that bug in place.

Comment 29 Clayton Coleman 2018-09-12 00:54:38 UTC
Moving the directory doesn't have any effect with notify, because the kernel still knows about the folder.  You can't move the directory, you have to move contents (without a code change to kubelet).

Comment 30 Scott Dodson 2018-09-12 02:25:38 UTC
Yeah, that's what I thought. When we discussed this morning we decided just to restart docker to achieve the same outcome, is that fine? This all happens when the node has been drained of pods otherwise.

Comment 31 Michal Fojtik 2018-09-12 14:27:44 UTC
Wouldn't it be better to just create the "-disabled" directory and move the files (static pods) there instead of restarting docker/etc. When you delete the files, the kubelet should delete the pods successfully.

Comment 32 Scott Dodson 2018-09-12 15:19:06 UTC
Offline discussion is that we'll rely on stopping docker or removing cri-o pods as implemented in #9784 which is merged in master, cherrypick is in the merge queue for release-3.11.

Comment 33 Scott Dodson 2018-09-12 16:14:43 UTC
https://github.com/openshift/openshift-ansible/pull/10030 release-3.11 pick

Comment 34 Wei Sun 2018-09-13 02:07:21 UTC
The PR 10030 has been merged to openshift-ansible-3.11.2-1,please check the bug.

Comment 35 liujia 2018-09-14 09:05:33 UTC
Blocked verify by bz1628730

Comment 36 liujia 2018-09-17 05:58:06 UTC
Verified on openshift-ansible-3.11.7-1.git.0.911481d.el7_5.noarch

Comment 37 liujia 2018-09-17 05:59:09 UTC
(In reply to liujia from comment #36)
> Verified on openshift-ansible-3.11.7-1.git.0.911481d.el7_5.noarch

(In reply to liujia from comment #36)
> Verified on openshift-ansible-3.11.7-1.git.0.911481d.el7_5.noarch

Sorry, a wrong pasted for another bug. Change back.

Comment 38 liujia 2018-09-19 06:47:40 UTC
Verified on openshift-ansible-3.11.9-1.git.0.63f7970.el7_5.noarch

Comment 39 Luke Meyer 2018-12-21 15:23:38 UTC
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.