Bug 1547923 - Upgrade failed at task [Verify that the web console is running] during minor version upgrade
Summary: Upgrade failed at task [Verify that the web console is running] during minor ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.9.0
Assignee: Samuel Padgett
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-22 10:00 UTC by liujia
Modified: 2018-03-27 09:49 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-28 13:04:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0489 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.9 RPM Release Advisory 2018-03-28 18:06:38 UTC

Description liujia 2018-02-22 10:00:04 UTC
Description of problem:
Upgrade an old version v3.9 to latest v3.9 with openshift_release set in hosts file. Upgrade failed at task [openshift_web_console : Verify that the web console is running] when install web console.

TASK [openshift_web_console : Verify that the web console is running] **********
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/install.yml:158
FAILED - RETRYING: Verify that the web console is running (60 retries left).
....
FAILED - RETRYING: Verify that the web console is running (1 retries left).
fatal: [x.x.x.x]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["curl", "-k", "https://webconsole.openshift-web-console.svc/healthz"], "delta": "0:00:01.013142", "end": "2018-02-22 03:51:26.238385", "msg": "non-zero return code", "rc": 7, "start": "2018-02-22 03:51:25.225243", "stderr": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused", "stderr_lines": ["  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current", "                                 Dload  Upload   Total   Spent    Left  Speed", "", "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}
...ignoring

...
...

TASK [openshift_web_console : Report console errors] ***************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/install.yml:211
fatal: [x.x.x.x]: FAILED! => {"changed": false, "msg": "Console install failed."}

==debug info
When upgrade failed, run "curl -k https://webconsole.openshift-web-console.svc/healthz" manually on the master hosts, get the right reply.

# curl -k https://webconsole.openshift-web-console.svc/healthz
ok


Version-Release number of the following components:
openshift-ansible-3.9.0-0.47.0.git.0.f8847bb.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Install an old verson v3.9 ocp
2. Run upgrade against above ocp with openshift_release specified in hosts file(workaround for bz1547898)
openshift_release=v3.9
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
TASK [openshift_web_console : Verify that the web console is running] **********
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/install.yml:158
FAILED - RETRYING: Verify that the web console is running (60 retries left).
....
FAILED - RETRYING: Verify that the web console is running (1 retries left).
fatal: [x.x.x.x]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["curl", "-k", "https://webconsole.openshift-web-console.svc/healthz"], "delta": "0:00:01.013142", "end": "2018-02-22 03:51:26.238385", "msg": "non-zero return code", "rc": 7, "start": "2018-02-22 03:51:25.225243", "stderr": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused", "stderr_lines": ["  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current", "                                 Dload  Upload   Total   Spent    Left  Speed", "", "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}
...ignoring

TASK [openshift_web_console : Check status in the openshift-web-console namespace] ***
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/install.yml:176
changed: [x.x.x.x] => {"changed": true, "cmd": ["oc", "status", "--config=/tmp/console-ansible-ShiWVE/admin.kubeconfig", "-n", "openshift-web-console"], "delta": "0:00:00.773984", "end": "2018-02-22 03:51:28.249952", "rc": 0, "start": "2018-02-22 03:51:27.475968", "stderr": "", "stderr_lines": [], "stdout": "In project openshift-web-console on server https://qe-jliu-39-master-etcd-1:8443\n\nsvc/webconsole - 172.30.148.25:443 -> 8443\n  deployment/webconsole deploys registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.9.0\n\nView details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.", "stdout_lines": ["In project openshift-web-console on server https://qe-jliu-39-master-etcd-1:8443", "", "svc/webconsole - 172.30.148.25:443 -> 8443", "  deployment/webconsole deploys registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.9.0", "", "View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'."]}

TASK [openshift_web_console : debug] *******************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/install.yml:181
ok: [x.x.x.x] => {
    "msg": [
        "In project openshift-web-console on server https://qe-jliu-39-master-etcd-1:8443", 
        "", 
        "svc/webconsole - 172.30.148.25:443 -> 8443", 
        "  deployment/webconsole deploys registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.9.0", 
        "", 
        "View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'."
    ]
}

TASK [openshift_web_console : Get pods in the openshift-web-console namespace] ***
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/install.yml:183
changed: [x.x.x.x] => {"changed": true, "cmd": ["oc", "get", "pods", "--config=/tmp/console-ansible-ShiWVE/admin.kubeconfig", "-n", "openshift-web-console", "-o", "wide"], "delta": "0:05:00.638511", "end": "2018-02-22 03:56:30.187713", "rc": 0, "start": "2018-02-22 03:51:29.549202", "stderr": "", "stderr_lines": [], "stdout": "NAME                          READY     STATUS    RESTARTS   AGE       IP            NODE\nwebconsole-6fcb5b98f6-5hgkl   1/1       Running   0          15m       10.128.0.16   qe-jliu-39-master-etcd-1", "stdout_lines": ["NAME                          READY     STATUS    RESTARTS   AGE       IP            NODE", "webconsole-6fcb5b98f6-5hgkl   1/1       Running   0          15m       10.128.0.16   qe-jliu-39-master-etcd-1"]}

TASK [openshift_web_console : debug] *******************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/install.yml:188
ok: [x.x.x.x] => {
    "msg": [
        "NAME                          READY     STATUS    RESTARTS   AGE       IP            NODE", 
        "webconsole-6fcb5b98f6-5hgkl   1/1       Running   0          15m       10.128.0.16   qe-jliu-39-master-etcd-1"
    ]
}

TASK [openshift_web_console : Report console errors] ***************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/install.yml:211
fatal: [x.x.x.x]: FAILED! => {"changed": false, "msg": "Console install failed."}

Comment 1 Samuel Padgett 2018-02-22 14:48:29 UTC
It looks like you might not have included all of the output after the "Verify that the web console is running" task. Is there output with the pod logs?

If not, can you include the log from...?

oc logs webconsole-6fcb5b98f6-5hgkl -n openshift-web-console

I don't see anything obviously wrong. It's odd that the curl failed when the pod is running and ready.

Comment 2 Samuel Padgett 2018-02-22 14:57:03 UTC
I just noticed that you ran the curl manually, and it worked. I wonder if the pod simply took more than 5 minutes to become ready.

cc Scott

Comment 3 Samuel Padgett 2018-02-22 15:06:08 UTC
The oc status output normally has a line like

deployment #1 running for 16 hours - 3 pods

that's missing here. If I'm reading it right, the call to get `oc get pods` took 5 minutes (!), so it's possible that the pod became ready in between the curl check and that command output, particularly if manually running curl later worked.

Comment 4 Samuel Padgett 2018-02-22 16:42:10 UTC
jiajliu - If you still have the ansible output, please include the events that were printed as well. Since the events will have expired at this point, you will only be able to find those in the ansible output.

Comment 6 liujia 2018-02-23 01:58:31 UTC
@Samuel Padgett

I've attached all output from PLAY [Upgrade web console] to the end. Hope it helps.

Comment 7 Samuel Padgett 2018-02-23 15:40:13 UTC
Thanks for the log.

It looks like we simply aren't waiting long enough for the console to be ready. From the events, the masters were not schedulable for a few minutes after the console was deployed. Since the masters were just recently upgraded and the console pod runs on the masters, it's possible the container will take more than 5 minutes to be ready, in particular if it needs to pull an image (not the case here, but possible). Note that the curl same command succeeded when run manually later.

Given that deployment configs default to 10 minutes before failing, it seems reasonable to use the same for openshift-ansible.

https://github.com/openshift/openshift-ansible/pull/7266

Please try again when these changes to go in and reopen if you still see the failure.

Comment 12 liujia 2018-02-27 06:46:01 UTC
Verified on openshift-ansible-3.9.0-0.53.0.git.0.f8f01ef.el7.noarch


Note You need to log in before you can comment on or make changes to this bug.