Bug 1565363

Summary: [DOCS] Upgrading from 3.7 to 3.9 fails when inventory file specifies the master(s) as being unschedulable prior to upgrade, does not automatically make master(s) schedulable
Product: OpenShift Container Platform Reporter: Candace Sheremeta <cshereme>
Component: DocumentationAssignee: Kathryn Alexander <kalexand>
Status: CLOSED CURRENTRELEASE QA Contact: liujia <jiajliu>
Severity: high Docs Contact: Vikram Goyal <vigoyal>
Priority: urgent    
Version: 3.9.0CC: adellape, aos-bugs, jokerman, kalexand, mmccomas, rhowe, scott.c.worthington, wmeng, yapei
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1565702 (view as bug list) Environment:
Last Closed: 2018-08-23 17:26:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1565702    

Description Candace Sheremeta 2018-04-09 22:23:04 UTC
Description of problem:

When masters are marked in the inventory file as unschedulable prior to a 3.9 upgrade attempt using the following flag:

openshift_schedulable=False

The upgrade process will fail.


Version-Release number of the following components:
rpm -q openshift-ansible

# rpm -q openshift-ansible
openshift-ansible-3.9.14-1.git.3.c62bc34.el7.noarch

rpm -q ansible

# rpm -q ansible
ansible-2.4.3.0-1.el7ae.noarch

ansible --version

ansible --version
ansible 2.4.3.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

How reproducible: 100%

Steps to Reproduce:
1. Set up 3.7 cluster
2. Mark master node as unschedulable by setting openshift_schedulable=False in inventory file
3. Attempt to upgrade to 3.9

Actual results:

TASK [openshift_web_console : Verify that the web console is running] *****************************
FAILED - RETRYING: Verify that the web console is running (60 retries left).
FAILED - RETRYING: Verify that the web console is running (59 retries left).
FAILED - RETRYING: Verify that the web console is running (58 retries left).
FAILED - RETRYING: Verify that the web console is running (57 retries left).
FAILED - RETRYING: Verify that the web console is running (56 retries left).
FAILED - RETRYING: Verify that the web console is running (55 retries left).
FAILED - RETRYING: Verify that the web console is running (54 retries left).
FAILED - RETRYING: Verify that the web console is running (53 retries left).
FAILED - RETRYING: Verify that the web console is running (52 retries left).
FAILED - RETRYING: Verify that the web console is running (51 retries left).
FAILED - RETRYING: Verify that the web console is running (50 retries left).
FAILED - RETRYING: Verify that the web console is running (49 retries left).
FAILED - RETRYING: Verify that the web console is running (48 retries left).
FAILED - RETRYING: Verify that the web console is running (47 retries left).
FAILED - RETRYING: Verify that the web console is running (46 retries left).
FAILED - RETRYING: Verify that the web console is running (45 retries left).
FAILED - RETRYING: Verify that the web console is running (44 retries left).
FAILED - RETRYING: Verify that the web console is running (43 retries left).
FAILED - RETRYING: Verify that the web console is running (42 retries left).
FAILED - RETRYING: Verify that the web console is running (41 retries left).
FAILED - RETRYING: Verify that the web console is running (40 retries left).
FAILED - RETRYING: Verify that the web console is running (39 retries left).
FAILED - RETRYING: Verify that the web console is running (38 retries left).
FAILED - RETRYING: Verify that the web console is running (37 retries left).
FAILED - RETRYING: Verify that the web console is running (36 retries left).
FAILED - RETRYING: Verify that the web console is running (35 retries left).
FAILED - RETRYING: Verify that the web console is running (34 retries left).
FAILED - RETRYING: Verify that the web console is running (33 retries left).
FAILED - RETRYING: Verify that the web console is running (32 retries left).
FAILED - RETRYING: Verify that the web console is running (31 retries left).
FAILED - RETRYING: Verify that the web console is running (30 retries left).
FAILED - RETRYING: Verify that the web console is running (29 retries left).
FAILED - RETRYING: Verify that the web console is running (28 retries left).
FAILED - RETRYING: Verify that the web console is running (27 retries left).
FAILED - RETRYING: Verify that the web console is running (26 retries left).
FAILED - RETRYING: Verify that the web console is running (25 retries left).
FAILED - RETRYING: Verify that the web console is running (24 retries left).
FAILED - RETRYING: Verify that the web console is running (23 retries left).
FAILED - RETRYING: Verify that the web console is running (22 retries left).
FAILED - RETRYING: Verify that the web console is running (21 retries left).
FAILED - RETRYING: Verify that the web console is running (20 retries left).
FAILED - RETRYING: Verify that the web console is running (19 retries left).
FAILED - RETRYING: Verify that the web console is running (18 retries left).
FAILED - RETRYING: Verify that the web console is running (17 retries left).
FAILED - RETRYING: Verify that the web console is running (16 retries left).
FAILED - RETRYING: Verify that the web console is running (15 retries left).
FAILED - RETRYING: Verify that the web console is running (14 retries left).
FAILED - RETRYING: Verify that the web console is running (13 retries left).
FAILED - RETRYING: Verify that the web console is running (12 retries left).
FAILED - RETRYING: Verify that the web console is running (11 retries left).
FAILED - RETRYING: Verify that the web console is running (10 retries left).
FAILED - RETRYING: Verify that the web console is running (9 retries left).
FAILED - RETRYING: Verify that the web console is running (8 retries left).
FAILED - RETRYING: Verify that the web console is running (7 retries left).
FAILED - RETRYING: Verify that the web console is running (6 retries left).
FAILED - RETRYING: Verify that the web console is running (5 retries left).
FAILED - RETRYING: Verify that the web console is running (4 retries left).
FAILED - RETRYING: Verify that the web console is running (3 retries left).
FAILED - RETRYING: Verify that the web console is running (2 retries left).
FAILED - RETRYING: Verify that the web console is running (1 retries left).
fatal: [marvolo.cluster.lan]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["curl", "-k", "https://webconsole.openshift-web-console.svc/healthz"], "delta": "0:00:01.013555", "end": "2018-04-09 17:47:24.207603", "msg": "non-zero return code", "rc": 7, "start": "2018-04-09 17:47:23.194048", "stderr": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused", "stderr_lines": ["  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current", "                                 Dload  Upload   Total   Spent    Left  Speed", "", "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0", "  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}
...ignoring

TASK [openshift_web_console : Check status in the openshift-web-console namespace] ****************
changed: [marvolo.cluster.lan]

TASK [openshift_web_console : debug] **************************************************************
ok: [marvolo.cluster.lan] => {
    "msg": [
        "In project openshift-web-console on server https://marvolo.cluster.lan:8443", 
        "", 
        "svc/webconsole - 172.30.10.18:443 -> 8443", 
        "  deployment/webconsole deploys registry.access.redhat.com/openshift3/ose-web-console:v3.9.14", 
        "    deployment #1 running for about an hour - 0/1 pods", 
        "", 
        "View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'."
    ]
}

TASK [openshift_web_console : Get pods in the openshift-web-console namespace] ********************
changed: [marvolo.cluster.lan]

TASK [openshift_web_console : debug] **************************************************************
ok: [marvolo.cluster.lan] => {
    "msg": [
        "NAME                          READY     STATUS    RESTARTS   AGE       IP        NODE", 
        "webconsole-56c6745c85-tt8dp   0/1       Pending   0          54m       <none>    <none>"
    ]
}

TASK [openshift_web_console : Get events in the openshift-web-console namespace] ******************
changed: [marvolo.cluster.lan]

TASK [openshift_web_console : debug] **************************************************************
ok: [marvolo.cluster.lan] => {
    "msg": [
        "LAST SEEN   FIRST SEEN   COUNT     NAME                                           KIND         SUBOBJECT   TYPE      REASON              SOURCE                  MESSAGE", 
        "14m         54m          141       webconsole-56c6745c85-tt8dp.1523dfd0cb7b26af   Pod                      Warning   FailedScheduling    default-scheduler       0/3 nodes are available: 1 NodeUnschedulable, 2 MatchNodeSelector.", 
        "12m         13m          7         webconsole-56c6745c85-tt8dp.1523e20fdb005ba1   Pod                      Warning   FailedScheduling    default-scheduler       0/3 nodes are available: 1 NodeUnschedulable, 2 MatchNodeSelector.", 
        "1m          11m          36        webconsole-56c6745c85-tt8dp.1523e224c847702c   Pod                      Warning   FailedScheduling    default-scheduler       0/3 nodes are available: 1 NodeUnschedulable, 2 MatchNodeSelector.", 
        "54m         54m          1         webconsole-56c6745c85.1523dfd0cb68adb0         ReplicaSet               Normal    SuccessfulCreate    replicaset-controller   Created pod: webconsole-56c6745c85-tt8dp", 
        "54m         54m          1         webconsole.1523dfd0c2c1bff1                    Deployment               Normal    ScalingReplicaSet   deployment-controller   Scaled up replica set webconsole-56c6745c85 to 1"
    ]
}

TASK [openshift_web_console : Get console pod logs] ***********************************************
changed: [marvolo.cluster.lan]

TASK [openshift_web_console : debug] **************************************************************
ok: [marvolo.cluster.lan] => {
    "msg": []
}

TASK [openshift_web_console : Remove temp directory] **********************************************
ok: [marvolo.cluster.lan]

TASK [openshift_web_console : Report console errors] **********************************************
fatal: [marvolo.cluster.lan]: FAILED! => {"changed": false, "msg": "Console install failed."}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade.retry

PLAY RECAP ****************************************************************************************
localhost                  : ok=25   changed=0    unreachable=0    failed=0   
marvolo.cluster.lan        : ok=311  changed=40   unreachable=0    failed=1   



Failure summary:


  1. Hosts:    marvolo.cluster.lan
     Play:     Upgrade web console
     Task:     Report console errors
     Message:  Console install failed.



Expected results: 
Master is automatically marked as schedulable per the following quote from the docs at https://docs.openshift.com/container-platform/3.9/upgrading/automated_upgrades.html#preparing-for-an-automated-upgrade :

"In previous versions of OpenShift Container Platform, master hosts were marked unschedulable by default by the installer, meaning that no pods could be placed on the hosts. Starting with OpenShift Container Platform 3.9, however, masters must be marked schedulable; this is done automatically during upgrade."

Upgrade completes successfully.

Comment 1 Ryan Howe 2018-04-09 22:35:40 UTC
This should be a documentation bug, as we do not want to overwrite the behaviour of setting scheduling false for a node variable.

Comment 3 liujia 2018-04-10 02:29:39 UTC
QE can reproduce the issue according to description.

Steps:
1. Install ocp v3.7
2. Edit inventory file to set "openshift_schedulable=False" for master node.
3. Run upgrade against above ocp.

But AFAIK, this should be expected result due to user want master node unscheduled by setting "openshift_schedulable=False" in inventory file.

> "In previous versions of OpenShift Container Platform, master hosts were marked unschedulable by default by the installer, meaning that no pods could be placed on the hosts. Starting with OpenShift Container Platform 3.9, however, masters must be marked schedulable; this is done automatically during upgrade."

Master node was marked unschedulable by default means that not setting openshift_schedulable explicitly. When no "openshift_schedulable" specified, installer will do default action.

To resolve it, user can just remove "openshift_schedulable=False" if user accept master node to be schedulable.

Comment 4 Scott Dodson 2018-04-10 14:39:57 UTC
The documentation note was intended to mean that if you'd previously not set schedulability of the masters we'd automatically make them schedulable. We should clarify that note to make sure that if they have all masters openshift_schedulable=false the upgrade will fail to install the console and potentially other components.

The workaround here is to remove `openshift_schedulable=false` from the masters and re-run the playbooks which will mark them schedulable.

Comment 5 Scott Worthington 2018-04-13 12:08:20 UTC
(In reply to Scott Dodson from comment #4)
> The documentation note was intended to mean that if you'd previously not set
> schedulability of the masters we'd automatically make them schedulable. We
> should clarify that note to make sure that if they have all masters
> openshift_schedulable=false the upgrade will fail to install the console and
> potentially other components.
> 
> The workaround here is to remove `openshift_schedulable=false` from the
> masters and re-run the playbooks which will mark them schedulable.

Note that the RH customer's BYO inventory for the upgrade had the following attributes:

[masters]
master-nodes-0[1:3].corp.local

[nodes]
master-nodes-0[1:3].corp.local openshift_schedulable=false openshift_node_labels="{'region': 'master', 'zone': 'default'}"
infra-nodes-0[1:3].corp.local openshift_node_labels="{'region': 'infra', 'zone': 'default'}"
app-nodes-0[1:3].corp.local openshift_node_labels="{'region': 'apps', 'zone': 'default'}"

###

Perhaps the documentation should be clearer when upgrading from 3.7.x to 3.9.x by stating _explicitly_ that all master inventory entries should have the variable "openshift_schedulable=false" removed if it previously was set.

Comment 7 Kathryn Alexander 2018-08-17 18:27:44 UTC
I have a draft of this change here: https://github.com/openshift/openshift-docs/pull/11628

@Johnny, will you PTAL? I think I understand the change for both the automatic and manual upgrade, but please let me know if they need more changes.

Comment 8 liujia 2018-08-20 07:02:49 UTC
Automatic upgrade part LGTM. For manual upgrade part:

> "During upgrade, if a value is not provided for the `openshift_schedulable`
 parameter, it is set to *true*. If a value is provided for
 `openshift_schedulable`, it is not changed."

This seems not relevant with manual upgrade step because inventory file is not needed for manual upgrade. It's better to remove it, or else, better to update it to "During automatically upgrade, ...." at the beginning for above section.

> "Remove the `openshift_schedulable` parameter and its value from the entry for
 each master host in the Ansible inventory file."

Better to add "if a value is provided for the `openshift_schedulable`" for above section.

Comment 9 Kathryn Alexander 2018-08-20 15:43:15 UTC
Thanks Jia Liu! I removed those sections per your feedback and Scott's feedback. Will you please take another look?

https://github.com/openshift/openshift-docs/pull/11628

Comment 10 liujia 2018-08-21 08:44:51 UTC
(In reply to Kathryn Alexander from comment #9)
> Thanks Jia Liu! I removed those sections per your feedback and Scott's
> feedback. Will you please take another look?
> 
> https://github.com/openshift/openshift-docs/pull/11628

Seems not update in pr11628. Could u double confirm?

Comment 11 Kathryn Alexander 2018-08-21 13:47:32 UTC
I apologize for the user error - I've fixed the PR.

@Jia Liu, will you PTAL?

Comment 12 liujia 2018-08-22 03:26:17 UTC
LGTM now.

Comment 13 Kathryn Alexander 2018-08-22 12:25:49 UTC
Thanks! I've merged the change and am waiting for it to go live.