1614650 – Upgrade fails at "Set node schedulability"

Bug 1614650 - Upgrade fails at "Set node schedulability"

Summary: Upgrade fails at "Set node schedulability"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Michael Gugino
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1614625 (view as bug list)
Depends On:
Blocks:	1638521
TreeView+	depends on / blocked

Reported:	2018-08-10 07:10 UTC by Jaspreet Kaur
Modified:	2022-03-13 15:22 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1638521 (view as bug list)
Environment:
Last Closed:	2019-01-10 09:03:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0024	0	None	None	None	2019-01-10 09:04:05 UTC

Description Jaspreet Kaur 2018-08-10 07:10:10 UTC

Description of problem: Ugrade fails at a very early stage. It is inconsistent and tries to find node short name instead of fqdn name provided in the inventory :

2018-08-10 02:08:55,534 p=17663 u=root |  TASK [openshift_manage_node : Set node schedulability] ************************************************************************************************************************************************************
2018-08-10 02:08:56,810 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (10 retries left).
2018-08-10 02:08:56,811 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (10 retries left).
2018-08-10 02:08:56,845 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (10 retries left).
2018-08-10 02:09:02,822 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (9 retries left).
2018-08-10 02:09:02,893 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (9 retries left).
2018-08-10 02:09:02,898 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (9 retries left).
2018-08-10 02:09:08,810 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (8 retries left).
2018-08-10 02:09:08,914 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (8 retries left).
2018-08-10 02:09:08,927 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (8 retries left).
2018-08-10 02:09:14,830 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (7 retries left).
2018-08-10 02:09:14,947 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (7 retries left).
2018-08-10 02:09:14,950 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (7 retries left).
2018-08-10 02:09:20,849 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (6 retries left).
2018-08-10 02:09:20,984 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (6 retries left).
2018-08-10 02:09:21,016 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (6 retries left).
2018-08-10 02:09:26,843 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (5 retries left).
2018-08-10 02:09:26,970 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (5 retries left).
2018-08-10 02:09:27,035 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (5 retries left).
2018-08-10 02:09:32,838 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (4 retries left).
2018-08-10 02:09:32,987 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (4 retries left).
2018-08-10 02:09:33,074 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (4 retries left).
2018-08-10 02:09:38,835 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (3 retries left).
2018-08-10 02:09:39,010 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (3 retries left).
2018-08-10 02:09:39,090 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (3 retries left).
2018-08-10 02:09:44,860 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (2 retries left).
2018-08-10 02:09:45,016 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (2 retries left).
2018-08-10 02:09:45,115 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (2 retries left).
2018-08-10 02:09:50,881 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (1 retries left).
2018-08-10 02:09:51,067 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (1 retries left).
2018-08-10 02:09:51,157 p=17663 u=root |  FAILED - RETRYING: Set node schedulability (1 retries left).
2018-08-10 02:09:56,901 p=17663 u=root |  fatal: [m001.example.com -> m001.example.com ]: FAILED! => {"attempts": 10, "changed": false, "failed": true, "msg": {"results": [{"cmd": "/usr/bin/oc get node m001 -o json", "results": [{}], "returncode": 1, "stderr": "Error from server (NotFound): nodes \"m001\" not found\n", "stdout": ""}], "returncode": 1}}

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results: Fails at node schedulability in any number of ansible runs
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results: It should pass this.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 14 Scott Dodson 2018-08-27 12:28:53 UTC

*** Bug 1614625 has been marked as a duplicate of this bug. ***

Comment 25 Scott Dodson 2018-10-09 20:05:29 UTC

There are actually two scenarios here.

If they had previously set openshift_hostname they were effectively setting a configuration item that no longer fits with the model of specifying configuration at a host group level. We will be re-introducing the ability to specify this value as an override which will be available until they upgrade to 4.0. When upgrading to 4.0 they will need to go through a process to insure that their nodename matches the output of `hostname`. That migration process is yet to be defined. The ability to override this value for clean 3.10+ installs will not be re-introduced.

If they have not previously set openshift_hostname there's a similar situation where in 3.9 openshift-ansible used `hostname -f` to set nodeName in config. Since that config file value is no longer valid we needed to align openshift-ansible with the kubelet which uses `hostname` rather than `hostname -f`. For this scenario the easiest solution would be to set the host's hostname to the FQDN, ex: `hostnamectl set-hostname ose3-master.example.com` Since this may affect other items running on the host please validate this change in a test environment to minimize risk. This workaround should work with the currently shipped version of openshift-ansible. If for some reason they cannot update their hostname value the override from the first scenario can also be used once it becomes available.

We're working on validating the hostname override work now. We do not have a definitive timeline for when that will be available in a 3.10 errata.

Comment 27 Scott Dodson 2018-10-11 19:23:13 UTC

https://github.com/openshift/openshift-ansible/pull/10356 implements the changes described in comment #25 on release-3.11 branch

Comment 28 liujia 2018-10-16 08:29:21 UTC

Should be a regression involved by disabling openshift_hostname in v3.10. The background could be available in bz1613765/bz1572859/bz1566455. 

Went through pr10356, maybe influence install/upgrade/scaleup node/aws scale group/glusterfs.

There are some verify scenarios from dev in https://gist.github.com/michaelgugino/c961476d8be7d160a5e53fe9a9734051

This fix should be available in both v3.10 and v3.11. V3.10 is here(https://bugzilla.redhat.com/show_bug.cgi?id=1638521)

Comment 29 Johnny Liu 2018-10-17 12:21:07 UTC

Record related 3.11 fix PR here:
https://github.com/openshift/openshift-ansible/pull/10356

Comment 34 Michael Gugino 2018-10-23 14:14:38 UTC

PR merged for 3.11: https://github.com/openshift/openshift-ansible/pull/10447

Comment 35 liujia 2018-10-24 06:40:45 UTC

Verified on openshift-ansible-3.11.31-1.git.0.d4b5614.el7.noarch

Comment 36 sheng.lao 2018-10-24 09:37:12 UTC

Verified on openshift-ansible-3.11.31-1.git.0.d4b5614.el7.noarch.rpm

Scenario 3 (PASSED)

1)  set openshift_kubelet_name_override and get:
TASK [Fail when openshift_kubelet_name_override is defined] ********************
fatal: [host-xxxxx.redhat.com]: FAILED! => {"changed": false, "msg": "openshift_kubelet_name_override Cannot be defined for new hosts"}
	to retry, use: --limit @~/playbooks/openshift-node/scaleup.retry

2) remove openshift_kubelet_name_override then play again: Scale up success
and check with:
  a) oc new-app centos/ruby-25-centos7~https://github.com/sclorg/ruby-ex.git -l appnew=new_node 
  b) # oc get pod 
NAME              READY     STATUS      RESTARTS   AGE
ruby-ex-1-build   0/1       Completed   0          2m
ruby-ex-1-xlbm2   1/1       Running     0          1m

  c) oc get pod ruby-ex-1-xlbm2 -o yaml |grep -i node
    appnew: new_node
  nodeName: host-172-16-122-68

Comment 38 errata-xmlrpc 2019-01-10 09:03:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024

Note You need to log in before you can comment on or make changes to this bug.