Bug 1396350 - Install fails when Host name has capital letter assigned to it
Summary: Install fails when Host name has capital letter assigned to it
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
high
low
Target Milestone: ---
: 3.7.z
Assignee: Vadim Rutkovsky
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks: 1543748 1543749
TreeView+ depends on / blocked
 
Reported: 2016-11-18 06:57 UTC by Gan Huang
Modified: 2018-10-12 18:11 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
OpenShift requires that hostnames conform to standards which preclude the use of upper case letters. Now the installer ensures that the hostnames for node objects are created with lowercase letters.
Clone Of: 1258243
: 1543748 1543749 (view as bug list)
Environment:
Last Closed: 2018-08-28 12:27:34 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3293431 None None None 2017-12-22 12:22:13 UTC

Description Gan Huang 2016-11-18 06:57:10 UTC
+++ This bug was initially created as a clone of Bug #1258243 +++

Description of problem: The ansible and quick install fail when the HostName is manually defined containing a Capital Letter.

- The install takes the env variable with the capital letter and passes it through with the command #oc get node <openshift_hostnamem> 

the anisable installer hase this as <openshift_nodes> 

Kubernetes converts the names of the nodes to lower case and will not recognize a node name with a capital letter. Example: 

HostName:   Node1.example.com 
Kubernetes names the node:    node1.example.com 

Ansible install checks on the node registration with the following: 

#oc get node Node1.example.com 


The installer in turn fails. 




Version-Release number of selected component (if applicable): 
openshift-ansible-3.4.27-1

How reproducible:
100%
Quick and Advanced 

Steps to Reproduce:
1.  Spin up a new vm follow the docs to get it ready
2.  Edit your ansible host file with the names of the hosts using capital letters. Also manual define the variables openshift_hostname openshift_public_hostname 
3.  Run the Ansible installer 

Actual results:

Installer fails to register nodes, stops before the SDN is configured

Expected results:

Installer completes, converts the capital letters to lower case when running kubernetes commands to check nodes. 

Additional info:

Ansible Code that fails:
https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.4.27-1/roles/openshift_manage_node/tasks/main.yml#L17

Work around: 
Change all names and variables in /etc/ansible/hosts to lower case

Comment 1 Gan Huang 2016-12-23 06:15:03 UTC
This issue has been fixed in https://github.com/openshift/openshift-ansible/pull/2835

 - name: Wait for Node Registration
   command: >
-    {{ openshift.common.client_binary }} get node {{ hostvars[item].openshift.common.hostname }}
+    {{ openshift.common.client_binary }} get node {{ openshift.common.hostname | lower }}
     --config={{ openshift_manage_node_kubeconfig }}
     -n default
   register: omd_get_node


Could anyone help confirm it, then mark it "ON_QA" so that I can verify it directly?

Comment 9 Scott Dodson 2017-09-21 12:36:33 UTC
https://github.com/openshift/openshift-ansible/pull/5483 proposed fix

Comment 11 Scott Dodson 2017-09-21 21:33:23 UTC
Ok, after setting my hostname to a mixed case hsotname using `hostnamectl` I was able to reproduce the problem and the proposed patch works for both install and upgrade scenarios.

Comment 13 openshift-github-bot 2017-09-23 02:08:52 UTC
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/16be0f2eb09b2e4c2b14864ba83195866479f917
Ensure that hostname is lowercase

Fixes Bug 1396350

Comment 15 Gan Huang 2017-10-10 07:50:15 UTC
Tested with openshift-ansible-3.7.0-0.146.0.git.0.3038a60.el7.noarch.rpm

Installation failed at:
RUNNING HANDLER [openshift_node : restart openvswitch pause] *******************
skipping: [OpenShift-151.lab.sjc.Redhat.com] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}

RUNNING HANDLER [openshift_node : restart node] ********************************

FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).

FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).

FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).

fatal: [OpenShift-151.lab.sjc.Redhat.com]: FAILED! => {"attempts": 3, "changed": false, "failed": true, "msg": "Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}
fatal: [OpenShift-126.lab.sjc.Redhat.com]: FAILED! => {"attempts": 3, "changed": false, "failed": true, "msg": "Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}


Node logs:
Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.304138   51318 kubelet_node_status.go:82] Attempting to register node openshift-126.lab.sjc.redhat.com
Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.307431   51318 server.go:348] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"openshift-126.lab.sjc.redhat.com", UID:"openshift-126.lab.sjc.redhat.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientDisk' Node openshift-126.lab.sjc.redhat.com status is now: NodeHasSufficientDisk
Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.307953   51318 server.go:348] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"openshift-126.lab.sjc.redhat.com", UID:"openshift-126.lab.sjc.redhat.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientMemory' Node openshift-126.lab.sjc.redhat.com status is now: NodeHasSufficientMemory
Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.308366   51318 server.go:348] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"openshift-126.lab.sjc.redhat.com", UID:"openshift-126.lab.sjc.redhat.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasNoDiskPressure' Node openshift-126.lab.sjc.redhat.com status is now: NodeHasNoDiskPressure
Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: E1010 03:40:32.309045   51318 kubelet_node_status.go:106] Unable to register node "openshift-126.lab.sjc.redhat.com" with API server: nodes "openshift-126.lab.sjc.redhat.com" is forbidden: node OpenShift-126.lab.sjc.Redhat.com cannot modify node openshift-126.lab.sjc.redhat.com
Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: W1010 03:40:32.489249   51318 sdn_controller.go:48] Could not find an allocated subnet for node: openshift-126.lab.sjc.redhat.com, Waiting...

Comment 16 Gan Huang 2017-10-10 07:54:29 UTC
With same configurations, didn't encounter such issue in openshift-ansible-3.6.173.0.45-1.git.0.dc70c99.el7.noarch.rpm.

#cat inventory_host
<--snip-->
[masters]
OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root  openshift_public_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_hostname=OpenShift-126.lab.sjc.Redhat.com
[nodes]
OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root  openshift_public_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_node_labels="{'role': 'node'}"
OpenShift-151.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root  openshift_public_hostname=OpenShift-151.lab.sjc.Redhat.com openshift_hostname=OpenShift-151.lab.sjc.Redhat.com openshift_node_labels="{'role': 'node','registry': 'enabled','router': 'enabled'}"
[etcd]
OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root  openshift_public_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_hostname=OpenShift-126.lab.sjc.Redhat.com
[nfs]
OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root 
<--snip-->

Comment 17 Scott Dodson 2017-10-10 12:48:24 UTC
Gan,

What was the process followed in #15? Was that a clean install or running config.yml against an existing cluster, one where the hostname had an upper case letter?

Comment 18 Gan Huang 2017-10-10 13:02:47 UTC
(In reply to Scott Dodson from comment #17)
> Gan,
> 
> What was the process followed in #15? Was that a clean install or running
> config.yml against an existing cluster, one where the hostname had an upper
> case letter?

It's was a fresh installation.

Steps:

1. Spin up instances to be installed

2. Assemble inventory host and modifying `openShift-126.lab.sjc.redhat.com` to `OpenShift-126.lab.sjc.Redhat.com` containing upper case letter. 

3. Trigger installation

Just a note as comment 16, it succeeded in OCP 3.6.

Might be related to https://github.com/kubernetes/kubernetes/issues/47695#issuecomment-315245352 which only could be reproduced in 3.7.

Comment 25 Eric Rich 2017-12-14 17:00:34 UTC
In short due to https://tools.ietf.org/html/rfc4343 we likely need to put in some additional test cases to ensure that this does not happen again.

Comment 30 Scott Dodson 2017-12-15 21:12:31 UTC
For immediate relief they can disable the NodeRestriction plugin by adding the NodeRestriction stanza to their admissionConfig and restarting their api servers.

admissionConfig:
  pluginConfig:
    NodeRestriction:
      configuration:
        kind: DefaultAdmissionConfig
        apiVersion: v1
        disable: true

Once that's done nodes should start becoming ready again.

After that, we can regenerate the certificates for the cluster. The 3.7 playbooks will correctly generate certificates using a lowercase CN, however there's a bug where those playbooks do not correctly update the node config to point at the newly minted kubeconfig and that'll have to be corrected before re-enabling the admission plugin.

Run the playbook
# ansible-playbook -i hosts playbooks/byo/openshift-cluster/redeploy-node-certificates.yml

On each node update the node config to point at the new kubeconfig, example:
# grep masterKubeConfig /etc/origin/node/node-config.yaml 
masterKubeConfig: system:node:OSE3-NODE1.example.com.kubeconfig

change that to

masterKubeConfig: system:node:ose3-node1.example.com.kubeconfig


Restart the node service. Once all nodes have been updated to use the new kubeconfig you may comment out the NodeRestriction admission plugin configuration and restart the api servers.

Comment 36 Scott Dodson 2018-01-19 13:44:54 UTC
The workaround in the solution is only relevant to the 3.6 to 3.7 upgrade where the node restriction admission config plugin blocks updates.

If this is happening on 3.5 to 3.6 upgrades that's something else we need to investigate.

Comment 37 Jaspreet Kaur 2018-01-22 08:56:45 UTC
Yes the upgrade is from 3.5 to 3.6. though there upgrade is not failing but openshift_upgrade_nodes_label doesnt consider the label that we define.

Comment 40 Scott Dodson 2018-01-22 13:43:34 UTC
Jaspreet, sure if that fixes it for your customer then lets go ahead with that change.

https://github.com/openshift/openshift-ansible/pull/6812

Comment 42 Vadim Rutkovsky 2018-01-29 09:53:49 UTC
The fix is available in openshift-ansible-3.7.27-1.git.0.ae95fc3.el7

Comment 43 Gan Huang 2018-01-30 07:38:45 UTC
Tested in openshift-ansible-3.7.27-1.git.0.ae95fc3.el7.noarch.rpm

Fresh installation that hostname with capital letter succeeded.

Wrt the upgrade from v3.6 to v3.7, the test result is the same as comment 28 and comment 31, and double checked that the workaround (comment 31) works for QE too. 

Vadim, if we don't intend to fix the issue via the upgrade playbook, I think we'd better to document the workaround as a known issue. WDYT?

Comment 44 Vadim Rutkovsky 2018-01-30 11:47:15 UTC
(In reply to Gan Huang from comment #43)
> Wrt the upgrade from v3.6 to v3.7, the test result is the same as comment 28
> and comment 31, and double checked that the workaround (comment 31) works
> for QE too. 

Excellent

> Vadim, if we don't intend to fix the issue via the upgrade playbook, I think
> we'd better to document the workaround as a known issue. WDYT?

I'll cherrypick the PRs to release-3.6 and check if these still apply. If for some reason it won't work we'll settle with a workaround from comment 31.

Moving back to MODIFIED

Comment 45 Vadim Rutkovsky 2018-01-30 14:01:43 UTC
Created https://github.com/openshift/openshift-ansible/pull/6939 (for 3.5) and https://github.com/openshift/openshift-ansible/pull/6940 (for 3.6), but a bit stuck during verification


Note You need to log in before you can comment on or make changes to this bug.