+++ This bug was initially created as a clone of Bug #1258243 +++ Description of problem: The ansible and quick install fail when the HostName is manually defined containing a Capital Letter. - The install takes the env variable with the capital letter and passes it through with the command #oc get node <openshift_hostnamem> the anisable installer hase this as <openshift_nodes> Kubernetes converts the names of the nodes to lower case and will not recognize a node name with a capital letter. Example: HostName: Node1.example.com Kubernetes names the node: node1.example.com Ansible install checks on the node registration with the following: #oc get node Node1.example.com The installer in turn fails. Version-Release number of selected component (if applicable): openshift-ansible-3.4.27-1 How reproducible: 100% Quick and Advanced Steps to Reproduce: 1. Spin up a new vm follow the docs to get it ready 2. Edit your ansible host file with the names of the hosts using capital letters. Also manual define the variables openshift_hostname openshift_public_hostname 3. Run the Ansible installer Actual results: Installer fails to register nodes, stops before the SDN is configured Expected results: Installer completes, converts the capital letters to lower case when running kubernetes commands to check nodes. Additional info: Ansible Code that fails: https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.4.27-1/roles/openshift_manage_node/tasks/main.yml#L17 Work around: Change all names and variables in /etc/ansible/hosts to lower case
This issue has been fixed in https://github.com/openshift/openshift-ansible/pull/2835 - name: Wait for Node Registration command: > - {{ openshift.common.client_binary }} get node {{ hostvars[item].openshift.common.hostname }} + {{ openshift.common.client_binary }} get node {{ openshift.common.hostname | lower }} --config={{ openshift_manage_node_kubeconfig }} -n default register: omd_get_node Could anyone help confirm it, then mark it "ON_QA" so that I can verify it directly?
https://github.com/openshift/openshift-ansible/pull/5483 proposed fix
Ok, after setting my hostname to a mixed case hsotname using `hostnamectl` I was able to reproduce the problem and the proposed patch works for both install and upgrade scenarios.
Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/16be0f2eb09b2e4c2b14864ba83195866479f917 Ensure that hostname is lowercase Fixes Bug 1396350
Tested with openshift-ansible-3.7.0-0.146.0.git.0.3038a60.el7.noarch.rpm Installation failed at: RUNNING HANDLER [openshift_node : restart openvswitch pause] ******************* skipping: [OpenShift-151.lab.sjc.Redhat.com] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} RUNNING HANDLER [openshift_node : restart node] ******************************** FAILED - RETRYING: restart node (3 retries left). FAILED - RETRYING: restart node (3 retries left). FAILED - RETRYING: restart node (2 retries left). FAILED - RETRYING: restart node (2 retries left). FAILED - RETRYING: restart node (1 retries left). FAILED - RETRYING: restart node (1 retries left). fatal: [OpenShift-151.lab.sjc.Redhat.com]: FAILED! => {"attempts": 3, "changed": false, "failed": true, "msg": "Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} fatal: [OpenShift-126.lab.sjc.Redhat.com]: FAILED! => {"attempts": 3, "changed": false, "failed": true, "msg": "Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} Node logs: Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.304138 51318 kubelet_node_status.go:82] Attempting to register node openshift-126.lab.sjc.redhat.com Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.307431 51318 server.go:348] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"openshift-126.lab.sjc.redhat.com", UID:"openshift-126.lab.sjc.redhat.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientDisk' Node openshift-126.lab.sjc.redhat.com status is now: NodeHasSufficientDisk Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.307953 51318 server.go:348] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"openshift-126.lab.sjc.redhat.com", UID:"openshift-126.lab.sjc.redhat.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientMemory' Node openshift-126.lab.sjc.redhat.com status is now: NodeHasSufficientMemory Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.308366 51318 server.go:348] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"openshift-126.lab.sjc.redhat.com", UID:"openshift-126.lab.sjc.redhat.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasNoDiskPressure' Node openshift-126.lab.sjc.redhat.com status is now: NodeHasNoDiskPressure Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: E1010 03:40:32.309045 51318 kubelet_node_status.go:106] Unable to register node "openshift-126.lab.sjc.redhat.com" with API server: nodes "openshift-126.lab.sjc.redhat.com" is forbidden: node OpenShift-126.lab.sjc.Redhat.com cannot modify node openshift-126.lab.sjc.redhat.com Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: W1010 03:40:32.489249 51318 sdn_controller.go:48] Could not find an allocated subnet for node: openshift-126.lab.sjc.redhat.com, Waiting...
With same configurations, didn't encounter such issue in openshift-ansible-3.6.173.0.45-1.git.0.dc70c99.el7.noarch.rpm. #cat inventory_host <--snip--> [masters] OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_hostname=OpenShift-126.lab.sjc.Redhat.com [nodes] OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_node_labels="{'role': 'node'}" OpenShift-151.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=OpenShift-151.lab.sjc.Redhat.com openshift_hostname=OpenShift-151.lab.sjc.Redhat.com openshift_node_labels="{'role': 'node','registry': 'enabled','router': 'enabled'}" [etcd] OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_hostname=OpenShift-126.lab.sjc.Redhat.com [nfs] OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root <--snip-->
Gan, What was the process followed in #15? Was that a clean install or running config.yml against an existing cluster, one where the hostname had an upper case letter?
(In reply to Scott Dodson from comment #17) > Gan, > > What was the process followed in #15? Was that a clean install or running > config.yml against an existing cluster, one where the hostname had an upper > case letter? It's was a fresh installation. Steps: 1. Spin up instances to be installed 2. Assemble inventory host and modifying `openShift-126.lab.sjc.redhat.com` to `OpenShift-126.lab.sjc.Redhat.com` containing upper case letter. 3. Trigger installation Just a note as comment 16, it succeeded in OCP 3.6. Might be related to https://github.com/kubernetes/kubernetes/issues/47695#issuecomment-315245352 which only could be reproduced in 3.7.
In short due to https://tools.ietf.org/html/rfc4343 we likely need to put in some additional test cases to ensure that this does not happen again.
For immediate relief they can disable the NodeRestriction plugin by adding the NodeRestriction stanza to their admissionConfig and restarting their api servers. admissionConfig: pluginConfig: NodeRestriction: configuration: kind: DefaultAdmissionConfig apiVersion: v1 disable: true Once that's done nodes should start becoming ready again. After that, we can regenerate the certificates for the cluster. The 3.7 playbooks will correctly generate certificates using a lowercase CN, however there's a bug where those playbooks do not correctly update the node config to point at the newly minted kubeconfig and that'll have to be corrected before re-enabling the admission plugin. Run the playbook # ansible-playbook -i hosts playbooks/byo/openshift-cluster/redeploy-node-certificates.yml On each node update the node config to point at the new kubeconfig, example: # grep masterKubeConfig /etc/origin/node/node-config.yaml masterKubeConfig: system:node:OSE3-NODE1.example.com.kubeconfig change that to masterKubeConfig: system:node:ose3-node1.example.com.kubeconfig Restart the node service. Once all nodes have been updated to use the new kubeconfig you may comment out the NodeRestriction admission plugin configuration and restart the api servers.
The workaround in the solution is only relevant to the 3.6 to 3.7 upgrade where the node restriction admission config plugin blocks updates. If this is happening on 3.5 to 3.6 upgrades that's something else we need to investigate.
Yes the upgrade is from 3.5 to 3.6. though there upgrade is not failing but openshift_upgrade_nodes_label doesnt consider the label that we define.
Jaspreet, sure if that fixes it for your customer then lets go ahead with that change. https://github.com/openshift/openshift-ansible/pull/6812
The fix is available in openshift-ansible-3.7.27-1.git.0.ae95fc3.el7
Tested in openshift-ansible-3.7.27-1.git.0.ae95fc3.el7.noarch.rpm Fresh installation that hostname with capital letter succeeded. Wrt the upgrade from v3.6 to v3.7, the test result is the same as comment 28 and comment 31, and double checked that the workaround (comment 31) works for QE too. Vadim, if we don't intend to fix the issue via the upgrade playbook, I think we'd better to document the workaround as a known issue. WDYT?
(In reply to Gan Huang from comment #43) > Wrt the upgrade from v3.6 to v3.7, the test result is the same as comment 28 > and comment 31, and double checked that the workaround (comment 31) works > for QE too. Excellent > Vadim, if we don't intend to fix the issue via the upgrade playbook, I think > we'd better to document the workaround as a known issue. WDYT? I'll cherrypick the PRs to release-3.6 and check if these still apply. If for some reason it won't work we'll settle with a workaround from comment 31. Moving back to MODIFIED
Created https://github.com/openshift/openshift-ansible/pull/6939 (for 3.5) and https://github.com/openshift/openshift-ansible/pull/6940 (for 3.6), but a bit stuck during verification