Bug 1396350
Summary: | Install fails when Host name has capital letter assigned to it | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Gan Huang <ghuang> | |
Component: | Installer | Assignee: | Vadim Rutkovsky <vrutkovs> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Johnny Liu <jialiu> | |
Severity: | low | Docs Contact: | ||
Priority: | high | |||
Version: | 3.4.0 | CC: | aos-bugs, clichybi, cshereme, dmoessne, erich, jkaur, jokerman, mmagnani, mmccomas, pdwyer, sdodson, simonas, vlaad, vrutkovs, wsun | |
Target Milestone: | --- | Keywords: | NeedsTestCase, Regression | |
Target Release: | 3.7.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
OpenShift requires that hostnames conform to standards which preclude the use of upper case letters. Now the installer ensures that the hostnames for node objects are created with lowercase letters.
|
Story Points: | --- | |
Clone Of: | 1258243 | |||
: | 1543748 1543749 (view as bug list) | Environment: | ||
Last Closed: | 2018-08-28 12:27:34 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1543748, 1543749 |
Description
Gan Huang
2016-11-18 06:57:10 UTC
This issue has been fixed in https://github.com/openshift/openshift-ansible/pull/2835 - name: Wait for Node Registration command: > - {{ openshift.common.client_binary }} get node {{ hostvars[item].openshift.common.hostname }} + {{ openshift.common.client_binary }} get node {{ openshift.common.hostname | lower }} --config={{ openshift_manage_node_kubeconfig }} -n default register: omd_get_node Could anyone help confirm it, then mark it "ON_QA" so that I can verify it directly? Ok, after setting my hostname to a mixed case hsotname using `hostnamectl` I was able to reproduce the problem and the proposed patch works for both install and upgrade scenarios. Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/16be0f2eb09b2e4c2b14864ba83195866479f917 Ensure that hostname is lowercase Fixes Bug 1396350 Tested with openshift-ansible-3.7.0-0.146.0.git.0.3038a60.el7.noarch.rpm Installation failed at: RUNNING HANDLER [openshift_node : restart openvswitch pause] ******************* skipping: [OpenShift-151.lab.sjc.Redhat.com] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} RUNNING HANDLER [openshift_node : restart node] ******************************** FAILED - RETRYING: restart node (3 retries left). FAILED - RETRYING: restart node (3 retries left). FAILED - RETRYING: restart node (2 retries left). FAILED - RETRYING: restart node (2 retries left). FAILED - RETRYING: restart node (1 retries left). FAILED - RETRYING: restart node (1 retries left). fatal: [OpenShift-151.lab.sjc.Redhat.com]: FAILED! => {"attempts": 3, "changed": false, "failed": true, "msg": "Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} fatal: [OpenShift-126.lab.sjc.Redhat.com]: FAILED! => {"attempts": 3, "changed": false, "failed": true, "msg": "Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} Node logs: Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.304138 51318 kubelet_node_status.go:82] Attempting to register node openshift-126.lab.sjc.redhat.com Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.307431 51318 server.go:348] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"openshift-126.lab.sjc.redhat.com", UID:"openshift-126.lab.sjc.redhat.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientDisk' Node openshift-126.lab.sjc.redhat.com status is now: NodeHasSufficientDisk Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.307953 51318 server.go:348] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"openshift-126.lab.sjc.redhat.com", UID:"openshift-126.lab.sjc.redhat.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasSufficientMemory' Node openshift-126.lab.sjc.redhat.com status is now: NodeHasSufficientMemory Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: I1010 03:40:32.308366 51318 server.go:348] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"openshift-126.lab.sjc.redhat.com", UID:"openshift-126.lab.sjc.redhat.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeHasNoDiskPressure' Node openshift-126.lab.sjc.redhat.com status is now: NodeHasNoDiskPressure Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: E1010 03:40:32.309045 51318 kubelet_node_status.go:106] Unable to register node "openshift-126.lab.sjc.redhat.com" with API server: nodes "openshift-126.lab.sjc.redhat.com" is forbidden: node OpenShift-126.lab.sjc.Redhat.com cannot modify node openshift-126.lab.sjc.redhat.com Oct 10 03:40:32 openshift-126.lab.sjc.redhat.com atomic-openshift-node[51318]: W1010 03:40:32.489249 51318 sdn_controller.go:48] Could not find an allocated subnet for node: openshift-126.lab.sjc.redhat.com, Waiting... With same configurations, didn't encounter such issue in openshift-ansible-3.6.173.0.45-1.git.0.dc70c99.el7.noarch.rpm. #cat inventory_host <--snip--> [masters] OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_hostname=OpenShift-126.lab.sjc.Redhat.com [nodes] OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_node_labels="{'role': 'node'}" OpenShift-151.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=OpenShift-151.lab.sjc.Redhat.com openshift_hostname=OpenShift-151.lab.sjc.Redhat.com openshift_node_labels="{'role': 'node','registry': 'enabled','router': 'enabled'}" [etcd] OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=OpenShift-126.lab.sjc.Redhat.com openshift_hostname=OpenShift-126.lab.sjc.Redhat.com [nfs] OpenShift-126.lab.sjc.Redhat.com ansible_user=root ansible_ssh_user=root <--snip--> Gan, What was the process followed in #15? Was that a clean install or running config.yml against an existing cluster, one where the hostname had an upper case letter? (In reply to Scott Dodson from comment #17) > Gan, > > What was the process followed in #15? Was that a clean install or running > config.yml against an existing cluster, one where the hostname had an upper > case letter? It's was a fresh installation. Steps: 1. Spin up instances to be installed 2. Assemble inventory host and modifying `openShift-126.lab.sjc.redhat.com` to `OpenShift-126.lab.sjc.Redhat.com` containing upper case letter. 3. Trigger installation Just a note as comment 16, it succeeded in OCP 3.6. Might be related to https://github.com/kubernetes/kubernetes/issues/47695#issuecomment-315245352 which only could be reproduced in 3.7. In short due to https://tools.ietf.org/html/rfc4343 we likely need to put in some additional test cases to ensure that this does not happen again. For immediate relief they can disable the NodeRestriction plugin by adding the NodeRestriction stanza to their admissionConfig and restarting their api servers. admissionConfig: pluginConfig: NodeRestriction: configuration: kind: DefaultAdmissionConfig apiVersion: v1 disable: true Once that's done nodes should start becoming ready again. After that, we can regenerate the certificates for the cluster. The 3.7 playbooks will correctly generate certificates using a lowercase CN, however there's a bug where those playbooks do not correctly update the node config to point at the newly minted kubeconfig and that'll have to be corrected before re-enabling the admission plugin. Run the playbook # ansible-playbook -i hosts playbooks/byo/openshift-cluster/redeploy-node-certificates.yml On each node update the node config to point at the new kubeconfig, example: # grep masterKubeConfig /etc/origin/node/node-config.yaml masterKubeConfig: system:node:OSE3-NODE1.example.com.kubeconfig change that to masterKubeConfig: system:node:ose3-node1.example.com.kubeconfig Restart the node service. Once all nodes have been updated to use the new kubeconfig you may comment out the NodeRestriction admission plugin configuration and restart the api servers. The workaround in the solution is only relevant to the 3.6 to 3.7 upgrade where the node restriction admission config plugin blocks updates. If this is happening on 3.5 to 3.6 upgrades that's something else we need to investigate. Yes the upgrade is from 3.5 to 3.6. though there upgrade is not failing but openshift_upgrade_nodes_label doesnt consider the label that we define. Jaspreet, sure if that fixes it for your customer then lets go ahead with that change. https://github.com/openshift/openshift-ansible/pull/6812 The fix is available in openshift-ansible-3.7.27-1.git.0.ae95fc3.el7 Tested in openshift-ansible-3.7.27-1.git.0.ae95fc3.el7.noarch.rpm Fresh installation that hostname with capital letter succeeded. Wrt the upgrade from v3.6 to v3.7, the test result is the same as comment 28 and comment 31, and double checked that the workaround (comment 31) works for QE too. Vadim, if we don't intend to fix the issue via the upgrade playbook, I think we'd better to document the workaround as a known issue. WDYT? (In reply to Gan Huang from comment #43) > Wrt the upgrade from v3.6 to v3.7, the test result is the same as comment 28 > and comment 31, and double checked that the workaround (comment 31) works > for QE too. Excellent > Vadim, if we don't intend to fix the issue via the upgrade playbook, I think > we'd better to document the workaround as a known issue. WDYT? I'll cherrypick the PRs to release-3.6 and check if these still apply. If for some reason it won't work we'll settle with a workaround from comment 31. Moving back to MODIFIED Created https://github.com/openshift/openshift-ansible/pull/6939 (for 3.5) and https://github.com/openshift/openshift-ansible/pull/6940 (for 3.6), but a bit stuck during verification |