Description of problem: Need to check node labels and node selector, or installation may fail after running for dozens of minutes. For some components such as template service broker, using default node selector "region=infra", if there is no nodes with such lable, installation will fail. Running installation for dozens of minutes to get a failure is quite upset, especially for customers. Version-Release number of the following components: How reproducible: Steps to Reproduce: 1. Actual results: Expected results: Installation fails at early stage, and show the problems.
When customer do not set template_service_broker_selector and openshift_web_console_nodeselector in inventory host file, also do not define "region=infra" node in the cluster, then installation would exit as failure at "Verify that TSB is running" or "Verify that the web console is running" task. That is because installer is trying to deploy TSB and web console pod onto "region=infra" node by default, but there is no such nodes in the cluster. Once installer exit, there is no any prompt to tell user what is wrong there, and waste a lot of time to run this installation. As user experience enhancement, openshift-ansible should do some pre-check before running installation to make sure there are available nodes to deploy TSB and web console at early phase.
Created https://github.com/openshift/openshift-ansible/pull/7022 to fix this. The PR won't check web console nodeselector, as by default its set to k8s master and it might not be specified in ansible inventory.
Partial fix for this is available in openshift-ansible-3.9.0-0.45.0.git.0.05f6826.el7 - it would only check the labels set in the inventory, so web console check is not enabled yet.
Not fixed. openshift-ansible-3.9.0-0.45.0.git.0.05f6826.el7.noarch.rpm Installation failed and exited after running 21 mins steps set non-exist node label for template service broker in inventory file template_service_broker_selector={"label123": "cannotfind"} TASK [template_service_broker : Ensure that Template Service Broker has nodes to run on] *** Thursday 22 February 2018 03:09:20 +0000 (0:00:00.066) 0:09:25.007 ***** skipping: [host-xxxx.redhat.com] => {"changed": false, "skip_reason": "Conditional result was False"} TASK [template_service_broker : Apply template file] *************************** Thursday 22 February 2018 03:09:24 +0000 (0:00:00.186) 0:09:28.636 ***** changed: [host-xxxx.redhat.com] => {"changed": true, "cmd": "oc process --config=/tmp/tsb-ansible-ii4vC4/admin.kubeconfig -f \"/tmp/tsb-ansible-ii4vC4/apiserver-template.yaml\" --param API_SERVER_CONFIG=\"kind: TemplateServiceBrokerConfig\napiVersion: config.templateservicebroker.openshift.io/v1\ntemplateNamespaces:\n- openshift\n\" --param IMAGE=\"registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker:v3.9.0\" --param NODE_SELECTOR='{\"label123\": \"cannotfind\"}' | oc apply --config=/tmp/tsb-ansible-ii4vC4/admin.kubeconfig -f -", "delta": "0:00:00.555409", "end": "2018-02-21 22:09:25.039356", "rc": 0, "start": "2018-02-21 22:09:24.483947", "stderr": "", "stderr_lines": [], "stdout": "daemonset \"apiserver\" created\nconfigmap \"apiserver-config\" created\nserviceaccount \"apiserver\" created\nservice \"apiserver\" created\nserviceaccount \"templateservicebroker-client\" created\nsecret \"templateservicebroker-client\" created", "stdout_lines": ["daemonset \"apiserver\" created", "configmap \"apiserver-config\" created", "serviceaccount \"apiserver\" created", "service \"apiserver\" created", "serviceaccount \"templateservicebroker-client\" created", "secret \"templateservicebroker-client\" created"]} TASK [template_service_broker : Verify that TSB is running] ******************** Thursday 22 February 2018 03:09:25 +0000 (0:00:00.809) 0:09:30.194 ***** FAILED - RETRYING: Verify that TSB is running (60 retries left). FAILED - RETRYING: Verify that TSB is running (59 retries left). ... FAILED - RETRYING: Verify that TSB is running (2 retries left). FAILED - RETRYING: Verify that TSB is running (1 retries left). fatal: [host-xxxx.redhat.com]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["curl", "-k", "https://apiserver.openshift-template-service-broker.svc/healthz"], "delta": "0:00:01.024979", "end": "2018-02-21 22:15:35.435460", "msg": "non-zero return code", "rc": 7, "start": "2018-02-21 22:15:34.410481", "stderr": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused", "stderr_lines": [" % Total % Received % Xferd Average Speed Time Time Time Current", " Dload Upload Total Spent Left Speed", "", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused"], "stdout": "", "stdout_lines": []} # oc get all -n openshift-template-service-broker NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE ds/apiserver 0 0 0 0 0 label123=cannotfind 17m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/apiserver ClusterIP 172.30.251.209 <none> 443/TCP 17m
Would you mind trying that on openshift-ansible-3.9.0-0.47.0.git.0.f8847bb.el7? The fix in https://github.com/openshift/openshift-ansible/pull/7220 might have fixed it.
failed after running 15 mins for 1 master + 1 node, could be longer for ha cluster. can we have an early notice/failure for that? openshift-ansible-3.9.0-0.47.0.git.0.f8847bb.el7.noarch.rpm TASK [template_service_broker : Ensure that Template Service Broker has nodes to run on] *** Thursday 22 February 2018 12:33:44 +0000 (0:00:00.065) 0:10:04.752 ***** fatal: [host-xxxx.redhat.com]: FAILED! => {"changed": false, "msg": "No schedulable nodes found matching node selector for Template Service Broker - '{'label123': 'cannotfind'}'"} PLAY RECAP ********************************************************************* host-xxxx.redhat.com : ok=626 changed=249 unreachable=0 failed=1 host-xxxx.redhat.com : ok=149 changed=52 unreachable=0 failed=0
(In reply to Weihua Meng from comment #7) > failed after running 15 mins for 1 master + 1 node, could be longer for ha > cluster. > can we have an early notice/failure for that? Unfortunately I don't think we can do this earlier: * component playbooks can be run independently - if we embed the check in playbooks/deploy_cluster.yml it won't show when just playbooks/openshift-service-catalog/config.yml is being run * existing node labels are being set later on, so we can't fully predict which labels would be set This is a tradeoff of course, we'll try to come up with better checks later on though
I don't think we can do any better before 3.9.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489