Bug 1538445
| Summary: | Need to do some pre-check to make sure there are available nodes to deploy TSB and web console pods | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Weihua Meng <wmeng> |
| Component: | Installer | Assignee: | Vadim Rutkovsky <vrutkovs> |
| Status: | CLOSED ERRATA | QA Contact: | Weihua Meng <wmeng> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.9.0 | CC: | aos-bugs, dmoessne, jialiu, jokerman, mmccomas, wmeng |
| Target Milestone: | --- | ||
| Target Release: | 3.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-03-28 14:22:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Weihua Meng
2018-01-25 06:09:45 UTC
When customer do not set template_service_broker_selector and openshift_web_console_nodeselector in inventory host file, also do not define "region=infra" node in the cluster, then installation would exit as failure at "Verify that TSB is running" or "Verify that the web console is running" task. That is because installer is trying to deploy TSB and web console pod onto "region=infra" node by default, but there is no such nodes in the cluster. Once installer exit, there is no any prompt to tell user what is wrong there, and waste a lot of time to run this installation. As user experience enhancement, openshift-ansible should do some pre-check before running installation to make sure there are available nodes to deploy TSB and web console at early phase. Created https://github.com/openshift/openshift-ansible/pull/7022 to fix this. The PR won't check web console nodeselector, as by default its set to k8s master and it might not be specified in ansible inventory. Partial fix for this is available in openshift-ansible-3.9.0-0.45.0.git.0.05f6826.el7 - it would only check the labels set in the inventory, so web console check is not enabled yet. Not fixed.
openshift-ansible-3.9.0-0.45.0.git.0.05f6826.el7.noarch.rpm
Installation failed and exited after running 21 mins
steps
set non-exist node label for template service broker in inventory file
template_service_broker_selector={"label123": "cannotfind"}
TASK [template_service_broker : Ensure that Template Service Broker has nodes to run on] ***
Thursday 22 February 2018 03:09:20 +0000 (0:00:00.066) 0:09:25.007 *****
skipping: [host-xxxx.redhat.com] => {"changed": false, "skip_reason": "Conditional result was False"}
TASK [template_service_broker : Apply template file] ***************************
Thursday 22 February 2018 03:09:24 +0000 (0:00:00.186) 0:09:28.636 *****
changed: [host-xxxx.redhat.com] => {"changed": true, "cmd": "oc process --config=/tmp/tsb-ansible-ii4vC4/admin.kubeconfig -f \"/tmp/tsb-ansible-ii4vC4/apiserver-template.yaml\" --param API_SERVER_CONFIG=\"kind: TemplateServiceBrokerConfig\napiVersion: config.templateservicebroker.openshift.io/v1\ntemplateNamespaces:\n- openshift\n\" --param IMAGE=\"registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker:v3.9.0\" --param NODE_SELECTOR='{\"label123\": \"cannotfind\"}' | oc apply --config=/tmp/tsb-ansible-ii4vC4/admin.kubeconfig -f -", "delta": "0:00:00.555409", "end": "2018-02-21 22:09:25.039356", "rc": 0, "start": "2018-02-21 22:09:24.483947", "stderr": "", "stderr_lines": [], "stdout": "daemonset \"apiserver\" created\nconfigmap \"apiserver-config\" created\nserviceaccount \"apiserver\" created\nservice \"apiserver\" created\nserviceaccount \"templateservicebroker-client\" created\nsecret \"templateservicebroker-client\" created", "stdout_lines": ["daemonset \"apiserver\" created", "configmap \"apiserver-config\" created", "serviceaccount \"apiserver\" created", "service \"apiserver\" created", "serviceaccount \"templateservicebroker-client\" created", "secret \"templateservicebroker-client\" created"]}
TASK [template_service_broker : Verify that TSB is running] ********************
Thursday 22 February 2018 03:09:25 +0000 (0:00:00.809) 0:09:30.194 *****
FAILED - RETRYING: Verify that TSB is running (60 retries left).
FAILED - RETRYING: Verify that TSB is running (59 retries left).
...
FAILED - RETRYING: Verify that TSB is running (2 retries left).
FAILED - RETRYING: Verify that TSB is running (1 retries left).
fatal: [host-xxxx.redhat.com]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["curl", "-k", "https://apiserver.openshift-template-service-broker.svc/healthz"], "delta": "0:00:01.024979", "end": "2018-02-21 22:15:35.435460", "msg": "non-zero return code", "rc": 7, "start": "2018-02-21 22:15:34.410481", "stderr": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused", "stderr_lines": [" % Total % Received % Xferd Average Speed Time Time Time Current", " Dload Upload Total Spent Left Speed", "", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}
# oc get all -n openshift-template-service-broker
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ds/apiserver 0 0 0 0 0 label123=cannotfind 17m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/apiserver ClusterIP 172.30.251.209 <none> 443/TCP 17m
Would you mind trying that on openshift-ansible-3.9.0-0.47.0.git.0.f8847bb.el7? The fix in https://github.com/openshift/openshift-ansible/pull/7220 might have fixed it. failed after running 15 mins for 1 master + 1 node, could be longer for ha cluster.
can we have an early notice/failure for that?
openshift-ansible-3.9.0-0.47.0.git.0.f8847bb.el7.noarch.rpm
TASK [template_service_broker : Ensure that Template Service Broker has nodes to run on] ***
Thursday 22 February 2018 12:33:44 +0000 (0:00:00.065) 0:10:04.752 *****
fatal: [host-xxxx.redhat.com]: FAILED! => {"changed": false, "msg": "No schedulable nodes found matching node selector for Template Service Broker - '{'label123': 'cannotfind'}'"}
PLAY RECAP *********************************************************************
host-xxxx.redhat.com : ok=626 changed=249 unreachable=0 failed=1
host-xxxx.redhat.com : ok=149 changed=52 unreachable=0 failed=0
(In reply to Weihua Meng from comment #7) > failed after running 15 mins for 1 master + 1 node, could be longer for ha > cluster. > can we have an early notice/failure for that? Unfortunately I don't think we can do this earlier: * component playbooks can be run independently - if we embed the check in playbooks/deploy_cluster.yml it won't show when just playbooks/openshift-service-catalog/config.yml is being run * existing node labels are being set later on, so we can't fully predict which labels would be set This is a tradeoff of course, we'll try to come up with better checks later on though I don't think we can do any better before 3.9. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489 |