Bug 1538616 - REGRESSION: Template Service Broker does no longer get installed on 3.7.23
Summary: REGRESSION: Template Service Broker does no longer get installed on 3.7.23
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.7.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.7.z
Assignee: Vadim Rutkovsky
QA Contact: sheng.lao
URL:
Whiteboard:
Depends On: 1601378 1603611 1603612
Blocks: 1599905
TreeView+ depends on / blocked
 
Reported: 2018-01-25 12:51 UTC by Wolfgang Kulhanek
Modified: 2018-08-27 20:43 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1599905 (view as bug list)
Environment:
Last Closed: 2018-08-27 20:43:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Hosts file used to install this cluster (7.51 KB, text/plain)
2018-01-25 12:51 UTC, Wolfgang Kulhanek
no flags Details

Description Wolfgang Kulhanek 2018-01-25 12:51:52 UTC
Created attachment 1386052 [details]
Hosts file used to install this cluster

Description of problem:

As of 3.7.23 the installer keeps failing on installing the Template Service Broker. It copies the TSB templates to the master but does not appear to actually create them.

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1. Use the attached hostsfile
2. Run the advanced installer
3. Boom

Actual results:
TASK [template_service_broker : copy] ******************************************************************************************************
changed: [master3.35ca.internal] => (item=apiserver-template.yaml)
changed: [master3.35ca.internal] => (item=rbac-template.yaml)
changed: [master3.35ca.internal] => (item=template-service-broker-registration.yaml)
changed: [master3.35ca.internal] => (item=apiserver-config.yaml)

TASK [template_service_broker : yedit] *****************************************************************************************************
ok: [master3.35ca.internal]

TASK [template_service_broker : slurp] *****************************************************************************************************
ok: [master3.35ca.internal]

TASK [template_service_broker : Apply template file] ***************************************************************************************
changed: [master3.35ca.internal]

TASK [template_service_broker : Reconcile with RBAC file] **********************************************************************************
changed: [master3.35ca.internal]

TASK [template_service_broker : Verify that TSB is running] ********************************************************************************
FAILED - RETRYING: Verify that TSB is running (120 retries left).
FAILED - RETRYING: Verify that TSB is running (119 retries left).
[...]
FAILED - RETRYING: Verify that TSB is running (2 retries left).
FAILED - RETRYING: Verify that TSB is running (1 retries left).
fatal: [master3.35ca.internal]: FAILED! => {"attempts": 120, "changed": false, "cmd": ["curl", "-k", "https://apiserver.openshift-template-service-broker.svc/healthz"], "delta": "0:00:01.011151", "end": "2018-01-25 12:06:14.166267", "failed": true, "msg": "non-zero return code", "rc": 7, "start": "2018-01-25 12:06:13.155116", "stderr": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused", "stderr_lines": ["  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current", "                                 Dload  Upload   Total   Spent    Left  Speed", "", "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0", "  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}


On further investigation the only thing that is created in the openshift-template-service-broker project is the service:

[root@master1 ~]# oc get all -n openshift-template-service-broker
NAME            CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
svc/apiserver   172.30.156.63   <none>        443/TCP   48m

Expected results:
No failure and TSB running.



Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Scott Dodson 2018-01-25 13:53:10 UTC
Wolfgang,

There was an error corrected where the TSB was previous deployed to all nodes because it wasn't setting a node selector. The behavior now is to deploy to your infrastructure nodes, which by default are those labeled with `region=infra`. Do you have any nodes labeled in that manner? This is in line with how the router and registry are deployed and will inherit the node selector you've set via the variable `openshift_hosted_infra_selector`

Comment 2 Wolfgang Kulhanek 2018-01-25 15:00:06 UTC
Scott,

Ah - so that was a bug! I always clear the default node selector (env=app) (well, really rather set it to empty to prevent it from picking up the default) from the TSB project so that it would deploy on all nodes....

So it's only supposed to deploy on Infranodes. OK. Good to know.

We don't use "region=infra" but "env=infra". Is "region=infra" the expected convention these days? I seem to remember that the concept of an "Infranode" was always more of a convention than anything that was officially documented.

I had not seen the variable openshift_hosted_infra_selector before. Is this a new catch all for router/registry/logging components/metrics components/TSB/etc?

I just ran the playbook again with
  openshift_hosted_infra_selector='env=infra'

And it failed as well. The apiserver DaemonSet still had 'region=infra' in it. So something is still off.

 

One of my colleagues meanwhile figured out that
  template_service_broker_selector={"env":"infra"}
seems to work...

I do see in the errata on docs.openshift.com now that this TSB fix is mentioned. But it doesn't mention how to set it up. So I think even if I had seen that (errata weren't live two days ago when 3.7.23 shipped) I would have completely missed it.

Comment 3 Wolfgang Kulhanek 2018-01-25 15:01:17 UTC
running again with
  openshift_hosted_infra_selector={"env": "infra"}
to see if that makes a difference...

Comment 4 Wolfgang Kulhanek 2018-01-25 15:22:59 UTC
Nope. That didn't do it either. So openshift_hosted_infra_selector is not the answer but it appears that

template_service_broker_selector={"env":"infra"}

is working.

Comment 5 Scott Dodson 2018-01-25 18:20:29 UTC
Yeah, looks like that's correct.

I was on master branch when I looked that code up.

In the context of this bug we'll fix the defaulting to work as I suggested and we'll make sure that we document both `openshift_hosted_infra_selector` and `template_service_broker_selector`.

Comment 6 Scott Dodson 2018-01-25 19:59:30 UTC
Summarizing:

In OCP 3.7 GA the TSB incorrectly deployed to all nodes.

In 3.7.23 the code was updated to deploy to nodes that match the undocumented variable 'template_service_broker_selector' which defaults to '{"region":"infra"}'

A workaround is to set template_service_broker_selector to a label which matches your infra nodes, ie: template_service_broker_selector={"env":"infra"}

Comment 11 Vadim Rutkovsky 2018-06-21 14:25:24 UTC
Created https://github.com/openshift/openshift-ansible/pull/8896 to document nodeselectors for hosted services and TSB in particular

Comment 13 sheng.lao 2018-06-29 11:42:53 UTC
1. Documents about the two parameters : template_service_broker_selector and openshift_hosted_infra_selector, not present and can't search on the websites: 
   1)、https://docs.openshift.org/3.7
   2)、https://docs.openshift.com/container-platform/3.7

2. Verify the 'openshift_hosted_infra_selector' option 
   1)、versions of playbooks:
     a)openshift-ansible-playbooks.noarch 3.7.56-1.git.31.91ec9c5.el7
     b)openshift-ansible-playbooks-3.7.23-1.git.0.bc406aa.el7.noarch.rpm

   2)、values of the configurable options in inventory:
     [OSEv3:vars]
     openshift_hosted_infra_selector="env=infra"

     [nodes]
     qe-shlao-yyyyyy.com openshift_node_labels="{'role': 'node', 'env' : 'infra'}"

   3)result: Failed, the output messages:

TASK [template_service_broker : Verify that TSB is running] **************************************************************************************
FAILED - RETRYING: Verify that TSB is running (120 retries left).
... ...
FAILED - RETRYING: Verify that TSB is running (1 retries left).
fatal: [qe-shlao-yyyyyy.com]: FAILED! => {"attempts": 120, "changed": false, "cmd": ["curl", "-k", "https://apiserver.openshift-template-service-broker.svc/healthz"], "delta": "0:00:01.033899", "end": "2018-06-29 07:11:33.032692", "failed": true, "msg": "non-zero return code", "rc": 7, "start": "2018-06-29 07:11:31.998793", "stderr": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused", "stderr_lines": ["  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current", "                                 Dload  Upload   Total   Spent    Left  Speed", "", "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0", "  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry

Comment 14 Vadim Rutkovsky 2018-06-29 12:00:08 UTC
(In reply to sheng.lao from comment #13)
> 1. Documents about the two parameters : template_service_broker_selector and
> openshift_hosted_infra_selector, not present and can't search on the
> websites: 
>    1)、https://docs.openshift.org/3.7
>    2)、https://docs.openshift.com/container-platform/3.7

Documentation needs to be updated in a separate bug.

> 2. Verify the 'openshift_hosted_infra_selector' option 
>    3)result: Failed, the output messages:
> 
> TASK [template_service_broker : Verify that TSB is running]
> *****************************************************************************
> *********
> FAILED - RETRYING: Verify that TSB is running (120 retries left).
> ... ...
> FAILED - RETRYING: Verify that TSB is running (1 retries left).
> fatal: [qe-shlao-yyyyyy.com]: FAILED! => {"attempts": 120, "changed": false,
> "cmd": ["curl", "-k",
> "https://apiserver.openshift-template-service-broker.svc/healthz"], "delta":
> "0:00:01.033899", "end": "2018-06-29 07:11:33.032692", "failed": true,
> "msg": "non-zero return code", "rc": 7, "start": "2018-06-29
> 07:11:31.998793", "stderr": "  % Total    % Received % Xferd  Average Speed 
> Time    Time     Time  Current\n                                 Dload 
> Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0    
> 0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0  
> 0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to
> apiserver.openshift-template-service-broker.svc:443; Connection refused",
> "stderr_lines": ["  % Total    % Received % Xferd  Average Speed   Time   
> Time     Time  Current", "                                 Dload  Upload  
> Total   Spent    Left  Speed", "", "  0     0    0     0    0     0      0  
> 0 --:--:-- --:--:-- --:--:--     0", "  0     0    0     0    0     0      0
> 0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to
> apiserver.openshift-template-service-broker.svc:443; Connection refused"],
> "stdout": "", "stdout_lines": []}
>         to retry, use: --limit
> @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry

ASB may not be up due to various reasons - e.g. etcd cannot mount PV as PVC didn't mount and so on. Did the pods get correct nodeselector?

Comment 27 Gan Huang 2018-07-02 07:28:09 UTC
Vadim,

Based on comment 8, we need two PRs to fix the bug, one is for 3.7 documentation, another is for openshift-ansible (template_service_broker_selector defaults to openshift_hosted_infra_selector).

https://github.com/openshift/openshift-ansible/pull/8896 didn't help at all because it lives in upstream openshift-ansible documentation only.

We can't find any related PRs for the bug yet. That would be extremely helpful if you could post the PRs here.


Thank you
Gan Huang

Comment 28 Vadim Rutkovsky 2018-07-09 10:29:49 UTC
Right, default nodeselectors are not set to infra, however I assumed this was already implemented.

Created PR https://github.com/openshift/openshift-ansible/pull/9106 to fix this

Comment 29 Eric Rich 2018-07-10 21:37:07 UTC
(In reply to Gan Huang from comment #27)
> Vadim,
> 
> Based on comment 8, we need two PRs to fix the bug, one is for 3.7
> documentation, 

Created https://bugzilla.redhat.com/show_bug.cgi?id=1599905 to track this. 
> another is for openshift-ansible (template_service_broker_selector defaults to
> openshift_hosted_infra_selector).
> 
> https://github.com/openshift/openshift-ansible/pull/8896 didn't help at all
> because it lives in upstream openshift-ansible documentation only.

Comment 30 Vadim Rutkovsky 2018-07-17 08:12:37 UTC
Fix is available in openshift-ansible-3.7.58-1

This would only update default TSB nodeselector, so if the issue is still reproducible please attach the inventory and playbook logs (or just the link to jenkins job)

Comment 33 sheng.lao 2018-07-19 10:30:16 UTC
The bug, TSB, is fixed in openshift-ansible-3.7.58, and I change the status after the errata has droped the item:
 REGRESSION: Template Service Broker does no longer get installed on 3.7.23  Regression
 (release version of openshift-ansible-3.7.57-1.git.33.cf01e48.el7 not fix the 
bug)

 Besides, I check ASB and find : it seems that ASB not use the variable openshift_hosted_infra_selector.

Comment 34 Vadim Rutkovsky 2018-07-19 11:14:54 UTC
> Besides, I check ASB and find : it seems that ASB not use the variable openshift_hosted_infra_selector.

Correct, in 3.7 ASB would run on first master and apply the label to it. Its consistent with 3.9 and 3.10 where ASB runs on masters, so infra selector is not used

Comment 36 sheng.lao 2018-07-23 01:43:03 UTC
1. to verify version: openshift-ansible-3.7.58-1.git.37.6db1e6f.el7.noarch.rpm

2. the excerpt of inventory.
[OSEv3:vars]
openshift_hosted_infra_selector="env=infra"

[nodes]
XXXX  openshift_node_labels="{... , 'env':'infra'}" openshift_schedulable=true

3. result: Passed
  1) installation is success
  2) oc get ds  -n openshift-template-service-broker
  NAME        DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
  apiserver   1         1         1         1            1           env=infra       48m


Note You need to log in before you can comment on or make changes to this bug.