Bug 1761612

Summary: The GUI installer is unable to override the prometheus port setting of ceph-ansible
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Paul Cuzner <pcuzner>
Component: Ceph-AnsibleAssignee: Dimitri Savineau <dsavinea>
Status: CLOSED ERRATA QA Contact: Ameena Suhani S H <amsyedha>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: aschoen, ceph-eng-bugs, ceph-qe-bugs, dsavinea, gabrioux, gmeno, kdreyer, nthomas, tchandra, tserlin, vashastr, ykaul
Target Milestone: rc   
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-ansible-4.0.3-1.el8cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-31 12:47:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1787068    
Bug Blocks:    
Attachments:
Description Flags
inventory file
none
group vars - all.yml none

Description Paul Cuzner 2019-10-14 21:27:41 UTC
Description of problem:
If the GUI detects that the target for the metrics is the same as the host being used for the installation, it attempts to defined the prometheus_port = 9095 in all.yml, to avoid a port conflict between prometheus and the cockpit UI.

This override no longer works.

Version-Release number of selected component (if applicable):
cockpit-ceph-installer-0.9-1
ceph-ansible-4.0.0-0.1.rc16.el8cp.noarch

How reproducible:
100%

Steps to Reproduce:
1. install rhcs, with the metrics host the same as the installation host
2.
3.

Actual results:
prometheus fails to start
dashboard settings are incorrect
dashboard/grafaana integration fails - shows with no data or errors
grafana datasource is incorrect


Expected results:
the prometheus configuration should work correctly

Additional info:
upstream issue raised - https://github.com/ceph/ceph-ansible/issues/4601

Comment 1 RHEL Program Management 2019-10-14 21:27:48 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Paul Cuzner 2019-10-15 21:15:12 UTC
By embedding the tasks from dashboard.yml into site.yml, the override for prometheus_port works - from the CLI (ansible-playbook site.yml) and ansble-runner (GUI)

In addition, I'm also seeing the grafana_server_addr not being set as expected

In my environment I have an installer machine - where ansible is running from, which is also the target for the grafana and prometheus containers. At the end of the play the containers are running from the installer machine, but the settings applied to ceph and the datasource defined to grafana all point to the mgr not the installer host?

Have attached my group_vars and inventory

Comment 3 Paul Cuzner 2019-10-15 21:16:03 UTC
Created attachment 1626180 [details]
inventory file

Comment 4 Paul Cuzner 2019-10-15 21:16:28 UTC
Created attachment 1626181 [details]
group vars - all.yml

Comment 5 Paul Cuzner 2019-10-15 21:34:22 UTC
In my test environment, I'm using two machines - and installer, and an all-in-one node for all ceph daemons. The installer is used for the grafana-server group.

Looking at the ceph config keys
[root@rhcs4-aio ~]# ceph config get mgr.rhcs4-aio mgr/dashboard/PROMETHEUS_API_HOST

[root@rhcs4-aio ~]# ceph config get mgr.rhcs4-aio mgr/dashboard/GRAFANA_API_URL
http://10.90.90.165:3000/

In my case 10.90.90.165 is the IP for the host running ceph mgr, 10.90.90.163 is actually where the prometheus & grafana containers are deployed to
from the rhcs4-aio box (.165), the rhcs4-installer name resolves to .163 correctly

As a consequence, the grafana dashboard integration is also broken

Comment 6 Paul Cuzner 2019-10-15 23:25:19 UTC
Other relevant versions

sh-4.4# rpm -q ansible
ansible-2.8.3-1.el8ae.noarch
sh-4.4# rpm -q ansible-runner
ansible-runner-1.3.4-2.el8ar.noarch

Comment 7 Dimitri Savineau 2019-10-16 13:12:59 UTC
> Looking at the ceph config keys
> [root@rhcs4-aio ~]# ceph config get mgr.rhcs4-aio mgr/dashboard/PROMETHEUS_API_HOST

The prometheus api host key has been added by [1] and present since v4.0.0

> [root@rhcs4-aio ~]# ceph config get mgr.rhcs4-aio mgr/dashboard/GRAFANA_API_URL
> http://10.90.90.165:3000/

This was fixed by [2] and present since v4.0.0

Expect those two issues, the prometheus_port override should already work in rc16.

I'll try to do a deployment using rc16.

[1] https://github.com/ceph/ceph-ansible/commit/74ab59c4f33d534cfbca4055c1f494a670be40e2
[2] https://github.com/ceph/ceph-ansible/commit/9bb11c7b2a17db56cfcd7284d2190af36e17bba6

Comment 8 Dimitri Savineau 2019-10-16 14:47:59 UTC
The prometheus_port override works for me with rc16

1 installer node with grafana/prometheus stack and running ceph-ansible
1 aio node with mon/mgr/osd/rgw/mds

$ ansible --version
ansible 2.8.3

$ grep prometheus_port group_vars/all.yml
prometheus_port: 9095

$ sudo ss -lntup|grep prometheus
tcp    LISTEN     0      128      :::9095                 :::*                   users:(("prometheus",pid=15481,fd=7))

Comment 9 Paul Cuzner 2019-10-17 03:03:21 UTC
Good to know - thanks for testing. Is this rhel8 with python3? My ansible version on rhel8 is 2.8.5
[root@rhcs4-installer inventory]# ansible --version
ansible 2.8.5
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.6/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 3.6.8 (default, Apr  3 2019, 17:26:03) [GCC 8.2.1 20180905 (Red Hat 8.2.1-3)]
[root@rhcs4-installer inventory]# rpm -q ansible
ansible-2.8.5-2.el8ae.noarch

Comment 10 Paul Cuzner 2019-10-17 04:22:30 UTC
I tried 4.02 from brew, and can confirm the issues with the prometheus/grafana definitions is resolved - the grafana integration is working as expected.

The prometheus_port though didn't take my override, but given the default is now 9092 I didn't have a port clash.

Comment 11 Dimitri Savineau 2019-10-17 13:18:30 UTC
> Is this rhel8 with python3? My ansible version on rhel8 is 2.8.5

No it was on CentOS 7 with python2 but it doesn't change anything.
I also tried with ansible 2.8.5 as well with success.

> The prometheus_port though didn't take my override, but given the default is now 9092 I didn't have a port clash.

Could you share how you run the ansible-playbook command ? (ie: where is located the inventory file, [group|host]_vars directory, etc...)

Comment 12 Dimitri Savineau 2019-10-17 17:52:53 UTC
Ok so the import_playbook doesn't use the [group|host]_vars the same way depending on the location of the ansible inventory.

the group_vars directory is used when it's present (there's more scenarios based on [1]):

 - in the same directory than the inventory file
 - in the same directory than the playbook file

When the inventory file is in the ceph-ansible directory then everything works perfectly (that's what I'm always using).

-----------------
$ grep prometheus_port group_vars/all.yml 
prometheus_port: 9099
$ ansible-playbook -i hosts site.yml 

PLAY [mons] *******************************************

TASK [ceph-prometheus : prometheus_port variable] *****
ok: [rhcs4-aio] => {
    "msg": 9099
}

PLAY [grafana-server] *********************************

TASK [ceph-prometheus : prometheus_port variable] *****
ok: [rhcs4-installer] => {
    "msg": 9099
}
-----------------

But if the inventory file directory doesn't contain a group_vars directory then the overrides for the dashboard playbook will be lost (because there's no group_vars in the infrastructure-playbooks directory).

-----------------
$ ansible-playbook -i /tmp/hosts site.yml 

PLAY [mons] *******************************************

TASK [ceph-prometheus : prometheus_port variable] *****
    "msg": 9099
}

PLAY [grafana-server] *********************************

TASK [ceph-prometheus : prometheus_port variable] *****
ok: [rhcs4-installer] => {
    "msg": 9092
}
-----------------

So we need to change either the dashboard.yml file location (not under infrastructure-playbooks directory) or duplicate that code in that playbook in both site and site-container playbooks.

@Paul Could you confirm that there's no group_vars directory in the inventory file directory ?

[1] https://docs.ansible.com/ansible/latest/user_guide/playbooks_variables.html#ansible-variable-precedence

Comment 13 Paul Cuzner 2019-10-18 03:51:18 UTC
In my case the inventory is in /ussr/share/ansible-runner-service/inventory/hosts, and group_vars and host_vars are coming from /usr/share/ceph-ansible as normal.

from the cli I was running the site.yml play as follows;
cd /usr/share/ceph-ansible
ansible-playbook -i /usr/share/ansible-runner-service/inventory/hosts site.yml

group_vars and host_vars are within /usr/share/ceph-ansible

Comment 14 Ken Dreyer (Red Hat) 2019-10-21 15:15:48 UTC
Would you please let us know what tagged version on the stable-4.0 branch contains the complete fixes for this BZ?

Comment 15 Dimitri Savineau 2019-10-21 15:32:12 UTC
There's no tag upstream yet. It will be present in v4.0.3

Comment 17 Guillaume Abrioux 2019-10-23 09:26:41 UTC
@Ken, shouldn't this BZ be targeted to something else than 4.* ?

Comment 25 errata-xmlrpc 2020-01-31 12:47:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0312