1418083 – Defining 'cluster_network' in ceph_conf_overrides instead of under OSD options in 'group_vars/all' results in two conflicting cluster_network addresses

Bug 1418083 - Defining 'cluster_network' in ceph_conf_overrides instead of under OSD options in 'group_vars/all' results in two conflicting cluster_network addresses

Summary: Defining 'cluster_network' in ceph_conf_overrides instead of under OSD option...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Documentation
Sub Component:
Version:	2.1
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	2.2
Assignee:	Bara Ancincova
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-31 19:25 UTC by Kyle Squizzato
Modified:	2020-05-14 15:35 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-21 23:50:30 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible issues 1262	0	None	None	None	2017-01-31 19:27:13 UTC

Description Kyle Squizzato 2017-01-31 19:25:39 UTC

Description of problem:
If ceph_conf_overrides is used to set cluster_network instead of setting it under OSD options in the group_vars/all.yml file the default var for cluster_network == {{ public_network }} is still set resulting in two conflicting cluster_network entries in the resulting ceph.conf file:

Version-Release number of selected component (if applicable):
ceph-ansible-1.0.5-46.el7scon

How reproducible:
Always

Steps to Reproduce:
1) Using the ceph-ansible playbook, modify group_vars/all.sample to the desired settings
2) When configuring cluster_network, *do not* use the OSD options parameter (comment that out) instead specify it in the CONFIG OVERRIDE section:

    ceph_conf_overrides: 
    global:
       cluster_network: 10.10.100.0/24

3) Run the playbook 
4) Once complete take a look at the resulting ceph.conf on one of the configured hosts, it should look like this: 

 # cat /etc/ceph/ceph.conf 
[global]
cluster_network = 10.10.100.0/24 <---
max open files = 131072
fsid = 78a15451-fe2b-4627-a99d-9e060d0aecf1

[mon.mon2]
host = mon2
mon addr = 192.168.100.22

[mon.mon3]
host = mon3
mon addr = 192.168.100.23

[mon.mon1]
host = mon1
mon addr = 192.168.100.21

[client]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok # must be writable by QEMU and allowed by SELinux or AppArmor
log file = /var/log/ceph/qemu-guest-$pid.log # must be writable by QEMU and allowed by SELinux or AppArmor

[mon]

[osd]
osd mount options xfs = noatime,largeio,inode64,swalloc
osd mkfs options xfs = -f -i size=2048
public_network = 192.168.100.0/24
cluster_network = 192.168.100.0/24 <---
osd mkfs type = xfs
osd journal size = 1024

Actual results:
It appears ansible has included cluster_network settings in both the [global] and [osd] sections of the ceph.conf with two differing values. In a customer's machine we saw that the removal of the duplicate cluster_network (and change to the correct IP) entry resulted in fixing his attempts to import the existing cluster into Red Hat Storage Console; but we're unsure what effects (if any) this may have on performance or problems within a cluster serving data.

The problem is two fold, cluster_network setting in [osd] gets set by roles/ceph-common/defaults/main.yml to: 

 cluster_network: "{{ public_network }}"

since it was never uncommented in the all file.  Then roles/ceph-common/tasks/main.yml comes in and dumps the config_overrides into the ceph.conf:  

 config_overrides: "{{ ceph_conf_overrides }}"

which results in the [global] option, which is actually the correct option here.

---------

I think this is all fine.  The cluster_network var works as expected, it's more an issue with conf overrides, but the implementation isn't smart in anyway we just dump the overrides values into the conf.  Perhaps we need syntax checks?  Or perhaps it's just something we just need to fix in the docs and conf_overrides is behaving as designed.

Expected results:
This is kind of a question mark for me.  It depends on how you wish to handle it.  See my comments upstream for more on this: https://github.com/ceph/ceph-ansible/issues/1262

Additional info:
This behavior appears to also occur upstream. 

[global]
mon initial members = mon1,mon2,mon3
cluster network = 192.168.100.0/24
mon host = 192.168.100.21,192.168.100.22,192.168.100.23
public network = 192.168.100.0/24
cluster_network = 10.10.100.0/24

I've opened an issue to track this as well: https://github.com/ceph/ceph-ansible/issues/1262

In addition, from a purely documentation standpoint, I'm not sure why we recommend using the config override option over the OSD options located in 'all': 

## OSD options
#public_network: 192.168.100.0/24
#cluster_network: 10.10.100.0/24

In Step 11 of the documentation located in Section 3.2.2 of https://access.redhat.com/documentation/en/red-hat-ceph-storage/2/paged/installation-guide-for-red-hat-enterprise-linux/chapter-3-storage-cluster-installation we specify the setting of the public_network setting: 
 
 "Set the public_network setting:"

but we never actually discuss setting the cluster_network: setting as an option.  This leads users to want to use the ceph_conf_overrides to complete this task if they follow the documentation per our instruction. 

I feel there should be a step 12 with an 'Optional' note specifying the setting of a cluster_network if desired, otherwise to comment the 'cluster_network' option which will set it to public_network as default and advise against placing this option inside of the ceph_conf_overrides section. 

In Section 3.2.5 of the same documentation we provide the following ceph_conf_override example: 

 ceph_conf_overrides:
   global:
      osd_pool_default_size: 2
      osd_pool_default_min_size: 1
      cluster_network: 10.0.0.1/24
   client.rgw.rgw1:
      log_file: /var/log/ceph/ceph-rgw-rgw1.log

which contains the 'cluster_network' in the example.  This results in further confusion by customers.

Comment 2 seb 2017-02-01 12:42:45 UTC

Please see my reply in https://github.com/ceph/ceph-ansible/issues/1262

Comment 3 Kyle Squizzato 2017-02-02 17:58:42 UTC

(In reply to seb from comment #2)
> Please see my reply in https://github.com/ceph/ceph-ansible/issues/1262

So, after the discussion upstream we'll just move forward to making this a purely documentation bug.

Note You need to log in before you can comment on or make changes to this bug.