Bug 1461367

Summary:	Addition of mds node to an existing cluster fails
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	shilpa <smanjara>
Component:	Ceph-Ansible	Assignee:	Sébastien Han <shan>
Status:	CLOSED WORKSFORME	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:	Erin Donnelly <edonnell>
Priority:	urgent
Version:	3.0	CC:	adeza, aschoen, ceph-eng-bugs, edonnell, flucifre, gmeno, hnallurv, icolle, kdreyer, nthomas, sankarshan, seb, shan, smanjara
Target Milestone:	rc
Target Release:	3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	.Adding an MDS to an existing cluster fails Adding a Ceph Metadata Server (MDS) to an existing cluster fails with the error: ---- osd_pool_default_pg_num is undefined\n\nThe error appears to have been in '/usr/share/ceph-ansible/roles/ceph-mon/tasks/create_mds_filesystems.yml ---- As a consequence, an attempt to create an MDS pool fails. To work around this issue, add the `osd_pool_default_pg_num` parameter to `ceph_conf_overrides` in the `/usr/share/ceph-ansible/group_vars/all.yml` file, for example: ---- ceph_conf_overrides: global: osd_pool_default_pg_num: 64 ----	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-09-15 13:10:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1437916

Description shilpa 2017-06-14 10:05:58 UTC

Description of problem:
Addition of mds server to an existing cluster fails with the following error:

TASK [ceph-mon : create filesystem pools] **************************************
fatal: [magna096]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: 'osd_pool_default_pg_num' is undefined\n\nThe error appears to have been in '/usr/share/ceph-ansible/roles/ceph-mon/tasks/create_mds_filesystems.yml': line 6, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n# since those check are performed by the ceph-common role\n- name: create filesystem pools\n  ^ here\n"}


Version-Release number of selected component (if applicable):
ceph-ansible-2.2.11-1.el7scon.noarch

How reproducible:
Always

Steps to Reproduce:
1. Create a cluster first with ceph-ansible.
2. Once the cluster is up, run ansible again to add an mds server


Actual results:
MDS pool creation fails.

"The error was: 'osd_pool_default_pg_num' is undefined\n\nThe error appears to have been in '/usr/share/ceph-ansible/roles/ceph-mon/tasks/create_mds_filesystems.yml"


Additional info:

As a workaround added 'osd_pool_default_pg_num' to group_vars/all.yml in  "CONFIG OVERRIDE" section:

ceph_conf_overrides:
  global:
     osd_pool_default_pg_num: 64

After this, ansible successfully installed mds server.

Comment 10 Ken Dreyer (Red Hat) 2017-07-06 11:15:18 UTC

Sebastien, what specific change upstream fixed this BZ?

Comment 11 seb 2017-07-27 09:27:27 UTC

Ken, this is fixed in https://github.com/ceph/ceph-ansible/commit/ea68fbaaaee38b1a39b1f093e0faf5f897a466b0

Comment 12 Ken Dreyer (Red Hat) 2017-07-31 16:45:27 UTC

That commit is already in the version Shilpa was running above (ceph-ansible-2.2.11). See "git tag --contains ea68fbaaaee38b1a39b1f093e0faf5f897a466b0"

What else do we need to fix this?

Comment 13 seb 2017-08-31 08:35:51 UTC

@Ken, to be honest, I don't know, we don't see this error in the CI.
@Shilpa, could you please try again and let me know if you still see this issue?

I tried to reproduce, without success. I first deployed an initial cluster with 3 mons and 3 osds. Then I added a MDS node and re-ran Ansible. Success.

As you can see I successfully passed this task:

TASK [ceph-mon : create filesystem pools] **********************************************************************************************************************************************************************
task path: /home/jenkins-build/build/workspace/ceph-ansible/roles/ceph-mon/tasks/create_mds_filesystems.yml:6
ok: [mon2] => (item=cephfs_data) => {"changed": false, "cmd": ["ceph", "--cluster", "test", "osd", "pool", "create", "cephfs_data", "8"], "delta": "0:00:01.564608", "end": "2017-08-31 08:27:25.223125", "item"
: "cephfs_data", "rc": 0, "start": "2017-08-31 08:27:23.658517", "stderr": "pool 'cephfs_data' created", "stderr_lines": ["pool 'cephfs_data' created"], "stdout": "", "stdout_lines": []}
ok: [mon2] => (item=cephfs_metadata) => {"changed": false, "cmd": ["ceph", "--cluster", "test", "osd", "pool", "create", "cephfs_metadata", "8"], "delta": "0:00:01.035994", "end": "2017-08-31 08:27:29.731975"
, "item": "cephfs_metadata", "rc": 0, "start": "2017-08-31 08:27:28.695981", "stderr": "pool 'cephfs_metadata' created", "stderr_lines": ["pool 'cephfs_metadata' created"], "stdout": "", "stdout_lines": []}


See the final results:

jenkins-build@ceph-builders:~/build/workspace/ceph-ansible/tests/functional/centos/7/bluestore$ vagrant ssh mon0 -c "sudo ceph --cluster test -s"
  cluster:
    id:     5a51b9d9-b110-4a5f-b73c-b5dcf63552a1
    health: HEALTH_WARN
            no active mgr

  services:
    mon: 3 daemons, quorum ceph-mon0,ceph-mon1,ceph-mon2
    mgr: no daemons active
    mds: cephfs-1/1/1 up  {0=ceph-mds0=up:active}
    osd: 1 osds: 1 up, 1 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 bytes
    usage:   0 kB used, 0 kB / 0 kB avail
    pgs:

Connection to 192.168.121.133 closed.
jenkins-build@ceph-builders:~/build/workspace/ceph-ansible/tests/functional/centos/7/bluestore$ vagrant ssh mon0 -c "sudo ceph --cluster test fs ls"
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
Connection to 192.168.121.133 closed.


I'm tempted to close this bug but I'll wait from you to report first.
Thanks in advance.

Comment 14 seb 2017-09-15 13:10:54 UTC

Given that I haven't got any response and that I can not reproduce, I'm closing this. Feel free to re-open.

Comment 15 Red Hat Bugzilla 2023-09-14 03:59:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days