Created attachment 1328804 [details] File contains contents contents of all.yml, osds.yml, inventory file, ansible-playbook log Description of problem: During cluster initialization handler 'restart containerized ceph osds daemon(s)' failing waiting PGs to come active+clean before configuring at least three OSD nodes. Initially same handler was failed after completing ceph-mgr tasks, after upgrading ceph-ansible to latest version, same handler is failing after ceph-mon tasks Version-Release number of selected component (if applicable): ceph-ansible-3.0.0-0.1.rc10.el7cp.noarch How reproducible: Always (3/3) Steps to Reproduce: 1. Configure ceph-ansible to initialized containerized cluster. 2. Run ansible-playbook site-docker.yml Actual results: Handler ceph-defaults : restart containerized ceph osds daemon(s) failing waiting for PGs to be in active+clean state. Expected results: Handler ceph-defaults : restart containerized ceph osds daemon(s) can be skipped after completing ceph-mon or ceph-mgr tasks or It shouldn't expect PGs to be in active+clean. Additional info: Cluster status when handler was expecting PGs to be in active+clean (all three OSDs were on single node) $ sudo docker exec ceph-mon-magna015 ceph -s --cluster 12_3_0 cluster: id: 3d632b94-abb3-45e2-8c62-ac2ddac0ed6e health: HEALTH_WARN Reduced data availability: 16 pgs inactive Degraded data redundancy: 16 pgs unclean, 16 pgs degraded, 16 pgs undersized too few PGs per OSD (5 < min 30) services: mon: 3 daemons, quorum magna012,magna015,magna027 mgr: magna027(active), standbys: magna012, magna015 mds: cephfs-1/1/1 up {0=magna020=up:creating} osd: 3 osds: 3 up, 3 in data: pools: 2 pools, 16 pgs objects: 0 objects, 0 bytes usage: 323 MB used, 2777 GB / 2778 GB avail pgs: 100.000% pgs not active 16 undersized+degraded+peered
This bug is blocking currently the verification of two ON_QA bugs 1492193 and 1488462 which in turn blocking collocated container testing in 3.0. Please help resolving this at the earliest.
There is an issue with containers and restart on dmcrypt. So please try to avoid this scenario for now. If the setup is still up, could you please log into the machine and verify if the osds are running? Or perhaps I can log into the setup? Thanks!
Vasishta, from what I can see in the log you provided, there is a mon socket present during initial deployment: ok: [magna027] => {"changed": false, "cmd": "docker exec ceph-mon-magna027 bash -c 'stat /var/run/ceph/12_3_0-mon*.asok > /dev/null 2>&1'", "delta": "0:00:00.079965", "end": "2017-09-21 06:48:22.252365", "failed": false, "failed_when_result": false, "rc": 0, "start": "2017-09-21 06:48:22.172400", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} That's why the handlers are triggered. Is there any leftover on the environment you used for that deployment?
Created attachment 1329691 [details] File contains ansible-playbook log (In reply to seb from comment #3) > If the setup is still up, could you please log into the machine and verify > if the osds are running? Or perhaps I can log into the setup? Hi Sebastien, Unfortunately setup isn't there. But I'm sure that the OSDs were still up. (In reply to Guillaume Abrioux from comment #4) > Is there any leftover on the environment you used for that deployment? Hi Guillaume, As far as I remember I was working on fresh machines after getting them re-imaged. On initial run it had failed after setting up mgrs, Later upgraded ceph-ansible (from *rc9* to *rc10*) and tried again then it failed after ceph-mon tasks. I've attached the ansible log of initial run. Regards, Vasishta
Vasishta, The first issue you encountered was caused by : 2017-09-20 16:24:23,408 p=32069 u=ubuntu | fatal: [magna019]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_interface'"} 2017-09-20 16:24:23,423 p=32069 u=ubuntu | fatal: [magna025]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_interface'"} This has been fixed here: https://github.com/ceph/ceph-ansible/commit/eb3ce6c02bb5d595c2613e0ea7a3a3e854925e89 (not yet merged into master) Also, since you are deploying an rgw node you must define the variable radosgw_interface otherwise you will get a similar error further. with the fix mentioned here and radosgw_interface defined in group_vars/all.yml I could successfully deploy the same setup than you. PLAY RECAP *********************************************************************************************************************************************************************************************************************************** mds0 : ok=56 changed=1 unreachable=0 failed=0 mon0 : ok=122 changed=6 unreachable=0 failed=0 mon1 : ok=119 changed=6 unreachable=0 failed=0 mon2 : ok=184 changed=7 unreachable=0 failed=0 osd0 : ok=125 changed=2 unreachable=0 failed=0 osd1 : ok=124 changed=7 unreachable=0 failed=0 [guits@elisheba ceph-ansible] $ cat hosts [mons] mon0 mon1 mon2 [mgrs] mon0 mon1 mon2 [osds] mon2 osd_scenario='collocated' dmcrypt='true' devices="['/dev/sda', '/dev/sdb', '/dev/sdc']" osd0 osd_scenario='non-collocated' devices="['/dev/sda', '/dev/sdb']" dedicated_devices="['/dev/sdc']" osd1 osd_scenario='non-collocated' dmcrypt='true' devices="['/dev/sda', '/dev/sdb']" dedicated_devices="['/dev/sdc']" [rgws] osd0 [nfss] osd1 [mdss] mds0 I'll let you know as soon the fix is merged upstream.
merged upstream: https://github.com/ceph/ceph-ansible/commit/be757122f1efc6d30cd578e1ba4807114f4000d3
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3387