Bug 1493920

Summary: [ceph-ansible] [ceph-container] : osd restart handler failing waiting PGs to be active+clean
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.0CC: adeza, aschoen, ceph-eng-bugs, gabrioux, gmeno, hnallurv, kdreyer, nthomas, sankarshan, seb, shan
Target Milestone: rc   
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.0.0-0.1.rc11.el7cp Ubuntu: ceph-ansible_3.0.0~rc11-2redhat1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-05 23:44:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1488462, 1492193    
Attachments:
Description Flags
File contains contents contents of all.yml, osds.yml, inventory file, ansible-playbook log
none
File contains ansible-playbook log none

Description Vasishta 2017-09-21 07:32:22 UTC
Created attachment 1328804 [details]
File contains contents contents of all.yml, osds.yml, inventory file, ansible-playbook log

Description of problem:
During cluster initialization handler 'restart containerized ceph osds daemon(s)' failing waiting PGs to come active+clean before configuring at least three OSD nodes. 

Initially same handler was failed after completing ceph-mgr tasks, after upgrading ceph-ansible to latest version, same handler is failing after ceph-mon tasks

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.0-0.1.rc10.el7cp.noarch

How reproducible:
Always (3/3)

Steps to Reproduce:
1. Configure ceph-ansible to initialized containerized cluster.
2. Run ansible-playbook site-docker.yml

Actual results:
Handler ceph-defaults : restart containerized ceph osds daemon(s) failing waiting for PGs to be in active+clean state.

Expected results:
Handler ceph-defaults : restart containerized ceph osds daemon(s) can be skipped   after completing ceph-mon or ceph-mgr tasks or It shouldn't expect PGs to be in active+clean.

Additional info:
Cluster status when handler was expecting PGs to be in active+clean (all three OSDs were on single node)
$ sudo docker exec ceph-mon-magna015 ceph -s --cluster 12_3_0
  cluster:
    id:     3d632b94-abb3-45e2-8c62-ac2ddac0ed6e
    health: HEALTH_WARN
            Reduced data availability: 16 pgs inactive
            Degraded data redundancy: 16 pgs unclean, 16 pgs degraded, 16 pgs undersized
            too few PGs per OSD (5 < min 30)
 
  services:
    mon: 3 daemons, quorum magna012,magna015,magna027
    mgr: magna027(active), standbys: magna012, magna015
    mds: cephfs-1/1/1 up  {0=magna020=up:creating}
    osd: 3 osds: 3 up, 3 in
 
  data:
    pools:   2 pools, 16 pgs
    objects: 0 objects, 0 bytes
    usage:   323 MB used, 2777 GB / 2778 GB avail
    pgs:     100.000% pgs not active
             16 undersized+degraded+peered

Comment 2 Harish NV Rao 2017-09-22 07:42:11 UTC
This bug is blocking currently the verification of two ON_QA bugs 1492193 and 1488462 which in turn blocking collocated container testing in 3.0. Please help resolving this at the earliest.

Comment 3 seb 2017-09-22 15:19:03 UTC
There is an issue with containers and restart on dmcrypt. So please try to avoid this scenario for now.

If the setup is still up, could you please log into the machine and verify if the osds are running? Or perhaps I can log into the setup?

Thanks!

Comment 4 Guillaume Abrioux 2017-09-22 15:58:20 UTC
Vasishta,

from what I can see in the log you provided, there is a mon socket present during initial deployment:

ok: [magna027] => {"changed": false, "cmd": "docker exec ceph-mon-magna027 bash -c 'stat /var/run/ceph/12_3_0-mon*.asok > /dev/null 2>&1'", "delta": "0:00:00.079965", "end": "2017-09-21 06:48:22.252365", "failed": false, "failed_when_result": false, "rc": 0, "start": "2017-09-21 06:48:22.172400", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

That's why the handlers are triggered.

Is there any leftover on the environment you used for that deployment?

Comment 5 Vasishta 2017-09-22 17:53:17 UTC
Created attachment 1329691 [details]
File contains ansible-playbook log

(In reply to seb from comment #3)
 
> If the setup is still up, could you please log into the machine and verify
> if the osds are running? Or perhaps I can log into the setup?
Hi Sebastien,

Unfortunately setup isn't there. But I'm sure that the OSDs were still up.


(In reply to Guillaume Abrioux from comment #4)

> Is there any leftover on the environment you used for that deployment?
Hi Guillaume,

As far as I remember I was working on fresh machines after getting them re-imaged. On initial run it had failed after setting up mgrs, Later upgraded ceph-ansible (from *rc9* to *rc10*) and tried again then it failed after ceph-mon tasks.

I've attached the ansible log of initial run. 

Regards,
Vasishta

Comment 6 Guillaume Abrioux 2017-09-23 07:16:24 UTC
Vasishta,

The first issue you encountered was caused by :

2017-09-20 16:24:23,408 p=32069 u=ubuntu |  fatal: [magna019]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_interface'"}
2017-09-20 16:24:23,423 p=32069 u=ubuntu |  fatal: [magna025]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_interface'"}

This has been fixed here: https://github.com/ceph/ceph-ansible/commit/eb3ce6c02bb5d595c2613e0ea7a3a3e854925e89 (not yet merged into master)


Also, since you are deploying an rgw node you must define the variable radosgw_interface otherwise you will get a similar error further.

with the fix mentioned here and radosgw_interface defined in group_vars/all.yml I could  successfully deploy the same setup than you.

PLAY RECAP ***********************************************************************************************************************************************************************************************************************************
mds0                       : ok=56   changed=1    unreachable=0    failed=0
mon0                       : ok=122  changed=6    unreachable=0    failed=0
mon1                       : ok=119  changed=6    unreachable=0    failed=0
mon2                       : ok=184  changed=7    unreachable=0    failed=0
osd0                       : ok=125  changed=2    unreachable=0    failed=0
osd1                       : ok=124  changed=7    unreachable=0    failed=0


[guits@elisheba ceph-ansible] $ cat hosts
[mons]
mon0
mon1
mon2

[mgrs]
mon0
mon1
mon2

[osds]
mon2 osd_scenario='collocated' dmcrypt='true' devices="['/dev/sda', '/dev/sdb', '/dev/sdc']"
osd0 osd_scenario='non-collocated' devices="['/dev/sda', '/dev/sdb']" dedicated_devices="['/dev/sdc']"
osd1 osd_scenario='non-collocated' dmcrypt='true' devices="['/dev/sda', '/dev/sdb']" dedicated_devices="['/dev/sdc']"

[rgws]
osd0

[nfss]
osd1

[mdss]
mds0

I'll let you know as soon the fix is merged upstream.

Comment 7 Guillaume Abrioux 2017-09-23 21:37:35 UTC
merged upstream:

https://github.com/ceph/ceph-ansible/commit/be757122f1efc6d30cd578e1ba4807114f4000d3

Comment 13 errata-xmlrpc 2017-12-05 23:44:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387