Bug 1493920 - [ceph-ansible] [ceph-container] : osd restart handler failing waiting PGs to be active+clean
Summary: [ceph-ansible] [ceph-container] : osd restart handler failing waiting PGs to ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: rc
: 3.0
Assignee: Sébastien Han
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks: 1488462 1492193
TreeView+ depends on / blocked
 
Reported: 2017-09-21 07:32 UTC by Vasishta
Modified: 2017-12-05 23:44 UTC (History)
11 users (show)

Fixed In Version: RHEL: ceph-ansible-3.0.0-0.1.rc11.el7cp Ubuntu: ceph-ansible_3.0.0~rc11-2redhat1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-05 23:44:47 UTC
Embargoed:


Attachments (Terms of Use)
File contains contents contents of all.yml, osds.yml, inventory file, ansible-playbook log (549.41 KB, text/plain)
2017-09-21 07:32 UTC, Vasishta
no flags Details
File contains ansible-playbook log (1.38 MB, text/plain)
2017-09-22 17:53 UTC, Vasishta
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 1946 0 None closed config: fix path to set `interface` in ceph.conf 2020-11-12 08:26:11 UTC
Red Hat Product Errata RHBA-2017:3387 0 normal SHIPPED_LIVE Red Hat Ceph Storage 3.0 bug fix and enhancement update 2017-12-06 03:03:45 UTC

Description Vasishta 2017-09-21 07:32:22 UTC
Created attachment 1328804 [details]
File contains contents contents of all.yml, osds.yml, inventory file, ansible-playbook log

Description of problem:
During cluster initialization handler 'restart containerized ceph osds daemon(s)' failing waiting PGs to come active+clean before configuring at least three OSD nodes. 

Initially same handler was failed after completing ceph-mgr tasks, after upgrading ceph-ansible to latest version, same handler is failing after ceph-mon tasks

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.0-0.1.rc10.el7cp.noarch

How reproducible:
Always (3/3)

Steps to Reproduce:
1. Configure ceph-ansible to initialized containerized cluster.
2. Run ansible-playbook site-docker.yml

Actual results:
Handler ceph-defaults : restart containerized ceph osds daemon(s) failing waiting for PGs to be in active+clean state.

Expected results:
Handler ceph-defaults : restart containerized ceph osds daemon(s) can be skipped   after completing ceph-mon or ceph-mgr tasks or It shouldn't expect PGs to be in active+clean.

Additional info:
Cluster status when handler was expecting PGs to be in active+clean (all three OSDs were on single node)
$ sudo docker exec ceph-mon-magna015 ceph -s --cluster 12_3_0
  cluster:
    id:     3d632b94-abb3-45e2-8c62-ac2ddac0ed6e
    health: HEALTH_WARN
            Reduced data availability: 16 pgs inactive
            Degraded data redundancy: 16 pgs unclean, 16 pgs degraded, 16 pgs undersized
            too few PGs per OSD (5 < min 30)
 
  services:
    mon: 3 daemons, quorum magna012,magna015,magna027
    mgr: magna027(active), standbys: magna012, magna015
    mds: cephfs-1/1/1 up  {0=magna020=up:creating}
    osd: 3 osds: 3 up, 3 in
 
  data:
    pools:   2 pools, 16 pgs
    objects: 0 objects, 0 bytes
    usage:   323 MB used, 2777 GB / 2778 GB avail
    pgs:     100.000% pgs not active
             16 undersized+degraded+peered

Comment 2 Harish NV Rao 2017-09-22 07:42:11 UTC
This bug is blocking currently the verification of two ON_QA bugs 1492193 and 1488462 which in turn blocking collocated container testing in 3.0. Please help resolving this at the earliest.

Comment 3 seb 2017-09-22 15:19:03 UTC
There is an issue with containers and restart on dmcrypt. So please try to avoid this scenario for now.

If the setup is still up, could you please log into the machine and verify if the osds are running? Or perhaps I can log into the setup?

Thanks!

Comment 4 Guillaume Abrioux 2017-09-22 15:58:20 UTC
Vasishta,

from what I can see in the log you provided, there is a mon socket present during initial deployment:

ok: [magna027] => {"changed": false, "cmd": "docker exec ceph-mon-magna027 bash -c 'stat /var/run/ceph/12_3_0-mon*.asok > /dev/null 2>&1'", "delta": "0:00:00.079965", "end": "2017-09-21 06:48:22.252365", "failed": false, "failed_when_result": false, "rc": 0, "start": "2017-09-21 06:48:22.172400", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

That's why the handlers are triggered.

Is there any leftover on the environment you used for that deployment?

Comment 5 Vasishta 2017-09-22 17:53:17 UTC
Created attachment 1329691 [details]
File contains ansible-playbook log

(In reply to seb from comment #3)
 
> If the setup is still up, could you please log into the machine and verify
> if the osds are running? Or perhaps I can log into the setup?
Hi Sebastien,

Unfortunately setup isn't there. But I'm sure that the OSDs were still up.


(In reply to Guillaume Abrioux from comment #4)

> Is there any leftover on the environment you used for that deployment?
Hi Guillaume,

As far as I remember I was working on fresh machines after getting them re-imaged. On initial run it had failed after setting up mgrs, Later upgraded ceph-ansible (from *rc9* to *rc10*) and tried again then it failed after ceph-mon tasks.

I've attached the ansible log of initial run. 

Regards,
Vasishta

Comment 6 Guillaume Abrioux 2017-09-23 07:16:24 UTC
Vasishta,

The first issue you encountered was caused by :

2017-09-20 16:24:23,408 p=32069 u=ubuntu |  fatal: [magna019]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_interface'"}
2017-09-20 16:24:23,423 p=32069 u=ubuntu |  fatal: [magna025]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_interface'"}

This has been fixed here: https://github.com/ceph/ceph-ansible/commit/eb3ce6c02bb5d595c2613e0ea7a3a3e854925e89 (not yet merged into master)


Also, since you are deploying an rgw node you must define the variable radosgw_interface otherwise you will get a similar error further.

with the fix mentioned here and radosgw_interface defined in group_vars/all.yml I could  successfully deploy the same setup than you.

PLAY RECAP ***********************************************************************************************************************************************************************************************************************************
mds0                       : ok=56   changed=1    unreachable=0    failed=0
mon0                       : ok=122  changed=6    unreachable=0    failed=0
mon1                       : ok=119  changed=6    unreachable=0    failed=0
mon2                       : ok=184  changed=7    unreachable=0    failed=0
osd0                       : ok=125  changed=2    unreachable=0    failed=0
osd1                       : ok=124  changed=7    unreachable=0    failed=0


[guits@elisheba ceph-ansible] $ cat hosts
[mons]
mon0
mon1
mon2

[mgrs]
mon0
mon1
mon2

[osds]
mon2 osd_scenario='collocated' dmcrypt='true' devices="['/dev/sda', '/dev/sdb', '/dev/sdc']"
osd0 osd_scenario='non-collocated' devices="['/dev/sda', '/dev/sdb']" dedicated_devices="['/dev/sdc']"
osd1 osd_scenario='non-collocated' dmcrypt='true' devices="['/dev/sda', '/dev/sdb']" dedicated_devices="['/dev/sdc']"

[rgws]
osd0

[nfss]
osd1

[mdss]
mds0

I'll let you know as soon the fix is merged upstream.

Comment 7 Guillaume Abrioux 2017-09-23 21:37:35 UTC
merged upstream:

https://github.com/ceph/ceph-ansible/commit/be757122f1efc6d30cd578e1ba4807114f4000d3

Comment 13 errata-xmlrpc 2017-12-05 23:44:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387


Note You need to log in before you can comment on or make changes to this bug.