Bug 1510470
| Summary: | Containerized OSDs don't start - fail to find the Journal device | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Daniel Messer <dmesser> | ||||||
| Component: | Ceph-Ansible | Assignee: | Guillaume Abrioux <gabrioux> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 3.0 | CC: | adeza, aschoen, ceph-eng-bugs, dmesser, gabrioux, gmeno, hnallurv, nthomas, sankarshan, shan, tserlin, vakulkar, vashastr | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | 3.0 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | RHEL: ceph-ansible-3.0.10-2.el7cp Ubuntu: ceph-ansible_3.0.10-2redhat1 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2017-12-05 23:49:35 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Daniel Messer
2017-11-07 13:57:02 UTC
Created attachment 1348967 [details]
osds.yml
I should add that this environment had several install attempts before and cleaned with purge-site-docker.yml. The PARTUUID that the OSDs are looking for however stay the same. The fetch directory is cleaned between runs. The problem persists even when changing to collocated setups. This is not something we see in our CI. Can we access this env? Thanks! Could you provide the full playbook run log? Thanks I tried to reproduce your issue with ceph-ansibe v3.0.9 and ceph-3.0-rhel-7-docker-candidate-61072-20171104225422 container image, the deployment worked fine.
OSDs are UP:
[root@osd0 ~]# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ff6266f26745 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.0-rhel-7-docker-candidate-61072-20171104225422 "/entrypoint.sh" 28 minutes ago Up 28 minutes ceph-osd-osd0-sdb
cea920b57eca brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.0-rhel-7-docker-candidate-61072-20171104225422 "/entrypoint.sh" 28 minutes ago Up 28 minutes ceph-osd-osd0-sda
299226e51fd4 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.0-rhel-7-docker-candidate-61072-20171104225422 "/entrypoint.sh" 28 minutes ago Exited (0) 28 minutes ago ceph-osd-prepare-osd0-sdb
0a103838a516 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.0-rhel-7-docker-candidate-61072-20171104225422 "/entrypoint.sh" 28 minutes ago Exited (0) 28 minutes ago ceph-osd-prepare-osd0-sda
[root@osd0 ~]#
[root@mon0 ~]# docker exec -ti ceph-mon-mon0 ceph -s
cluster:
id: 915ba53a-1288-4062-aa6d-45b5db0019b2
health: HEALTH_WARN
too few PGs per OSD (8 < min 30)
services:
mon: 3 daemons, quorum mon0,mon1,mon2
mgr: mon0(active)
mds: cephfs-1/1/1 up {0=mds0=up:active}
osd: 2 osds: 2 up, 2 in
data:
pools: 2 pools, 16 pgs
objects: 21 objects, 2246 bytes
usage: 214 MB used, 102133 MB / 102347 MB avail
pgs: 16 active+clean
[root@mon0 ~]#
I think your multiple attempts to deploy have probably broken something.
I couldn't reproduce your issue, CI and QE didn't catch this issue as well. Could you retry to deploy from scratch? As Sebastien asked, any chance to access your env?
Thanks!
@Guilaume - this might well be the case. I will send you and leseb the credentials of the environment. It's AWS-based. I could retry to deploy from scratch to, but honestly I don't see what could cause this. I suggest we work in parallel. I will try to re-deploy from scratch and you can try to re-deploy in my environment see where it's choking up. This behavior will likely effect others that run in https://bugzilla.redhat.com/show_bug.cgi?id=1510555 - which is the reason I had to re-deploy so many times. Hi Daniel, the issue here is in purge-docker-cluster.yml playbook. You tried several times to deploy your cluster; the first time, the osd disk prepare process produced some logs that are used later in ceph-ansible to retrieve journal partition uuid, these logs are supposed to be generated [1] only at initial deployment because they come from the prepare containers logs, if we lose these containers for any reason (reboot or anything else) we can't generate these logs again. upstream PR: https://github.com/ceph/ceph-ansible/pull/2152 [1] https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-osd/templates/ceph-osd-run.sh.j2#L17-L35 Hi Daniel, Tried setting up OSDs with dedicated journals (both dmcrypt and non-dmcrypt) using latest builds, Container image - ceph-3.0-rhel-7-docker-candidate-36461-20171114235412 Ceph-ansible - ceph-ansible-3.0.11-1.el7cp.noarch (Because of test environment constraint only two journals (of 2 OSDs) were on dedicated disk) Initialization and purging were tried thrice back to back on same set of nodes with node reboot after initializing cluster each time. All these time (both Initialization and after reboot) OSDs came up and were running as expected, thus it looks good to me. Can you please let me know your views on steps followed as part of the bug fix verification ? Regards, Vasishta Hi, I'm moving the BZ to VERIFIED as per suggestions I got based on Comment 15. Please feel free to let me know if there are any concerns. Regards, Vasishta Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3387 |