Description of problem: OSD activation is failing after ansible-playbook ran successfully in case of dm crypt+ collocated journal scenario. Version-Release number of selected component (if applicable): ceph version 10.2.7-23.el7cp container image tag :ceph-2-rhel-7-docker-candidate-24815-20170601202916 How reproducible: always Steps to Reproduce: 1. Perform preflight ops on all nodes 2. enable respective options in group_vars/osds.yml for encrypted+collocated journal scenario 3. run plabook for container installation Actual results: osds are not getting activated Expected results: osds must get activated Additional info: # ceph -s --cluster 2_3 cluster 0d4548f4-c5a7-4c37-ac06-ee15017d9518 health HEALTH_ERR 72 pgs are stuck inactive for more than 300 seconds 72 pgs stuck inactive 72 pgs stuck unclean monmap e2: 3 mons at {magna030=10.8.128.30:6789/0,magna033=10.8.128.33:6789/0,magna042=10.8.128.42:6789/0} election epoch 6, quorum 0,1,2 magna030,magna033,magna042 osdmap e6: 6 osds: 0 up, 0 in flags sortbitwise,require_jewel_osds pgmap v7: 72 pgs, 2 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 72 creating
Created attachment 1284699 [details] ansible-playbook log and osds.yml, all.yml file snippets Faced this issue while working on BZ 1452316 . Though ansible-playbook ran successfully, osds couldn't get activated Attachment contains ansible-playbook log and osds.yml, all.yml file snippets [ubuntu@magna088 ~]$ cat /usr/share/ceph-ansible/group_vars/osds.yml | egrep -v ^# | grep -v ^$ --- dummy: copy_admin_key: true devices: - /dev/sdb - /dev/sdc osd_containerized_deployment: true ceph_osd_docker_prepare_env: -e CLUSTER={{ cluster }} -e OSD_JOURNAL_SIZE={{ journal_size }} -e OSD_FORCE_ZAP=1 -e OSD_DMCRYPT=1 ceph_osd_docker_devices: "{{ devices }}" ceph_osd_docker_extra_env: -e CLUSTER={{ cluster }} -e CEPH_DAEMON=OSD_CEPH_DISK_ACTIVATE -e OSD_JOURNAL_SIZE={{ journal_size }} -e OSD_DMCRYPT=1
Created attachment 1284700 [details] Contains log snipets of two different osds Found two issues regarding two different osds in osd log 1) Jun 03 16:49:11 magna003 ceph-osd-run.sh[24497]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.m1sLEc Jun 03 16:49:11 magna003 ceph-osd-run.sh[24497]: no valid command found; 10 closest matches: 2)Jun 03 17:00:45 magna013 ceph-osd-run.sh[9768]: command_check_call: Running command: /bin/mount -o noatime,inode64 -- /dev/mapper/704e3599-8ae9-40c5-b86c-24c36b5c15e8 /var/lib/ceph/osd/2_3-1 Jun 03 17:00:45 magna013 ceph-osd-run.sh[9768]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.fJ_tix Jun 03 17:00:45 magna013 ceph-osd-run.sh[9768]: df: '/var/lib/ceph/osd/2_3-2/': No such file or directory Jun 03 17:00:45 magna013 ceph-osd-run.sh[9768]: 2017-06-03 17:00:45.260147 7f15f1661700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/2_3-2//keyring: (2) No such file or directory Please refer attachment for larger log snippet. Sorry for not mentioning in the description that cluster has custom cluster name, '2_3' in this case. Regards, Vasishta
Created attachment 1284765 [details] osd log snippet - dedicated journal Hi, OSD activation is failing in dedicated journal scenario also with similar log messages. 1> These lines commonly appears for all the osd logs (i,e in all nodes), Jun 04 16:27:33 magna106 ceph-osd-run.sh[15897]: df: '/var/lib/ceph/osd/7_3_2_3-7/': No such file or directory Jun 04 16:27:33 magna106 ceph-osd-run.sh[15897]: 2017-06-04 16:27:33.883293 7f73b17ba700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/7_3_2_3-7//keyring: (2) No such file or directory Jun 04 16:27:33 magna106 ceph-osd-run.sh[15897]: 2017-06-04 16:27:33.883303 7f73b17ba700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication Jun 04 16:27:33 magna106 ceph-osd-run.sh[15897]: 2017-06-04 16:27:33.883305 7f73b17ba700 0 librados: osd.7 initialization error (2) No such file or directory Please refer the attachment for larger log snippet. File also contains conf file, all.yml and osds.yml files (used for ceph-ansible) Regards, Vasishta
Alfredo please triage this.
release note 2.3, but keep working... this is relevant to 3.0
Alfredo please disregard request to triage -- we're kicking this out of 2.3
This bug is affecting the osd_disk_activate.sh script when cluster name includes numbers. when osd_disk_activate.sh is being run, it seeks for OSD id by executing this command: OSD_ID=$(grep "${MOUNTED_PART}" /proc/mounts | awk '{print $2}' | grep -oh '[0-9]*') (see: https://github.com/ceph/ceph-docker/blob/master/ceph-releases/jewel/ubuntu/14.04/daemon/osd_scenarios/osd_disk_activate.sh#L43) In the case where the cluster name includes numbers, the last grep is wrong because '[0-9]*' will accept any numbers. ++ grep /dev/mapper/34c336ea-3594-4c1d-b5f1-9cdf18a8ba6c /proc/mounts ++ awk '{print $2}' ++ grep -oh '[0-9]*' + OSD_ID='23 For instance: [root@ceph-osd0 /]# grep /dev/mapper/34c336ea-3594-4c1d-b5f1-9cdf18a8ba6c /proc/mounts /dev/mapper/34c336ea-3594-4c1d-b5f1-9cdf18a8ba6c /var/lib/ceph/osd/23-0 xfs rw,seclabel,noatime,attr2,inode64,noquota 0 0 [root@ceph-osd0 /]# grep /dev/mapper/34c336ea-3594-4c1d-b5f1-9cdf18a8ba6c /proc/mounts | awk '{print $2}' /var/lib/ceph/osd/23-0 [root@ceph-osd0 /]# grep /dev/mapper/34c336ea-3594-4c1d-b5f1-9cdf18a8ba6c /proc/mounts | awk '{print $2}' | grep -oh '[0-9]*' 23 0 [root@ceph-osd0 /]# we should only get '0'. fix: https://github.com/ceph/ceph-docker/pull/662
Guillaume, would you mind setting the "doc text" field for this BZ? It needs to go into the 2.3 Release Notes. Thanks! Erin
Thank you Guillaume for the doc text info! I have updated it a bit--would you mind taking a look and letting me know if it looks ok?
(In reply to Erin Donnelly from comment #13) > Thank you Guillaume for the doc text info! I have updated it a bit--would > you mind taking a look and letting me know if it looks ok? looks good to me
yes just release note the bug, no doc addition is necessary.
merge upstream, backport in progress
Changing summary to an appropriate version as osd activation fails whether encrypted or not and whether collocated or dedicated journal device. Regards, Vasishta
Tried using ceph-3.0-rhel-7-docker-candidate-31370-20171003232256, working fine, Moving BZ to VERIFIED state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3388