Bug 1458512

Summary: [ceph-ansible] [ceph-container] : osd activation failing when cluster name has numbers
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: ContainerAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: medium Docs Contact: Erin Donnelly <edonnell>
Priority: medium    
Version: 2.3CC: adeza, anharris, dang, edonnell, flucifre, gabrioux, gmeno, hchen, hnallurv, jim.curtis, kdreyer, pprakash, seb, tserlin
Target Milestone: rc   
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhceph:ceph-3.0-rhel-7-docker-candidate-71465-20170804220045 Doc Type: Bug Fix
Doc Text:
.OSD activation no longer fails when running the `osd_disk_activate.sh` script in the Ceph container when a cluster name contains numbers Previously, in the Ceph container image the `osd_disk_activate.sh` script considered all numbers included in a cluster name as an OSD ID. As a consequence, OSD activation failed when running the script because the script was seeking a keyring on a path based on an OSD ID that did not exist. The underlying issue has been fixed, and OSD activation no longer fails when the name of a cluster in a container contains numbers.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-05 23:18:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1437916, 1494421    
Attachments:
Description Flags
ansible-playbook log and osds.yml, all.yml file snippets
none
Contains log snipets of two different osds
none
osd log snippet - dedicated journal none

Description Vasishta 2017-06-03 18:06:12 UTC
Description of problem:
OSD activation is failing after ansible-playbook ran successfully  in case of dm crypt+ collocated journal scenario.


Version-Release number of selected component (if applicable):
ceph version 10.2.7-23.el7cp
container image tag :ceph-2-rhel-7-docker-candidate-24815-20170601202916

How reproducible:
always

Steps to Reproduce:
1. Perform preflight ops on all nodes
2. enable respective options in group_vars/osds.yml for encrypted+collocated journal scenario
3. run plabook for container installation

Actual results:
osds are not getting activated

Expected results:
osds must get activated

Additional info:

# ceph -s --cluster 2_3 
    cluster 0d4548f4-c5a7-4c37-ac06-ee15017d9518
     health HEALTH_ERR
            72 pgs are stuck inactive for more than 300 seconds
            72 pgs stuck inactive
            72 pgs stuck unclean
     monmap e2: 3 mons at {magna030=10.8.128.30:6789/0,magna033=10.8.128.33:6789/0,magna042=10.8.128.42:6789/0}
            election epoch 6, quorum 0,1,2 magna030,magna033,magna042
     osdmap e6: 6 osds: 0 up, 0 in
            flags sortbitwise,require_jewel_osds
      pgmap v7: 72 pgs, 2 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                  72 creating

Comment 2 Vasishta 2017-06-03 18:12:14 UTC
Created attachment 1284699 [details]
ansible-playbook log and osds.yml, all.yml file snippets

Faced this issue while working on BZ 1452316 . 
Though ansible-playbook ran successfully, osds couldn't get activated
Attachment contains ansible-playbook log and osds.yml, all.yml file snippets

[ubuntu@magna088 ~]$ cat /usr/share/ceph-ansible/group_vars/osds.yml | egrep -v ^# | grep -v ^$
---
dummy:
copy_admin_key: true
devices:
  - /dev/sdb
  - /dev/sdc
osd_containerized_deployment: true
ceph_osd_docker_prepare_env: -e CLUSTER={{ cluster }} -e OSD_JOURNAL_SIZE={{ journal_size }} -e OSD_FORCE_ZAP=1 -e OSD_DMCRYPT=1
ceph_osd_docker_devices: "{{ devices }}"
ceph_osd_docker_extra_env: -e CLUSTER={{ cluster }} -e CEPH_DAEMON=OSD_CEPH_DISK_ACTIVATE -e OSD_JOURNAL_SIZE={{ journal_size }} -e OSD_DMCRYPT=1

Comment 3 Vasishta 2017-06-03 18:21:21 UTC
Created attachment 1284700 [details]
Contains log snipets of two different osds

Found two issues regarding two different osds in osd log

1) Jun 03 16:49:11 magna003 ceph-osd-run.sh[24497]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.m1sLEc
Jun 03 16:49:11 magna003 ceph-osd-run.sh[24497]: no valid command found; 10 closest matches:

2)Jun 03 17:00:45 magna013 ceph-osd-run.sh[9768]: command_check_call: Running command: /bin/mount -o noatime,inode64 -- /dev/mapper/704e3599-8ae9-40c5-b86c-24c36b5c15e8 /var/lib/ceph/osd/2_3-1
Jun 03 17:00:45 magna013 ceph-osd-run.sh[9768]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.fJ_tix
Jun 03 17:00:45 magna013 ceph-osd-run.sh[9768]: df: '/var/lib/ceph/osd/2_3-2/': No such file or directory
Jun 03 17:00:45 magna013 ceph-osd-run.sh[9768]: 2017-06-03 17:00:45.260147 7f15f1661700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/2_3-2//keyring: (2) No such file or directory

Please refer attachment for larger log snippet.
Sorry for not mentioning in the description that cluster has custom cluster name, '2_3' in this case.

Regards,
Vasishta

Comment 4 Vasishta 2017-06-04 17:21:01 UTC
Created attachment 1284765 [details]
osd log snippet - dedicated journal

Hi,

OSD activation is failing in dedicated journal scenario also with similar log messages.

1> These lines commonly appears for all the osd logs (i,e in all nodes),
 
Jun 04 16:27:33 magna106 ceph-osd-run.sh[15897]: df: '/var/lib/ceph/osd/7_3_2_3-7/': No such file or directory
Jun 04 16:27:33 magna106 ceph-osd-run.sh[15897]: 2017-06-04 16:27:33.883293 7f73b17ba700 -1 auth: unable to find a keyring on /var/lib/ceph/osd/7_3_2_3-7//keyring: (2) No such file or directory
Jun 04 16:27:33 magna106 ceph-osd-run.sh[15897]: 2017-06-04 16:27:33.883303 7f73b17ba700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Jun 04 16:27:33 magna106 ceph-osd-run.sh[15897]: 2017-06-04 16:27:33.883305 7f73b17ba700  0 librados: osd.7 initialization error (2) No such file or directory


Please refer the attachment for larger log snippet.
File also contains conf file, all.yml and osds.yml files (used for ceph-ansible)


Regards,
Vasishta

Comment 5 Christina Meno 2017-06-05 15:17:42 UTC
Alfredo please triage this.

Comment 6 Christina Meno 2017-06-05 15:17:57 UTC
Alfredo please triage this.

Comment 8 Federico Lucifredi 2017-06-05 15:25:10 UTC
release note 2.3, but keep working... this is relevant to 3.0

Comment 10 Christina Meno 2017-06-05 15:35:09 UTC
Alfredo please disregard request to triage -- we're kicking this out of 2.3

Comment 11 Guillaume Abrioux 2017-06-06 13:33:01 UTC
This bug is affecting the osd_disk_activate.sh script when cluster name includes numbers.

when osd_disk_activate.sh is being run, it seeks for OSD id by executing this command:

OSD_ID=$(grep "${MOUNTED_PART}" /proc/mounts | awk '{print $2}' | grep -oh '[0-9]*')

(see: https://github.com/ceph/ceph-docker/blob/master/ceph-releases/jewel/ubuntu/14.04/daemon/osd_scenarios/osd_disk_activate.sh#L43)

In the case where the cluster name includes numbers, the last grep is wrong because '[0-9]*' will accept any numbers.

++ grep /dev/mapper/34c336ea-3594-4c1d-b5f1-9cdf18a8ba6c /proc/mounts
++ awk '{print $2}'
++ grep -oh '[0-9]*'
+ OSD_ID='23

For instance:
[root@ceph-osd0 /]# grep /dev/mapper/34c336ea-3594-4c1d-b5f1-9cdf18a8ba6c /proc/mounts
/dev/mapper/34c336ea-3594-4c1d-b5f1-9cdf18a8ba6c /var/lib/ceph/osd/23-0 xfs rw,seclabel,noatime,attr2,inode64,noquota 0 0
[root@ceph-osd0 /]# grep /dev/mapper/34c336ea-3594-4c1d-b5f1-9cdf18a8ba6c /proc/mounts | awk '{print $2}'
/var/lib/ceph/osd/23-0
[root@ceph-osd0 /]# grep /dev/mapper/34c336ea-3594-4c1d-b5f1-9cdf18a8ba6c /proc/mounts | awk '{print $2}' | grep -oh '[0-9]*'
23
0
[root@ceph-osd0 /]#

we should only get '0'.

fix: https://github.com/ceph/ceph-docker/pull/662

Comment 12 Erin Donnelly 2017-06-06 13:40:14 UTC
Guillaume, would you mind setting the "doc text" field for this BZ? It needs to go into the 2.3 Release Notes.

Thanks!
Erin

Comment 13 Erin Donnelly 2017-06-07 12:56:48 UTC
Thank you Guillaume for the doc text info! I have updated it a bit--would you mind taking a look and letting me know if it looks ok?

Comment 14 Guillaume Abrioux 2017-06-07 12:58:55 UTC
(In reply to Erin Donnelly from comment #13)
> Thank you Guillaume for the doc text info! I have updated it a bit--would
> you mind taking a look and letting me know if it looks ok?

looks good to me

Comment 16 Federico Lucifredi 2017-06-08 18:59:48 UTC
yes just release note the bug, no doc addition is necessary.

Comment 17 seb 2017-06-12 10:21:49 UTC
merge upstream, backport in progress

Comment 19 Vasishta 2017-07-20 16:48:35 UTC
Changing summary to an appropriate version as osd activation fails whether encrypted or not and whether collocated or dedicated journal device.

Regards,
Vasishta

Comment 21 Vasishta 2017-10-04 16:48:23 UTC
Tried using ceph-3.0-rhel-7-docker-candidate-31370-20171003232256, working fine, Moving BZ to VERIFIED state.

Comment 26 errata-xmlrpc 2017-12-05 23:18:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3388