Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1648010

Summary: mixed role container installation fails at random places
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasu Kulkarni <vakulkar>
Component: ContainerAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Vasu Kulkarni <vakulkar>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.2CC: ceph-eng-bugs, evelu, gabrioux, hnallurv, kdreyer, seb, shan, tserlin, vakulkar
Target Milestone: rcKeywords: Automation, AutomationBlocker
Target Release: 3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhceph:ceph-3.2-rhel-7-containers-candidate-70340-20181128235616 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1654011 (view as bug list) Environment:
Last Closed: 2019-01-03 20:19:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1654011    
Bug Blocks:    

Description Vasu Kulkarni 2018-11-08 17:52:31 UTC
Description of problem:

Steps:

1) Latest RHEL 3.2 Container build with RHEL 7.6

a) Configuration
ceph_conf_overrides:
  client:
    rgw crypt require ssl: false
    rgw crypt s3 kms encryption keys: testkey-1=YmluCmJvb3N0CmJvb3N0LWJ1aWxkCmNlcGguY29uZgo=
      testkey-2=aWIKTWFrZWZpbGUKbWFuCm91dApzcmMKVGVzdGluZwo=
  global:
    mon_max_pg_per_osd: 1024
    osd_default_pool_size: 2
    osd_pool_default_pg_num: 64
    osd_pool_default_pgp_num: 64
  mon:
    mon_allow_pool_delete: true
ceph_docker_image: rhceph
ceph_docker_image_tag: ceph-3.2-rhel-7-containers-candidate-50387-20181108090202
ceph_docker_registry: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888
ceph_origin: distro
ceph_repository: rhcs
ceph_stable: true
ceph_stable_release: luminous
ceph_stable_rh_storage: true
ceph_test: true
cephfs_pools:
- name: cephfs_data
  pgs: '8'
- name: cephfs_metadata
  pgs: '8'
containerized_deployment: true
copy_admin_key: true
fetch_directory: ~/fetch/
journal_size: 1024
osd_auto_discovery: false
osd_scenario: collocated
public_network: 172.16.0.0/12

b) Hosts file:

[mons]
ceph-jenkins-build-1541691039066-node2-osdmonmgr monitor_interface=eth0
ceph-jenkins-build-1541691039066-node5-monmgr monitor_interface=eth0
ceph-jenkins-build-1541691039066-node6-monmgr monitor_interface=eth0
[mgrs]
ceph-jenkins-build-1541691039066-node2-osdmonmgr monitor_interface=eth0
ceph-jenkins-build-1541691039066-node5-monmgr monitor_interface=eth0
ceph-jenkins-build-1541691039066-node6-monmgr monitor_interface=eth0
[osds]
ceph-jenkins-build-1541691039066-node9-osd monitor_interface=eth0  devices='["/dev/vdb", "/dev/vdc", "/dev/vdd", "/dev/vde"]' 
ceph-jenkins-build-1541691039066-node3-osdrgw monitor_interface=eth0  devices='["/dev/vdb", "/dev/vdc", "/dev/vdd", "/dev/vde"]' 
ceph-jenkins-build-1541691039066-node2-osdmonmgr monitor_interface=eth0  devices='["/dev/vdb", "/dev/vdc", "/dev/vdd", "/dev/vde"]' 
ceph-jenkins-build-1541691039066-node4-osdmds monitor_interface=eth0  devices='["/dev/vdb", "/dev/vdc", "/dev/vdd", "/dev/vde"]' 
[mdss]
ceph-jenkins-build-1541691039066-node4-osdmds monitor_interface=eth0
ceph-jenkins-build-1541691039066-node7-mds monitor_interface=eth0
[rgws]
ceph-jenkins-build-1541691039066-node8-rgw radosgw_interface=eth0
ceph-jenkins-build-1541691039066-node3-osdrgw radosgw_interface=eth0
[clients]
ceph-jenkins-build-1541691039066-node10-client client_interface=eth0



Full logs:

https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/wip-cephci/job/wip-vshevche-ceph-ansible-sanity-3.x/122/consoleFull

018-11-08 16:18:49,659 - ceph.ceph - INFO - failed: [ceph-jenkins-build-1541691039066-node4-osdmds] (item=[{'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'/dev/vde', u'script': u"unit 'MiB' print", '_ansible_no_log': False, u'changed': False, 'failed': False, 'item': u'/dev/vde', u'invocation': {u'module_args': {u'part_start': u'0%', u'part_end': u'100%', u'name': None, u'align': u'optimal', u'number': None, u'label': u'msdos', u'state': u'info', u'part_type': u'primary', u'flags': None, u'device': u'/dev/vde', u'unit': u'MiB'}}, u'disk': {u'dev': u'/dev/vde', u'physical_block': 512, u'table': u'unknown', u'logical_block': 512, u'model': u'Virtio Block Device', u'unit': u'mib', u'size': 15360.0}, '_ansible_ignore_errors': None, u'partitions': []}, u'/dev/vde']) => {"changed": true, "cmd": "docker run --net=host --pid=host --privileged=true --name=ceph-osd-prepare-ceph-jenkins-build-1541691039066-node4-osdmds-vde -v /etc/ceph:/etc/ceph:z -v /var/lib/ceph/:/var/lib/ceph/:z -v /dev:/dev -v /etc/localtime:/etc/localtime:ro 
2018-11-08 16:18:49,660 - ceph.ceph - INFO - -e DEBUG=verbose -e CLUSTER=ceph -e CEPH_DAEMON=OSD_CEPH_DISK_PREPARE -e OSD_DEVICE=/dev/vde -e OSD_BLUESTORE=1 -e OSD_FILESTORE=0 -e OSD_DMCRYPT=0 -e OSD_JOURNAL_SIZE=1024 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:ceph-3.2-rhel-7-containers-candidate-50387-20181108090202", "delta": "0:00:25.405561", "end": "2018-11-08 11:18:46.533965", "item": [{"_ansible_ignore_errors": null, "_ansible_item_label": "/dev/vde", "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "disk": {"dev": "/dev/vde", "logical_block": 512, "model": "Virtio Block Device", "physical_block": 512, "size": 15360.0, "table": "unknown", "unit": "mib"}, "failed": false, "invocation": {"module_args": {"align": "optimal", "device": "/dev/vde", "flags": null, "label": "msdos", "name": null, "number": null, "part_end": "100%", "part_start": "0%", "part_type": "primary", "state": "info", "unit": "MiB"}}, "item": "/dev/vde", "partitions": [], "script": "unit 'MiB' print"}, "/dev/vde"], 
2018-11-08 16:18:49,660 - ceph.ceph - INFO - "msg": "non-zero return code", "rc": 1, "start": "2018-11-08 11:18:21.128404", "stderr": "+/entrypoint.sh:19: case \"$KV_TYPE\" in\n+/entrypoint.sh:29: source /config.static.sh\n++/config.static.sh:2: set -e\n++/entrypoint.sh:39: to_lowercase OSD_CEPH_DISK_PREPARE\n++common_functions.sh:189: to_lowercase(): echo osd_ceph_disk_prepare\n+/entrypoint.sh:39: CEPH_DAEMON=osd_ceph_disk_prepare\n+/entrypoint.sh:41: create_mandatory_directories\n+common_functions.sh:64: create_mandatory_directories(): for keyring in '$OSD_BOOTSTRAP_KEYRING' '$MDS_BOOTSTRAP_KEYRING' '$RGW_BOOTSTRAP_KEYRING' '$RBD_MIRROR_BOOTSTRAP_KEYRING'\n++common_functions.sh:65: create_mandatory_directories(): dirname /var/lib/ceph/bootstrap-osd/ceph.keyring\n+common_functions.sh:65: create_mandatory_directories(): mkdir -p /var/lib/ceph/bootstrap-osd\n+common_functions.sh:64: create_mandatory_directories(): for keyring in '$OSD_BOOTSTRAP_KEYRING' '$MDS_BOOTSTRAP_KEYRING' '$RGW_BOOTSTRAP_KEYRING' '$RBD_MIRROR_BOOTSTRAP_KEYRING'\n++common_functions.
2018-11-08 16:18:49,661 - ceph.ceph - INFO - sh:65: create_mandatory_directories(): dirname /var/lib/ceph/bootstrap-mds/ceph.keyring\n+common_functions.sh:65: create_mandatory_directories(): mkdir -p /var/lib/ceph/bootstrap-mds\n+common_functions.sh:64: create_mandatory_directories(): for keyring in '$OSD_BOOTSTRAP_KEYRING' '$MDS_BOOTSTRAP_KEYRING' '$RGW_BOOTSTRAP_KEYRING' '$RBD_MIRROR_BOOTSTRAP_KEYRING'\n++common_functions.sh:65: create_mandatory_directories(): dirname /var/lib/ceph/bootstrap-rgw/ceph.keyring\n+common_functions.sh:65: create_mandatory_directories(): mkdir -p /var/lib/ceph/bootstrap-rgw\n+common_functions.sh:64: create_mandatory_directories(): for keyring in '$OSD_BOOTSTRAP_KEYRING' '$MDS_BOOTSTRAP_KEYRING' '$RGW_BOOTSTRAP_KEYRING' '$RBD_MIRROR_BOOTSTRAP_KEYRING'\n++common_functions.sh:65: create_mandatory_directories(): dirname /var/lib/ceph/bootstrap-rbd/ceph.keyring\n+common_functions.sh:65: create_mandatory_directories(): mkdir -p /var/lib/ceph/bootstrap-rbd\n+common_functions.sh:69: create_mandatory_directories(): for directory in 
2018-11-08 16:18:49,661 - ceph.ceph - INFO - mon osd mds radosgw tmp mgr\n+common_functions.sh:70: create_mandatory_directories(): mkdir -p /var/lib/ceph/mon\n+common_functions.sh:69: create_mandatory_directories(): for directory in mon osd mds radosgw tmp mgr\n+common_functions.sh:70: create_mandatory_directories(): mkdir -p /var/lib/ceph/osd\n+common_functions.sh:69: create_mandatory_directories(): for directory in mon osd mds radosgw tmp mgr\n+common_functions.sh:70: create_mandatory_directories(): mkdir -p /var/lib/ceph/mds\n+common_functions.sh:69: create_mandatory_directories(): for directory in mon osd mds radosgw tmp mgr\n+common_functions.sh:70: create_mandatory_directories(): mkdir -p /var/lib/ceph/radosgw\n+common_functions.sh:69: create_mandatory_directories(): for directory in mon osd mds radosgw tmp mgr\n+common_functions.sh:70: create_mandatory_directories(): mkdir -p /var/lib/ceph/tmp\n+common_functions.sh:69: create_mandatory_directories(): for directory in mon osd mds radosgw tmp mgr\n+common_functions.sh:70: create_mandatory_directori
2018-11-08 16:18:49,661 - ceph.ceph - INFO - es(): mkdir -p /var/lib/ceph/mgr\n+common_functions.sh:74: create_mandatory_directories(): mkdir -p /var/lib/ceph/mon/ceph-ceph-jenkins-build-1541691039066-node4-osdmds\n+common_functions.sh:77: create_mandatory_directories(): mkdir -p /var/run/ceph\n+common_functions.sh:80: create_mandatory_directories(): mkdir -p /var/lib/ceph/radosgw/ceph-rgw.ceph-jenkins-build-1541691039066-node4-osdmds\n+common_functions.sh:83: create_mandatory_directories(): mkdir -p /var/lib/ceph/mds/ceph-ceph-jenkins-build-1541691039066-node4-osdmds\n+common_functions.sh:86: create_mandatory_directories(): mkdir -p /var/lib/ceph/mgr/ceph-ceph-jenkins-build-1541691039066-node4-osdmds\n+common_functions.sh:89: create_mandatory_directories(): chown --verbose -R ceph. /var/run/ceph/\n+common_functions.sh:90: create_mandatory_directories(): find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown --verbose ceph. '{}' ';'\n+/entrypoint.sh:45: case \"$CEPH_DAEMON\" in\n+/entrypoint.sh:81: source start_osd.sh\n++start_osd.sh:2: set -e\n++st
2018-11-08 16:18:49,662 - ceph.ceph - INFO - art_osd.sh:4: is_redhat\n++common_functions.sh:214: is_redhat(): get_package_manager\n++common_functions.sh:207: get_package_manager(): is_available rpm\n++common_functions.sh:58: is_available(): command -v rpm\n++common_functions.sh:208: get_package_manager(): OS_VENDOR=redhat\n++common_functions.sh:215: is_redhat(): [[ redhat == \\r\\e\\d\\h\\a\\t ]]\n++start_osd.sh:5: source /etc/sysconfig/ceph\n+++/etc/sysconfig/ceph:7: TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728\n+++/etc/sysconfig/ceph:18: CEPH_AUTO_RESTART_ON_UPGRADE=no\n+/entrypoint.sh:82: OSD_TYPE=prepare\n+/entrypoint.sh:83: start_osd\n+start_osd.sh:9: start_osd(): get_config\n+/config.static.sh:114: get_config(): log 'static: does not generate config'\n+common_functions.sh:7: log(): '[' -z 'static: does not generate config' ']'\n+common_functions.sh:11: log(): local timestamp\n++common_functions.sh:12: log(): date '+%F %T'\n+common_functions.sh:12: log(): timestamp='2018-11-08 11:18:21'\n+common_functions.sh:13: log(): echo '2018-11-08 11:18:21 
2018-11-08 16:18:49,662 - ceph.ceph - INFO -  /entrypoint.sh: static: does not generate config'\n+common_functions.sh:14: log(): return 0\n+start_osd.sh:10: start_osd(): check_config\n+common_functions.sh:30: check_config(): [[ ! -e /etc/ceph/ceph.conf ]]\n+start_osd.sh:12: start_osd(): '[' 0 -eq 1 ']'\n+start_osd.sh:17: start_osd(): case \"$OSD_TYPE\" in\n+start_osd.sh:31: start_osd(): source osd_disk_prepare.sh\n++osd_disk_prepare.sh:2: source(): set -e\n+start_osd.sh:32: start_osd(): osd_disk_prepare\n+osd_disk_prepare.sh:5: osd_disk_prepare(): [[ -z /dev/vde ]]\n+osd_disk_prepare.sh:10: osd_disk_prepare(): [[ ! -e /dev/vde ]]\n+osd_disk_prepare.sh:15: osd_disk_prepare(): '[' '!' -e /var/lib/ceph/bootstrap-osd/ceph.keyring ']'\n+osd_disk_prepare.sh:20: osd_disk_prepare(): ceph_health client.bootstrap-osd /var/lib/ceph/bootstrap-osd/ceph.keyring\n+common_functions.sh:321: ceph_health(): local bootstrap_user=client.bootstrap-osd\n+common_functions.sh:322: ceph_health(): local bootstrap_key=/var/lib/ceph/bootstrap-osd/ceph.keyring\n+common_functions.sh:
2018-11-08 16:18:49,662 - ceph.ceph - INFO - 324: ceph_health(): timeout 10 ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring health\n+osd_disk_prepare.sh:23: osd_disk_prepare(): grep -qE '^ 1.*ceph data'\n+osd_disk_prepare.sh:23: osd_disk_prepare(): parted --script /dev/vde print\n+osd_disk_prepare.sh:30: osd_disk_prepare(): IFS=' '\n+osd_disk_prepare.sh:30: osd_disk_prepare(): read -r -a CEPH_DISK_CLI_OPTS\n+osd_disk_prepare.sh:31: osd_disk_prepare(): [[ 0 -eq 1 ]]\n+osd_disk_prepare.sh:38: osd_disk_prepare(): [[ 1 -eq 1 ]]\n+osd_disk_prepare.sh:39: osd_disk_prepare(): CEPH_DISK_CLI_OPTS+=(--bluestore)\n+osd_disk_prepare.sh:40: osd_disk_prepare(): ceph-disk -v prepare --cluster ceph --bluestore --block.wal /dev/vde --block.wal-uuid 420a5bff-3e73-422e-94d9-e72ddc21cdfe --block.db /dev/vde --block.db-uuid 7889f7bf-38a9-4cb1-a61d-cf40fd811361 --block-uuid 61376bcd-fc6d-4de7-947d-3ce1de34c51c /dev/vde\ncommand: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid\nget_dm_uuid: get_dm_u
2018-11-08 16:18:49,663 - ceph.ceph - INFO - uid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nset_type: Will colocate block with data on /dev/vde\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_db_size\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_wal_size\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_type\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_type\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
2018-11-08 16:18:49,663 - ceph.ceph - INFO -  --lookup osd_mkfs_options_xfs\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nset_data_partition: Creating osd partition on /dev/vde\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nptype_tobe_for_name: name = data\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\ncreate_partition: Creating data partition num 1 size 100 on /dev/vde\ncommand_check_call: Running command: /usr/sbin/sgdisk --new=1:0:+100M --change-name=1:ceph data --partition-guid=1:c61fe020-db16-4764-b6ac-9007d3ed7802 --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/vde\nupdate_partition: Calling partprobe on created device /
2018-11-08 16:18:49,663 - ceph.ceph - INFO - dev/vde\ncommand_check_call: Running command: /usr/bin/udevadm settle --timeout=600\ncommand: Running command: /usr/bin/flock -s /dev/vde /usr/sbin/partprobe /dev/vde\ncommand_check_call: Running command: /usr/bin/udevadm settle --timeout=600\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nget_dm_uuid: get_dm_uuid /dev/vde1 uuid path is /sys/dev/block/253:65/dm/uuid\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nptype_tobe_for_name: name = block.db\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\ncreate_partition: Creating block.db partition num 3 size 1024 on /dev/vde\ncommand_check_call: Running command: /usr/sbin/sgdisk --new=3:0:+1024M --change-name=3:ceph block.db --partition-guid=3:7889f7bf-38a9-4cb1-a61d-cf40fd811361 --typecode=3:30cd0809-c2b2-499c-8879-2d6b
2018-11-08 16:18:49,664 - ceph.ceph - INFO - 785292be --mbrtogpt -- /dev/vde\nupdate_partition: Calling partprobe on created device /dev/vde\ncommand_check_call: Running command: /usr/bin/udevadm settle --timeout=600\ncommand: Running command: /usr/bin/flock -s /dev/vde /usr/sbin/partprobe /dev/vde\ncommand_check_call: Running command: /usr/bin/udevadm settle --timeout=600\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nget_dm_uuid: get_dm_uuid /dev/vde3 uuid path is /sys/dev/block/253:67/dm/uuid\nprepare_device: Block.db is GPT partition /dev/disk/by-partuuid/7889f7bf-38a9-4cb1-a61d-cf40fd811361\ncommand_check_call: Running command: /usr/sbin/sgdisk --typecode=3:30cd0809-c2b2-499c-8879-2d6b78529876 -- /dev/vde\nupdate_partition: Calling partprobe on prepared device /dev/vde\ncommand_check_call: Running command: /usr/bin/udevadm settle --timeout=600\ncommand: Running command: /usr/bin/flock -s /dev/vde /usr/sbin/partprobe /dev/vde\ncommand_check_
2018-11-08 16:18:49,664 - ceph.ceph - INFO - call: Running command: /usr/bin/udevadm settle --timeout=600\nprepare_device: Block.db is GPT partition /dev/disk/by-partuuid/7889f7bf-38a9-4cb1-a61d-cf40fd811361\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\nptype_tobe_for_name: name = block.wal\nget_dm_uuid: get_dm_uuid /dev/vde uuid path is /sys/dev/block/253:64/dm/uuid\ncreate_partition: Creating block.wal partition num 4 size 576 on /dev/vde\ncommand_check_call: Running command: /usr/sbin/sgdisk --new=4:0:+576M --change-name=4:ceph block.wal --partition-guid=4:420a5bff-3e73-422e-94d9-e72ddc21cdfe 


2018-11-08 16:24:51,522 - ceph.ceph - INFO - 
PLAY RECAP *********************************************************************
ceph-jenkins-build-1541691039066-node10-client : ok=122  changed=10   unreachable=0    failed=0   
ceph-jenkins-build-1541691039066-node2-osdmonmgr : ok=342  changed=34   unreachable=0    failed=0   
ceph-jenkins-build-1541691039066-node3-osdrgw : ok=242  changed=23   unreachable=0    failed=0   
ceph-jenkins-build-1541691039066-node4-osdmds : ok=143  changed=11   unreachable=0    failed=1   
ceph-jenkins-build-1541691039066-node5-monmgr : ok=227  changed=23   unreachable=0    failed=0   
ceph-jenkins-build-1541691039066-node6-monmgr : ok=231  changed=24   unreachable=0    failed=0   
ceph-jenkins-build-1541691039066-node7-mds : ok=137  changed=15   unreachable=0    failed=0   
ceph-jenkins-build-1541691039066-node8-rgw : ok=131  changed=14   unreachable=0    failed=0   
ceph-jenkins-build-1541691039066-node9-osd : ok=165  changed=16   unreachable=0    failed=0

Comment 3 Sébastien Han 2018-11-12 12:37:11 UTC
Why aren't you testing ceph-volume lvm? Using ceph-disk based provisioning is not encouraged anymore.
Also, I don't see any "mixed role container installation", the group_vars file you shared is trying to configure collocated OSD on all the OSD machine. However, based on your logs I have a different impression.

Comment 4 Vasu Kulkarni 2018-11-12 19:30:12 UTC
The support matrix is coming from Trello ceph-disk + ceph-volume  both are supported in RHCS, infact many are on ceph-disk unless they are migrated, the roles are collocated

eg below: same node is osd and mds, likewise there are other nodes which are collocated based on the config file above

[osd]
ceph-jenkins-build-1541691039066-node4-osdmds monitor_interface=eth0  devices='["/dev/vdb", "/dev/vdc", "/dev/vdd", "/dev/vde"]' 
[mdss]
ceph-jenkins-build-1541691039066-node4-osdmds monitor_interface=eth0

Comment 5 Sébastien Han 2018-11-19 09:33:27 UTC
It'll be easier for us to log into an env that has this issue if possible. Thanks.

Comment 6 Vasu Kulkarni 2018-11-19 20:46:42 UTC
It is failing at random nodes, not always in the same place, we will try to recreate this, but in the mean time please check the logs and let me know if there is anything that we can collect when the test fails that can help debug offline.

Full logs:

https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/wip-cephci/job/wip-vshevche-ceph-ansible-sanity-3.x/122/consoleFull

Comment 8 Sébastien Han 2018-11-21 16:37:40 UTC
Thanks Vasu, I'm presently looking into this. I'll let you know ASAP.

Comment 9 Sébastien Han 2018-11-21 17:56:48 UTC
Ok, I've looked closely as this issue, it seems what you encountered is a well-known race condition where ceph-disk creates partitions but udev has a timing issue informing us that the partition has been created. Thus the node /dev/vdb4, in this case, is not yet created and we try to chown it. As you can see, you got this error quite randomly, not always on the same host, nor the same drive AFAIR. The container context makes things even harder to diagnose the timing issue between udev and the kernel.

The udev race condition is one of the reasons why we switched to lvm and not partitions anymore.

Also, please note that the only occasion we have seen this is in VMs. We have had a single case on bare metal and we fixed the ceph-disk code for this.
Your issue is just another mutation of this race condition, that can happen anytime at different places, so yeah it's random.


I just produced a fix for this.

Comment 10 Sébastien Han 2018-11-22 10:05:46 UTC
I just pushed a new commit in ceph-3.2-rhel-7 so a new container image should build soon. Once available please give it a try, thanks.

Comment 11 Vasu Kulkarni 2018-11-26 20:49:47 UTC
Hi Sebastein,

Is it fixed in this build, I am still seeing the issue with below container build.

ceph_docker_image: rhceph
ceph_docker_image_tag: ceph-3.2-rhel-7-containers-candidate-38188-20181121222025
ceph_docker_registry: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888

Full logs:
https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/wip-cephci/job/wip-ceph-containerized-ansible-sanity-3.x-rhel7.5/4/consoleFull

Comment 12 seb 2018-11-27 10:08:26 UTC
The image does not container the fix, however, the issue you're hitting is again a mutation of the race condition, see:

2018-11-26 19:45:15,341 - ceph.ceph - INFO - e00e07\ncommand_check_call: Running command: /usr/sbin/sgdisk --typecode=2:cafecafe-9b03-4f30-b4c6-b4b80ceff106 -- /dev/vdb\nupdate_partition: Calling partprobe on prepared device /dev/vdb\ncommand_check_call: Running command: /usr/bin/udevadm settle --timeout=600\ncommand: Running command: /usr/bin/flock -s /dev/vdb /usr/sbin/partprobe /dev/vdb\ncommand_check_call: Running command: /usr/bin/udevadm settle --timeout=600\nprepare_device: Block is GPT partition /dev/disk/by-partuuid/9e858435-d484-40d5-822d-87e8e0e00e07\npopulate_data_path_device: Creating xfs fs on /dev/vdb1\ncommand_check_call: Running command: /usr/sbin/mkfs -t xfs -f -i size=2048 -- /dev/vdb1\n/dev/vdb1: No such file or directory\nUsage: mkfs.xfs\n/* blocksize */\t\t[-b log=n|size=num]\n/* metadata */\t\t[-m crc=0|1,finobt=0|1,uuid=xxx]\n/* data subvol */\t[-d agcount=n,agsize=n,file,name=xxx,size=num,\n\t\t\t    (sunit=value,swidth=value|su=num,sw=num|noalign),\n\t\t\t    sectlog=n|sectsize=num\n/* force overwrite */\t[-f]\n/* inode size */

mkfs fails to create an fs on the partition because the node is not present yet.
Unfortunately, I cannot assist you further with this since this issue is in ceph-disk itself.

The original issue has been fixed, so I'm moving this to POST. Feel free to open a new BZ under ceph-disk with the issue you just encountered.

Thanks.

Comment 13 Vasu Kulkarni 2018-11-27 20:31:48 UTC
Sebastein,

you said the image doesn't have the fix but its in POST, can you clarify which build has the fix?

Comment 14 seb 2018-11-28 16:00:44 UTC
Vasu, I don't know which one has it since I don't know where to find the images in the build system. Anyway, your original issue is fixed for sure and the new bug is something we can not solve in the ceph-container code but in ceph-disk itself.
 
Ken, can you help with finding the container image? Thanks

Comment 16 seb 2018-11-29 09:05:04 UTC
Moving this again to POST, the fix is in the container image.

Comment 19 errata-xmlrpc 2019-01-03 20:19:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0021