Bug 1672025

Summary:	"failed to bind the UNIX domain socket" warning happens on 1 of 3 controller nodes after deployed Ceph with RHOSP13 Director
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Meiyan Zheng <mzheng>
Component:	Ceph-Ansible	Assignee:	Sébastien Han <shan>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Vasishta <vashastr>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.1	CC:	aschoen, ceph-eng-bugs, gabrioux, gfidente, gmeno, mzheng, nthomas, sankarshan, shan, tpetr, yrabl
Target Milestone:	rc	Keywords:	Reopened
Target Release:	3.*
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-26 17:08:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1578730

Comment 1 Giulio Fidente 2019-02-06 16:03:00 UTC

Even though admin_socket is not set in ceph.conf, we are not seeing this issue when deploying with ceph-ansible-3.2.5-1.el7cp.noarch (undercloud) and ceph-common-12.2.8-76.el7cp.x86_64 (overcloud)

It might be useful to run "ceph -s" within the ceph-mon docker container, instead of running it from the node hosting the container ... but the ceph.conf file contents should be the same.

Can you try again with the latest version of ceph-ansible and also report about what version of ceph-common is installed in the overcloud image?

Comment 2 Yogev Rabl 2019-02-26 19:10:24 UTC

Please use the following command
docker exec <monitor docker container> ceph -s

Comment 3 Meiyan Zheng 2019-04-11 07:52:12 UTC

(In reply to Yogev Rabl from comment #2)
> Please use the following command
> docker exec <monitor docker container> ceph -s

# docker exec 2a7817fa6bba ceph -s
2019-04-11 07:48:37.734233 7f0bf064c700 -1 asok(0x7f0be8000fe0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists
  cluster:
    id:     b54035fc-2469-11e9-a332-5254004fb0be
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum overcloud-controller-1,overcloud-controller-0,overcloud-controller-2
    mgr: overcloud-controller-1(active), standbys: overcloud-controller-0, overcloud-controller-2
    osd: 9 osds: 9 up, 9 in
    rgw: 3 daemons active
 
  data:
    pools:   11 pools, 1408 pgs
    objects: 1.68k objects, 2.17GiB
    usage:   7.99GiB used, 397GiB / 405GiB avail
    pgs:     1408 active+clean

The warning still happening. 

And customer confirmed that running ceph -w will reproduce the issue. 
Do you think we need to provide some method to add following configuration when deploy with director? 

+++++
[client]
admin socket = /var/run/ceph/$name.$pid.asok
+++++


Best Regards,
Meiyan

Comment 4 Sébastien Han 2019-04-11 08:02:47 UTC

Can you check that:

* /var/run/ceph/ceph-client.admin.asok is not in used before issuing the ceph command
* if used, then by which process
* please show the ceph.conf

Thanks

Comment 5 Meiyan Zheng 2019-04-11 08:07:49 UTC

(In reply to leseb from comment #4)
> Can you check that:
>

Thanks for your quick response!

> * /var/run/ceph/ceph-client.admin.asok is not in used before issuing the ceph command
A: /var/run/ceph/ceph-client.admin.asok is not created before running ceph -w, after running
   ceph -w, it is created with root:root
 
  srwxr-xr-x. 1 root root 0 Apr 11 07:48 ceph-client.admin.asok


> * if used, then by which process
A: with ceph -w

> * please show the ceph.conf
A: ceph.conf is created when deploying with director. Here is the ceph.conf:

++++
# cat /etc/ceph/ceph.conf 
[client.rgw.overcloud-controller-0]
host = overcloud-controller-0
keyring = /var/lib/ceph/radosgw/ceph-rgw.overcloud-controller-0/keyring
log file = /var/log/ceph/ceph-rgw-overcloud-controller-0.log
rgw frontends = civetweb port=172.16.1.15:8080 num_threads=100

[client.rgw.overcloud-controller-1]
host = overcloud-controller-1
keyring = /var/lib/ceph/radosgw/ceph-rgw.overcloud-controller-1/keyring
log file = /var/log/ceph/ceph-rgw-overcloud-controller-1.log
rgw frontends = civetweb port=172.16.1.4:8080 num_threads=100

[client.rgw.overcloud-controller-2]
host = overcloud-controller-2
keyring = /var/lib/ceph/radosgw/ceph-rgw.overcloud-controller-2/keyring
log file = /var/log/ceph/ceph-rgw-overcloud-controller-2.log
rgw frontends = civetweb port=172.16.1.30:8080 num_threads=100

# Please do not change this file directly since it is managed by Ansible and will be overwritten
[global]
# let's force the admin socket the way it was so we can properly check for existing instances
# also the line $cluster-$name.$pid.$cctid.asok is only needed when running multiple instances
# of the same daemon, thing ceph-ansible cannot do at the time of writing
admin socket = "$run_dir/$cluster-$name.asok"
cluster network = 172.16.3.0/24
filestore_max_sync_interval = 10
fsid = b54035fc-2469-11e9-a332-5254004fb0be
log file = /dev/null
mon cluster log file = /dev/null
mon host = 172.16.1.30,172.16.1.4,172.16.1.15
mon initial members = overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
mon_max_pg_per_osd = 3072
osd_pool_default_pg_num = 128
osd_pool_default_pgp_num = 128
osd_pool_default_size = 3
public network = 172.16.1.0/24
rgw_keystone_accepted_roles = Member, admin
rgw_keystone_admin_domain = default
rgw_keystone_admin_password = kHpZsZf2KZWgeQEWGmN24GQun
rgw_keystone_admin_project = service
rgw_keystone_admin_user = swift
rgw_keystone_api_version = 3
rgw_keystone_implicit_tenants = true
rgw_keystone_revocation_interval = 0
rgw_keystone_url = http://172.16.2.5:5000
rgw_s3_auth_use_keystone = true
# enable rgw usage log
rgw enable usage log = true
rgw usage log tick interval = 30
rgw usage log flush threshold = 1024
rgw usage max shards = 32
rgw usage max user shards = 1
rgw_enable_ops_log = true
rgw_log_http_headers = http_x_forwarded_for, http_expect, http_content_md5
++++

I can reproduce this in my test env. 
Please let me know if you need further information. 


Best Regards,
Meiyan

Comment 11 Giridhar Ramaraju 2019-08-20 07:17:17 UTC

Level setting the severity of this defect to "High" with a bulk update. Pls
refine it to a more closure value, as defined by the severity definition in
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity