1672025 – "failed to bind the UNIX domain socket" warning happens on 1 of 3 controller nodes after deployed Ceph with RHOSP13 Director

Bug 1672025 - "failed to bind the UNIX domain socket" warning happens on 1 of 3 controller nodes after deployed Ceph with RHOSP13 Director

Summary: "failed to bind the UNIX domain socket" warning happens on 1 of 3 controller ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	3.*
Assignee:	Sébastien Han
QA Contact:	Vasishta
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1578730
TreeView+	depends on / blocked

Reported:	2019-02-03 09:25 UTC by Meiyan Zheng
Modified:	2019-11-29 08:24 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-26 17:08:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3747511	0	None	None	None	2019-03-15 13:02:45 UTC

Comment 1 Giulio Fidente 2019-02-06 16:03:00 UTC

Even though admin_socket is not set in ceph.conf, we are not seeing this issue when deploying with ceph-ansible-3.2.5-1.el7cp.noarch (undercloud) and ceph-common-12.2.8-76.el7cp.x86_64 (overcloud)

It might be useful to run "ceph -s" within the ceph-mon docker container, instead of running it from the node hosting the container ... but the ceph.conf file contents should be the same.

Can you try again with the latest version of ceph-ansible and also report about what version of ceph-common is installed in the overcloud image?

Comment 2 Yogev Rabl 2019-02-26 19:10:24 UTC

Please use the following command
docker exec <monitor docker container> ceph -s

Comment 3 Meiyan Zheng 2019-04-11 07:52:12 UTC

(In reply to Yogev Rabl from comment #2)
> Please use the following command
> docker exec <monitor docker container> ceph -s

# docker exec 2a7817fa6bba ceph -s
2019-04-11 07:48:37.734233 7f0bf064c700 -1 asok(0x7f0be8000fe0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists
  cluster:
    id:     b54035fc-2469-11e9-a332-5254004fb0be
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum overcloud-controller-1,overcloud-controller-0,overcloud-controller-2
    mgr: overcloud-controller-1(active), standbys: overcloud-controller-0, overcloud-controller-2
    osd: 9 osds: 9 up, 9 in
    rgw: 3 daemons active
 
  data:
    pools:   11 pools, 1408 pgs
    objects: 1.68k objects, 2.17GiB
    usage:   7.99GiB used, 397GiB / 405GiB avail
    pgs:     1408 active+clean

The warning still happening. 

And customer confirmed that running ceph -w will reproduce the issue. 
Do you think we need to provide some method to add following configuration when deploy with director? 

+++++
[client]
admin socket = /var/run/ceph/$name.$pid.asok
+++++


Best Regards,
Meiyan

Comment 4 Sébastien Han 2019-04-11 08:02:47 UTC

Can you check that:

* /var/run/ceph/ceph-client.admin.asok is not in used before issuing the ceph command
* if used, then by which process
* please show the ceph.conf

Thanks

Comment 5 Meiyan Zheng 2019-04-11 08:07:49 UTC

(In reply to leseb from comment #4)
> Can you check that:
>

Thanks for your quick response!

> * /var/run/ceph/ceph-client.admin.asok is not in used before issuing the ceph command
A: /var/run/ceph/ceph-client.admin.asok is not created before running ceph -w, after running
   ceph -w, it is created with root:root
 
  srwxr-xr-x. 1 root root 0 Apr 11 07:48 ceph-client.admin.asok


> * if used, then by which process
A: with ceph -w

> * please show the ceph.conf
A: ceph.conf is created when deploying with director. Here is the ceph.conf:

++++
# cat /etc/ceph/ceph.conf 
[client.rgw.overcloud-controller-0]
host = overcloud-controller-0
keyring = /var/lib/ceph/radosgw/ceph-rgw.overcloud-controller-0/keyring
log file = /var/log/ceph/ceph-rgw-overcloud-controller-0.log
rgw frontends = civetweb port=172.16.1.15:8080 num_threads=100

[client.rgw.overcloud-controller-1]
host = overcloud-controller-1
keyring = /var/lib/ceph/radosgw/ceph-rgw.overcloud-controller-1/keyring
log file = /var/log/ceph/ceph-rgw-overcloud-controller-1.log
rgw frontends = civetweb port=172.16.1.4:8080 num_threads=100

[client.rgw.overcloud-controller-2]
host = overcloud-controller-2
keyring = /var/lib/ceph/radosgw/ceph-rgw.overcloud-controller-2/keyring
log file = /var/log/ceph/ceph-rgw-overcloud-controller-2.log
rgw frontends = civetweb port=172.16.1.30:8080 num_threads=100

# Please do not change this file directly since it is managed by Ansible and will be overwritten
[global]
# let's force the admin socket the way it was so we can properly check for existing instances
# also the line $cluster-$name.$pid.$cctid.asok is only needed when running multiple instances
# of the same daemon, thing ceph-ansible cannot do at the time of writing
admin socket = "$run_dir/$cluster-$name.asok"
cluster network = 172.16.3.0/24
filestore_max_sync_interval = 10
fsid = b54035fc-2469-11e9-a332-5254004fb0be
log file = /dev/null
mon cluster log file = /dev/null
mon host = 172.16.1.30,172.16.1.4,172.16.1.15
mon initial members = overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
mon_max_pg_per_osd = 3072
osd_pool_default_pg_num = 128
osd_pool_default_pgp_num = 128
osd_pool_default_size = 3
public network = 172.16.1.0/24
rgw_keystone_accepted_roles = Member, admin
rgw_keystone_admin_domain = default
rgw_keystone_admin_password = kHpZsZf2KZWgeQEWGmN24GQun
rgw_keystone_admin_project = service
rgw_keystone_admin_user = swift
rgw_keystone_api_version = 3
rgw_keystone_implicit_tenants = true
rgw_keystone_revocation_interval = 0
rgw_keystone_url = http://172.16.2.5:5000
rgw_s3_auth_use_keystone = true
# enable rgw usage log
rgw enable usage log = true
rgw usage log tick interval = 30
rgw usage log flush threshold = 1024
rgw usage max shards = 32
rgw usage max user shards = 1
rgw_enable_ops_log = true
rgw_log_http_headers = http_x_forwarded_for, http_expect, http_content_md5
++++

I can reproduce this in my test env. 
Please let me know if you need further information. 


Best Regards,
Meiyan

Comment 11 Giridhar Ramaraju 2019-08-20 07:17:17 UTC

Level setting the severity of this defect to "High" with a bulk update. Pls
refine it to a more closure value, as defined by the severity definition in
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity

Note You need to log in before you can comment on or make changes to this bug.