Bug 1328412 - [RBD] All RBD commands hang in the latest repos of ceph-ansible and ceph
Summary: [RBD] All RBD commands hang in the latest repos of ceph-ansible and ceph
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Installer
Version: 2.0
Hardware: Unspecified
OS: Linux
unspecified
urgent
Target Milestone: rc
: 2.0
Assignee: Christina Meno
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-19 10:49 UTC by Tejas
Modified: 2017-12-13 00:23 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-05 04:14:05 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Tejas 2016-04-19 10:49:12 UTC
Description of problem:
All rbd commands hang indefinitely in the latest build of ceph.
the repos can be found here:

http://puddle.ceph.redhat.com/puddles/rhscon/2/latest/
http://puddle.ceph.redhat.com/puddles/ceph/2/latest/

The cluster install goes through fine.
But the cluster is always stuck in HEALTH_ERR



Version-Release number of selected component (if applicable):
ceph 10.1.1

How reproducible:
Always

Steps to Reproduce:
1. Install a ceph cluster .


Additional info:


[root@magna009 ~]# ceph osd tree
2016-04-19 10:22:07.690321 7f8c3f7dd700 -1 asok(0x7f8c38001680) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/rbd-clients/ceph-client.admin.3095.140240211679152.asok': (2) No such file or directory
ID WEIGHT   TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 13.49236 root default                                        
-2  2.69847     host magna052                                   
 0  0.89949         osd.0          up  1.00000          1.00000 
 9  0.89949         osd.9          up  1.00000          1.00000 
13  0.89949         osd.13         up  1.00000          1.00000 
-3  2.69847     host magna077                                   
 2  0.89949         osd.2          up  1.00000          1.00000 
 6  0.89949         osd.6          up  1.00000          1.00000 
11  0.89949         osd.11         up  1.00000          1.00000 
-4  2.69847     host magna046                                   
 1  0.89949         osd.1          up  1.00000          1.00000 
 3  0.89949         osd.3          up  1.00000          1.00000 
 7  0.89949         osd.7          up  1.00000          1.00000 
-5  2.69847     host magna080                                   
 5  0.89949         osd.5          up  1.00000          1.00000 
12  0.89949         osd.12         up  1.00000          1.00000 
14  0.89949         osd.14         up  1.00000          1.00000 
-6  2.69847     host magna058                                   
 4  0.89949         osd.4          up  1.00000          1.00000 
 8  0.89949         osd.8          up  1.00000          1.00000 
10  0.89949         osd.10         up  1.00000          1.00000 


root@magna009 ~]# ceph -s
2016-04-19 10:31:53.320634 7ffa9aa85700 -1 asok(0x7ffa94001680) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/rbd-clients/ceph-client.admin.3127.140714201585584.asok': (2) No such file or directory
    cluster aab37108-8f14-49fe-b581-d5c9bd63b5ac
     health HEALTH_ERR
            23 pgs are stuck inactive for more than 300 seconds
            23 pgs peering
            23 pgs stuck inactive
     monmap e1: 3 mons at {magna009=10.8.128.9:6789/0,magna031=10.8.128.31:6789/0,magna046=10.8.128.46:6789/0}
            election epoch 12, quorum 0,1,2 magna009,magna031,magna046
     osdmap e92: 15 osds: 15 up, 15 in; 7 remapped pgs
            flags sortbitwise
      pgmap v275: 192 pgs, 2 pools, 0 bytes data, 0 objects
            525 MB used, 13815 GB / 13815 GB avail
                 169 active+clean
                  11 peering
                   7 remapped+peering
                   5 creating+peering


[root@magna046 ~]# rbd create Tejas/img --size 5G
2016-04-19 10:40:51.530122 7f00370fad80 -1 asok(0x7f0041fe7e60) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/rbd-clients/ceph-client.admin.4070.139639083925648.asok': (2) No such file or directory
^C
[root@magna046 ~]#

[root@magna031 ~]# rbd ls -l Tejas
2016-04-19 10:34:07.905834 7f157ea7bd80 -1 asok(0x7f1589de3e30) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/rbd-clients/ceph-client.admin.2833.139730484084880.asok': (2) No such file or directory
2016-04-19 10:34:07.908651 7f1560557700  0 -- 10.8.128.31:0/3899757158 >> 10.8.128.46:6808/3521 pipe(0x7f1589e4a620 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1589e4b8e0).fault
^C
[root@magna031 ~]#

Comment 2 Tanay Ganguly 2016-04-19 11:09:50 UTC
I have seen the same yesterday and thought it was something to do with my Network configuration.

I checked the OSD logs, the osd's were reporting connectivity issue.

After that i flushed the iptable it started working for me for the same Build.

[CEPH-2]
name=CEPH-2
baseurl=http://puddle.ceph.redhat.com/puddles/ceph/2/2016-04-16.1/CEPH-2/$basearch/os
gpgcheck=0
enabled=1

[CEPH-2-debug]
name=CEPH-2 Debuginfo
baseurl=http://puddle.ceph.redhat.com/puddles/ceph/2/2016-04-16.1/CEPH-2/$basearch/debuginfo
gpgcheck=0
enabled=0

[CEPH-2-sources]
name=CEPH-2 Sources
baseurl=http://puddle.ceph.redhat.com/puddles/ceph/2/2016-04-16.1/CEPH-2/source
gpgcheck=0
enabled=0

Note: For earlier build, i didn't need to flush the Iptable.

Comment 3 Ken Dreyer (Red Hat) 2016-04-26 15:17:54 UTC
It's my understanding that you must open the ports in the firewalls on your cluster nodes prior to running the installer.

Did you do that?

Comment 4 Christina Meno 2016-05-04 21:06:24 UTC
Tejas,

Would you please answer Ken's query "It's my understanding that you must open the ports in the firewalls on your cluster nodes prior to running the installer.

Did you do that?"

Comment 5 Ken Dreyer (Red Hat) 2016-05-04 21:11:12 UTC
For example, on the monitors, I use the following (Ansible):

- firewalld:
    port: 6789/tcp
    immediate: true
    permanent: true
    state: enabled

And on the OSDs:

- firewalld:
    port: 6800-7300/tcp
    immediate: true
    permanent: true
    state: enabled

Comment 6 Tejas 2016-05-05 04:14:05 UTC
Hi Ken,

We do have the same firewall ports open when running installer.
However this issue was seen when we upgraded our setup to a different repo ceph.
Now we rae not seeing this issue. 
I will go ahead and close this bug for now.

Thanks,
Tejas


Note You need to log in before you can comment on or make changes to this bug.