Bug 1493192 - IPv6 ceph: librados: client.admin authentication error (110) Connection timed out"
Summary: IPv6 ceph: librados: client.admin authentication error (110) Connection timed...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 12.0 (Pike)
Assignee: John Fulton
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-19 14:44 UTC by Alexander Chuzhoy
Modified: 2018-02-05 19:15 UTC (History)
15 users (show)

Fixed In Version: openstack-tripleo-heat-templates-7.0.2-0.20171007062244.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 22:10:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1715246 0 None None None 2017-10-12 20:17:54 UTC
OpenStack gerrit 511996 0 'None' MERGED Remove monitor_interface from ceph-ansible parameters 2020-05-16 08:53:18 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Alexander Chuzhoy 2017-09-19 14:44:32 UTC
ceph: librados: client.admin authentication error (110) Connection timed out", "Error connecting to cluster: TimedOut

Environment:
openstack-puppet-modules-11.0.0-0.20170828113154.el7ost.noarch
puppet-ceph-2.4.1-0.20170911230204.ebea4b7.el7ost.noarch
instack-undercloud-7.4.1-0.20170912115418.el7ost.noarch
ceph-ansible-3.0.0-0.1.rc8.1.el7cp.noarch
openstack-tripleo-heat-templates-7.0.0-0.20170913050523.0rc2.el7ost.noarch
ceph-common-10.2.7-32.el7cp.x86_64
ceph-mon-10.2.7-32.el7cp.x86_64
libcephfs1-10.2.7-32.el7cp.x86_64
python-cephfs-10.2.7-32.el7cp.x86_64
ceph-base-10.2.7-32.el7cp.x86_64
ceph-radosgw-10.2.7-32.el7cp.x86_64
puppet-ceph-2.4.1-0.20170911230204.ebea4b7.el7ost.noarch
ceph-mds-10.2.7-32.el7cp.x86_64
ceph-selinux-10.2.7-32.el7cp.x86_64


Steps to reproduce:

Deploy overcloud with vlan+ipv6 using:
openstack overcloud deploy --templates \
--libvirt-type kvm \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \
-e /home/stack/templates/nodes_data.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml \
-e /home/stack/virt/network/network-environment-v6.yaml \
-e /home/stack/rhos12.yaml

The overcloud is deployed but OSD's are down:
Apply workaround for bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1491027
https://bugzilla.redhat.com/show_bug.cgi?id=1491780


re-run the deployment command.


Result:
The re-deployment fails with:

overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::Mistral::ExternalResource
  physical_resource_id: 01550cd1-340c-4430-b42c-b9f134952c18
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR



Debugging with mistral, see this error (truncated output):
a0).fault\", \"2017-09-19 04:12:00.125276 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704008030).fault\", \"2017-09-19 04:12:03.125571 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008e70 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704006820).fault\", \"2017-09-19 04:12:06.125874 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704008030).fault\", \"2017-09-19 04:12:09.126176 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008e70 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704006820).fault\", \"2017-09-19 04:12:12.126509 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400e460).fault\", \"2017-09-19 04:12:15.126788 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008e70 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704006820).fault\", \"2017-09-19 04:12:18.127201 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400e430).fault\", \"2017-09-19 04:12:21.127597 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008d20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704006820).fault\", \"2017-09-19 04:12:24.127912 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400a0a0).fault\", \"2017-09-19 04:12:27.128102 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008d20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011ce0).fault\", \"2017-09-19 04:12:30.128500 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003d90).fault\", \"2017-09-19 04:12:33.128751 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008d20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011ce0).fault\", \"2017-09-19 04:12:36.128922 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003d90).fault\", \"2017-09-19 04:12:39.129311 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008d20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400e1f0).fault\", \"2017-09-19 04:12:42.129599 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003d90).fault\", \"2017-09-19 04:12:45.129882 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704004e70).fault\", \"2017-09-19 04:12:48.130184 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb7040008c0).fault\", \"2017-09-19 04:12:51.130424 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011c90).fault\", \"2017-09-19 04:12:54.130800 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb7040008c0).fault\", \"2017-09-19 04:12:57.131032 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011c90).fault\", \"2017-09-19 04:13:00.131390 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003680).fault\", \"2017-09-19 04:13:03.131818 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011c90).fault\", \"2017-09-19 04:13:06.132238 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003680).fault\", \"2017-09-19 04:13:09.132441 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011c90).fault\", \"2017-09-19 04:13:12.132841 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c8f0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003680).fault\", \"2017-09-19 04:13:15.133120 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400c080).fault\", \"2017-09-19 04:13:18.133980 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c8a0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb7040133c0).fault\", \"2017-09-19 04:13:21.134223 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400c080).fault\", \"2017-09-19 04:13:24.134400 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c8a0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb7040133c0).fault\", \"2017-09-19 04:13:27.134839 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400c080).fault\", \"2017-09-19 04:13:30.105628 7fb718274700 0 monclient(hunting): authenticate timed out after 300\", \"2017-09-19 04:13:30.105689 7fb718274700 0 librados: client.admin authentication error (110) Connection timed out\", \"Error connecting to cluster: TimedOut\"], \"stdout\":

Comment 1 John Fulton 2017-09-19 14:55:18 UTC
To put this in ceph-ansible terms...

Configuration of the Monitor node [1] failed [2] on the task "ceph-mon : add openstack key(s) to ceph" [3] with ceph-ansible-3.0.0-0.1.rc8.1.el7cp.noarch with the following when using IPv6: 

 librados: client.admin authentication error (110) Connection timed out", "Error connecting to cluster: TimedOut"]

[1] 
(undercloud) [stack@undercloud-0 ~]$ nova list
+-------------------------+--------+-------------+------------------------+
| Name                    | Status | Power State | Networks               |
+-------------------------+--------+-------------+------------------------+
| overcloud-cephstorage-0 | ACTIVE | Running     | ctlplane=192.168.24.8  |
| overcloud-cephstorage-1 | ACTIVE | Running     | ctlplane=192.168.24.10 |
| overcloud-cephstorage-2 | ACTIVE | Running     | ctlplane=192.168.24.18 |
| overcloud-compute-0     | ACTIVE | Running     | ctlplane=192.168.24.9  |
| overcloud-controller-0  | ACTIVE | Running     | ctlplane=192.168.24.19 |
+-------------------------+--------+-------------+------------------------+
(undercloud) [stack@undercloud-0 ~]$ 

[2] 
PLAY RECAP ****************************************************************
192.168.24.10              : ok=1    changed=0    unreachable=0    failed=0   
192.168.24.18              : ok=1    changed=0    unreachable=0    failed=0   
192.168.24.19              : ok=54   changed=6    unreachable=0    failed=1   
192.168.24.8               : ok=1    changed=0    unreachable=0    failed=0   
192.168.24.9               : ok=1    changed=0    unreachable=0    failed=0

[3]
017-09-19 00:03:16,456 p=2823 u=mistral |  failed: [192.168.24.19] (item=[{u'mon_cap': u'allow r', u'osd_cap': u'allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=backups, allow rwx pool=vms, allow rwx pool=images, allow rwx pool=metrics', u'name': u'client.openstack', u'key': u'AQBXcsBZAAAAABAA0+Trl53eDg38apvvuYuA2w==', u'mode': u'0644'}, ...  "2017-09-19 04:03:17.211501 7f11cc3f7700  0 monclient(hunting): authenticate timed out after 300", "2017-09-19 04:03:17.211573 7f11cc3f7700  0 librados: client.admin authentication error (110) Connection timed out", "Error connecting to cluster: TimedOut"], "stdout": "", "stdout_lines": [] ...

Comment 2 John Fulton 2017-09-19 15:40:05 UTC
Network info: 

Deployment was done with storage network on IPv6: 

[stack@undercloud-0 ~]$ cat /home/stack/virt/network/network-environment-v6.yaml
parameter_defaults:
    ControlPlaneDefaultRoute: 192.168.24.1
    DnsServers:
    - 10.35.28.1
    EC2MetadataIp: 192.168.24.1
    ExternalAllocationPools:
    -   end: 2620:52:0:13b8:5054:ff:fe3e:aa
        start: 2620:52:0:13b8:5054:ff:fe3e:1
    ExternalInterfaceDefaultRoute: 2620:52:0:13b8::fe
    ExternalNetCidr: 2620:52:0:13b8::/64
    ExternalNetworkVlanID: 10
    InternalApiAllocationPools:
    -   end: fd00:fd00:fd00:2000:ffff:ffff:ffff:fffe
        start: fd00:fd00:fd00:2000::10
    InternalApiNetCidr: fd00:fd00:fd00:2000::/64
    NeutronBridgeMappings: datacentre:br-ex,tenant:br-isolated
    NeutronExternalNetworkBridge: ''
    NeutronNetworkType: vlan
    NeutronNetworkVLANRanges: tenant:1000:2000
    NeutronTunnelTypes: ''''''
    StorageAllocationPools:
    -   end: fd00:fd00:fd00:3000:ffff:ffff:ffff:fffe
        start: fd00:fd00:fd00:3000::10
    StorageMgmtAllocationPools:
    -   end: fd00:fd00:fd00:4000:ffff:ffff:ffff:fffe
        start: fd00:fd00:fd00:4000::10
    StorageMgmtNetCidr: fd00:fd00:fd00:4000::/64
    StorageNetCidr: fd00:fd00:fd00:3000::/64
resource_registry:
    OS::TripleO::BlockStorage::Net::SoftwareConfig: three-nics-vlans/cinder-storage.yaml
    OS::TripleO::CephStorage::Net::SoftwareConfig: three-nics-vlans/ceph-storage.yaml
    OS::TripleO::Compute::Net::SoftwareConfig: three-nics-vlans/compute.yaml
    OS::TripleO::Controller::Net::SoftwareConfig: three-nics-vlans/controller-v6.yaml
    OS::TripleO::ObjectStorage::Net::SoftwareConfig: three-nics-vlans/swift-storage.yaml
[stack@undercloud-0 ~]$ 

The monitor node is using 

              type: ovs_bridge
              name: br-isolated
...
                  name: nic2
...
with the VLANs 

- InternalApiNetworkVlanID
- StorageNetworkVlanID
- StorageMgmtNetworkVlanID
- TenantNetworkVlanID


The ceph storage node is using:

              type: ovs_bridge
              name: br-isolated
...
                  name: nic2
...
with the VLANs
- StorageMgmtNetworkVlanID
- StorageNetworkVlanID

This comes down to the following: 

[root@overcloud-cephstorage-0 ~]# ovs-vsctl show
65cb1e84-c4a9-4123-9391-0abef55c4f5d
    Bridge br-isolated
        fail_mode: standalone
        Port "eth1"
            Interface "eth1"
        Port br-isolated
            Interface br-isolated
                type: internal
        Port "vlan40"
            tag: 40
            Interface "vlan40"
                type: internal
        Port "vlan30"
            tag: 30
            Interface "vlan30"
                type: internal
    ovs_version: "2.7.2"
[root@overcloud-cephstorage-0 ~]# 

[root@overcloud-controller-0 ~]# ovs-vsctl show
...
    Bridge br-isolated
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "vlan40"
            tag: 40
            Interface "vlan40"
                type: internal
        Port "vlan50"
            tag: 50
            Interface "vlan50"
                type: internal
        Port "vlan30"
            tag: 30
            Interface "vlan30"
                type: internal
        Port "eth1"
            Interface "eth1"
        Port "vlan20"
            tag: 20
            Interface "vlan20"
                type: internal
        Port br-isolated
            Interface br-isolated
                type: internal
        Port phy-br-isolated
            Interface phy-br-isolated
                type: patch
                options: {peer=int-br-isolated}
...

[root@overcloud-controller-0 ~]# ip a s vlan30
10: vlan30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 1a:11:3c:a5:d9:66 brd ff:ff:ff:ff:ff:ff
    inet6 fd00:fd00:fd00:3000::10/128 scope global 
       valid_lft forever preferred_lft forever
    inet6 fd00:fd00:fd00:3000::18/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::1811:3cff:fea5:d966/64 scope link 
       valid_lft forever preferred_lft forever
[root@overcloud-controller-0 ~]# ip a s vlan40
11: vlan40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether ae:77:df:8a:ba:46 brd ff:ff:ff:ff:ff:ff
    inet6 fd00:fd00:fd00:4000::17/128 scope global 
       valid_lft forever preferred_lft forever
    inet6 fd00:fd00:fd00:4000::13/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::ac77:dfff:fe8a:ba46/64 scope link 
       valid_lft forever preferred_lft forever
[root@overcloud-controller-0 ~]# 

[root@overcloud-cephstorage-0 ~]# ip a s vlan30
7: vlan30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 52:42:79:0b:3a:a6 brd ff:ff:ff:ff:ff:ff
    inet6 fd00:fd00:fd00:3000::1b/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::5042:79ff:fe0b:3aa6/64 scope link 
       valid_lft forever preferred_lft forever
[root@overcloud-cephstorage-0 ~]# ip a s vlan40
8: vlan40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether b6:b6:9a:bf:7d:94 brd ff:ff:ff:ff:ff:ff
    inet6 fd00:fd00:fd00:4000::11/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::b4b6:9aff:febf:7d94/64 scope link 
       valid_lft forever preferred_lft forever
[root@overcloud-cephstorage-0 ~]# 

Communication between the Storage and StorageMgmt networks among the storage and mon nodes seems OK. 

[root@overcloud-controller-0 ~]# ping6 -c 2 fd00:fd00:fd00:3000::1b
PING fd00:fd00:fd00:3000::1b(fd00:fd00:fd00:3000::1b) 56 data bytes
64 bytes from fd00:fd00:fd00:3000::1b: icmp_seq=1 ttl=64 time=0.880 ms
64 bytes from fd00:fd00:fd00:3000::1b: icmp_seq=2 ttl=64 time=0.308 ms

--- fd00:fd00:fd00:3000::1b ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.308/0.594/0.880/0.286 ms
[root@overcloud-controller-0 ~]# ping6 -c 2 fd00:fd00:fd00:4000::11
PING fd00:fd00:fd00:4000::11(fd00:fd00:fd00:4000::11) 56 data bytes
64 bytes from fd00:fd00:fd00:4000::11: icmp_seq=1 ttl=64 time=0.334 ms
64 bytes from fd00:fd00:fd00:4000::11: icmp_seq=2 ttl=64 time=0.375 ms

--- fd00:fd00:fd00:4000::11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.334/0.354/0.375/0.027 ms
[root@overcloud-controller-0 ~]# 

[root@overcloud-cephstorage-0 ~]# ping6 -c 2 fd00:fd00:fd00:3000::10
PING fd00:fd00:fd00:3000::10(fd00:fd00:fd00:3000::10) 56 data bytes
64 bytes from fd00:fd00:fd00:3000::10: icmp_seq=1 ttl=64 time=1.03 ms
64 bytes from fd00:fd00:fd00:3000::10: icmp_seq=2 ttl=64 time=0.332 ms

--- fd00:fd00:fd00:3000::10 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.332/0.682/1.032/0.350 ms
[root@overcloud-cephstorage-0 ~]# ping6 -c 2 fd00:fd00:fd00:4000::17
PING fd00:fd00:fd00:4000::17(fd00:fd00:fd00:4000::17) 56 data bytes
64 bytes from fd00:fd00:fd00:4000::17: icmp_seq=1 ttl=64 time=1.04 ms
64 bytes from fd00:fd00:fd00:4000::17: icmp_seq=2 ttl=64 time=0.296 ms

--- fd00:fd00:fd00:4000::17 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.296/0.672/1.048/0.376 ms
[root@overcloud-cephstorage-0 ~]#

Comment 3 John Fulton 2017-09-19 15:42:09 UTC
Ceph info:

[root@overcloud-controller-0 ~]# cat /etc/ceph/ceph.conf 
[global]
ms bind ipv6 = true
mon initial members = overcloud-controller-0
rgw_keystone_admin_password = dpjJRCtcMf7c8ND9TN6RQUDNq
cluster network = fd00:fd00:fd00:4000::/64
rgw_keystone_url = http://[fd00:fd00:fd00:2000::11]:5000
rgw_s3_auth_use_keystone = true
mon host = [fd00:fd00:fd00:3000::10]
rgw_keystone_admin_domain = default
osd_pool_default_size = 3
rgw_keystone_admin_project = service
osd_pool_default_pg_num = 32
rgw_keystone_accepted_roles = Member, _member_, admin
rgw_keystone_api_version = 3
rgw_keystone_admin_user = swift
public network = fd00:fd00:fd00:3000::/64
max open files = 131072
fsid = 9a73798c-9cd9-11e7-aa4e-525400d41121

[root@overcloud-controller-0 ~]# 

[root@overcloud-cephstorage-0 ~]# cat /etc/ceph/ceph.conf 
[global]
ms bind ipv6 = true
mon initial members = overcloud-controller-0
rgw_keystone_admin_password = dpjJRCtcMf7c8ND9TN6RQUDNq
cluster network = fd00:fd00:fd00:4000::/64
rgw_keystone_url = http://[fd00:fd00:fd00:2000::11]:5000
rgw_s3_auth_use_keystone = true
mon host = [fd00:fd00:fd00:3000::18]
rgw_keystone_admin_domain = default
osd_pool_default_size = 3
rgw_keystone_admin_project = service
osd_pool_default_pg_num = 32
rgw_keystone_accepted_roles = Member, _member_, admin
rgw_keystone_api_version = 3
rgw_keystone_admin_user = swift
public network = fd00:fd00:fd00:3000::/64
max open files = 131072
fsid = 9a73798c-9cd9-11e7-aa4e-525400d41121

[root@overcloud-cephstorage-0 ~]# 

[root@overcloud-controller-0 ~]# ceph -s
2017-09-19 14:56:43.024250 7f30a05f6700  0 -- :/411061361 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7f309c064ac0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f309c062820).fault
2017-09-19 14:56:46.023618 7f30a04f5700  0 -- :/411061361 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7f3090000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3090001f90).fault
^CTraceback (most recent call last):
  File "/bin/ceph", line 948, in <module>
    retval = main()
  File "/bin/ceph", line 852, in main
    prefix='get_command_descriptions')
  File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 1300, in json_command
    raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "None": exception "['{"prefix": "get_command_descriptions"}']": exception You cannot perform that operation on a Rados object in state configuring.
[root@overcloud-controller-0 ~]# 

[root@overcloud-controller-0 ~]# ls -l /etc/ceph/
total 28
-rw-------. 1 root root 137 Sep 19 02:24 ceph.client.admin.keyring
-rw-r--r--. 1 root root 254 Sep 19 02:27 ceph.client.manila.keyring
-rw-r--r--. 1 root root 277 Sep 19 02:27 ceph.client.openstack.keyring
-rw-r--r--. 1 root root 127 Sep 19 02:27 ceph.client.radosgw.keyring
-rw-r--r--. 1 root root 653 Sep 19 03:14 ceph.conf
-rw-------. 1 ceph ceph 553 Sep 19 02:24 ceph.mon.keyring
-rw-r--r--. 1 root root  92 Aug 25 22:43 rbdmap
[root@overcloud-controller-0 ~]# 

[root@overcloud-controller-0 ~]# cat /etc/ceph/ceph.mon.keyring 
[mon.]
	key = AQD3f8BZ0ag3OxAAr0mzKG+7jP4SmZrvIaNpsg==
	caps mon = "allow *"
[client.admin]
	key = AQD3f8BZ0aGwORAAHriA0sfzhYXDh98DEV0jgw==
	auid = 0
	caps mds = "allow"
	caps mon = "allow *"
	caps osd = "allow *"
[client.bootstrap-mds]
	key = AQD4f8BZDrWoAhAAV9Fe4cCg23MU/m74HmEusg==
	caps mon = "allow profile bootstrap-mds"
[client.bootstrap-osd]
	key = AQD4f8BZSIUdARAAKi/Oa81K+n4PuNJiBRjNqA==
	caps mon = "allow profile bootstrap-osd"
[client.bootstrap-rgw]
	key = AQD4f8BZgPMvBBAAgr8kOcOanyXzxPhFRp2JSw==
	caps mon = "allow profile bootstrap-rgw"
[root@overcloud-controller-0 ~]# 

[root@overcloud-controller-0 ~]# docker ps | grep ceph 
03166717a9c9        192.168.24.1:8787/ceph/rhceph-2-rhel7:latest                                      "/entrypoint.sh"         13 hours ago        Up 13 hours                                   ceph-mon-overcloud-controller-0
[root@overcloud-controller-0 ~]# 


[root@overcloud-cephstorage-0 ~]# docker ps -a
CONTAINER ID        IMAGE                                                          COMMAND                  CREATED             STATUS                    PORTS               NAMES
c3e7ac15a16c        192.168.24.1:8787/rhosp12/openstack-cron-docker:2017-09-15.1   "kolla_start"            12 hours ago        Up 12 hours                                   logrotate_crond
b91e57759c4c        192.168.24.1:8787/ceph/rhceph-2-rhel7:latest                   "/usr/bin/ceph --vers"   12 hours ago        Exited (0) 12 hours ago                       nauseous_fermat
[root@overcloud-cephstorage-0 ~]#

Comment 4 John Fulton 2017-09-19 20:21:35 UTC
I think I found the root cause: the ceph.conf had the wrong mon IP in it. 

Within the ceph monitor container [0][1], any ceph command I ran, resulted in the container being unable to reach fd00:fd00:fd00:3000::10, however the container was started to run not on ::10 but on ::18. If I update the ceph.conf to use ::10 instead of ::18 [3], then the ceph command works [4]. 

So, _why_ did it get the wrong IP? 


Footnotes:

[0] docker exec -ti ceph-mon-overcloud-controller-0 /bin/bash
[1] 
[root@overcloud-controller-0 ~]# ps axu | grep ceph 
root       43368  0.0  0.0 548848 12752 ?        Ssl  02:24   0:01 /usr/bin/docker-current run --rm --name ceph-mon-overcloud-controller-0 --net=host --memory=1g --cpu-quota=100000 -v /var/lib/ceph:/var/lib/ceph -v /etc/ceph:/etc/ceph -v /etc/localtime:/etc/localtime:ro --net=host -e IP_VERSION=6 -e MON_IP=[fd00:fd00:fd00:3000::18] -e CLUSTER=ceph -e FSID=9a73798c-9cd9-11e7-aa4e-525400d41121 -e CEPH_PUBLIC_NETWORK=fd00:fd00:fd00:3000::/64 -e CEPH_DAEMON=MON 192.168.24.1:8787/ceph/rhceph-2-rhel7:latest
ceph       43437  0.0  0.0 360600 24740 ?        Ssl  02:24   0:17 /usr/bin/ceph-mon --cluster ceph --setuser ceph --setgroup ceph -d -i overcloud-controller-0 --mon-data /var/lib/ceph/mon/ceph-overcloud-controller-0 --public-addr [fd00:fd00:fd00:3000::18]:6789
root      168467  0.0  0.0 112664   968 pts/0    S+   19:17   0:00 grep --color=auto ceph
[root@overcloud-controller-0 ~]# 

[2] 
[root@overcloud-controller-0 /]# ceph -s
2017-09-19 20:06:21.156807 7fb918640700  0 -- :/202014710 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb91405cf20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb9140623a0).fault
2017-09-19 20:06:24.156876 7fb91853f700  0 -- :/202014710 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb908000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb908001f90).fault
^CTraceback (most recent call last):
  File "/usr/bin/ceph", line 948, in <module>
    retval = main()
  File "/usr/bin/ceph", line 852, in main
    prefix='get_command_descriptions')
  File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 1300, in json_command
    raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "None": exception "['{"prefix": "get_command_descriptions"}']": exception You cannot perform that operation on a Rados object in state configuring.
[root@overcloud-controller-0 /]#     

[3] 
[root@overcloud-controller-0 /]# grep "mon host" /etc/ceph/ceph.conf 
#mon host = [fd00:fd00:fd00:3000::10]
mon host = [fd00:fd00:fd00:3000::18]
[root@overcloud-controller-0 /]# 

[4] 
[root@overcloud-controller-0 /]# ceph -s
    cluster 9a73798c-9cd9-11e7-aa4e-525400d41121
     health HEALTH_ERR
            224 pgs are stuck inactive for more than 300 seconds
            224 pgs stuck inactive
            224 pgs stuck unclean
            no osds
     monmap e1: 1 mons at {overcloud-controller-0=[fd00:fd00:fd00:3000::18]:6789/0}
            election epoch 3, quorum 0 overcloud-controller-0
     osdmap e6: 0 osds: 0 up, 0 in
            flags sortbitwise,require_jewel_osds
      pgmap v7: 224 pgs, 6 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                 224 creating
[root@overcloud-controller-0 /]#

Comment 6 John Fulton 2017-09-19 21:18:49 UTC
(In reply to John Fulton from comment #4)
> If I update the ceph.conf to use ::10 instead of ::18 [3], then the ceph 
> command works [4]. 

I meant to say ::18 instead of ::10

> [3] 
> [root@overcloud-controller-0 /]# grep "mon host" /etc/ceph/ceph.conf 
> #mon host = [fd00:fd00:fd00:3000::10]
> mon host = [fd00:fd00:fd00:3000::18]

Comment 7 John Fulton 2017-09-20 00:51:38 UTC
WORKAROUND:

Deploy with a Heat environment file containing something like the following: 

parameter_defaults:
    CephAnsibleExtraConfig:
      monitor_interface: vlan30

whenever the ceph monitor node will use a network interface which isn't br-ex for the storage network. So in the case of this deployment, vlan30 was the appropriate network to use because the network environment (/home/stack/virt/network/network-environment-v6.yaml) specified that would be the ceph storage network. 

FIX:

After the following patches land there is no need to hard code a monitor_interface in THT: 

 https://github.com/ceph/ceph-ansible/pull/1884
 https://review.openstack.org/#/c/501121

Comment 9 Alexander Chuzhoy 2017-09-20 21:47:02 UTC
Successfully deployed/populated a setup using a w/a from comment #7:

the deployment was with ipv6+vlan:
openstack overcloud deploy --templates \
--libvirt-type kvm \
-e /home/stack/templates/nodes_data.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml \
-e /home/stack/virt/network/network-environment-v6.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-tls.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
-e /home/stack/inject-trust-anchor-hiera.yaml \
-e /home/stack/rhos12.yaml

Comment 10 John Fulton 2017-10-09 14:15:18 UTC
Tested similar deploy without monitor_interface parameter (the workaround described in comment #7) using upstream ceph-ansible rc18. I confirm this worked so I am changing the status to POST as it has merged upstream. 

E.g. The mons reached quorum without passing a monitor_interface param:

[root@overcloud-controller-0 ~]# docker exec -ti c2e82779be3a /bin/bash
[root@overcloud-controller-0 /]# ceph -s                                                                                                                                                                          
    cluster b4136f4e-a9e7-11e7-8a8b-525400330666
     health HEALTH_ERR
            1856 pgs are stuck inactive for more than 300 seconds
            1856 pgs stuck inactive
            1856 pgs stuck unclean
            no osds
     monmap e2: 3 mons at {overcloud-controller-0=172.16.1.16:6789/0,overcloud-controller-1=172.16.1.15:6789/0,overcloud-controller-2=172.16.1.23:6789/0}
            election epoch 6, quorum 0,1,2 overcloud-controller-1,overcloud-controller-0,overcloud-controller-2
      fsmap e2: 0/0/1 up
     osdmap e9: 0 osds: 0 up, 0 in
            flags sortbitwise,require_jewel_osds
      pgmap v10: 1856 pgs, 8 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                1856 creating
[root@overcloud-controller-0 /]# ip a 
bash: ip: command not found
[root@overcloud-controller-0 /]# ping 172.16.1.16
PING 172.16.1.16 (172.16.1.16) 56(84) bytes of data.
64 bytes from 172.16.1.16: icmp_seq=1 ttl=64 time=0.017 ms
64 bytes from 172.16.1.16: icmp_seq=2 ttl=64 time=0.013 ms
^C
--- 172.16.1.16 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.013/0.015/0.017/0.002 ms
[root@overcloud-controller-0 /]# ping 172.16.1.15
PING 172.16.1.15 (172.16.1.15) 56(84) bytes of data.
64 bytes from 172.16.1.15: icmp_seq=1 ttl=64 time=0.065 ms
64 bytes from 172.16.1.15: icmp_seq=2 ttl=64 time=0.049 ms
^C
--- 172.16.1.15 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.049/0.057/0.065/0.008 ms
[root@overcloud-controller-0 /]# ping 172.16.1.23
PING 172.16.1.23 (172.16.1.23) 56(84) bytes of data.
64 bytes from 172.16.1.23: icmp_seq=1 ttl=64 time=0.120 ms
64 bytes from 172.16.1.23: icmp_seq=2 ttl=64 time=0.037 ms
^C
--- 172.16.1.23 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.037/0.078/0.120/0.042 ms
[root@overcloud-controller-0 /]#

Comment 11 John Fulton 2017-10-09 14:22:41 UTC
Correction. Back to ON_DEV as the following hasn't merged yet (we should not be passing this param as a default) and we're still passing br_ex so the workaround in comment #7 is still necessary. 

 https://review.openstack.org/#/c/501121

We cannot merge the above until we get CI working with rc18 and must confirm with the following first: 

 https://review.openstack.org/#/c/501987 

Also, I'm removing the upstream link to the following as it confused me, leading to my last comment. To be clear, this is a THT bug to be fixed by 501121. The ceph-ansible bug is tracked elsewhere and has already merged as the following has merged. 

 https://github.com//ceph/ceph-ansible/pull/1884

Comment 12 Giulio Fidente 2017-10-09 20:30:54 UTC
Currently blocked by promotion of ceph-ansible > rc18 upstream

Comment 13 Giulio Fidente 2017-10-16 10:41:19 UTC
Merged into master branch, updating reference to the stable/pike port.

Comment 15 Alexander Chuzhoy 2017-10-16 13:48:07 UTC
The "Fixed in version" is there: openstack-tripleo-heat-templates-7.0.2-0.20171007062244.el7ost

Comment 16 Alexander Chuzhoy 2017-10-16 21:10:25 UTC
Verified.
Environment:
openstack-tripleo-heat-templates-7.0.2-0.20171007062244.el7ost.noarch

The reported issue didn't reproduce.

Comment 20 errata-xmlrpc 2017-12-13 22:10:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.