Hide Forgot
ceph: librados: client.admin authentication error (110) Connection timed out", "Error connecting to cluster: TimedOut Environment: openstack-puppet-modules-11.0.0-0.20170828113154.el7ost.noarch puppet-ceph-2.4.1-0.20170911230204.ebea4b7.el7ost.noarch instack-undercloud-7.4.1-0.20170912115418.el7ost.noarch ceph-ansible-3.0.0-0.1.rc8.1.el7cp.noarch openstack-tripleo-heat-templates-7.0.0-0.20170913050523.0rc2.el7ost.noarch ceph-common-10.2.7-32.el7cp.x86_64 ceph-mon-10.2.7-32.el7cp.x86_64 libcephfs1-10.2.7-32.el7cp.x86_64 python-cephfs-10.2.7-32.el7cp.x86_64 ceph-base-10.2.7-32.el7cp.x86_64 ceph-radosgw-10.2.7-32.el7cp.x86_64 puppet-ceph-2.4.1-0.20170911230204.ebea4b7.el7ost.noarch ceph-mds-10.2.7-32.el7cp.x86_64 ceph-selinux-10.2.7-32.el7cp.x86_64 Steps to reproduce: Deploy overcloud with vlan+ipv6 using: openstack overcloud deploy --templates \ --libvirt-type kvm \ -e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \ -e /home/stack/templates/nodes_data.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml \ -e /home/stack/virt/network/network-environment-v6.yaml \ -e /home/stack/rhos12.yaml The overcloud is deployed but OSD's are down: Apply workaround for bugs: https://bugzilla.redhat.com/show_bug.cgi?id=1491027 https://bugzilla.redhat.com/show_bug.cgi?id=1491780 re-run the deployment command. Result: The re-deployment fails with: overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution: resource_type: OS::Mistral::ExternalResource physical_resource_id: 01550cd1-340c-4430-b42c-b9f134952c18 status: CREATE_FAILED status_reason: | resources.WorkflowTasks_Step2_Execution: ERROR Debugging with mistral, see this error (truncated output): a0).fault\", \"2017-09-19 04:12:00.125276 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704008030).fault\", \"2017-09-19 04:12:03.125571 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008e70 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704006820).fault\", \"2017-09-19 04:12:06.125874 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704008030).fault\", \"2017-09-19 04:12:09.126176 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008e70 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704006820).fault\", \"2017-09-19 04:12:12.126509 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400e460).fault\", \"2017-09-19 04:12:15.126788 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008e70 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704006820).fault\", \"2017-09-19 04:12:18.127201 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400e430).fault\", \"2017-09-19 04:12:21.127597 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008d20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704006820).fault\", \"2017-09-19 04:12:24.127912 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400a0a0).fault\", \"2017-09-19 04:12:27.128102 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008d20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011ce0).fault\", \"2017-09-19 04:12:30.128500 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003d90).fault\", \"2017-09-19 04:12:33.128751 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008d20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011ce0).fault\", \"2017-09-19 04:12:36.128922 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003d90).fault\", \"2017-09-19 04:12:39.129311 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008d20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400e1f0).fault\", \"2017-09-19 04:12:42.129599 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003d90).fault\", \"2017-09-19 04:12:45.129882 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704004e70).fault\", \"2017-09-19 04:12:48.130184 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb7040008c0).fault\", \"2017-09-19 04:12:51.130424 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011c90).fault\", \"2017-09-19 04:12:54.130800 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb7040008c0).fault\", \"2017-09-19 04:12:57.131032 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011c90).fault\", \"2017-09-19 04:13:00.131390 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003680).fault\", \"2017-09-19 04:13:03.131818 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011c90).fault\", \"2017-09-19 04:13:06.132238 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c9c0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003680).fault\", \"2017-09-19 04:13:09.132441 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704011c90).fault\", \"2017-09-19 04:13:12.132841 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c8f0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb704003680).fault\", \"2017-09-19 04:13:15.133120 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400c080).fault\", \"2017-09-19 04:13:18.133980 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c8a0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb7040133c0).fault\", \"2017-09-19 04:13:21.134223 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400c080).fault\", \"2017-09-19 04:13:24.134400 7fb7142fb700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb70400c8a0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb7040133c0).fault\", \"2017-09-19 04:13:27.134839 7fb7141fa700 0 -- :/2900802587 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb704008c30 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb70400c080).fault\", \"2017-09-19 04:13:30.105628 7fb718274700 0 monclient(hunting): authenticate timed out after 300\", \"2017-09-19 04:13:30.105689 7fb718274700 0 librados: client.admin authentication error (110) Connection timed out\", \"Error connecting to cluster: TimedOut\"], \"stdout\":
To put this in ceph-ansible terms... Configuration of the Monitor node [1] failed [2] on the task "ceph-mon : add openstack key(s) to ceph" [3] with ceph-ansible-3.0.0-0.1.rc8.1.el7cp.noarch with the following when using IPv6: librados: client.admin authentication error (110) Connection timed out", "Error connecting to cluster: TimedOut"] [1] (undercloud) [stack@undercloud-0 ~]$ nova list +-------------------------+--------+-------------+------------------------+ | Name | Status | Power State | Networks | +-------------------------+--------+-------------+------------------------+ | overcloud-cephstorage-0 | ACTIVE | Running | ctlplane=192.168.24.8 | | overcloud-cephstorage-1 | ACTIVE | Running | ctlplane=192.168.24.10 | | overcloud-cephstorage-2 | ACTIVE | Running | ctlplane=192.168.24.18 | | overcloud-compute-0 | ACTIVE | Running | ctlplane=192.168.24.9 | | overcloud-controller-0 | ACTIVE | Running | ctlplane=192.168.24.19 | +-------------------------+--------+-------------+------------------------+ (undercloud) [stack@undercloud-0 ~]$ [2] PLAY RECAP **************************************************************** 192.168.24.10 : ok=1 changed=0 unreachable=0 failed=0 192.168.24.18 : ok=1 changed=0 unreachable=0 failed=0 192.168.24.19 : ok=54 changed=6 unreachable=0 failed=1 192.168.24.8 : ok=1 changed=0 unreachable=0 failed=0 192.168.24.9 : ok=1 changed=0 unreachable=0 failed=0 [3] 017-09-19 00:03:16,456 p=2823 u=mistral | failed: [192.168.24.19] (item=[{u'mon_cap': u'allow r', u'osd_cap': u'allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=backups, allow rwx pool=vms, allow rwx pool=images, allow rwx pool=metrics', u'name': u'client.openstack', u'key': u'AQBXcsBZAAAAABAA0+Trl53eDg38apvvuYuA2w==', u'mode': u'0644'}, ... "2017-09-19 04:03:17.211501 7f11cc3f7700 0 monclient(hunting): authenticate timed out after 300", "2017-09-19 04:03:17.211573 7f11cc3f7700 0 librados: client.admin authentication error (110) Connection timed out", "Error connecting to cluster: TimedOut"], "stdout": "", "stdout_lines": [] ...
Network info: Deployment was done with storage network on IPv6: [stack@undercloud-0 ~]$ cat /home/stack/virt/network/network-environment-v6.yaml parameter_defaults: ControlPlaneDefaultRoute: 192.168.24.1 DnsServers: - 10.35.28.1 EC2MetadataIp: 192.168.24.1 ExternalAllocationPools: - end: 2620:52:0:13b8:5054:ff:fe3e:aa start: 2620:52:0:13b8:5054:ff:fe3e:1 ExternalInterfaceDefaultRoute: 2620:52:0:13b8::fe ExternalNetCidr: 2620:52:0:13b8::/64 ExternalNetworkVlanID: 10 InternalApiAllocationPools: - end: fd00:fd00:fd00:2000:ffff:ffff:ffff:fffe start: fd00:fd00:fd00:2000::10 InternalApiNetCidr: fd00:fd00:fd00:2000::/64 NeutronBridgeMappings: datacentre:br-ex,tenant:br-isolated NeutronExternalNetworkBridge: '' NeutronNetworkType: vlan NeutronNetworkVLANRanges: tenant:1000:2000 NeutronTunnelTypes: '''''' StorageAllocationPools: - end: fd00:fd00:fd00:3000:ffff:ffff:ffff:fffe start: fd00:fd00:fd00:3000::10 StorageMgmtAllocationPools: - end: fd00:fd00:fd00:4000:ffff:ffff:ffff:fffe start: fd00:fd00:fd00:4000::10 StorageMgmtNetCidr: fd00:fd00:fd00:4000::/64 StorageNetCidr: fd00:fd00:fd00:3000::/64 resource_registry: OS::TripleO::BlockStorage::Net::SoftwareConfig: three-nics-vlans/cinder-storage.yaml OS::TripleO::CephStorage::Net::SoftwareConfig: three-nics-vlans/ceph-storage.yaml OS::TripleO::Compute::Net::SoftwareConfig: three-nics-vlans/compute.yaml OS::TripleO::Controller::Net::SoftwareConfig: three-nics-vlans/controller-v6.yaml OS::TripleO::ObjectStorage::Net::SoftwareConfig: three-nics-vlans/swift-storage.yaml [stack@undercloud-0 ~]$ The monitor node is using type: ovs_bridge name: br-isolated ... name: nic2 ... with the VLANs - InternalApiNetworkVlanID - StorageNetworkVlanID - StorageMgmtNetworkVlanID - TenantNetworkVlanID The ceph storage node is using: type: ovs_bridge name: br-isolated ... name: nic2 ... with the VLANs - StorageMgmtNetworkVlanID - StorageNetworkVlanID This comes down to the following: [root@overcloud-cephstorage-0 ~]# ovs-vsctl show 65cb1e84-c4a9-4123-9391-0abef55c4f5d Bridge br-isolated fail_mode: standalone Port "eth1" Interface "eth1" Port br-isolated Interface br-isolated type: internal Port "vlan40" tag: 40 Interface "vlan40" type: internal Port "vlan30" tag: 30 Interface "vlan30" type: internal ovs_version: "2.7.2" [root@overcloud-cephstorage-0 ~]# [root@overcloud-controller-0 ~]# ovs-vsctl show ... Bridge br-isolated Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port "vlan40" tag: 40 Interface "vlan40" type: internal Port "vlan50" tag: 50 Interface "vlan50" type: internal Port "vlan30" tag: 30 Interface "vlan30" type: internal Port "eth1" Interface "eth1" Port "vlan20" tag: 20 Interface "vlan20" type: internal Port br-isolated Interface br-isolated type: internal Port phy-br-isolated Interface phy-br-isolated type: patch options: {peer=int-br-isolated} ... [root@overcloud-controller-0 ~]# ip a s vlan30 10: vlan30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 1a:11:3c:a5:d9:66 brd ff:ff:ff:ff:ff:ff inet6 fd00:fd00:fd00:3000::10/128 scope global valid_lft forever preferred_lft forever inet6 fd00:fd00:fd00:3000::18/64 scope global valid_lft forever preferred_lft forever inet6 fe80::1811:3cff:fea5:d966/64 scope link valid_lft forever preferred_lft forever [root@overcloud-controller-0 ~]# ip a s vlan40 11: vlan40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether ae:77:df:8a:ba:46 brd ff:ff:ff:ff:ff:ff inet6 fd00:fd00:fd00:4000::17/128 scope global valid_lft forever preferred_lft forever inet6 fd00:fd00:fd00:4000::13/64 scope global valid_lft forever preferred_lft forever inet6 fe80::ac77:dfff:fe8a:ba46/64 scope link valid_lft forever preferred_lft forever [root@overcloud-controller-0 ~]# [root@overcloud-cephstorage-0 ~]# ip a s vlan30 7: vlan30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 52:42:79:0b:3a:a6 brd ff:ff:ff:ff:ff:ff inet6 fd00:fd00:fd00:3000::1b/64 scope global valid_lft forever preferred_lft forever inet6 fe80::5042:79ff:fe0b:3aa6/64 scope link valid_lft forever preferred_lft forever [root@overcloud-cephstorage-0 ~]# ip a s vlan40 8: vlan40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether b6:b6:9a:bf:7d:94 brd ff:ff:ff:ff:ff:ff inet6 fd00:fd00:fd00:4000::11/64 scope global valid_lft forever preferred_lft forever inet6 fe80::b4b6:9aff:febf:7d94/64 scope link valid_lft forever preferred_lft forever [root@overcloud-cephstorage-0 ~]# Communication between the Storage and StorageMgmt networks among the storage and mon nodes seems OK. [root@overcloud-controller-0 ~]# ping6 -c 2 fd00:fd00:fd00:3000::1b PING fd00:fd00:fd00:3000::1b(fd00:fd00:fd00:3000::1b) 56 data bytes 64 bytes from fd00:fd00:fd00:3000::1b: icmp_seq=1 ttl=64 time=0.880 ms 64 bytes from fd00:fd00:fd00:3000::1b: icmp_seq=2 ttl=64 time=0.308 ms --- fd00:fd00:fd00:3000::1b ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 0.308/0.594/0.880/0.286 ms [root@overcloud-controller-0 ~]# ping6 -c 2 fd00:fd00:fd00:4000::11 PING fd00:fd00:fd00:4000::11(fd00:fd00:fd00:4000::11) 56 data bytes 64 bytes from fd00:fd00:fd00:4000::11: icmp_seq=1 ttl=64 time=0.334 ms 64 bytes from fd00:fd00:fd00:4000::11: icmp_seq=2 ttl=64 time=0.375 ms --- fd00:fd00:fd00:4000::11 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.334/0.354/0.375/0.027 ms [root@overcloud-controller-0 ~]# [root@overcloud-cephstorage-0 ~]# ping6 -c 2 fd00:fd00:fd00:3000::10 PING fd00:fd00:fd00:3000::10(fd00:fd00:fd00:3000::10) 56 data bytes 64 bytes from fd00:fd00:fd00:3000::10: icmp_seq=1 ttl=64 time=1.03 ms 64 bytes from fd00:fd00:fd00:3000::10: icmp_seq=2 ttl=64 time=0.332 ms --- fd00:fd00:fd00:3000::10 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.332/0.682/1.032/0.350 ms [root@overcloud-cephstorage-0 ~]# ping6 -c 2 fd00:fd00:fd00:4000::17 PING fd00:fd00:fd00:4000::17(fd00:fd00:fd00:4000::17) 56 data bytes 64 bytes from fd00:fd00:fd00:4000::17: icmp_seq=1 ttl=64 time=1.04 ms 64 bytes from fd00:fd00:fd00:4000::17: icmp_seq=2 ttl=64 time=0.296 ms --- fd00:fd00:fd00:4000::17 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.296/0.672/1.048/0.376 ms [root@overcloud-cephstorage-0 ~]#
Ceph info: [root@overcloud-controller-0 ~]# cat /etc/ceph/ceph.conf [global] ms bind ipv6 = true mon initial members = overcloud-controller-0 rgw_keystone_admin_password = dpjJRCtcMf7c8ND9TN6RQUDNq cluster network = fd00:fd00:fd00:4000::/64 rgw_keystone_url = http://[fd00:fd00:fd00:2000::11]:5000 rgw_s3_auth_use_keystone = true mon host = [fd00:fd00:fd00:3000::10] rgw_keystone_admin_domain = default osd_pool_default_size = 3 rgw_keystone_admin_project = service osd_pool_default_pg_num = 32 rgw_keystone_accepted_roles = Member, _member_, admin rgw_keystone_api_version = 3 rgw_keystone_admin_user = swift public network = fd00:fd00:fd00:3000::/64 max open files = 131072 fsid = 9a73798c-9cd9-11e7-aa4e-525400d41121 [root@overcloud-controller-0 ~]# [root@overcloud-cephstorage-0 ~]# cat /etc/ceph/ceph.conf [global] ms bind ipv6 = true mon initial members = overcloud-controller-0 rgw_keystone_admin_password = dpjJRCtcMf7c8ND9TN6RQUDNq cluster network = fd00:fd00:fd00:4000::/64 rgw_keystone_url = http://[fd00:fd00:fd00:2000::11]:5000 rgw_s3_auth_use_keystone = true mon host = [fd00:fd00:fd00:3000::18] rgw_keystone_admin_domain = default osd_pool_default_size = 3 rgw_keystone_admin_project = service osd_pool_default_pg_num = 32 rgw_keystone_accepted_roles = Member, _member_, admin rgw_keystone_api_version = 3 rgw_keystone_admin_user = swift public network = fd00:fd00:fd00:3000::/64 max open files = 131072 fsid = 9a73798c-9cd9-11e7-aa4e-525400d41121 [root@overcloud-cephstorage-0 ~]# [root@overcloud-controller-0 ~]# ceph -s 2017-09-19 14:56:43.024250 7f30a05f6700 0 -- :/411061361 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7f309c064ac0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f309c062820).fault 2017-09-19 14:56:46.023618 7f30a04f5700 0 -- :/411061361 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7f3090000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3090001f90).fault ^CTraceback (most recent call last): File "/bin/ceph", line 948, in <module> retval = main() File "/bin/ceph", line 852, in main prefix='get_command_descriptions') File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 1300, in json_command raise RuntimeError('"{0}": exception {1}'.format(argdict, e)) RuntimeError: "None": exception "['{"prefix": "get_command_descriptions"}']": exception You cannot perform that operation on a Rados object in state configuring. [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# ls -l /etc/ceph/ total 28 -rw-------. 1 root root 137 Sep 19 02:24 ceph.client.admin.keyring -rw-r--r--. 1 root root 254 Sep 19 02:27 ceph.client.manila.keyring -rw-r--r--. 1 root root 277 Sep 19 02:27 ceph.client.openstack.keyring -rw-r--r--. 1 root root 127 Sep 19 02:27 ceph.client.radosgw.keyring -rw-r--r--. 1 root root 653 Sep 19 03:14 ceph.conf -rw-------. 1 ceph ceph 553 Sep 19 02:24 ceph.mon.keyring -rw-r--r--. 1 root root 92 Aug 25 22:43 rbdmap [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# cat /etc/ceph/ceph.mon.keyring [mon.] key = AQD3f8BZ0ag3OxAAr0mzKG+7jP4SmZrvIaNpsg== caps mon = "allow *" [client.admin] key = AQD3f8BZ0aGwORAAHriA0sfzhYXDh98DEV0jgw== auid = 0 caps mds = "allow" caps mon = "allow *" caps osd = "allow *" [client.bootstrap-mds] key = AQD4f8BZDrWoAhAAV9Fe4cCg23MU/m74HmEusg== caps mon = "allow profile bootstrap-mds" [client.bootstrap-osd] key = AQD4f8BZSIUdARAAKi/Oa81K+n4PuNJiBRjNqA== caps mon = "allow profile bootstrap-osd" [client.bootstrap-rgw] key = AQD4f8BZgPMvBBAAgr8kOcOanyXzxPhFRp2JSw== caps mon = "allow profile bootstrap-rgw" [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# docker ps | grep ceph 03166717a9c9 192.168.24.1:8787/ceph/rhceph-2-rhel7:latest "/entrypoint.sh" 13 hours ago Up 13 hours ceph-mon-overcloud-controller-0 [root@overcloud-controller-0 ~]# [root@overcloud-cephstorage-0 ~]# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES c3e7ac15a16c 192.168.24.1:8787/rhosp12/openstack-cron-docker:2017-09-15.1 "kolla_start" 12 hours ago Up 12 hours logrotate_crond b91e57759c4c 192.168.24.1:8787/ceph/rhceph-2-rhel7:latest "/usr/bin/ceph --vers" 12 hours ago Exited (0) 12 hours ago nauseous_fermat [root@overcloud-cephstorage-0 ~]#
I think I found the root cause: the ceph.conf had the wrong mon IP in it. Within the ceph monitor container [0][1], any ceph command I ran, resulted in the container being unable to reach fd00:fd00:fd00:3000::10, however the container was started to run not on ::10 but on ::18. If I update the ceph.conf to use ::10 instead of ::18 [3], then the ceph command works [4]. So, _why_ did it get the wrong IP? Footnotes: [0] docker exec -ti ceph-mon-overcloud-controller-0 /bin/bash [1] [root@overcloud-controller-0 ~]# ps axu | grep ceph root 43368 0.0 0.0 548848 12752 ? Ssl 02:24 0:01 /usr/bin/docker-current run --rm --name ceph-mon-overcloud-controller-0 --net=host --memory=1g --cpu-quota=100000 -v /var/lib/ceph:/var/lib/ceph -v /etc/ceph:/etc/ceph -v /etc/localtime:/etc/localtime:ro --net=host -e IP_VERSION=6 -e MON_IP=[fd00:fd00:fd00:3000::18] -e CLUSTER=ceph -e FSID=9a73798c-9cd9-11e7-aa4e-525400d41121 -e CEPH_PUBLIC_NETWORK=fd00:fd00:fd00:3000::/64 -e CEPH_DAEMON=MON 192.168.24.1:8787/ceph/rhceph-2-rhel7:latest ceph 43437 0.0 0.0 360600 24740 ? Ssl 02:24 0:17 /usr/bin/ceph-mon --cluster ceph --setuser ceph --setgroup ceph -d -i overcloud-controller-0 --mon-data /var/lib/ceph/mon/ceph-overcloud-controller-0 --public-addr [fd00:fd00:fd00:3000::18]:6789 root 168467 0.0 0.0 112664 968 pts/0 S+ 19:17 0:00 grep --color=auto ceph [root@overcloud-controller-0 ~]# [2] [root@overcloud-controller-0 /]# ceph -s 2017-09-19 20:06:21.156807 7fb918640700 0 -- :/202014710 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb91405cf20 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb9140623a0).fault 2017-09-19 20:06:24.156876 7fb91853f700 0 -- :/202014710 >> [fd00:fd00:fd00:3000::10]:6789/0 pipe(0x7fb908000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb908001f90).fault ^CTraceback (most recent call last): File "/usr/bin/ceph", line 948, in <module> retval = main() File "/usr/bin/ceph", line 852, in main prefix='get_command_descriptions') File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 1300, in json_command raise RuntimeError('"{0}": exception {1}'.format(argdict, e)) RuntimeError: "None": exception "['{"prefix": "get_command_descriptions"}']": exception You cannot perform that operation on a Rados object in state configuring. [root@overcloud-controller-0 /]# [3] [root@overcloud-controller-0 /]# grep "mon host" /etc/ceph/ceph.conf #mon host = [fd00:fd00:fd00:3000::10] mon host = [fd00:fd00:fd00:3000::18] [root@overcloud-controller-0 /]# [4] [root@overcloud-controller-0 /]# ceph -s cluster 9a73798c-9cd9-11e7-aa4e-525400d41121 health HEALTH_ERR 224 pgs are stuck inactive for more than 300 seconds 224 pgs stuck inactive 224 pgs stuck unclean no osds monmap e1: 1 mons at {overcloud-controller-0=[fd00:fd00:fd00:3000::18]:6789/0} election epoch 3, quorum 0 overcloud-controller-0 osdmap e6: 0 osds: 0 up, 0 in flags sortbitwise,require_jewel_osds pgmap v7: 224 pgs, 6 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 224 creating [root@overcloud-controller-0 /]#
(In reply to John Fulton from comment #4) > If I update the ceph.conf to use ::10 instead of ::18 [3], then the ceph > command works [4]. I meant to say ::18 instead of ::10 > [3] > [root@overcloud-controller-0 /]# grep "mon host" /etc/ceph/ceph.conf > #mon host = [fd00:fd00:fd00:3000::10] > mon host = [fd00:fd00:fd00:3000::18]
WORKAROUND: Deploy with a Heat environment file containing something like the following: parameter_defaults: CephAnsibleExtraConfig: monitor_interface: vlan30 whenever the ceph monitor node will use a network interface which isn't br-ex for the storage network. So in the case of this deployment, vlan30 was the appropriate network to use because the network environment (/home/stack/virt/network/network-environment-v6.yaml) specified that would be the ceph storage network. FIX: After the following patches land there is no need to hard code a monitor_interface in THT: https://github.com/ceph/ceph-ansible/pull/1884 https://review.openstack.org/#/c/501121
Successfully deployed/populated a setup using a w/a from comment #7: the deployment was with ipv6+vlan: openstack overcloud deploy --templates \ --libvirt-type kvm \ -e /home/stack/templates/nodes_data.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml \ -e /home/stack/virt/network/network-environment-v6.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-tls.yaml \ -e /home/stack/virt/public_vip.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \ -e /home/stack/inject-trust-anchor-hiera.yaml \ -e /home/stack/rhos12.yaml
Tested similar deploy without monitor_interface parameter (the workaround described in comment #7) using upstream ceph-ansible rc18. I confirm this worked so I am changing the status to POST as it has merged upstream. E.g. The mons reached quorum without passing a monitor_interface param: [root@overcloud-controller-0 ~]# docker exec -ti c2e82779be3a /bin/bash [root@overcloud-controller-0 /]# ceph -s cluster b4136f4e-a9e7-11e7-8a8b-525400330666 health HEALTH_ERR 1856 pgs are stuck inactive for more than 300 seconds 1856 pgs stuck inactive 1856 pgs stuck unclean no osds monmap e2: 3 mons at {overcloud-controller-0=172.16.1.16:6789/0,overcloud-controller-1=172.16.1.15:6789/0,overcloud-controller-2=172.16.1.23:6789/0} election epoch 6, quorum 0,1,2 overcloud-controller-1,overcloud-controller-0,overcloud-controller-2 fsmap e2: 0/0/1 up osdmap e9: 0 osds: 0 up, 0 in flags sortbitwise,require_jewel_osds pgmap v10: 1856 pgs, 8 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 1856 creating [root@overcloud-controller-0 /]# ip a bash: ip: command not found [root@overcloud-controller-0 /]# ping 172.16.1.16 PING 172.16.1.16 (172.16.1.16) 56(84) bytes of data. 64 bytes from 172.16.1.16: icmp_seq=1 ttl=64 time=0.017 ms 64 bytes from 172.16.1.16: icmp_seq=2 ttl=64 time=0.013 ms ^C --- 172.16.1.16 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 999ms rtt min/avg/max/mdev = 0.013/0.015/0.017/0.002 ms [root@overcloud-controller-0 /]# ping 172.16.1.15 PING 172.16.1.15 (172.16.1.15) 56(84) bytes of data. 64 bytes from 172.16.1.15: icmp_seq=1 ttl=64 time=0.065 ms 64 bytes from 172.16.1.15: icmp_seq=2 ttl=64 time=0.049 ms ^C --- 172.16.1.15 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 999ms rtt min/avg/max/mdev = 0.049/0.057/0.065/0.008 ms [root@overcloud-controller-0 /]# ping 172.16.1.23 PING 172.16.1.23 (172.16.1.23) 56(84) bytes of data. 64 bytes from 172.16.1.23: icmp_seq=1 ttl=64 time=0.120 ms 64 bytes from 172.16.1.23: icmp_seq=2 ttl=64 time=0.037 ms ^C --- 172.16.1.23 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 999ms rtt min/avg/max/mdev = 0.037/0.078/0.120/0.042 ms [root@overcloud-controller-0 /]#
Correction. Back to ON_DEV as the following hasn't merged yet (we should not be passing this param as a default) and we're still passing br_ex so the workaround in comment #7 is still necessary. https://review.openstack.org/#/c/501121 We cannot merge the above until we get CI working with rc18 and must confirm with the following first: https://review.openstack.org/#/c/501987 Also, I'm removing the upstream link to the following as it confused me, leading to my last comment. To be clear, this is a THT bug to be fixed by 501121. The ceph-ansible bug is tracked elsewhere and has already merged as the following has merged. https://github.com//ceph/ceph-ansible/pull/1884
Currently blocked by promotion of ceph-ansible > rc18 upstream
Merged into master branch, updating reference to the stable/pike port.
The "Fixed in version" is there: openstack-tripleo-heat-templates-7.0.2-0.20171007062244.el7ost
Verified. Environment: openstack-tripleo-heat-templates-7.0.2-0.20171007062244.el7ost.noarch The reported issue didn't reproduce.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462