Description of problem: Ceph-ansible is grabbing the VIP address and using it for MON, MGR and RGW containers for the controller node that hosts the storage VIP when the deployment is run. Version-Release number of selected component (if applicable): ceph-ansible-3.1.5-1.el7cp.noarch How reproducible: Seems consistent to me, got it 3 times in a row, however it's hard to detect since the intitial deploy completes successfully. I marked this urgent because of how hard it is to notice Steps to Reproduce: 1. Configure Storage and Cluster networks for IPv6 2. Deploy cluster with OSP-d 13, works fine 3. Re-deploy with exact same settings, redeploy fails Troubleshooting and you find that the RGW container on the controller with the VIP fails to start, get poking around and you find that the VIP is configured in the ceph.conf file for MON and RGW Please find the following VIP entry in the PCS Status and ceph.conf output below: 2605:1c00:50f2:29a8::19 pcs status --------------------------------------------------------- pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: engcloud-controller-2 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with quorum Last updated: Fri Feb 22 20:36:14 2019 Last change: Fri Feb 22 20:05:42 2019 by root via cibadmin on engcloud-controller-1 12 nodes configured 38 resources configured Online: [ engcloud-controller-0 engcloud-controller-1 engcloud-controller-2 ] GuestOnline: [ galera-bundle-0@engcloud-controller-0 galera-bundle-1@engcloud-controller-1 galera-bundle-2@engcloud-controller-2 rabbitmq-bundle-0@engcloud-controller-0 rabbitmq-bundle-1@engcloud-controller-1 rabbitmq-bundle-2@engcloud-controller-2 redis-bundle-0@engcloud-controller-0 redis-bundle-1@engcloud-controller-1 redis-bundle-2@engcloud-controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [satellite-eng.nfv.charterlab.com:5000/nfv_charterlab_com-eng-osp13-osp13_containers-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started engcloud-controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started engcloud-controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started engcloud-controller-2 Docker container set: galera-bundle [satellite-eng.nfv.charterlab.com:5000/nfv_charterlab_com-eng-osp13-osp13_containers-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master engcloud-controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master engcloud-controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master engcloud-controller-2 Docker container set: redis-bundle [satellite-eng.nfv.charterlab.com:5000/nfv_charterlab_com-eng-osp13-osp13_containers-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master engcloud-controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave engcloud-controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave engcloud-controller-2 ip-44.154.16.45 (ocf::heartbeat:IPaddr2): Started engcloud-controller-0 ip-2605.1c00.50f2.2980.44.154.0.30 (ocf::heartbeat:IPaddr2): Started engcloud-controller-1 ip-2605.1c00.50f2.2998..11 (ocf::heartbeat:IPaddr2): Started engcloud-controller-2 ip-2605.1c00.50f2.2998..1e (ocf::heartbeat:IPaddr2): Started engcloud-controller-0 ip-2605.1c00.50f2.29a8..19 (ocf::heartbeat:IPaddr2): Started engcloud-controller-1 ip-2605.1c00.50f2.29b0..10 (ocf::heartbeat:IPaddr2): Started engcloud-controller-2 Docker container set: haproxy-bundle [satellite-eng.nfv.charterlab.com:5000/nfv_charterlab_com-eng-osp13-osp13_containers-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started engcloud-controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started engcloud-controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started engcloud-controller-2 Docker container: openstack-cinder-volume [satellite-eng.nfv.charterlab.com:5000/nfv_charterlab_com-eng-osp13-osp13_containers-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started engcloud-controller-0 Docker container: openstack-cinder-backup [satellite-eng.nfv.charterlab.com:5000/nfv_charterlab_com-eng-osp13-osp13_containers-cinder-backup:pcmklatest] openstack-cinder-backup-docker-0 (ocf::heartbeat:docker): Started engcloud-controller-1 -------------------------------------------------- ceph.conf -------------------------------------------------- [client.libvirt] admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok # must be writable by QEMU and allowed by SELinux or AppArmor log file = /var/log/ceph/qemu-guest-$pid.log # must be writable by QEMU and allowed by SELinux or AppArmor [client.rgw.engcloud-controller-0] host = engcloud-controller-0 keyring = /var/lib/ceph/radosgw/ceph-rgw.engcloud-controller-0/keyring log file = /var/log/ceph/ceph-rgw-engcloud-controller-0.log rgw frontends = civetweb port=[2605:1c00:50f2:29a8::23]:8080 num_threads=100 [client.rgw.engcloud-controller-1] host = engcloud-controller-1 keyring = /var/lib/ceph/radosgw/ceph-rgw.engcloud-controller-1/keyring log file = /var/log/ceph/ceph-rgw-engcloud-controller-1.log rgw frontends = civetweb port=[2605:1c00:50f2:29a8::19]:8080 num_threads=100 [client.rgw.engcloud-controller-2] host = engcloud-controller-2 keyring = /var/lib/ceph/radosgw/ceph-rgw.engcloud-controller-2/keyring log file = /var/log/ceph/ceph-rgw-engcloud-controller-2.log rgw frontends = civetweb port=[2605:1c00:50f2:29a8::17]:8080 num_threads=100 # Please do not change this file directly since it is managed by Ansible and will be overwritten [global] # let's force the admin socket the way it was so we can properly check for existing instances # also the line $cluster-$name.$pid.$cctid.asok is only needed when running multiple instances # of the same daemon, thing ceph-ansible cannot do at the time of writing admin socket = "$run_dir/$cluster-$name.asok" cluster network = 2605:1c00:50f2:29b0::/64,2605:1c00:50f2:29b1::/64 fsid = 84cf784a-2baa-11e9-bdf3-525400afa1a4 journal_size = 5120 log file = /dev/null mon cluster log file = /dev/null mon host = [2605:1c00:50f2:29a8::19],[2605:1c00:50f2:29a8::23],[2605:1c00:50f2:29a8::17] mon initial members = engcloud-controller-1,engcloud-controller-0,engcloud-controller-2 mon_max_pg_per_osd = 3072 ms bind ipv6 = true osd_pool_default_min_size = 2 osd_pool_default_pg_num = 128 osd_pool_default_pgp_num = 128 osd_pool_default_size = 3 public network = 2605:1c00:50f2:29a8::/64,2605:1c00:50f2:29a9::/64 rgw_keystone_accepted_roles = Member, admin rgw_keystone_admin_domain = default rgw_keystone_admin_password = GNTyKv8Q6ZECvx4vBeKAvGZaD rgw_keystone_admin_project = service rgw_keystone_admin_user = swift rgw_keystone_api_version = 3 rgw_keystone_implicit_tenants = true rgw_keystone_revocation_interval = 0 rgw_keystone_url = http://[2605:1c00:50f2:2998::1e]:5000 rgw_s3_auth_use_keystone = true ---------------------------------------------------------- Actual results: Ceph.conf contains VIP address for MON and RGW Expected results: ceph.conf should contain server address not VIP Additional info: Looks like ceph-ansible is picking the first address in the list: https://github.com/ceph/ceph-ansible/blob/0eb56e36f8ce52015aa6c343faccd589e5fd2c6c/roles/ceph-facts/tasks/set_radosgw_address.yml#L29 This may need to look for netmask info or something to ensure it's not a vip... /32 for IPv4 /128 for IPv6.
Update: This behavior is triggered by re-running the overcloud deploy, like one would do to change a config value or perform a scale out. The initial deployment does not have this issue, but as soon as deployment completes, if I re-run the exact same deploy command this issue crops up. Also note that it's important to check the ceph.conf on the node that is hosting the storage VIP after the re-deploy to see the issue. I think during the initial deploy that ceph-ansible completes before the VIPs are created in PCS, so the problem does not occur.
thanks, despite not affecting fresh deployments, I see how this can be hit on further stack updates, looks pretty urgent indeed
I observe the same behaviour and just to add to what already been said, issue will be probably only visible if you use IPv6 addresses. It looks like when new ipv4 address is added to interface it lands on the end of the IP lists, but in case of IPv6 it actually becomes first address of the interface and this is what ansible grabs when ceph-ansible playbooks are rerun.
(In reply to mskalski from comment #3) > I observe the same behaviour and just to add to what already been said, > issue will be probably only visible if you use IPv6 addresses. It looks like > when new ipv4 address is added to interface it lands on the end of the IP > lists, but in case of IPv6 it actually becomes first address of the > interface and this is what ansible grabs when ceph-ansible playbooks are > rerun. I was thinking to reprise the suggestion in the BZ report "This may need to look for netmask info or something to ensure it's not a vip... /32 for IPv4 /128 for IPv6."; can you confirm the IPv6 address has /128 subnet?
(In reply to Giulio Fidente from comment #6) > I was thinking to reprise the suggestion in the BZ report "This may need to > look for netmask info or something to ensure it's not a vip... /32 for IPv4 > /128 for IPv6."; can you confirm the IPv6 address has /128 subnet? Yes vip address has /128 mask: [root@overcloud8yi-ctrl-0 ~]# pcs status Full list of resources: Docker container set: rabbitmq-bundle [192.168.213.1:8787/rhosp13/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud8yi-ctrl-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud8yi-ctrl-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud8yi-ctrl-2 Docker container set: galera-bundle [192.168.213.1:8787/rhosp13/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master overcloud8yi-ctrl-0 galera-bundle-1 (ocf::heartbeat:galera): Master overcloud8yi-ctrl-1 galera-bundle-2 (ocf::heartbeat:galera): Master overcloud8yi-ctrl-2 ip-192.168.213.60 (ocf::heartbeat:IPaddr2): Started overcloud8yi-ctrl-0 ip-10.87.4.227 (ocf::heartbeat:IPaddr2): Started overcloud8yi-ctrl-1 ip-172.16.0.90 (ocf::heartbeat:IPaddr2): Started overcloud8yi-ctrl-2 ip-fd9e.2d4e.a32a.7777..17 (ocf::heartbeat:IPaddr2): Started overcloud8yi-ctrl-0 [root@overcloud8yi-ctrl-0 ~]# ip a s vlan333 6: vlan333@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 52:54:00:d0:c6:6a brd ff:ff:ff:ff:ff:ff inet6 fd9e:2d4e:a32a:7777::17/128 scope global valid_lft forever preferred_lft forever inet6 fd9e:2d4e:a32a:7777::14/64 scope global valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fed0:c66a/64 scope link valid_lft forever preferred_lft forever
yes: from pcs: ip-2605.1c00.50f2.2900.44.150.0.30 (ocf::heartbeat:IPaddr2): Started qacloud-controller-1 [root@qacloud-controller-1 ~]# ip a |grep 2900 inet6 2605:1c00:50f2:2900:44:150:0:30/128 scope global inet6 2605:1c00:50f2:2900::20/64 scope global [root@qacloud-controller-1 ~]#
Thanks for helping; for ipv4 deployments we seem to be passing the right argument to ipaddr (ip/prefix). For example in a recent CI job [1] monitor_address_block is selecting only /24 addresses. Can you help collecting this same information from your ipv6 deployment? In OSP14 the ceph-ansible inventory and group_vars are saved in a path like "/var/lib/mistral/overcloud/ceph-ansible". In OSP13 deployments instead the inventory and vars are saved in a temporary directory (created under /tmp) if you set CephAnsiblePlaybookVerbosity to 1 in a Heat environment file before the deployment; alternatively it should be possible to grep these params in the mistral-executor logs. 1. http://logs.openstack.org/23/638323/6/check/tripleo-ci-centos-7-scenario004-standalone/aaddc99/logs/undercloud/home/zuul/undercloud-ansible-ScgSrj/ceph-ansible/group_vars/all.yml.txt.gz
So this is new deployment with this vip for storage network: [root@overcloud8yi-ctrl-0 ~]# pcs resource show ip-fd9e.2d4e.a32a.7777..26 Resource: ip-fd9e.2d4e.a32a.7777..26 (class=ocf provider=heartbeat type=IPaddr2) Attributes: cidr_netmask=128 ip=fd9e:2d4e:a32a:7777::26 lvs_ipv6_addrlabel=true lvs_ipv6_addrlabel_value=99 nic=vlan333 Meta Attrs: resource-stickiness=INFINITY Operations: monitor interval=10s timeout=20s (ip-fd9e.2d4e.a32a.7777..26-monitor-interval-10s) start interval=0s timeout=20s (ip-fd9e.2d4e.a32a.7777..26-start-interval-0s) stop interval=0s timeout=20s (ip-fd9e.2d4e.a32a.7777..26-stop-interval-0s) The monitor_address_block and radosgw_address_block are passed as expected with subnet address (from mistral executor.log): u'radosgw_address_block': u'fd9e:2d4e:a32a:7777::/64', u'user_config': True, u'radosgw_keystone': True, u'ceph_mgr_docker_extra_env': u'-e MGR_DASHBOARD=0', u'ceph_docker_image_tag': u'3-23', u'containerized_deployment': True, u'public_network': u'fd9e:2d4e:a32a:7777::/64', u'generate_fsid': False, u'monitor_address_block': u'fd9e:2d4e:a32a:7777::/64' What i think is happening on ceph ansible site is in this line [1] we gets local host addresses from given address_block [2] and the first one is chosen. The local list addresses on hosts looks like this [root@overcloud8yi-ctrl-0 ~]# ansible -m setup localhost "ansible_facts": { "ansible_all_ipv4_addresses": [ "192.168.213.52", "192.168.213.67", "10.87.4.235", "172.31.0.1", "172.16.0.103" ], "ansible_all_ipv6_addresses": [ "fe80::5054:ff:fecc:7d59", "fe80::5054:ff:fee6:b6d1", "fe80::5054:ff:fe49:e0a1", "fe80::42:79ff:fe2c:d276", "fe80::5054:ff:fecc:7d59", "fd9e:2d4e:a32a:7777::26", "fd9e:2d4e:a32a:7777::19", "fe80::5054:ff:fecc:7d59" ], So as mentioned earlier new IPv6 address becomes first on the list (as opposed to IPv4) and the the vip address fd9e:2d4e:a32a:7777::26 will be chosen. Not sure what operation are performed on CI but please remember that this issue is most likely IPv6 specific due the new IP addresses order, and visible when stack is updated because then vip is present and becomes first IPv6 address. During initial deployment ceph configuration is prepared before vip is setup. [1] https://github.com/ceph/ceph-ansible/blob/v3.2.10/roles/ceph-config/templates/ceph.conf.j2#L47 [2] https://docs.ansible.com/ansible/latest/user_guide/playbooks_filters_ipaddr.html#getting-information-about-hosts-and-networks
hi, thanks again for helping with this bug. it seems in fact that in this line [1], while ipaddr() does get the ip/prefix argument correctly and could theoretically filter out addresses out of the wanted prefix, the _addresses list generated as ansible fact does not include the ip prefixes :( an alternative option could be to discard the address later in the process if its got a /128 or /32 prefix 1. https://github.com/ceph/ceph-ansible/blob/v3.2.10/roles/ceph-config/templates/ceph.conf.j2#L47
Hi, do you have any updates on this issue? I see that in 4.0 branch things changed a bit around getting IP addresses, maybe simplest solution is to just take last address of interface from given subnet in case of IPv6 in 3.2 branch and let 4.0 has more sophisticated way to get it?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2538