Created attachment 1650389 [details] /var/lib/mistral/dcn1/ansible.log Description of problem: OSP 16 DCN with multi stack (central/dcn1/dcn2) with Spine-Leaf Network topology. Deployment of stack dcn1 failed with error: "fatal: [dcn1-computehci1-0]: FAILED! => ", " msg: 'Unexpected templating type error occurred on ({{ _monitor_addresses | default([]) + [{ ''name'': item, ''addr'': hostvars[item][''ansible_all_ipv4_addresses''] | ips_in_ranges(hostvars[item][''monitor_address_block''].split('','')) | first }] }}): must be str, not list'", "fatal: [dcn1-computehci1-1]: FAILED! => ", "fatal: [dcn1-computehci1-2]: FAILED! => ", Version-Release number of selected component (if applicable): RHOS_TRUNK-16.0-RHEL-8-20191217.n.1 How reproducible: See documentation -> https://docs.google.com/document/d/1QV4lYXh2tRSoxdOZgWOK3H6UeNlzS1rojznH0dM0hlc/edit# Templates -> https://code.engineering.redhat.com/gerrit/gitweb?p=rhos-infrared.git;a=tree;f=settings/installer/ospd/deployment/edge/osp-16-spine-leaf-multistack-hci;h=100cc538d1ed00cee4c95e2caec25973ecb94588;hb=HEAD Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: I hold the environment for investigation. Please, ping me om email/ircc for details.
You had the following in your parameters for ceph-ansible: cluster_network: 172.18.1.0/24,172.18.2.0/24 public_network: 172.23.1.0/24,172.23.2.0/24 monitor_address_block: 172.23.1.0/24,172.23.2.0/24 for example: [stack@site-undercloud-0 dcn1]$ sudo grep monitor_address_block /var/lib/mistral/config-download-latest/ceph-ansible/group_vars/all.yml monitor_address_block: 172.23.1.0/24,172.23.2.0/24 [stack@site-undercloud-0 dcn1]$ As per the docs [1] they need to be passed with CephAnsibleExtraConfig to be overridden and then quoted. I added the following to your internal.yaml: CephAnsibleExtraConfig: cluster_network: '172.18.1.0/24,172.18.2.0/24' public_network: '172.23.1.0/24,172.23.2.0/24' monitor_address_block: '172.23.1.0/24,172.23.2.0/24' You had put CephAnsibleExtraConfig in nodes_data.yaml but you may only use this parameter once and it was already in your internal.yaml to set 'is_hci: true'so that's where I put it. I then ran a stack update. Your overcloud then failed with a new error message because the error in bug you reported was no longer happening [2]. The new error happened becasuse your host doesn't have the desired '172.23' or '172.18' IPs on it [3]. This however is not a ceph-ansible bug. It's a problem you're having with assigning the correct IPs to your hosts. When you determine what the correct IP should be on your host, quote that IP and override it as I have described above. It also looks like we need a doc bug for getting that in. Harold, who worked on bug 1740283, modified ceph-ansible during the 16 cycle so it would support these quoted values [4] you just need to quote them once you correctly configure your deployment to assign them. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/spine_leaf_networking/index#assigning-routes-for-roles [2] "ok: [dcn1-computehci1-0] => (item=dcn1-computehci1-0) => changed=false ", " _monitor_addresses: '[{''name'': ''dcn1-computehci1-0'', ''addr'': AnsibleUndefined}]'", " item: dcn1-computehci1-0", "ok: [dcn1-computehci1-1] => (item=dcn1-computehci1-0) => changed=false ", "fatal: [dcn1-computehci1-0]: FAILED! => ", " msg: 'Unexpected templating type error occurred on ({{ _monitor_addresses | default([]) + [{ ''name'': item, ''addr'': hostvars[item][''ansible_all_ipv4_addresses''] | ips_in_ran ges(hostvars[item][''monitor_address_block''].split('','')) | first }] }}): must be str, not list'", "ok: [dcn1-computehci1-2] => (item=dcn1-computehci1-0) => changed=false ", "fatal: [dcn1-computehci1-1]: FAILED! => ", [3] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:8b:2e:71 brd ff:ff:ff:ff:ff:ff inet 192.168.34.89/24 brd 192.168.34.255 scope global dynamic noprefixroute ens3 valid_lft 78942sec preferred_lft 78942sec inet6 fe80::5054:ff:fe8b:2e71/64 scope link valid_lft forever preferred_lft forever 3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:94:50:e1 brd ff:ff:ff:ff:ff:ff inet 172.16.20.66/24 brd 172.16.20.255 scope global dynamic noprefixroute ens4 valid_lft 2503sec preferred_lft 2503sec inet6 fe80::7beb:692b:fc54:fdd4/64 scope link noprefixroute valid_lft forever preferred_lft forever 4: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:83:ef:3c brd ff:ff:ff:ff:ff:ff inet 10.0.20.69/24 brd 10.0.20.255 scope global dynamic noprefixroute ens5 valid_lft 2759sec preferred_lft 2759sec inet6 2620:52:0:13b8::fe:63/128 scope global dynamic noprefixroute valid_lft 1985sec preferred_lft 1985sec inet6 fe80::b5b1:adc4:16af:f585/64 scope link noprefixroute valid_lft forever preferred_lft forever [4] https://github.com/ceph/ceph-ansible/commit/e695efcaf79909e2237197fd473117930e8d83e5#diff-d53302523567dc01b57c06bb371f1e3d
New Summary after RCA: The Storage and StorageMgmt networks passed to ceph-ansible in spine/leaf deployments are passed as a list: public_network: 172.23.1.0/24,172.23.2.0/24 As per the error message in #1, ceph-ansible cannot parse the above. The workaround is to determine the appropriate network ceph-ansible should use and then pass it as an override and use quotes. CephAnsibleExtraConfig: public_network: '172.23.1.0/24,172.23.2.0/24' Though quoting was the recommended and documented method in the past, it should no longer be necessary in OSP16. The goal of this bug is to either modify ceph-ansible so it can manage the non-quoted value [1] or for TripleO to quote the data before it is passed to ceph-ansible. The next step is for the ceph-ansible team to provide input on which of the above options we should pursue (hence the needinfo to gabrioux)
resetting product as it's ceph-ansible which requires the quotes. We documented the workaround on the openstack side for now in chapter 2 https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html-single/deploying_distributed_compute_nodes_with_separate_heat_stacks/index#proc_designing-your-separate-heat-stacks-deployment