Bug 1788536 - spine/leaf DCN deployments require quoted storage network overrides
Summary: spine/leaf DCN deployments require quoted storage network overrides
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 4.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 5.*
Assignee: Guillaume Abrioux
QA Contact: Vasishta
URL:
Whiteboard:
Depends On:
Blocks: 1760354 1802774
TreeView+ depends on / blocked
 
Reported: 2020-01-07 13:04 UTC by Yuri Obshansky
Modified: 2020-03-23 16:06 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-23 16:06:37 UTC
Embargoed:


Attachments (Terms of Use)
/var/lib/mistral/dcn1/ansible.log (215.37 KB, application/gzip)
2020-01-07 13:04 UTC, Yuri Obshansky
no flags Details

Description Yuri Obshansky 2020-01-07 13:04:56 UTC
Created attachment 1650389 [details]
/var/lib/mistral/dcn1/ansible.log

Description of problem:
OSP 16 DCN with multi stack (central/dcn1/dcn2) with Spine-Leaf Network topology.
Deployment of stack dcn1 failed with error:
        "fatal: [dcn1-computehci1-0]: FAILED! => ",
        "  msg: 'Unexpected templating type error occurred on ({{ _monitor_addresses | default([]) + [{ ''name'': item, ''addr'': hostvars[item][''ansible_all_ipv4_addresses''] | ips_in_ranges(hostvars[item][''monitor_address_block''].split('','')) | first }] }}): must be str, not list'",
        "fatal: [dcn1-computehci1-1]: FAILED! => ",
        "fatal: [dcn1-computehci1-2]: FAILED! => ",


Version-Release number of selected component (if applicable):
RHOS_TRUNK-16.0-RHEL-8-20191217.n.1

How reproducible:
See documentation ->
https://docs.google.com/document/d/1QV4lYXh2tRSoxdOZgWOK3H6UeNlzS1rojznH0dM0hlc/edit#
Templates ->
https://code.engineering.redhat.com/gerrit/gitweb?p=rhos-infrared.git;a=tree;f=settings/installer/ospd/deployment/edge/osp-16-spine-leaf-multistack-hci;h=100cc538d1ed00cee4c95e2caec25973ecb94588;hb=HEAD

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
I hold the environment for investigation.
Please, ping me om email/ircc for details.

Comment 1 John Fulton 2020-01-07 20:09:56 UTC
You had the following in your parameters for ceph-ansible:

cluster_network: 172.18.1.0/24,172.18.2.0/24
public_network: 172.23.1.0/24,172.23.2.0/24
monitor_address_block: 172.23.1.0/24,172.23.2.0/24

for example:

[stack@site-undercloud-0 dcn1]$ sudo grep monitor_address_block /var/lib/mistral/config-download-latest/ceph-ansible/group_vars/all.yml
monitor_address_block: 172.23.1.0/24,172.23.2.0/24
[stack@site-undercloud-0 dcn1]$

As per the docs [1] they need to be passed with CephAnsibleExtraConfig to be
overridden and then quoted. I added the following to your internal.yaml:

CephAnsibleExtraConfig:
  cluster_network: '172.18.1.0/24,172.18.2.0/24'
  public_network: '172.23.1.0/24,172.23.2.0/24'
  monitor_address_block: '172.23.1.0/24,172.23.2.0/24'

You had put CephAnsibleExtraConfig in nodes_data.yaml but you may only use
this parameter once and it was already in your internal.yaml to set
'is_hci: true'so that's where I put it. I then ran a stack update.

Your overcloud then failed with a new error message because the error in
bug you reported was no longer happening [2]. The new error happened
becasuse your host doesn't have the desired '172.23' or '172.18' IPs 
on it [3]. 

This however is not a ceph-ansible bug. It's a problem you're having with
assigning the correct IPs to your hosts. When you determine what the correct
IP should be on your host, quote that IP and override it as I have described
above. It also looks like we need a doc bug for getting that in.

Harold, who worked on bug 1740283, modified ceph-ansible during the 16 cycle
so it would support these quoted values [4] you just need to quote them once
you correctly configure your deployment to assign them.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/spine_leaf_networking/index#assigning-routes-for-roles

[2]
        "ok: [dcn1-computehci1-0] => (item=dcn1-computehci1-0) => changed=false ",
        "    _monitor_addresses: '[{''name'': ''dcn1-computehci1-0'', ''addr'': AnsibleUndefined}]'",                                                                                       
        "  item: dcn1-computehci1-0",
        "ok: [dcn1-computehci1-1] => (item=dcn1-computehci1-0) => changed=false ",
        "fatal: [dcn1-computehci1-0]: FAILED! => ",
        "  msg: 'Unexpected templating type error occurred on ({{ _monitor_addresses | default([]) + [{ ''name'': item, ''addr'': hostvars[item][''ansible_all_ipv4_addresses''] | ips_in_ran
ges(hostvars[item][''monitor_address_block''].split('','')) | first }] }}): must be str, not list'",                                                                                        
        "ok: [dcn1-computehci1-2] => (item=dcn1-computehci1-0) => changed=false ",
        "fatal: [dcn1-computehci1-1]: FAILED! => ",


[3]
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:8b:2e:71 brd ff:ff:ff:ff:ff:ff
    inet 192.168.34.89/24 brd 192.168.34.255 scope global dynamic noprefixroute ens3
       valid_lft 78942sec preferred_lft 78942sec
    inet6 fe80::5054:ff:fe8b:2e71/64 scope link 
       valid_lft forever preferred_lft forever
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:94:50:e1 brd ff:ff:ff:ff:ff:ff
    inet 172.16.20.66/24 brd 172.16.20.255 scope global dynamic noprefixroute ens4
       valid_lft 2503sec preferred_lft 2503sec
    inet6 fe80::7beb:692b:fc54:fdd4/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
4: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:83:ef:3c brd ff:ff:ff:ff:ff:ff
    inet 10.0.20.69/24 brd 10.0.20.255 scope global dynamic noprefixroute ens5
       valid_lft 2759sec preferred_lft 2759sec
    inet6 2620:52:0:13b8::fe:63/128 scope global dynamic noprefixroute 
       valid_lft 1985sec preferred_lft 1985sec
    inet6 fe80::b5b1:adc4:16af:f585/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

[4] https://github.com/ceph/ceph-ansible/commit/e695efcaf79909e2237197fd473117930e8d83e5#diff-d53302523567dc01b57c06bb371f1e3d

Comment 8 John Fulton 2020-01-08 13:19:41 UTC
New Summary after RCA:

The Storage and StorageMgmt networks passed to ceph-ansible in spine/leaf deployments are passed as a list:

 public_network: 172.23.1.0/24,172.23.2.0/24

As per the error message in #1, ceph-ansible cannot parse the above. The workaround is to determine the appropriate network ceph-ansible should use and then pass it as an override and use quotes.

CephAnsibleExtraConfig:
  public_network: '172.23.1.0/24,172.23.2.0/24'

Though quoting was the recommended and documented method in the past, it should no longer be necessary in OSP16.

The goal of this bug is to either modify ceph-ansible so it can manage the non-quoted value [1] or for TripleO to quote the data before it is passed to ceph-ansible.

The next step is for the ceph-ansible team to provide input on which of the above options we should pursue (hence the needinfo to gabrioux)

Comment 12 John Fulton 2020-02-24 13:54:33 UTC
resetting product as it's ceph-ansible which requires the quotes.

We documented the workaround on the openstack side for now in chapter 2

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html-single/deploying_distributed_compute_nodes_with_separate_heat_stacks/index#proc_designing-your-separate-heat-stacks-deployment


Note You need to log in before you can comment on or make changes to this bug.