Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1788536

Summary: spine/leaf DCN deployments require quoted storage network overrides
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Yuri Obshansky <yobshans>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED WONTFIX QA Contact: Vasishta <vashastr>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: aschoen, ceph-eng-bugs, emacchi, gabrioux, gfidente, gmeno, johfulto, mburns, nthomas, pasik, pgrist, slinaber, ykaul
Target Milestone: rcKeywords: Triaged
Target Release: 5.*   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-23 16:06:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1760354, 1802774    
Attachments:
Description Flags
/var/lib/mistral/dcn1/ansible.log none

Description Yuri Obshansky 2020-01-07 13:04:56 UTC
Created attachment 1650389 [details]
/var/lib/mistral/dcn1/ansible.log

Description of problem:
OSP 16 DCN with multi stack (central/dcn1/dcn2) with Spine-Leaf Network topology.
Deployment of stack dcn1 failed with error:
        "fatal: [dcn1-computehci1-0]: FAILED! => ",
        "  msg: 'Unexpected templating type error occurred on ({{ _monitor_addresses | default([]) + [{ ''name'': item, ''addr'': hostvars[item][''ansible_all_ipv4_addresses''] | ips_in_ranges(hostvars[item][''monitor_address_block''].split('','')) | first }] }}): must be str, not list'",
        "fatal: [dcn1-computehci1-1]: FAILED! => ",
        "fatal: [dcn1-computehci1-2]: FAILED! => ",


Version-Release number of selected component (if applicable):
RHOS_TRUNK-16.0-RHEL-8-20191217.n.1

How reproducible:
See documentation ->
https://docs.google.com/document/d/1QV4lYXh2tRSoxdOZgWOK3H6UeNlzS1rojznH0dM0hlc/edit#
Templates ->
https://code.engineering.redhat.com/gerrit/gitweb?p=rhos-infrared.git;a=tree;f=settings/installer/ospd/deployment/edge/osp-16-spine-leaf-multistack-hci;h=100cc538d1ed00cee4c95e2caec25973ecb94588;hb=HEAD

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
I hold the environment for investigation.
Please, ping me om email/ircc for details.

Comment 1 John Fulton 2020-01-07 20:09:56 UTC
You had the following in your parameters for ceph-ansible:

cluster_network: 172.18.1.0/24,172.18.2.0/24
public_network: 172.23.1.0/24,172.23.2.0/24
monitor_address_block: 172.23.1.0/24,172.23.2.0/24

for example:

[stack@site-undercloud-0 dcn1]$ sudo grep monitor_address_block /var/lib/mistral/config-download-latest/ceph-ansible/group_vars/all.yml
monitor_address_block: 172.23.1.0/24,172.23.2.0/24
[stack@site-undercloud-0 dcn1]$

As per the docs [1] they need to be passed with CephAnsibleExtraConfig to be
overridden and then quoted. I added the following to your internal.yaml:

CephAnsibleExtraConfig:
  cluster_network: '172.18.1.0/24,172.18.2.0/24'
  public_network: '172.23.1.0/24,172.23.2.0/24'
  monitor_address_block: '172.23.1.0/24,172.23.2.0/24'

You had put CephAnsibleExtraConfig in nodes_data.yaml but you may only use
this parameter once and it was already in your internal.yaml to set
'is_hci: true'so that's where I put it. I then ran a stack update.

Your overcloud then failed with a new error message because the error in
bug you reported was no longer happening [2]. The new error happened
becasuse your host doesn't have the desired '172.23' or '172.18' IPs 
on it [3]. 

This however is not a ceph-ansible bug. It's a problem you're having with
assigning the correct IPs to your hosts. When you determine what the correct
IP should be on your host, quote that IP and override it as I have described
above. It also looks like we need a doc bug for getting that in.

Harold, who worked on bug 1740283, modified ceph-ansible during the 16 cycle
so it would support these quoted values [4] you just need to quote them once
you correctly configure your deployment to assign them.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/spine_leaf_networking/index#assigning-routes-for-roles

[2]
        "ok: [dcn1-computehci1-0] => (item=dcn1-computehci1-0) => changed=false ",
        "    _monitor_addresses: '[{''name'': ''dcn1-computehci1-0'', ''addr'': AnsibleUndefined}]'",                                                                                       
        "  item: dcn1-computehci1-0",
        "ok: [dcn1-computehci1-1] => (item=dcn1-computehci1-0) => changed=false ",
        "fatal: [dcn1-computehci1-0]: FAILED! => ",
        "  msg: 'Unexpected templating type error occurred on ({{ _monitor_addresses | default([]) + [{ ''name'': item, ''addr'': hostvars[item][''ansible_all_ipv4_addresses''] | ips_in_ran
ges(hostvars[item][''monitor_address_block''].split('','')) | first }] }}): must be str, not list'",                                                                                        
        "ok: [dcn1-computehci1-2] => (item=dcn1-computehci1-0) => changed=false ",
        "fatal: [dcn1-computehci1-1]: FAILED! => ",


[3]
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:8b:2e:71 brd ff:ff:ff:ff:ff:ff
    inet 192.168.34.89/24 brd 192.168.34.255 scope global dynamic noprefixroute ens3
       valid_lft 78942sec preferred_lft 78942sec
    inet6 fe80::5054:ff:fe8b:2e71/64 scope link 
       valid_lft forever preferred_lft forever
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:94:50:e1 brd ff:ff:ff:ff:ff:ff
    inet 172.16.20.66/24 brd 172.16.20.255 scope global dynamic noprefixroute ens4
       valid_lft 2503sec preferred_lft 2503sec
    inet6 fe80::7beb:692b:fc54:fdd4/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
4: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:83:ef:3c brd ff:ff:ff:ff:ff:ff
    inet 10.0.20.69/24 brd 10.0.20.255 scope global dynamic noprefixroute ens5
       valid_lft 2759sec preferred_lft 2759sec
    inet6 2620:52:0:13b8::fe:63/128 scope global dynamic noprefixroute 
       valid_lft 1985sec preferred_lft 1985sec
    inet6 fe80::b5b1:adc4:16af:f585/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

[4] https://github.com/ceph/ceph-ansible/commit/e695efcaf79909e2237197fd473117930e8d83e5#diff-d53302523567dc01b57c06bb371f1e3d

Comment 8 John Fulton 2020-01-08 13:19:41 UTC
New Summary after RCA:

The Storage and StorageMgmt networks passed to ceph-ansible in spine/leaf deployments are passed as a list:

 public_network: 172.23.1.0/24,172.23.2.0/24

As per the error message in #1, ceph-ansible cannot parse the above. The workaround is to determine the appropriate network ceph-ansible should use and then pass it as an override and use quotes.

CephAnsibleExtraConfig:
  public_network: '172.23.1.0/24,172.23.2.0/24'

Though quoting was the recommended and documented method in the past, it should no longer be necessary in OSP16.

The goal of this bug is to either modify ceph-ansible so it can manage the non-quoted value [1] or for TripleO to quote the data before it is passed to ceph-ansible.

The next step is for the ceph-ansible team to provide input on which of the above options we should pursue (hence the needinfo to gabrioux)

Comment 12 John Fulton 2020-02-24 13:54:33 UTC
resetting product as it's ceph-ansible which requires the quotes.

We documented the workaround on the openstack side for now in chapter 2

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html-single/deploying_distributed_compute_nodes_with_separate_heat_stacks/index#proc_designing-your-separate-heat-stacks-deployment