2249693 – [FFU][DCN] Ceph adopt stage fails when upgrading Ceph of DCN multistorage backend env

Bug 2249693 - [FFU][DCN] Ceph adopt stage fails when upgrading Ceph of DCN multistorage backend env

Summary: [FFU][DCN] Ceph adopt stage fails when upgrading Ceph of DCN multistorage bac...

Keywords:
Status:	CLOSED DUPLICATE of bug 2231469
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	5.3z6
Assignee:	Manoj Katari
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1997638
TreeView+	depends on / blocked

Reported:	2023-11-14 22:00 UTC by Marian Krcmarik
Modified:	2023-12-12 19:35 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-12 19:35:50 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-30442	0	None	None	None	2023-11-14 22:01:59 UTC
Red Hat Issue Tracker	RHCEPH-7912	0	None	None	None	2023-11-15 07:58:05 UTC

Description Marian Krcmarik 2023-11-14 22:00:59 UTC

Description of problem:
The ceph upgrade procedure on Openstack DCN env includes running a ceph adopt stage by executing following command:
openstack overcloud external-upgrade run ${EXTERNAL_ANSWER} \
    --stack ${STACK} \
    --skip-tags "ceph_health,opendev-validation,ceph_ansible_remote_tmp" \
    --tags cephadm_adopt  2>&1

This fails on following ansible task (from ceph-ansible):
    - name: add ceph label for core component
      command: "{{ ceph_cmd }} orch host label add {{ ansible_facts['nodename'] }} ceph"
      changed_when: false
      delegate_to: '{{ groups[mon_group_name][0] }}'
      when: inventory_hostname in groups.get(mon_group_name, []) or
            inventory_hostname in groups.get(osd_group_name, []) or
            inventory_hostname in groups.get(mds_group_name, []) or
            inventory_hostname in groups.get(rgw_group_name, []) or
            inventory_hostname in groups.get(mgr_group_name, []) or
            inventory_hostname in groups.get(rbdmirror_group_name, [])
I do not have the exact error msg atm but It complained about the host where the label was supposed to be added not existing. So I tried to list all the hosts with "orch host ls" command and It said the cluster had no hosts. I noticed that the hosts should be added to the cluster in previous task:
    - name: manage nodes with cephadm - ipv4
      command: "{{ ceph_cmd }} orch host add {{ ansible_facts['nodename'] }} {{ ansible_facts['all_ipv4_addresses'] | ips_in_ranges(cephadm_mgmt_network.split(',')) | first }} {{ group_names | intersect(adopt_label_group_names) | join(' ') }}"
      changed_when: false
      delegate_to: '{{ groups[mon_group_name][0] }}'
      when: cephadm_mgmt_network is ansible.utils.ipv4

But the task was skipped. The condition for the task is:
when: cephadm_mgmt_network is ansible.utils.ipv4

I tried to find what the value of cephadm_mgmt_network and I grepped the config-download directory for cephadm_mgmt_network (since I assume it takes the vars from there) and I got this:
cephadm_mgmt_network: 192.168.24.0/24,192.168.34.0/24,192.168.44.0/24

It includes the networks from all ceph clusters/dcn sites which is not a valid IPv4 range and the task is skipped.
I tried to edit all the occurrences in config-download and keep only relevant network for the specific ceph cluster/DCN stack, i.e.:
cephadm_mgmt_network: 192.168.44.0/24
And then the ceph upgrade passes.

NOTE: The upstream ceph-ansible  includes following patch:
https://github.com/ceph/ceph-ansible/commit/db2f3e42dc23d72cfae85ec942b6b2f43e482e81

Which means I assume that the task won't be skipped but the first network will be picked. I guess that's not the right solution because specific network for specific ceph cluster should be picked?
Anyway the latest ceph-ansible does not include the patch.

Version-Release number of selected component (if applicable):
ceph-ansible-6.0.28.6-1.el8cp.noarch

How reproducible:
Always

Steps to Reproduce:
1. Execute ceph upgrade of OSP ceph multicluster DCN env.
2. Execute the ceph adopt stage.

Actual results:
It fails on the task "add ceph label for core component"

Expected results:
Passed ceph upgrade

Comment 1 John Fulton 2023-11-15 21:24:48 UTC

(In reply to Marian Krcmarik from comment #0)
[order changed]
> Version-Release number of selected component (if applicable):
> ceph-ansible-6.0.28.6-1.el8cp.noarch

So this is what was tested:

  https://github.com/ceph/ceph-ansible/blob/v6.0.28.6/infrastructure-playbooks/cephadm-adopt.yml#L375

> NOTE: The upstream ceph-ansible  includes following patch:
> https://github.com/ceph/ceph-ansible/commit/
> db2f3e42dc23d72cfae85ec942b6b2f43e482e81

The next release v6.0.28.7 already has the above patch:

  https://github.com/ceph/ceph-ansible/blob/v6.0.28.7/infrastructure-playbooks/cephadm-adopt.yml#L375

When the above is used all three clusters get the correct hosts added relative to their networks as per Manoj's testing.

192.168.24.0/24 is for central site
192.168.34.0/24 is for dcn1
192.168.44.0/24 is for dcn2

> Which means I assume that the task won't be skipped but the first network
> will be picked. I guess that's not the right solution because specific
> network for specific ceph cluster should be picked?

The solution from that patch in v6.0.28.7 is correct because it will add only the right host during adoption. Manoj tested it. I can also explain why it works.

Here's the task for the IPv4 case from v6.0.28.7

1.  name: manage nodes with cephadm - ipv4
2.  command: "{{ ceph_cmd }} orch host add {{ ansible_facts['nodename'] }} {{ ansible_facts['all_ipv4_addresses'] | ips_in_ranges(cephadm_mgmt_network.split(',')) | first }} {{ group_names | intersect(adopt_label_group_names) | join(' ') }}"
3.  changed_when: false
4.  delegate_to: '{{ groups[mon_group_name][0] }}'
5.  when: cephadm_mgmt_network.split(',')[0] is ansible.utils.ipv4

The purpose of line 5 is to ensure that we pull the correct set of IPs (v4 vs v6) from ansible_facts on line 2.

We do not support dual stack with Ceph in OSP17 so there won't be a mix of IPv6 and IPv4 addresses in the cephadm_mgmt_network list. Thus, as long as we know the first IP address is v4, we are safe to use `ansible_facts['all_ipv4_addresses']` on line 2. The same logic applies to task "manage nodes with cephadm - ipv6" which uses all_ipv6_addresses.

If we were to update line 5 so that it verifies that every IP address in the list is valid v4 it wouldn't add any benefit since the goal of line 5 is only to confirm if we should be pulling v4 or v6 facts. In theory it could have an unwanted side effect of preventing all clusters from being adopted if another cluster's IP range was in the list but not valid.

Line 2 results in the following command being run:

 ceph orch host add $NAME $IP $LABEL

Where $IP comes from this expression:

  {{ ansible_facts['all_ipv4_addresses'] | ips_in_ranges(cephadm_mgmt_network.split(',')) | first }}

The above will produce the correct set of IPs relative to each cluster and `| first` is only used as a precaution in case >1 is returned.

Because of line 4, this task will be delegated to the first mon node in the Ceph cluster being adopted (central, dcn1, dcn2) each time the playbook is run. So this expression:

  ansible_facts['all_ipv4_addresses']

Will return all IPs on the first monitor of the cluster being adopted (e.g. the IP from internal API net, the IP from ctlplane net, the IP from stroage net, etc). Only the IP in the range of the cephadm_mgmt_network list will be selected. Thus, if the dcn1 is being adopted, and it's mon has IP 192.168.34.42, then only 192.168.34.42 will come out of the expression since that IP is in the ranges 192.168.24.0/24,192.168.34.0/24,192.168.44.0/24. Thus, the correct IP will be added relative to the mon node where the task is delegated per playbook run per adoption.

Thus, my recommendation is that this bug can be set to POST since v6.0.28.7 already contains the fix and we just need it to be shipped. I.e. downstream the current latest version is 6.0.28.6 but we need our customers to be able to get v6.0.28.7.

Comment 2 Manoj Katari 2023-11-16 08:13:19 UTC

As the working fix is available upstream in ceph-ansible v6.0.28.7 and we await for that version in downstream, moving the BZ to POST.

Comment 7 John Fulton 2023-11-20 13:21:53 UTC


*** This bug has been marked as a duplicate of bug 2231469 ***

Comment 10 John Fulton 2023-12-12 19:35:50 UTC

The fix to 2249693 is:

  https://github.com/ceph/ceph-ansible/pull/7452/files

which is in:

  https://github.com/ceph/ceph-ansible/tree/v6.0.28.7

The fix to 2231469 is the same and I see it targeted at 5.3z6.

Thus, I'm closing 2249693 as a duplicate of 2231469.

*** This bug has been marked as a duplicate of bug 2231469 ***

Note You need to log in before you can comment on or make changes to this bug.