Description of problem: The ceph upgrade procedure on Openstack DCN env includes running a ceph adopt stage by executing following command: openstack overcloud external-upgrade run ${EXTERNAL_ANSWER} \ --stack ${STACK} \ --skip-tags "ceph_health,opendev-validation,ceph_ansible_remote_tmp" \ --tags cephadm_adopt 2>&1 This fails on following ansible task (from ceph-ansible): - name: add ceph label for core component command: "{{ ceph_cmd }} orch host label add {{ ansible_facts['nodename'] }} ceph" changed_when: false delegate_to: '{{ groups[mon_group_name][0] }}' when: inventory_hostname in groups.get(mon_group_name, []) or inventory_hostname in groups.get(osd_group_name, []) or inventory_hostname in groups.get(mds_group_name, []) or inventory_hostname in groups.get(rgw_group_name, []) or inventory_hostname in groups.get(mgr_group_name, []) or inventory_hostname in groups.get(rbdmirror_group_name, []) I do not have the exact error msg atm but It complained about the host where the label was supposed to be added not existing. So I tried to list all the hosts with "orch host ls" command and It said the cluster had no hosts. I noticed that the hosts should be added to the cluster in previous task: - name: manage nodes with cephadm - ipv4 command: "{{ ceph_cmd }} orch host add {{ ansible_facts['nodename'] }} {{ ansible_facts['all_ipv4_addresses'] | ips_in_ranges(cephadm_mgmt_network.split(',')) | first }} {{ group_names | intersect(adopt_label_group_names) | join(' ') }}" changed_when: false delegate_to: '{{ groups[mon_group_name][0] }}' when: cephadm_mgmt_network is ansible.utils.ipv4 But the task was skipped. The condition for the task is: when: cephadm_mgmt_network is ansible.utils.ipv4 I tried to find what the value of cephadm_mgmt_network and I grepped the config-download directory for cephadm_mgmt_network (since I assume it takes the vars from there) and I got this: cephadm_mgmt_network: 192.168.24.0/24,192.168.34.0/24,192.168.44.0/24 It includes the networks from all ceph clusters/dcn sites which is not a valid IPv4 range and the task is skipped. I tried to edit all the occurrences in config-download and keep only relevant network for the specific ceph cluster/DCN stack, i.e.: cephadm_mgmt_network: 192.168.44.0/24 And then the ceph upgrade passes. NOTE: The upstream ceph-ansible includes following patch: https://github.com/ceph/ceph-ansible/commit/db2f3e42dc23d72cfae85ec942b6b2f43e482e81 Which means I assume that the task won't be skipped but the first network will be picked. I guess that's not the right solution because specific network for specific ceph cluster should be picked? Anyway the latest ceph-ansible does not include the patch. Version-Release number of selected component (if applicable): ceph-ansible-6.0.28.6-1.el8cp.noarch How reproducible: Always Steps to Reproduce: 1. Execute ceph upgrade of OSP ceph multicluster DCN env. 2. Execute the ceph adopt stage. Actual results: It fails on the task "add ceph label for core component" Expected results: Passed ceph upgrade
(In reply to Marian Krcmarik from comment #0) [order changed] > Version-Release number of selected component (if applicable): > ceph-ansible-6.0.28.6-1.el8cp.noarch So this is what was tested: https://github.com/ceph/ceph-ansible/blob/v6.0.28.6/infrastructure-playbooks/cephadm-adopt.yml#L375 > NOTE: The upstream ceph-ansible includes following patch: > https://github.com/ceph/ceph-ansible/commit/ > db2f3e42dc23d72cfae85ec942b6b2f43e482e81 The next release v6.0.28.7 already has the above patch: https://github.com/ceph/ceph-ansible/blob/v6.0.28.7/infrastructure-playbooks/cephadm-adopt.yml#L375 When the above is used all three clusters get the correct hosts added relative to their networks as per Manoj's testing. 192.168.24.0/24 is for central site 192.168.34.0/24 is for dcn1 192.168.44.0/24 is for dcn2 > Which means I assume that the task won't be skipped but the first network > will be picked. I guess that's not the right solution because specific > network for specific ceph cluster should be picked? The solution from that patch in v6.0.28.7 is correct because it will add only the right host during adoption. Manoj tested it. I can also explain why it works. Here's the task for the IPv4 case from v6.0.28.7 1. name: manage nodes with cephadm - ipv4 2. command: "{{ ceph_cmd }} orch host add {{ ansible_facts['nodename'] }} {{ ansible_facts['all_ipv4_addresses'] | ips_in_ranges(cephadm_mgmt_network.split(',')) | first }} {{ group_names | intersect(adopt_label_group_names) | join(' ') }}" 3. changed_when: false 4. delegate_to: '{{ groups[mon_group_name][0] }}' 5. when: cephadm_mgmt_network.split(',')[0] is ansible.utils.ipv4 The purpose of line 5 is to ensure that we pull the correct set of IPs (v4 vs v6) from ansible_facts on line 2. We do not support dual stack with Ceph in OSP17 so there won't be a mix of IPv6 and IPv4 addresses in the cephadm_mgmt_network list. Thus, as long as we know the first IP address is v4, we are safe to use `ansible_facts['all_ipv4_addresses']` on line 2. The same logic applies to task "manage nodes with cephadm - ipv6" which uses all_ipv6_addresses. If we were to update line 5 so that it verifies that every IP address in the list is valid v4 it wouldn't add any benefit since the goal of line 5 is only to confirm if we should be pulling v4 or v6 facts. In theory it could have an unwanted side effect of preventing all clusters from being adopted if another cluster's IP range was in the list but not valid. Line 2 results in the following command being run: ceph orch host add $NAME $IP $LABEL Where $IP comes from this expression: {{ ansible_facts['all_ipv4_addresses'] | ips_in_ranges(cephadm_mgmt_network.split(',')) | first }} The above will produce the correct set of IPs relative to each cluster and `| first` is only used as a precaution in case >1 is returned. Because of line 4, this task will be delegated to the first mon node in the Ceph cluster being adopted (central, dcn1, dcn2) each time the playbook is run. So this expression: ansible_facts['all_ipv4_addresses'] Will return all IPs on the first monitor of the cluster being adopted (e.g. the IP from internal API net, the IP from ctlplane net, the IP from stroage net, etc). Only the IP in the range of the cephadm_mgmt_network list will be selected. Thus, if the dcn1 is being adopted, and it's mon has IP 192.168.34.42, then only 192.168.34.42 will come out of the expression since that IP is in the ranges 192.168.24.0/24,192.168.34.0/24,192.168.44.0/24. Thus, the correct IP will be added relative to the mon node where the task is delegated per playbook run per adoption. Thus, my recommendation is that this bug can be set to POST since v6.0.28.7 already contains the fix and we just need it to be shipped. I.e. downstream the current latest version is 6.0.28.6 but we need our customers to be able to get v6.0.28.7.
As the working fix is available upstream in ceph-ansible v6.0.28.7 and we await for that version in downstream, moving the BZ to POST.
*** This bug has been marked as a duplicate of bug 2231469 ***
The fix to 2249693 is: https://github.com/ceph/ceph-ansible/pull/7452/files which is in: https://github.com/ceph/ceph-ansible/tree/v6.0.28.7 The fix to 2231469 is the same and I see it targeted at 5.3z6. Thus, I'm closing 2249693 as a duplicate of 2231469. *** This bug has been marked as a duplicate of bug 2231469 ***