Description of problem: OC Reboot failing while waiting for OSDs to come back Version-Release number of selected component (if applicable): How reproducible: every time Steps to Reproduce: 1. Configure predictable ip spine leaf topology 2. Reboot OC 3. Actual results: OSDs not coming back online, causing reboot to fail Expected results: Reboot succeeds Additional info: 577 pgs: 39 stale+undersized+degraded+peered, 91 stale+undersized+peered, 174 undersized+peered, 156 active+undersized, 59 undersized+degraded+peered, 58 active+undersized+degraded; 208 MiB data, 483 MiB used, 703 GiB / 704 GiB avail; 585/1068 objects degraded (54.775%) Inferring fsid 0f3fd140-ffd0-5584-a3b9-7055c087b761 2022-07-06 17:15:13.836 | Using recent ceph image undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:0eb00dcba8ab47ff957166e1a2018be0905c11ed30b7c71247893423353986ee 2022-07-06 17:15:13.838 | time="2022-07-06T17:15:13Z" level=warning msg=" binary not found, container dns will not be enabled"
The problem was reproduced for me. The testing playbook expects the OSDs to be active+clean after retries:24 and delay:15. Those numbers work for non-17 deployments but they timed out for the 17 deployment. https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-overcloud/overcloud_reboot.yml#L340-L350 2 out of 5 storage servers have their OSDs up, the rest are down. The next question I seek to answer is: why didn't the OSDs come up on 3 nodes? I'll investigate and update this bug. [ceph: root@controller-0 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.93567 root default -7 0.15594 host overcloud-cephstorage1-0 3 hdd 0.03119 osd.3 down 0 1.00000 6 hdd 0.03119 osd.6 down 0 1.00000 8 hdd 0.03119 osd.8 down 0 1.00000 17 hdd 0.03119 osd.17 down 1.00000 1.00000 21 hdd 0.03119 osd.21 down 1.00000 1.00000 -5 0.15594 host overcloud-cephstorage1-1 2 hdd 0.03119 osd.2 down 1.00000 1.00000 7 hdd 0.03119 osd.7 down 1.00000 1.00000 9 hdd 0.03119 osd.9 down 1.00000 1.00000 12 hdd 0.03119 osd.12 down 1.00000 1.00000 13 hdd 0.03119 osd.13 down 1.00000 1.00000 -9 0.15594 host overcloud-cephstorage2-0 1 hdd 0.03119 osd.1 up 1.00000 1.00000 15 hdd 0.03119 osd.15 up 1.00000 1.00000 18 hdd 0.03119 osd.18 up 1.00000 1.00000 22 hdd 0.03119 osd.22 up 1.00000 1.00000 25 hdd 0.03119 osd.25 up 1.00000 1.00000 -3 0.15594 host overcloud-cephstorage2-1 0 hdd 0.03119 osd.0 up 1.00000 1.00000 4 hdd 0.03119 osd.4 up 1.00000 1.00000 5 hdd 0.03119 osd.5 up 1.00000 1.00000 10 hdd 0.03119 osd.10 up 1.00000 1.00000 11 hdd 0.03119 osd.11 up 1.00000 1.00000 -11 0.15594 host overcloud-cephstorage3-0 14 hdd 0.03119 osd.14 down 1.00000 1.00000 19 hdd 0.03119 osd.19 down 1.00000 1.00000 23 hdd 0.03119 osd.23 down 1.00000 1.00000 26 hdd 0.03119 osd.26 down 1.00000 1.00000 28 hdd 0.03119 osd.28 down 1.00000 1.00000 -13 0.15594 host overcloud-cephstorage3-1 16 hdd 0.03119 osd.16 down 0 1.00000 20 hdd 0.03119 osd.20 down 0 1.00000 24 hdd 0.03119 osd.24 down 0 1.00000 27 hdd 0.03119 osd.27 down 0 1.00000 29 hdd 0.03119 osd.29 down 0 1.00000 [ceph: root@controller-0 /]#
The OSDs which did not come up failed with an error like this: Jul 07 16:47:54 overcloud-cephstorage3-0 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-14[10332]: debug 2022-07-07T16:47:54.886+0000 7fd70725a200 0 starting osd.14 osd_data /var/lib/ceph/osd/ceph-14 /var/lib/ceph/osd/ceph-14/journal Jul 07 16:47:54 overcloud-cephstorage3-0 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-14[10332]: debug 2022-07-07T16:47:54.886+0000 7fd70725a200 -1 unable to find any IPv4 address in networks '172.120.3.0/24' interfaces '' Jul 07 16:47:54 overcloud-cephstorage3-0 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-14[10332]: debug 2022-07-07T16:47:54.886+0000 7fd70725a200 -1 Failed to pick public address. You can see this for all failed OSDs as per the following command: (undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml overcloud-cephstorage3-1,overcloud-cephstorage3-0,overcloud-cephstorage1-1,overcloud-cephstorage1-0 -b -m shell -a "for OSD in \$(cephadm ls | jq ".[].systemd_unit" | grep osd | sed s/\\\"//g); do echo \$OSD; journalctl -u \$OSD | grep 'Failed to pick public address' ; done" | curl -F 'f:1=<-' ix.io http://ix.io/41S5 (undercloud) [stack@undercloud-0 overcloud]$ pwd /home/stack/overcloud-deploy/overcloud The output looks like this: overcloud-cephstorage1-1 | CHANGED | rc=0 >> ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608 Jul 07 16:44:07 overcloud-cephstorage1-1 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-12[3344]: debug 2022-07-07T16:44:07.413+0000 7f958860d200 -1 Failed to pick public address. Jul 07 16:44:26 overcloud-cephstorage1-1 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-12[5140]: debug 2022-07-07T16:44:26.369+0000 7fd09ce9b200 -1 Failed to pick public address. Jul 07 16:45:20 overcloud-cephstorage1-1 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-12[10633]: debug 2022-07-07T16:45:20.261+0000 7f03726d4200 -1 Failed to pick public address. ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608 Jul 07 16:44:07 overcloud-cephstorage1-1 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-13[3394]: debug 2022-07-07T16:44:07.476+0000 7f8c1a8f2200 -1 Failed to pick public address. Jul 07 16:44:26 overcloud-cephstorage1-1 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-13[5208]: debug 2022-07-07T16:44:26.651+0000 7f6c222fb200 -1 Failed to pick public address. ...
Created attachment 1895283 [details] Output of journalctl -u ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608
The OSDs didn't come up because they were configured to use a network that is not on the host. The hosts have the following 172 addresses below and none of them are in 172.120.3.0/24 (undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml CephStorage3 -b -m shell -a "ip a | grep 172" overcloud-cephstorage3-1 | CHANGED | rc=0 >> inet 172.119.3.223/24 brd 172.119.3.255 scope global vlan32 inet 172.119.4.223/24 brd 172.119.4.255 scope global vlan42 overcloud-cephstorage3-0 | CHANGED | rc=0 >> inet 172.119.3.222/24 brd 172.119.3.255 scope global vlan32 inet 172.119.4.222/24 brd 172.119.4.255 scope global vlan42 (undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml CephStorage2 -b -m shell -a "ip a | grep 172" overcloud-cephstorage2-0 | CHANGED | rc=0 >> inet 172.118.3.222/24 brd 172.118.3.255 scope global vlan31 inet 172.118.4.222/24 brd 172.118.4.255 scope global vlan41 overcloud-cephstorage2-1 | CHANGED | rc=0 >> inet 172.118.3.223/24 brd 172.118.3.255 scope global vlan31 inet 172.118.4.223/24 brd 172.118.4.255 scope global vlan41 (undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml CephStorage1 -b -m shell -a "ip a | grep 172" overcloud-cephstorage1-1 | CHANGED | rc=0 >> inet 172.117.3.223/24 brd 172.117.3.255 scope global vlan30 inet 172.117.4.223/24 brd 172.117.4.255 scope global vlan40 overcloud-cephstorage1-0 | CHANGED | rc=0 >> inet 172.117.3.222/24 brd 172.117.3.255 scope global vlan30 inet 172.117.4.222/24 brd 172.117.4.255 scope global vlan40 (undercloud) [stack@undercloud-0 overcloud]$
This is the initial ceph.conf passed during deployment: [global] public_network = '172.120.3.0/24,172.117.3.0/24,172.118.3.0/24,172.119.3.0/24' cluster_network = '172.120.4.0/24,172.117.4.0/24,172.118.4.0/24,172.119.4.0/24' ms_bind_ipv4 = true ms_bind_ipv6 = false Like this: openstack overcloud ceph deploy \ -o /home/stack/templates/overcloud-ceph-deployed.yaml \ --container-image-prepare "/home/stack/containers-prepare-parameter.yaml" \ --config /home/stack/initial-ceph.conf \ --stack "overcloud" \ --cluster ceph \ --network-data "/home/stack/virt/network/network_data_v2.yaml" \ --roles-data /home/stack/virt/roles/roles_data.yaml \ /home/stack/templates/overcloud-baremetal-deployed.yaml Yet, ceph is (mis)configured to use the following (per `ceph config dump`): cluster_network 172.120.4.0/24 public_network 172.120.3.0/24 If the above were true, then the OSDs would have never booted in the first place. But we did have a cluster running fine until it was rebooted. During initial deployment with `openstack overcloud ceph deploy` everything was fine (thus, OSDs booted). However, during the overcloud deployment (during config-download) the Ceph network settings were changed! The following lines were in /home/stack/config-download/overcloud/cephadm/cephadm_command.log 2022-07-07 16:00:45,059 p=152615 u=stack n=ansible | 2022-07-07 16:00:45.058633 | 52540043-0b57-c97a-1873-00000000026f | TASK | Set public/cluster network and v4/v6 ms_bind unless already in ceph.conf 2022-07-07 16:00:46,126 p=152615 u=stack n=ansible | 2022-07-07 16:00:46.125533 | 52540043-0b57-c97a-1873-00000000026f | OK | Set public/cluster network and v4/v6 ms_bind unless already in ceph.conf | controller-0 | item={'key': 'public_network', 'value': '172.120.3.0/24'} 2022-07-07 16:00:47,079 p=152615 u=stack n=ansible | 2022-07-07 16:00:47.078370 | 52540043-0b57-c97a-1873-00000000026f | OK | Set public/cluster network and v4/v6 ms_bind unless already in ceph.conf | controller-0 | item={'key': 'cluster_network', 'value': '172.120.4.0/24'} 2022-07-07 16:00:47,088 p=152615 u=stack n=ansible | 2022-07-07 16:00:47.087777 | 52540043-0b57-c97a-1873-00000000026f | SKIPPED | Set public/cluster network and v4/v6 ms_bind unless already in ceph.conf | controller-0 | item={'key': 'ms_bind_ipv4', 'value': ''} 2022-07-07 16:00:47,091 p=152615 u=stack n=ansible | 2022-07-07 16:00:47.090674 | 52540043-0b57-c97a-1873-00000000026f | SKIPPED | Set public/cluster network and v4/v6 ms_bind unless already in ceph.conf | controller-0 | item={'key': 'ms_bind_ipv6', 'value': ''} This patch (mine :/ ) introduced the Ansible task that created the misconfiguration: https://review.opendev.org/c/openstack/tripleo-ansible/+/843265 It shouldn't have run during config download and it shouldn't have configured the IPs above. I'm assigning this bug to myself so I can fix it. Thank you for finding it.
I was able to bring the cluster back up manually: [ceph: root@controller-0 /]# ceph config dump | grep network global advanced cluster_network 172.120.4.0/24 * global advanced public_network 172.120.3.0/24 * mon advanced public_network 172.120.3.0/24 * [ceph: root@controller-0 /]# ceph config set global public_network '172.120.3.0/24,172.117.3.0/24,172.118.3.0/24,172.119.3.0/24' [ceph: root@controller-0 /]# ceph config set global cluster_network '172.120.4.0/24,172.117.4.0/24,172.118.4.0/24,172.119.4.0/24' [ceph: root@controller-0 /]# ceph config set mon public_network '172.120.3.0/24,172.117.3.0/24,172.118.3.0/24,172.119.3.0/24' [ceph: root@controller-0 /]# ceph config set mon cluster_network '172.120.4.0/24,172.117.4.0/24,172.118.4.0/24,172.119.4.0/24' [ceph: root@controller-0 /]# ceph config dump | grep network global advanced cluster_network 172.120.4.0/24,172.117.4.0/24,172.118.4.0/24,172.119.4.0/24 * global advanced public_network 172.120.3.0/24,172.117.3.0/24,172.118.3.0/24,172.119.3.0/24 * mon advanced cluster_network 172.120.4.0/24,172.117.4.0/24,172.118.4.0/24,172.119.4.0/24 * mon advanced public_network 172.120.3.0/24,172.117.3.0/24,172.118.3.0/24,172.119.3.0/24 * [ceph: root@controller-0 /]# [root@overcloud-cephstorage3-0 ~]# systemctl stop ceph\*.service ceph\*.target [root@overcloud-cephstorage3-0 ~]# systemctl start ceph\*.service ceph\*.target --all [root@overcloud-cephstorage3-0 ~]# (undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml overcloud-cephstorage3-1,overcloud-cephstorage1-1,overcloud-cephstorage1-0 -b -m shell -a "systemctl stop ceph\*.service ceph\*.target " overcloud-cephstorage1-0 | CHANGED | rc=0 >> overcloud-cephstorage3-1 | CHANGED | rc=0 >> overcloud-cephstorage1-1 | CHANGED | rc=0 >> (undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml overcloud-cephstorage3-1,overcloud-cephstorage1-1,overcloud-cephstorage1-0 -b -m shell -a "systemctl start ceph\*.service ceph\*.target --all" overcloud-cephstorage1-0 | CHANGED | rc=0 >> overcloud-cephstorage3-1 | CHANGED | rc=0 >> overcloud-cephstorage1-1 | CHANGED | rc=0 >> (undercloud) [stack@undercloud-0 overcloud]$ [ceph: root@controller-0 /]# ceph -s cluster: id: 0b840fca-dc9e-50d3-b800-bcf2ccd7e608 health: HEALTH_WARN 15 failed cephadm daemon(s) services: mon: 3 daemons, quorum controller-0,controller-1,controller-2 (age 4h) mgr: controller-0.diedbm(active, since 4h), standbys: controller-1.ftfmeo, controller-2.whrmcb osd: 30 osds: 30 up (since 17s), 30 in (since 18s) rgw: 3 daemons active (3 hosts, 1 zones) data: pools: 10 pools, 577 pgs objects: 356 objects, 208 MiB usage: 664 MiB used, 959 GiB / 960 GiB avail pgs: 577 active+clean io: client: 35 KiB/s rd, 0 B/s wr, 35 op/s rd, 23 op/s wr [ceph: root@controller-0 /]#
*** Bug 2107114 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543