Bug 2104647 - OC Reboot failing while waiting for OSDs to come back
Summary: OC Reboot failing while waiting for OSDs to come back
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 17.0
Assignee: John Fulton
QA Contact: Jason Paroly
URL:
Whiteboard:
: 2107114 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-06 19:15 UTC by Jason Paroly
Modified: 2022-09-21 12:24 UTC (History)
7 users (show)

Fixed In Version: tripleo-ansible-3.3.1-0.20220708201820.fa5422f.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-21 12:23:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Output of journalctl -u ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608@osd.14 (625.27 KB, text/plain)
2022-07-07 18:37 UTC, John Fulton
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1980992 0 None None None 2022-07-07 21:35:58 UTC
OpenStack gerrit 849048 0 None MERGED Only run Ceph network_config_set during initial deployment 2022-07-11 06:14:26 UTC
OpenStack gerrit 849083 0 None MERGED Only run Ceph network_config_set during initial deployment 2022-07-11 06:14:27 UTC
Red Hat Issue Tracker OSP-16281 0 None None None 2022-07-06 19:26:24 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:24:21 UTC

Description Jason Paroly 2022-07-06 19:15:46 UTC
Description of problem: OC Reboot failing while waiting for OSDs to come back


Version-Release number of selected component (if applicable):


How reproducible:
every time

Steps to Reproduce:
1. Configure predictable ip spine leaf topology
2. Reboot OC
3.

Actual results:
OSDs not coming back online, causing reboot to fail

Expected results:
Reboot succeeds

Additional info:

577 pgs: 39 stale+undersized+degraded+peered, 91 stale+undersized+peered, 174 undersized+peered, 156 active+undersized, 59 undersized+degraded+peered, 58 active+undersized+degraded; 208 MiB data, 483 MiB used, 703 GiB / 704 GiB avail; 585/1068 objects degraded (54.775%)

Inferring fsid 0f3fd140-ffd0-5584-a3b9-7055c087b761
2022-07-06 17:15:13.836 | Using recent ceph image undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:0eb00dcba8ab47ff957166e1a2018be0905c11ed30b7c71247893423353986ee
2022-07-06 17:15:13.838 | time="2022-07-06T17:15:13Z" level=warning msg=" binary not found, container dns will not be enabled"

Comment 6 John Fulton 2022-07-07 17:20:39 UTC
The problem was reproduced for me. The testing playbook expects the OSDs to be active+clean after retries:24 and delay:15. Those numbers work for non-17 deployments but they timed out for the 17 deployment.

 https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-overcloud/overcloud_reboot.yml#L340-L350

2 out of 5 storage servers have their OSDs up, the rest are down. The next question I seek to answer is: why didn't the OSDs come up on 3 nodes? I'll investigate and update this bug.

[ceph: root@controller-0 /]# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                          STATUS  REWEIGHT  PRI-AFF
 -1         0.93567  root default                                                
 -7         0.15594      host overcloud-cephstorage1-0                           
  3    hdd  0.03119          osd.3                        down         0  1.00000
  6    hdd  0.03119          osd.6                        down         0  1.00000
  8    hdd  0.03119          osd.8                        down         0  1.00000
 17    hdd  0.03119          osd.17                       down   1.00000  1.00000
 21    hdd  0.03119          osd.21                       down   1.00000  1.00000
 -5         0.15594      host overcloud-cephstorage1-1                           
  2    hdd  0.03119          osd.2                        down   1.00000  1.00000
  7    hdd  0.03119          osd.7                        down   1.00000  1.00000
  9    hdd  0.03119          osd.9                        down   1.00000  1.00000
 12    hdd  0.03119          osd.12                       down   1.00000  1.00000
 13    hdd  0.03119          osd.13                       down   1.00000  1.00000
 -9         0.15594      host overcloud-cephstorage2-0                           
  1    hdd  0.03119          osd.1                          up   1.00000  1.00000
 15    hdd  0.03119          osd.15                         up   1.00000  1.00000
 18    hdd  0.03119          osd.18                         up   1.00000  1.00000
 22    hdd  0.03119          osd.22                         up   1.00000  1.00000
 25    hdd  0.03119          osd.25                         up   1.00000  1.00000
 -3         0.15594      host overcloud-cephstorage2-1                           
  0    hdd  0.03119          osd.0                          up   1.00000  1.00000
  4    hdd  0.03119          osd.4                          up   1.00000  1.00000
  5    hdd  0.03119          osd.5                          up   1.00000  1.00000
 10    hdd  0.03119          osd.10                         up   1.00000  1.00000
 11    hdd  0.03119          osd.11                         up   1.00000  1.00000
-11         0.15594      host overcloud-cephstorage3-0                           
 14    hdd  0.03119          osd.14                       down   1.00000  1.00000
 19    hdd  0.03119          osd.19                       down   1.00000  1.00000
 23    hdd  0.03119          osd.23                       down   1.00000  1.00000
 26    hdd  0.03119          osd.26                       down   1.00000  1.00000
 28    hdd  0.03119          osd.28                       down   1.00000  1.00000
-13         0.15594      host overcloud-cephstorage3-1                           
 16    hdd  0.03119          osd.16                       down         0  1.00000
 20    hdd  0.03119          osd.20                       down         0  1.00000
 24    hdd  0.03119          osd.24                       down         0  1.00000
 27    hdd  0.03119          osd.27                       down         0  1.00000
 29    hdd  0.03119          osd.29                       down         0  1.00000
[ceph: root@controller-0 /]#

Comment 7 John Fulton 2022-07-07 18:32:23 UTC
The OSDs which did not come up failed with an error like this:

Jul 07 16:47:54 overcloud-cephstorage3-0 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-14[10332]: debug 2022-07-07T16:47:54.886+0000 7fd70725a200  0 starting osd.14 osd_data /var/lib/ceph/osd/ceph-14 /var/lib/ceph/osd/ceph-14/journal
Jul 07 16:47:54 overcloud-cephstorage3-0 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-14[10332]: debug 2022-07-07T16:47:54.886+0000 7fd70725a200 -1 unable to find any IPv4 address in networks '172.120.3.0/24' interfaces ''
Jul 07 16:47:54 overcloud-cephstorage3-0 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-14[10332]: debug 2022-07-07T16:47:54.886+0000 7fd70725a200 -1 Failed to pick public address.

You can see this for all failed OSDs as per the following command:

(undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml overcloud-cephstorage3-1,overcloud-cephstorage3-0,overcloud-cephstorage1-1,overcloud-cephstorage1-0  -b -m shell -a "for OSD in \$(cephadm ls | jq ".[].systemd_unit" | grep osd | sed s/\\\"//g); do echo \$OSD; journalctl -u \$OSD | grep 'Failed to pick public address' ; done" | curl -F 'f:1=<-' ix.io 
http://ix.io/41S5
(undercloud) [stack@undercloud-0 overcloud]$ pwd
/home/stack/overcloud-deploy/overcloud

The output looks like this:

overcloud-cephstorage1-1 | CHANGED | rc=0 >>
ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608
Jul 07 16:44:07 overcloud-cephstorage1-1 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-12[3344]: debug 2022-07-07T16:44:07.413+0000 7f958860d200 -1 Failed to pick public address.
Jul 07 16:44:26 overcloud-cephstorage1-1 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-12[5140]: debug 2022-07-07T16:44:26.369+0000 7fd09ce9b200 -1 Failed to pick public address.
Jul 07 16:45:20 overcloud-cephstorage1-1 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-12[10633]: debug 2022-07-07T16:45:20.261+0000 7f03726d4200 -1 Failed to pick public address.
ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608
Jul 07 16:44:07 overcloud-cephstorage1-1 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-13[3394]: debug 2022-07-07T16:44:07.476+0000 7f8c1a8f2200 -1 Failed to pick public address.
Jul 07 16:44:26 overcloud-cephstorage1-1 ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608-osd-13[5208]: debug 2022-07-07T16:44:26.651+0000 7f6c222fb200 -1 Failed to pick public address.
...

Comment 8 John Fulton 2022-07-07 18:37:56 UTC
Created attachment 1895283 [details]
Output of journalctl -u ceph-0b840fca-dc9e-50d3-b800-bcf2ccd7e608

Comment 9 John Fulton 2022-07-07 18:50:22 UTC
The OSDs didn't come up because they were configured to use a network that is not on the host. 

The hosts have the following 172 addresses below and none of them are in 172.120.3.0/24

(undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml CephStorage3 -b -m shell -a "ip a | grep 172"
overcloud-cephstorage3-1 | CHANGED | rc=0 >>
    inet 172.119.3.223/24 brd 172.119.3.255 scope global vlan32
    inet 172.119.4.223/24 brd 172.119.4.255 scope global vlan42
overcloud-cephstorage3-0 | CHANGED | rc=0 >>
    inet 172.119.3.222/24 brd 172.119.3.255 scope global vlan32
    inet 172.119.4.222/24 brd 172.119.4.255 scope global vlan42
(undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml CephStorage2 -b -m shell -a "ip a | grep 172"
overcloud-cephstorage2-0 | CHANGED | rc=0 >>
    inet 172.118.3.222/24 brd 172.118.3.255 scope global vlan31
    inet 172.118.4.222/24 brd 172.118.4.255 scope global vlan41
overcloud-cephstorage2-1 | CHANGED | rc=0 >>
    inet 172.118.3.223/24 brd 172.118.3.255 scope global vlan31
    inet 172.118.4.223/24 brd 172.118.4.255 scope global vlan41
(undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml CephStorage1 -b -m shell -a "ip a | grep 172"
overcloud-cephstorage1-1 | CHANGED | rc=0 >>
    inet 172.117.3.223/24 brd 172.117.3.255 scope global vlan30
    inet 172.117.4.223/24 brd 172.117.4.255 scope global vlan40
overcloud-cephstorage1-0 | CHANGED | rc=0 >>
    inet 172.117.3.222/24 brd 172.117.3.255 scope global vlan30
    inet 172.117.4.222/24 brd 172.117.4.255 scope global vlan40
(undercloud) [stack@undercloud-0 overcloud]$

Comment 10 John Fulton 2022-07-07 19:47:19 UTC
This is the initial ceph.conf passed during deployment:

 [global]
 public_network = '172.120.3.0/24,172.117.3.0/24,172.118.3.0/24,172.119.3.0/24'
 cluster_network = '172.120.4.0/24,172.117.4.0/24,172.118.4.0/24,172.119.4.0/24'
 ms_bind_ipv4 = true
 ms_bind_ipv6 = false

Like this:

openstack overcloud ceph deploy \
-o /home/stack/templates/overcloud-ceph-deployed.yaml \
--container-image-prepare "/home/stack/containers-prepare-parameter.yaml" \
 --config /home/stack/initial-ceph.conf \
--stack "overcloud" \
--cluster ceph \
--network-data "/home/stack/virt/network/network_data_v2.yaml" \
--roles-data /home/stack/virt/roles/roles_data.yaml \
/home/stack/templates/overcloud-baremetal-deployed.yaml

Yet, ceph is (mis)configured to use the following (per `ceph config dump`):

  cluster_network                        172.120.4.0/24
  public_network                         172.120.3.0/24

If the above were true, then the OSDs would have never booted in the first place.
But we did have a cluster running fine until it was rebooted.

During initial deployment with `openstack overcloud ceph deploy` everything was
fine (thus, OSDs booted). However, during the overcloud deployment (during
config-download) the Ceph network settings were changed! The following lines
were in /home/stack/config-download/overcloud/cephadm/cephadm_command.log

2022-07-07 16:00:45,059 p=152615 u=stack n=ansible | 2022-07-07 16:00:45.058633 | 52540043-0b57-c97a-1873-00000000026f |       TASK | Set public/cluster network and v4/v6 ms_bind unless already in ceph.conf
2022-07-07 16:00:46,126 p=152615 u=stack n=ansible | 2022-07-07 16:00:46.125533 | 52540043-0b57-c97a-1873-00000000026f |         OK | Set public/cluster network and v4/v6 ms_bind unless already in ceph.conf | controller-0 | item={'key': 'public_network', 'value': '172.120.3.0/24'}
2022-07-07 16:00:47,079 p=152615 u=stack n=ansible | 2022-07-07 16:00:47.078370 | 52540043-0b57-c97a-1873-00000000026f |         OK | Set public/cluster network and v4/v6 ms_bind unless already in ceph.conf | controller-0 | item={'key': 'cluster_network', 'value': '172.120.4.0/24'}
2022-07-07 16:00:47,088 p=152615 u=stack n=ansible | 2022-07-07 16:00:47.087777 | 52540043-0b57-c97a-1873-00000000026f |    SKIPPED | Set public/cluster network and v4/v6 ms_bind unless already in ceph.conf | controller-0 | item={'key': 'ms_bind_ipv4', 'value': ''}
2022-07-07 16:00:47,091 p=152615 u=stack n=ansible | 2022-07-07 16:00:47.090674 | 52540043-0b57-c97a-1873-00000000026f |    SKIPPED | Set public/cluster network and v4/v6 ms_bind unless already in ceph.conf | controller-0 | item={'key': 'ms_bind_ipv6', 'value': ''}

This patch (mine :/ ) introduced the Ansible task that created the misconfiguration:

  https://review.opendev.org/c/openstack/tripleo-ansible/+/843265

It shouldn't have run during config download and it shouldn't have configured the IPs above.

I'm assigning this bug to myself so I can fix it. Thank you for finding it.

Comment 13 John Fulton 2022-07-07 20:12:00 UTC
I was able to bring the cluster back up manually:

[ceph: root@controller-0 /]# ceph config dump | grep network                                                                                                                                 
global                                        advanced  cluster_network                        172.120.4.0/24                                                                                
                                 *
global                                        advanced  public_network                         172.120.3.0/24                                                                                
                                 *
  mon                                         advanced  public_network                         172.120.3.0/24                                                                                
                                 *
[ceph: root@controller-0 /]# ceph config set global public_network '172.120.3.0/24,172.117.3.0/24,172.118.3.0/24,172.119.3.0/24'
[ceph: root@controller-0 /]# ceph config set global cluster_network '172.120.4.0/24,172.117.4.0/24,172.118.4.0/24,172.119.4.0/24'
[ceph: root@controller-0 /]# ceph config set mon public_network '172.120.3.0/24,172.117.3.0/24,172.118.3.0/24,172.119.3.0/24'
[ceph: root@controller-0 /]# ceph config set mon cluster_network '172.120.4.0/24,172.117.4.0/24,172.118.4.0/24,172.119.4.0/24'

[ceph: root@controller-0 /]# ceph config dump | grep network
global                                        advanced  cluster_network                        172.120.4.0/24,172.117.4.0/24,172.118.4.0/24,172.119.4.0/24                                                                     * 
global                                        advanced  public_network                         172.120.3.0/24,172.117.3.0/24,172.118.3.0/24,172.119.3.0/24                                                                     * 
  mon                                         advanced  cluster_network                        172.120.4.0/24,172.117.4.0/24,172.118.4.0/24,172.119.4.0/24                                                                     * 
  mon                                         advanced  public_network                         172.120.3.0/24,172.117.3.0/24,172.118.3.0/24,172.119.3.0/24                                                                     * 
[ceph: root@controller-0 /]# 


[root@overcloud-cephstorage3-0 ~]# systemctl stop ceph\*.service ceph\*.target 
[root@overcloud-cephstorage3-0 ~]# systemctl start ceph\*.service ceph\*.target --all
[root@overcloud-cephstorage3-0 ~]#

(undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml overcloud-cephstorage3-1,overcloud-cephstorage1-1,overcloud-cephstorage1-0  -b -m shell -a "systemctl stop ceph\*.service ceph\*.target "
overcloud-cephstorage1-0 | CHANGED | rc=0 >>

overcloud-cephstorage3-1 | CHANGED | rc=0 >>

overcloud-cephstorage1-1 | CHANGED | rc=0 >>

(undercloud) [stack@undercloud-0 overcloud]$ ansible -i tripleo-ansible-inventory.yaml overcloud-cephstorage3-1,overcloud-cephstorage1-1,overcloud-cephstorage1-0  -b -m shell -a "systemctl start ceph\*.service ceph\*.target --all"
overcloud-cephstorage1-0 | CHANGED | rc=0 >>

overcloud-cephstorage3-1 | CHANGED | rc=0 >>

overcloud-cephstorage1-1 | CHANGED | rc=0 >>

(undercloud) [stack@undercloud-0 overcloud]$ 

[ceph: root@controller-0 /]# ceph -s
  cluster:
    id:     0b840fca-dc9e-50d3-b800-bcf2ccd7e608
    health: HEALTH_WARN
            15 failed cephadm daemon(s)
 
  services:
    mon: 3 daemons, quorum controller-0,controller-1,controller-2 (age 4h)
    mgr: controller-0.diedbm(active, since 4h), standbys: controller-1.ftfmeo, controller-2.whrmcb
    osd: 30 osds: 30 up (since 17s), 30 in (since 18s)
    rgw: 3 daemons active (3 hosts, 1 zones)
 
  data:
    pools:   10 pools, 577 pgs
    objects: 356 objects, 208 MiB
    usage:   664 MiB used, 959 GiB / 960 GiB avail
    pgs:     577 active+clean
 
  io:
    client:   35 KiB/s rd, 0 B/s wr, 35 op/s rd, 23 op/s wr
 
[ceph: root@controller-0 /]#

Comment 22 Khomesh Thakre 2022-07-21 13:41:10 UTC
*** Bug 2107114 has been marked as a duplicate of this bug. ***

Comment 27 errata-xmlrpc 2022-09-21 12:23:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.