Bug 1853134 - OSDs fails to come up after ceph node upgrade to 16.1 (RHEL7-RHEL8)
Summary: OSDs fails to come up after ceph node upgrade to 16.1 (RHEL7-RHEL8)
Keywords:
Status: CLOSED DUPLICATE of bug 1844591
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: James Slagle
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-02 04:38 UTC by Sadique Puthen
Modified: 2020-07-02 08:10 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-02 08:10:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sadique Puthen 2020-07-02 04:38:47 UTC
Description of problem:

First ceph node is leap upgraded from RHEL7 to RHEL8. Then the upgrade is successful, but the upgrade process fails to bring up the osds running on that node.

TASK [tripleo-podman : Clean podman images] ************************************
Wednesday 01 July 2020  12:34:05 -0400 (0:00:00.147)       0:00:21.953 ******** 
changed: [ceph-1] => {"changed": true, "cmd": ["podman", "image", "prune", "-a"], "delta": "0:00:00.752407", "end": "2020-07-01 16:34:06.782893", "rc": 0, "start": "2020-07-01 16:34:06.030486", "stderr": "", "stderr_lines": [], "stdout": "10a6f75b69a1ec7797727c6f7969d7cd0061a59302a127c3690f31b456fbdcfd", "stdout_lines": ["10a6f75b69a1ec7797727c6f7969d7cd0061a59302a127c3690f31b456fbdcfd"]}

TASK [tripleo-podman : Clean podman volumes] ***********************************
Wednesday 01 July 2020  12:34:06 -0400 (0:00:01.052)       0:00:23.006 ******** 
changed: [ceph-1] => {"changed": true, "cmd": ["podman", "volume", "prune", "-f"], "delta": "0:00:00.075415", "end": "2020-07-01 16:34:07.161528", "rc": 0, "start": "2020-07-01 16:34:07.086113", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP *********************************************************************
ceph-1                     : ok=20   changed=4    unreachable=0    failed=0    skipped=17   rescued=0    ignored=0   

Wednesday 01 July 2020  12:34:07 -0400 (0:00:00.362)       0:00:23.368 ******** 
=============================================================================== 

Updated nodes - ceph-1
Success

# podman ps
CONTAINER ID  IMAGE                                                                               COMMAND      CREATED       STATUS           PORTS  NAMES
67d03f7bfa69  satellite.redhat.local:5000/sadique_openstack-osp16_1_beta_containers-cron:16.1-40  kolla_start  12 hours ago  Up 12 hours ago         logrotate_crond

No OSD.

Ceph still remains degrated 33%

#  podman exec ceph-mon-controller-1 ceph -s
  cluster:
    id:     60a470b2-b08b-11ea-965d-525400e6befc
    health: HEALTH_WARN
            Degraded data redundancy: 407/1221 objects degraded (33.333%), 150 pgs degraded, 640 pgs undersized
 
  services:
    mon: 3 daemons, quorum controller-1,controller-2,controller-3
    mgr: controller-1(active), standbys: controller-3, controller-2
    osd: 6 osds: 4 up, 4 in
 
  data:
    pools:   5 pools, 640 pgs
    objects: 407 objects, 42.9MiB
    usage:   757MiB used, 799GiB / 800GiB avail
    pgs:     407/1221 objects degraded (33.333%)
             490 active+undersized
             150 active+undersized+degraded

systemd service expected to bring up the osd is in activating state.

[root@ceph-1 ~]# systemctl status ceph-osd
● ceph-osd - Ceph OSD
   Loaded: loaded (/etc/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2020-07-02 04:19:37 UTC; 1s ago
  Process: 188632 ExecStart=/usr/share/ceph-osd-run.sh 4 (code=exited, status=1/FAILURE)
  Process: 188590 ExecStartPre=/usr/bin/podman rm -f ceph-osd-4 (code=exited, status=1/FAILURE)
  Process: 188550 ExecStartPre=/usr/bin/podman stop ceph-osd-4 (code=exited, status=125)
 Main PID: 188632 (code=exited, status=1/FAILURE)

this command errors out.

# /usr/share/ceph-osd-run.sh 4
2020-07-02 04:37:46  /entrypoint.sh: OSD id 4 does not exist

We should report and upgrade to be successful if we fail to bring up the osds on those nodes. The upgrade failed in the initial attempt through.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


Note You need to log in before you can comment on or make changes to this bug.