Bug 1853134

Summary: OSDs fails to come up after ceph node upgrade to 16.1 (RHEL7-RHEL8)
Product: Red Hat OpenStack Reporter: Sadique Puthen <sputhenp>
Component: openstack-tripleoAssignee: James Slagle <jslagle>
Status: CLOSED DUPLICATE QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16.1 (Train)CC: fpantano, mburns, ramishra
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-02 08:10:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sadique Puthen 2020-07-02 04:38:47 UTC
Description of problem:

First ceph node is leap upgraded from RHEL7 to RHEL8. Then the upgrade is successful, but the upgrade process fails to bring up the osds running on that node.

TASK [tripleo-podman : Clean podman images] ************************************
Wednesday 01 July 2020  12:34:05 -0400 (0:00:00.147)       0:00:21.953 ******** 
changed: [ceph-1] => {"changed": true, "cmd": ["podman", "image", "prune", "-a"], "delta": "0:00:00.752407", "end": "2020-07-01 16:34:06.782893", "rc": 0, "start": "2020-07-01 16:34:06.030486", "stderr": "", "stderr_lines": [], "stdout": "10a6f75b69a1ec7797727c6f7969d7cd0061a59302a127c3690f31b456fbdcfd", "stdout_lines": ["10a6f75b69a1ec7797727c6f7969d7cd0061a59302a127c3690f31b456fbdcfd"]}

TASK [tripleo-podman : Clean podman volumes] ***********************************
Wednesday 01 July 2020  12:34:06 -0400 (0:00:01.052)       0:00:23.006 ******** 
changed: [ceph-1] => {"changed": true, "cmd": ["podman", "volume", "prune", "-f"], "delta": "0:00:00.075415", "end": "2020-07-01 16:34:07.161528", "rc": 0, "start": "2020-07-01 16:34:07.086113", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP *********************************************************************
ceph-1                     : ok=20   changed=4    unreachable=0    failed=0    skipped=17   rescued=0    ignored=0   

Wednesday 01 July 2020  12:34:07 -0400 (0:00:00.362)       0:00:23.368 ******** 
=============================================================================== 

Updated nodes - ceph-1
Success

# podman ps
CONTAINER ID  IMAGE                                                                               COMMAND      CREATED       STATUS           PORTS  NAMES
67d03f7bfa69  satellite.redhat.local:5000/sadique_openstack-osp16_1_beta_containers-cron:16.1-40  kolla_start  12 hours ago  Up 12 hours ago         logrotate_crond

No OSD.

Ceph still remains degrated 33%

#  podman exec ceph-mon-controller-1 ceph -s
  cluster:
    id:     60a470b2-b08b-11ea-965d-525400e6befc
    health: HEALTH_WARN
            Degraded data redundancy: 407/1221 objects degraded (33.333%), 150 pgs degraded, 640 pgs undersized
 
  services:
    mon: 3 daemons, quorum controller-1,controller-2,controller-3
    mgr: controller-1(active), standbys: controller-3, controller-2
    osd: 6 osds: 4 up, 4 in
 
  data:
    pools:   5 pools, 640 pgs
    objects: 407 objects, 42.9MiB
    usage:   757MiB used, 799GiB / 800GiB avail
    pgs:     407/1221 objects degraded (33.333%)
             490 active+undersized
             150 active+undersized+degraded

systemd service expected to bring up the osd is in activating state.

[root@ceph-1 ~]# systemctl status ceph-osd
● ceph-osd - Ceph OSD
   Loaded: loaded (/etc/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2020-07-02 04:19:37 UTC; 1s ago
  Process: 188632 ExecStart=/usr/share/ceph-osd-run.sh 4 (code=exited, status=1/FAILURE)
  Process: 188590 ExecStartPre=/usr/bin/podman rm -f ceph-osd-4 (code=exited, status=1/FAILURE)
  Process: 188550 ExecStartPre=/usr/bin/podman stop ceph-osd-4 (code=exited, status=125)
 Main PID: 188632 (code=exited, status=1/FAILURE)

this command errors out.

# /usr/share/ceph-osd-run.sh 4
2020-07-02 04:37:46  /entrypoint.sh: OSD id 4 does not exist

We should report and upgrade to be successful if we fail to bring up the osds on those nodes. The upgrade failed in the initial attempt through.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info: