Bug 1891405

Summary:	[FFU OSP13 TO 16.1] Ceph osds[version-3] wiped out post ceph upgrade
Product:	Red Hat OpenStack	Reporter:	Ravi Singh <ravsingh>
Component:	ceph-ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED DUPLICATE	QA Contact:	Yogev Rabl <yrabl>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	16.1 (Train)	CC:	fpantano, gfidente, ravsingh
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-03 13:51:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ravi Singh 2020-10-26 07:01:40 UTC

Description of problem:

I have been performing FFU[OSP13->16.1] and post ceph upgrade[step 17.3 in [1]] I can see all ceph osds which were on RHCS-3 were wiped out & even none of the images are available on ceph nodes.

I think this has been done during the system upgrade where this task has been executed to stop docker containers

#openstack overcloud upgrade run --tags system_upgrade --limit overcloud-cephstorage-0

~~~
TASK [Stop all services by stopping all docker containers] *********************
Tuesday 13 October 2020  05:18:00 -0400 (0:00:01.647)       0:00:16.825 ******* 
~~~

On ceph nodes too I can see ceph-OSD containers died at this time[2].

Is this expected behavior? I believe this should not be expected behaviour, RHCS-3 osds should persist until we upgrade to RHCS-4.


I move ahead & did the converge step & then upgrade to RHCS-4 then only I can see those
OSDs that too on RHCS-4.

Please note that I was able to start an upgrade on one ceph node which now has new OSDs but the rest 2 still don't have any[3] since it failed due to some other reasons.

Do you need any logs or am I missing something?

[1]https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index#upgrading-the-operating-system-for-ceph-storage-nodes-upgrading-overcloud-standard 


[2]
~~~
Oct 13 09:18:02 overcloud-cephstorage-0 journal: Failed at 154: wait $child_for_exec on parent 31222 with return code 143
Oct 13 09:18:02 overcloud-cephstorage-0 journal: teardown: managing teardown after SIGTERM
Oct 13 09:18:02 overcloud-cephstorage-0 journal: teardown: Sending SIGTERM to PID 31518
Oct 13 09:18:02 overcloud-cephstorage-0 ceph-osd-run.sh: Failed at 154: wait $child_for_exec on parent 31222 with return code 143
Oct 13 09:18:02 overcloud-cephstorage-0 ceph-osd-run.sh: teardown: managing teardown after SIGTERM
Oct 13 09:18:02 overcloud-cephstorage-0 ceph-osd-run.sh: teardown: Sending SIGTERM to PID 31518
Oct 13 09:18:02 overcloud-cephstorage-0 journal: 2020-10-13 09:18:02.963794 7f84a0dd1700 -1 Fail to read '/proc/134423/cmdline' error = (3) No such process
Oct 13 09:18:02 overcloud-cephstorage-0 ceph-osd-run.sh: teardown: Waiting PID 31518 to terminate .2020-10-13 09:18:02.963794 7f84a0dd1700 -1 Fail to read '/proc/134423/cmdline' error = (3) No such process
Oct 13 09:18:02 overcloud-cephstorage-0 ceph-osd-run.sh: 2020-10-13 09:18:02.963822 7f84a0dd1700 -1 received  signal: Terminated from  PID: 134423 task name: <unknown> UID: 0
Oct 13 09:18:02 overcloud-cephstorage-0 ceph-osd-run.sh: 2020-10-13 09:18:02.963833 7f84a0dd1700 -1 osd.4 59 *** Got signal Terminated ***
Oct 13 09:18:02 overcloud-cephstorage-0 journal: 2020-10-13 09:18:02.963822 7f84a0dd1700 -1 received  signal: Terminated from  PID: 134423 task name: <unknown> UID: 0
Oct 13 09:18:02 overcloud-cephstorage-0 journal: 2020-10-13 09:18:02.963833 7f84a0dd1700 -1 osd.4 59 *** Got signal Terminated ***
Oct 13 09:18:03 overcloud-cephstorage-0 journal: 2020-10-13 09:18:03.053871 7f84a0dd1700 -1 osd.4 59 shutdown
Oct 13 09:18:03 overcloud-cephstorage-0 ceph-osd-run.sh: 2020-10-13 09:18:03.053871 7f84a0dd1700 -1 osd.4 59 shutdown
Oct 13 09:18:05 overcloud-cephstorage-0 journal: teardown: Waiting PID 31518 to terminate ......................
Oct 13 09:18:05 overcloud-cephstorage-0 ceph-osd-run.sh: .....................
Oct 13 09:18:05 overcloud-cephstorage-0 ceph-osd-run.sh: teardown: Process 31518 is terminated
Oct 13 09:18:05 overcloud-cephstorage-0 ceph-osd-run.sh: sigterm_cleanup_post
Oct 13 09:18:05 overcloud-cephstorage-0 journal: teardown: Process 31518 is terminated
Oct 13 09:18:05 overcloud-cephstorage-0 journal: sigterm_cleanup_post
Oct 13 09:18:05 overcloud-cephstorage-0 journal: 2020-10-13 09:18:05  /entrypoint.sh: osd_disk_activate: Unmounting /var/lib/ceph/osd/ceph-4
Oct 13 09:18:05 overcloud-cephstorage-0 ceph-osd-run.sh: 2020-10-13 09:18:05  /entrypoint.sh: osd_disk_activate: Unmounting /var/lib/ceph/osd/ceph-4
Oct 13 09:18:05 overcloud-cephstorage-0 kernel: XFS (vdc1): Unmounting Filesystem
Oct 13 09:18:05 overcloud-cephstorage-0 journal: teardown: Bye Bye, container will die with return code 0
Oct 13 09:18:05 overcloud-cephstorage-0 ceph-osd-run.sh: teardown: Bye Bye, container will die with return code 0
Version-Release number of selected component (if applicable):
~~~

[3]
~~~
[root@overcloud-cephstorage-2 ~]# podman ps -a
CONTAINER ID  IMAGE                                                                    COMMAND      CREATED     STATUS         PORTS  NAMES
7fafca09bfb7  undercloud.ctlplane.localdomain:8787/rhosp-rhel8/openstack-cron:16.1-57  kolla_start  6 days ago  Up 6 days ago         logrotate_crond

[heat-admin@overcloud-cephstorage-0 ~]$ sudo podman ps -a
CONTAINER ID  IMAGE                                                                    COMMAND      CREATED            STATUS                PORTS  NAMES
85a37e2c1308  undercloud.ctlplane.localdomain:8787/rhceph/rhceph-4-rhel8:4-33                       About an hour ago  Up About an hour ago         ceph-osd-1
9229ea45c170  undercloud.ctlplane.localdomain:8787/rhceph/rhceph-4-rhel8:4-33                       4 days ago         Up 4 days ago                ceph-osd-4
3c218e510337  undercloud.ctlplane.localdomain:8787/rhosp-rhel8/openstack-cron:16.1-57  kolla_start  12 days ago        Up 12 days ago               logrotate_crond

[heat-admin@overcloud-cephstorage-2 ~]$ sudo podman images
REPOSITORY                                                        TAG       IMAGE ID       CREATED       SIZE
undercloud.ctlplane.localdomain:8787/rhosp-rhel8/openstack-cron   16.1-57   84b32ff4015f   6 weeks ago   390 MB


[root@overcloud-controller-0 ~]# podman exec -it ceph-mon-overcloud-controller-0 ceph status
  cluster:
    id:     37d85332-f8fd-11ea-b7c6-5254004e7212
    health: HEALTH_WARN
            noout,nobackfill,norecover,nodeep-scrub flag(s) set
            2 osds down
            2 hosts (4 osds) down
            Reduced data availability: 256 pgs inactive, 256 pgs down
            256 pgs not deep-scrubbed in time
            256 pgs not scrubbed in time
            3 monitors have not enabled msgr2
 
  services:
    mon: 3 daemons, quorum overcloud-controller-0,overcloud-controller-1,overcloud-controller-2 (age 68m)
    mgr: overcloud-controller-2(active, since 3d), standbys: overcloud-controller-1, overcloud-controller-0
    osd: 6 osds: 2 up, 4 in
         flags noout,nobackfill,norecover,nodeep-scrub
 
  data:
    pools:   4 pools, 256 pgs
    objects: 901 objects, 3.8 GiB
    usage:   3.9 GiB used, 35 GiB / 39 GiB avail
    pgs:     100.000% pgs not active
             256 down

~~~
How reproducible:
100%

Steps to Reproduce:
1.
2.
3.

Actual results:
ceph osds on version-3 are removed.

Expected results:
ceph osds on version-3 should not be removed until we upgrade to RHCS-4.

Additional info: