On the FFU upgrade from osp16.2 to osp17.1 the openstack overcloud upgrade got stuck for more than 7 hours. From overcloud_upgrade_run-computehci-0,computehci-1,computehci-2,controller-0,controller-1,controller-2,database-0,database-1,database-2,messaging-0,messaging-1,messaging-2,networker-0,networker-1,undercloud.log: 2023-07-12 23:18:44 | 2023-07-12 23:18:44.799776 | 525400d7-420c-ee98-3fcf-00000001a8f7 | OK | Notify user about upcoming cephadm execution(s) | undercloud | result={^M 2023-07-12 23:18:44 | "changed": false,^M 2023-07-12 23:18:44 | "msg": "Running 1 cephadm playbook(s) (immediate log at /home/stack/overcloud-deploy/qe-Cloud-0/config-download/qe-Cloud-0/cephadm/cephadm_command.log)"^M 2023-07-12 23:18:44 | } 2023-07-12 23:18:44 | 2023-07-12 23:18:44.802034 | 525400d7-420c-ee98-3fcf-00000001a8f7 | TIMING | tripleo_run_cephadm : Notify user about upcoming cephadm execution(s) | undercloud | 0:29:31.578488 | 0.05s 2023-07-12 23:18:44 | 2023-07-12 23:18:44.818351 | 525400d7-420c-ee98-3fcf-00000001a8f8 | TASK | run cephadm playbook From /home/stack/overcloud-deploy/qe-Cloud-0/config-download/qe-Cloud-0/cephadm/cephadm_command.log: 2023-07-12 23:19:26,778 p=463425 u=stack n=ansible | 2023-07-12 23:19:26.777611 | 525400d7-420c-9c3f-b529-0000000001aa | TASK | Get ceph_cli 2023-07-12 23:19:26,830 p=463425 u=stack n=ansible | 2023-07-12 23:19:26.830146 | b6978dfe-0e19-493b-b2ec-c3f5c0f2d291 | INCLUDED | /usr/share/ansible/roles/tripleo_cephadm/tasks/ceph_cli.yaml | controller-0 2023-07-12 23:19:26,853 p=463425 u=stack n=ansible | 2023-07-12 23:19:26.852367 | 525400d7-420c-9c3f-b529-00000000037a | TASK | Set ceph CLI 2023-07-12 23:19:26,914 p=463425 u=stack n=ansible | 2023-07-12 23:19:26.913226 | 525400d7-420c-9c3f-b529-00000000037a | OK | Set ceph CLI | controller-0 2023-07-12 23:19:26,936 p=463425 u=stack n=ansible | 2023-07-12 23:19:26.935348 | 525400d7-420c-9c3f-b529-0000000001ab | TASK | Get the ceph orchestrator status Executing the following commands did got any result and it got stuck: tripleo-admin@controller-0 ~]$ sudo cephadm shell -- ceph orch status Inferring fsid 1606aa1c-08f8-4e53-9b34-3c74181f65d5 Using recent ceph image undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:f165a015a577bc7aeebb22355f01d471722cee7873fa0f7cd23e1c210e45f30e
CEPH STATUS =========== tripleo-admin@controller-0 ~]$ sudo cephadm shell -- ceph status Inferring fsid 1606aa1c-08f8-4e53-9b34-3c74181f65d5 Using recent ceph image undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:f165a015a577bc7aeebb22355f01d471722cee7873fa0f7cd23e1c210e45f30e cluster: id: 1606aa1c-08f8-4e53-9b34-3c74181f65d5 health: HEALTH_WARN mons are allowing insecure global_id reclaim services: mon: 3 daemons, quorum controller-2,controller-1,controller-0 (age 8h) mgr: controller-0(active, since 8h), standbys: controller-1, controller-2 osd: 15 osds: 15 up (since 8h), 15 in (since 6w) data: pools: 5 pools, 513 pgs objects: 584 objects, 1.8 GiB usage: 5.3 GiB used, 475 GiB / 480 GiB avail pgs: 513 active+clean ERROR LOGS =========== [tripleo-admin@controller-0 ~]$ sudo grep -i health -r /var/log/ceph/ [tripleo-admin@controller-0 ~]$ grep -i error -r /var/log/ceph/* grep: /var/log/ceph/1606aa1c-08f8-4e53-9b34-3c74181f65d5: Permission denied /var/log/ceph/cephadm.log:2023-07-12 22:24:34,883 7fe016b5eb80 DEBUG /bin/podman: Error: error inspecting object: no such container ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5-crash-controller-0 /var/log/ceph/cephadm.log:2023-07-12 22:24:34,888 7fe016b5eb80 INFO /bin/podman: stderr Error: error inspecting object: no such container ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5-crash-controller-0 /var/log/ceph/cephadm.log:2023-07-12 22:24:35,000 7fe016b5eb80 DEBUG /bin/podman: Error: error inspecting object: no such container ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5-crash.controller-0 /var/log/ceph/cephadm.log:2023-07-12 22:24:35,007 7fe016b5eb80 INFO /bin/podman: stderr Error: error inspecting object: no such container ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5-crash.controller-0 CONTAINERS =========== [tripleo-admin@controller-0 ~]$ sudo podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 824cbd62da39 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-horizon:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago horizon f3d03c58ed52 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago iscsid 65971615e059 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-keystone:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago keystone ce0f1e5ffa58 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-proxy-server:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_object_expirer 35282fd69a26 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-object:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_object_replicator 5c5cf10e91ff undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-object:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_object_server a2dddd6bccd6 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-object:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_object_updater 92915cbcf7fa undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-object:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_rsync c618e98e8a97 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-api:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago cinder_api f655a318253d undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-api:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago cinder_api_cron eefb74ccd83b undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-scheduler:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago cinder_scheduler e7dd5a227e0e undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-heat-api:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago heat_api d4d492f3e803 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-heat-api-cfn:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago heat_api_cfn d66201008e35 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-heat-api:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago heat_api_cron 3544c63637b4 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-heat-engine:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago heat_engine 8ff76e38752a undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago logrotate_crond 9dbba7d4b2f2 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-server:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago neutron_api 0d84f6f18a2f undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago nova_conductor 5b6e51882fc1 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-scheduler:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago nova_scheduler 5f9258688351 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-novncproxy:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago nova_vnc_proxy f91655ecdba7 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-account:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_account_auditor 942156362e53 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-account:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_account_reaper 84fd9898003e undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-account:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_account_replicator 1bb99f2125f7 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-account:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_account_server 599c4d4fc6e7 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-container:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_container_auditor 2e2cc698178b undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-container:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_container_replicator 9422624a67e7 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-container:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_container_server 381786cee25d undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-container:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_container_updater 36ccd0726ebc undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-object:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_object_auditor 9568a44ca416 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-placement-api:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago placement_api 9b1c694cf8f0 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-glance-api:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago glance_api 98bb550c241a undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-glance-api:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago glance_api_cron f5462a1ba132 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-api:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago nova_api 50f422579832 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-api:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago nova_metadata a7c7854b5c03 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-swift-proxy-server:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago swift_proxy 2f8ca8cc4d10 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-api:16.2_20230526.1 kolla_start 6 weeks ago Up 10 hours ago nova_api_cron 4999273b993b undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:5-404 -n mon.controller... 8 hours ago Up 8 hours ago ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5-mon-controller-0 de6b573beadc undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:5-404 -n mgr.controller... 8 hours ago Up 8 hours ago ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5-mgr-controller-0 985cc3701391 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:f165a015a577bc7aeebb22355f01d471722cee7873fa0f7cd23e1c210e45f30e -n client.crash.c... 8 hours ago Up 8 hours ago ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5-crash-controller-0 63f9724035c9 cluster.common.tag/haproxy:pcmklatest /bin/bash /usr/lo... 8 hours ago Up 8 hours ago haproxy-bundle-podman-0 002ad445b88b undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-memcached:17.1_20230711.2 kolla_start 7 hours ago Up 7 hours ago memcached bd20fa130fdc undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:5-404 --fsid 1606aa1c-0... 7 hours ago Up 7 hours ago ecstatic_napier CEPH HEALTH =========== [tripleo-admin@controller-0 ~]$ sudo cephadm shell -- ceph health detail Inferring fsid 1606aa1c-08f8-4e53-9b34-3c74181f65d5 Using recent ceph image undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:f165a015a577bc7aeebb22355f01d471722cee7873fa0f7cd23e1c210e45f30e HEALTH_WARN mons are allowing insecure global_id reclaim [WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim mon.controller-2 has auth_allow_insecure_global_id_reclaim set to true mon.controller-1 has auth_allow_insecure_global_id_reclaim set to true mon.controller-0 has auth_allow_insecure_global_id_reclaim set to true SYSTEMCTL ========= [tripleo-admin@controller-0 ~]$ sudo systemctl status ceph\*.service ● ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5.service - Ceph mon.controller-0 for 1606aa1c-08f8-4e53-9b34-3c74181f65d5 Loaded: loaded (/etc/systemd/system/ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5@.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2023-07-12 22:19:45 UTC; 8h ago Main PID: 353380 (conmon) Tasks: 30 (limit: 100744) Memory: 925.2M CGroup: /system.slice/system-ceph\x2d1606aa1c\x2d08f8\x2d4e53\x2d9b34\x2d3c74181f65d5.slice/ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5.service ├─container │ ├─353392 /dev/init -- /usr/bin/ceph-mon -n mon.controller-0 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true │ └─353405 /usr/bin/ceph-mon -n mon.controller-0 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true └─supervisor └─353380 /usr/bin/conmon --api-version 1 -c 4999273b993b6b9d663f3b4dd1076062f679932b64a549a789b5289a06fdbfcc -u 4999273b993b6b9d663f3b4dd1076062f679932b64a549a789b5289a06fdbfcc -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/4999273b993b6b9d663f3b4dd1076062f679932b64a549a789b5289a06fd> Jul 13 06:49:02 controller-0 conmon[353380]: cluster 2023-07-13T06:49:00.696279+0000 mgr.controller-0 (mgr.64108) 15369 : cluster [DBG] pgmap v15323: 513 pgs: 513 active+clean; 1.8 GiB data, 5.3 GiB used, 475 GiB / 480 GiB avail Jul 13 06:49:03 controller-0 conmon[353380]: debug 2023-07-13T06:49:03.943+0000 7f8f9f84e700 0 mon.controller-0@2(peon) e2 handle_command mon_command({"prefix": "config dump", "format": "json"} v 0) v1 Jul 13 06:49:03 controller-0 conmon[353380]: debug 2023-07-13T06:49:03.943+0000 7f8f9f84e700 0 log_channel(audit) log [DBG] : from='mgr.64108 [fd00:fd00:fd00:3000::277]:0/4282114632' entity='mgr.controller-0' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch Jul 13 06:49:04 controller-0 conmon[353380]: cluster 2023-07-13T06:49:02.696968+0000 mgr.controller-0 (mgr.64108) 15370 : cluster [DBG] pgmap v15324: 513 pgs: 513 active+clean; 1.8 GiB data, 5.3 GiB used, 475 GiB / 480 GiB avail Jul 13 06:49:04 controller-0 conmon[353380]: audit 2023-07-13T06:49:03.945045+0000 mon.controller-0 (mon.2) 7472 : audit [DBG] from='mgr.64108 [fd00:fd00:fd00:3000::277]:0/4282114632' entity='mgr.controller-0' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch Jul 13 06:49:04 controller-0 conmon[353380]: debug 2023-07-13T06:49:04.811+0000 7f8f9f84e700 0 mon.controller-0@2(peon) e2 handle_command mon_command([{prefix=config-key set, key=mgr/cephadm/osd_remove_queue}] v 0) v1 Jul 13 06:49:05 controller-0 conmon[353380]: cluster 2023-07-13T06:49:04.697700+0000 mgr.controller-0 (mgr.64108) 15371 : cluster [DBG] pgmap v15325: 513 pgs: 513 active+clean; 1.8 GiB data, 5.3 GiB used, 475 GiB / 480 GiB avail Jul 13 06:49:05 controller-0 conmon[353380]: audit 2023-07-13T06:49:04.822347+0000 mon.controller-2 (mon.0) 10302 : audit [INF] from='mgr.64108 ' entity='mgr.controller-0' Jul 13 06:49:06 controller-0 conmon[353380]: debug 2023-07-13T06:49:06.422+0000 7f8fa2053700 1 mon.controller-0@2(peon).osd e226 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 71303168 kv_alloc: 872415232 Jul 13 06:49:07 controller-0 conmon[353380]: cluster 2023-07-13T06:49:06.698819+0000 mgr.controller-0 (mgr.64108) 15372 : cluster [DBG] pgmap v15326: 513 pgs: 513 active+clean; 1.8 GiB data, 5.3 GiB used, 475 GiB / 480 GiB avail ● ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5.service - Ceph crash.controller-0 for 1606aa1c-08f8-4e53-9b34-3c74181f65d5 Loaded: loaded (/etc/systemd/system/ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5@.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2023-07-12 22:24:37 UTC; 8h ago Main PID: 373192 (conmon) Tasks: 4 (limit: 100744) Memory: 7.9M CGroup: /system.slice/system-ceph\x2d1606aa1c\x2d08f8\x2d4e53\x2d9b34\x2d3c74181f65d5.slice/ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5.service ├─container │ ├─373205 /dev/init -- /usr/bin/ceph-crash -n client.crash.controller-0 │ └─373237 /usr/libexec/platform-python -s /usr/bin/ceph-crash -n client.crash.controller-0 └─supervisor └─373192 /usr/bin/conmon --api-version 1 -c 985cc3701391e9501ce0605ac9e0e3aea068cbfae03d398ac7b3f9b42156f725 -u 985cc3701391e9501ce0605ac9e0e3aea068cbfae03d398ac7b3f9b42156f725 -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/985cc3701391e9501ce0605ac9e0e3aea068cbfae03d398ac7b3f9b42156> Jul 12 22:24:36 controller-0 systemd[1]: Starting Ceph crash.controller-0 for 1606aa1c-08f8-4e53-9b34-3c74181f65d5... Jul 12 22:24:37 controller-0 podman[373122]: 2023-07-12 22:24:37.193182476 +0000 UTC m=+0.128684558 container create 985cc3701391e9501ce0605ac9e0e3aea068cbfae03d398ac7b3f9b42156f725 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:f165a015a577bc7aeebb22355f01d471722cee7873fa0f7cd23e1c210e45f30e, > Jul 12 22:24:37 controller-0 podman[373122]: 2023-07-12 22:24:37.137771526 +0000 UTC m=+0.073273609 image pull undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:f165a015a577bc7aeebb22355f01d471722cee7873fa0f7cd23e1c210e45f30e Jul 12 22:24:37 controller-0 podman[373122]: 2023-07-12 22:24:37.382231705 +0000 UTC m=+0.317733796 container init 985cc3701391e9501ce0605ac9e0e3aea068cbfae03d398ac7b3f9b42156f725 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:f165a015a577bc7aeebb22355f01d471722cee7873fa0f7cd23e1c210e45f30e, na> Jul 12 22:24:37 controller-0 podman[373122]: 2023-07-12 22:24:37.408185636 +0000 UTC m=+0.343687734 container start 985cc3701391e9501ce0605ac9e0e3aea068cbfae03d398ac7b3f9b42156f725 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph@sha256:f165a015a577bc7aeebb22355f01d471722cee7873fa0f7cd23e1c210e45f30e, n> Jul 12 22:24:37 controller-0 bash[372881]: 985cc3701391e9501ce0605ac9e0e3aea068cbfae03d398ac7b3f9b42156f725 Jul 12 22:24:37 controller-0 systemd[1]: Started Ceph crash.controller-0 for 1606aa1c-08f8-4e53-9b34-3c74181f65d5. Jul 12 22:24:37 controller-0 conmon[373192]: INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s ● ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5.service - Ceph mgr.controller-0 for 1606aa1c-08f8-4e53-9b34-3c74181f65d5 Loaded: loaded (/etc/systemd/system/ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5@.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2023-07-12 22:20:34 UTC; 8h ago Main PID: 356559 (conmon) Tasks: 113 (limit: 100744) Memory: 465.5M CGroup: /system.slice/system-ceph\x2d1606aa1c\x2d08f8\x2d4e53\x2d9b34\x2d3c74181f65d5.slice/ceph-1606aa1c-08f8-4e53-9b34-3c74181f65d5.service ├─container │ ├─356572 /dev/init -- /usr/bin/ceph-mgr -n mgr.controller-0 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug │ ├─356584 /usr/bin/ceph-mgr -n mgr.controller-0 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug │ ├─356901 ssh -C -F /tmp/cephadm-conf-etmyg6ii -i /tmp/cephadm-identity-86slzpv0 -o ServerAliveInterval=7 -o ServerAliveCountMax=3 ceph-admin.24.30 sudo python3 -c "import sys;exec(eval(sys.stdin.readline()))" │ ├─356902 ssh -C -F /tmp/cephadm-conf-etmyg6ii -i /tmp/cephadm-identity-86slzpv0 -o ServerAliveInterval=7 -o ServerAliveCountMax=3 ceph-admin.24.32 sudo python3 -c "import sys;exec(eval(sys.stdin.readline()))" │ ├─356903 ssh -C -F /tmp/cephadm-conf-etmyg6ii -i /tmp/cephadm-identity-86slzpv0 -o ServerAliveInterval=7 -o ServerAliveCountMax=3 ceph-admin.24.13 sudo python3 -c "import sys;exec(eval(sys.stdin.readline()))" │ ├─356904 ssh -C -F /tmp/cephadm-conf-etmyg6ii -i /tmp/cephadm-identity-86slzpv0 -o ServerAliveInterval=7 -o ServerAliveCountMax=3 ceph-admin.24.35 sudo python3 -c "import sys;exec(eval(sys.stdin.readline()))" │ ├─356906 ssh -C -F /tmp/cephadm-conf-etmyg6ii -i /tmp/cephadm-identity-86slzpv0 -o ServerAliveInterval=7 -o ServerAliveCountMax=3 ceph-admin.24.36 sudo python3 -c "import sys;exec(eval(sys.stdin.readline()))" │ └─356907 ssh -C -F /tmp/cephadm-conf-etmyg6ii -i /tmp/cephadm-identity-86slzpv0 -o ServerAliveInterval=7 -o ServerAliveCountMax=3 ceph-admin.24.16 sudo python3 -c "import sys;exec(eval(sys.stdin.readline()))" └─supervisor └─356559 /usr/bin/conmon --api-version 1 -c de6b573beadcf2396d6e085b316d2f5c79f41365a4fc05499f762b545a8f4117 -u de6b573beadcf2396d6e085b316d2f5c79f41365a4fc05499f762b545a8f4117 -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/de6b573beadcf2396d6e085b316d2f5c79f41365a4fc05499f762b545a8f> Jul 13 06:49:00 controller-0 conmon[356559]: debug 2023-07-13T06:49:00.695+0000 7f8fa5ee4700 0 log_channel(cluster) log [DBG] : pgmap v15323: 513 pgs: 513 active+clean; 1.8 GiB data, 5.3 GiB used, 475 GiB / 480 GiB avail Jul 13 06:49:01 controller-0 conmon[356559]: debug 2023-07-13T06:49:01.837+0000 7f8f98c4a700 0 [progress INFO root] Processing OSDMap change 226..226 Jul 13 06:49:02 controller-0 conmon[356559]: debug 2023-07-13T06:49:02.695+0000 7f8fa5ee4700 0 log_channel(cluster) log [DBG] : pgmap v15324: 513 pgs: 513 active+clean; 1.8 GiB data, 5.3 GiB used, 475 GiB / 480 GiB avail Jul 13 06:49:04 controller-0 conmon[356559]: debug 2023-07-13T06:49:04.696+0000 7f8fa5ee4700 0 log_channel(cluster) log [DBG] : pgmap v15325: 513 pgs: 513 active+clean; 1.8 GiB data, 5.3 GiB used, 475 GiB / 480 GiB avail Jul 13 06:49:04 controller-0 conmon[356559]: debug 2023-07-13T06:49:04.824+0000 7f8f9ccd2700 0 [progress WARNING root] complete: ev fa1ba7ce-42cf-4a59-b20b-cd803b331e20 does not exist Jul 13 06:49:04 controller-0 conmon[356559]: debug 2023-07-13T06:49:04.825+0000 7f8f9ccd2700 0 [progress WARNING root] complete: ev 4df6b9cd-ea4b-45e7-8504-0f3efa3534e6 does not exist Jul 13 06:49:04 controller-0 conmon[356559]: debug 2023-07-13T06:49:04.826+0000 7f8f9ccd2700 0 [progress WARNING root] complete: ev 82594c29-a3e5-4169-878f-5e82ddbb427f does not exist Jul 13 06:49:06 controller-0 conmon[356559]: debug 2023-07-13T06:49:06.697+0000 7f8fa5ee4700 0 log_channel(cluster) log [DBG] : pgmap v15326: 513 pgs: 513 active+clean; 1.8 GiB data, 5.3 GiB used, 475 GiB / 480 GiB avail Jul 13 06:49:06 controller-0 conmon[356559]: debug 2023-07-13T06:49:06.841+0000 7f8f98c4a700 0 [progress INFO root] Processing OSDMap change 226..226 Jul 13 06:49:08 controller-0 conmon[356559]: debug 2023-07-13T06:49:08.698+0000 7f8fa5ee4700 0 log_channel(cluster) log [DBG] : pgmap v15327: 513 pgs: 513 active+clean; 1.8 GiB data, 5.3 GiB used, 475 GiB / 480 GiB avail
It seems that there is a workaround: @controller-0 $ sudo cephadm shell -- ceph mgr fail controller-0 controller-0 is due to that it is the active mgr showing on the ceph status BTW, moving this to DFG:Storage
Possile workaround https://review.opendev.org/c/openstack/tripleo-ansible/+/887565
Tested the fix in regular ceph deployments and no regression observed. We only need to verify in different upgrade scenarios
FWIW I did not encounter this issue in OSPdO initially. When I added DeployedCeph: true I hit this issue. Removing it again resolved the issue. Discussing on slack with fultonj yesterday "you should be setting DeployedCeph: false... It means you used tripleo client to deploy ceph"
(In reply to Ollie Walsh from comment #9) > FWIW I did not encounter this issue in OSPdO initially. When I added > DeployedCeph: true I hit this issue. Removing it again resolved the issue. > > Discussing on slack with fultonj yesterday "you should be setting > DeployedCeph: false... It means you used tripleo client to deploy ceph" But OSPdO is a special case which is using Heat to trigger cephadm because it's not using the python tripleo client.
Here are the details on what DeployedCeph does: https://github.com/openstack/tripleo-ansible/commit/35e455ad52751ae334a614a03c8257da2f30e038 https://github.com/openstack/tripleo-heat-templates/commit/feb93b26b42673d1958917603f2498b9456fb34c
Doc text looks good to me.