+++ This bug was initially created as a clone of Bug #2119217 +++ While testing snapshot based rbd-mirror with a random rw workload, Paul Cuzner noticed that at the start of every replication interval the ceph-osd cpu consumption spikes dramatically, and continues to grow over time. For example, at the beginning of a run the cpu spike was 60% of a core...but after 24 hours with the same randrw workload running this grows to 1.5-2 cores. The cpu overhead appears worse for 4KB block sizes, than for IO sizes of 16KB or more. The change rate within the snapshot is only 250MB every 5 mins - The workload is just 20 rbd images, using rate limited fio which caps each rbd image to 50 IOPS (40 read + 10 write) If the host is not capping the OSD this issue, will likely go unnoticed but in environments like k8s where the OSD is capped this is more of a problem. This translates to high latency for clients during these spikes, and with the growth over time, it means continual performance degradation. --- Additional comment from Josh Durgin on 2022-08-17 22:44:10 UTC --- Paul Cuzner's experiments and analysis leading to this are described here: https://docs.google.com/document/d/13ms1bptpnra7Inyk70ZoeqtIJtx0sUPXaXtkHuxVZk4/edit?usp=sharing --- Additional comment from Neha Ojha on 2022-11-02 16:23:16 UTC --- Thanks to Thomas, we have a branch based on the current state of 6.0: https://gitlab.cee.redhat.com/ceph/ceph/-/commits/private-tserlin-ceph-6.0-rhel-9-test-bz2119217-patches for early performance testing. Current working doc: https://docs.google.com/document/d/16QPNZDumOOONL20E3YtKYJf-n_NXBYqRU22pLZAgeSE/edit#heading=h.mi6k5ka1jby2 --- Additional comment from Elvir Kuric on 2022-11-02 17:05:19 UTC --- (In reply to Neha Ojha from comment #2) > Thanks to Thomas, we have a branch based on the current state of 6.0: > https://gitlab.cee.redhat.com/ceph/ceph/-/commits/private-tserlin-ceph-6.0- > rhel-9-test-bz2119217-patches for early performance testing. > > Current working doc: > https://docs.google.com/document/d/16QPNZDumOOONL20E3YtKYJf- > n_NXBYqRU22pLZAgeSE/edit#heading=h.mi6k5ka1jby2 will new image be at https://quay.ceph.io/ceph-ci/ceph ( once it is back online ) and what image tag will be used for it. Thank you Elvir --- Additional comment from Neha Ojha on 2022-11-02 18:29:04 UTC --- (In reply to Elvir Kuric from comment #3) > (In reply to Neha Ojha from comment #2) > > Thanks to Thomas, we have a branch based on the current state of 6.0: > > https://gitlab.cee.redhat.com/ceph/ceph/-/commits/private-tserlin-ceph-6.0- > > rhel-9-test-bz2119217-patches for early performance testing. > > > > Current working doc: > > https://docs.google.com/document/d/16QPNZDumOOONL20E3YtKYJf- > > n_NXBYqRU22pLZAgeSE/edit#heading=h.mi6k5ka1jby2 > > will new image be at https://quay.ceph.io/ceph-ci/ceph ( once it is back > online ) and what image tag will be used for it. > Thank you > Elvir Thomas, I believe Adam has pushed his changes to private-tserlin-ceph-6.0-rhel-9-test-bz2119217-patches, can you please let us know when the image is ready. Elvir, you will receive a downstream container image because upstream https://quay.ceph.io/ceph-ci/ceph is still down. --- Additional comment from on 2022-11-03 02:36:00 UTC --- The testfix container is ready: * rhceph-container-6-55.0.TEST.bz2119217 * Pull from: registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 * Brew link: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2243733 * ceph testfix build in container: ceph-17.2.5-8.0.TEST.bz2119217.el9cp Based on this -patches branch (5c31df0e91285684fcb133a48ac24948aa3a9785): https://gitlab.cee.redhat.com/ceph/ceph/-/commits/private-tserlin-ceph-6.0-rhel-9-test-bz2119217-patches Thomas --- Additional comment from Neha Ojha on 2022-11-04 19:02:03 UTC --- (In reply to tserlin from comment #5) > The testfix container is ready: > > * rhceph-container-6-55.0.TEST.bz2119217 > * Pull from: > registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 > * Brew link: > https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2243733 > * ceph testfix build in container: ceph-17.2.5-8.0.TEST.bz2119217.el9cp > > Based on this -patches branch (5c31df0e91285684fcb133a48ac24948aa3a9785): > https://gitlab.cee.redhat.com/ceph/ceph/-/commits/private-tserlin-ceph-6.0- > rhel-9-test-bz2119217-patches > > Thomas Thanks Thomas. Hi Elvir, the container images are ready, please let us know when you'll have a chance to run another round of tests. --- Additional comment from Elvir Kuric on 2022-11-07 12:03:19 UTC --- (In reply to Neha Ojha from comment #6) > (In reply to tserlin from comment #5) > > The testfix container is ready: > > > > * rhceph-container-6-55.0.TEST.bz2119217 > > * Pull from: > > registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 > > * Brew link: > > https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2243733 > > * ceph testfix build in container: ceph-17.2.5-8.0.TEST.bz2119217.el9cp > > > > Based on this -patches branch (5c31df0e91285684fcb133a48ac24948aa3a9785): > > https://gitlab.cee.redhat.com/ceph/ceph/-/commits/private-tserlin-ceph-6.0- > > rhel-9-test-bz2119217-patches > > > > Thomas > > Thanks Thomas. > > Hi Elvir, the container images are ready, please let us know when you'll > have a chance to run another round of tests. I am going to use below image ( based on https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2243733 ) to upgrade clusters registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 I can reach it with "podman pull registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217" and to upgrade clusters I am going to use # ceph orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 based on https://docs.ceph.com/en/quincy/cephadm/upgrade/#starting-the-upgrade --- Additional comment from Elvir Kuric on 2022-11-08 11:00:55 UTC --- On test clusters which is in healthy state I wanted to upgrade to test image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 Can someone advise me here what I am doing wrong, why upgrade process does not start - but it should https://docs.ceph.com/en/quincy/cephadm/upgrade/#starting-the-upgrade Comments are welcome , thank you in advance, Elvir --- logs --- # cluster1 orch upgrade start --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 Initiating upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 in second console # ceph -W cephadm 2022-11-08T10:37:42.727226+0000 mgr.f09-h01-000-1029u.rdu2.scalelab.redhat.com.nnirqz [INF] Upgrade: Started with target registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 2022-11-08T10:37:42.876399+0000 mgr.f09-h01-000-1029u.rdu2.scalelab.redhat.com.nnirqz [INF] Upgrade: First pull of registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 2022-11-08T10:37:44.862471+0000 mgr.f09-h01-000-1029u.rdu2.scalelab.redhat.com.nnirqz [INF] Upgrade: Target is version 17.2.5-8.0.TEST.bz2119217.el9cp (quincy) 2022-11-08T10:37:44.862544+0000 mgr.f09-h01-000-1029u.rdu2.scalelab.redhat.com.nnirqz [INF] Upgrade: Target container is registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:431d7d287041b25e3fae2c920b9f040b3ee20af61ebf4d10f9d6131d767914dc, digests ['registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:431d7d287041b25e3fae2c920b9f040b3ee20af61ebf4d10f9d6131d767914dc', 'registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:899805e337013cce4143c815b7b6fb1f3d615505850a7ee85a5b9b542ea44a59'] 2022-11-08T10:37:44.862631+0000 mgr.f09-h01-000-1029u.rdu2.scalelab.redhat.com.nnirqz [ERR] Upgrade: Paused due to UPGRADE_BAD_TARGET_VERSION: Upgrade: cannot upgrade/downgrade to 17.2.5-8.0.TEST.bz2119217.el9cp # cluster1 orch upgrade status { "target_image": "registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:431d7d287041b25e3fae2c920b9f040b3ee20af61ebf4d10f9d6131d767914dc", "in_progress": true, "which": "Upgrading all daemon types on all hosts", "services_complete": [], "progress": "0/22 daemons upgraded", "message": "Error: UPGRADE_BAD_TARGET_VERSION: Upgrade: cannot upgrade/downgrade to 17.2.5-8.0.TEST.bz2119217.el9cp", "is_paused": true } -------------------- # cluster1 -s cluster: id: 6a296ec8-483a-11ed-9fd7-ac1f6b7abb24 health: HEALTH_ERR Upgrade: cannot upgrade/downgrade to 17.2.5-8.0.TEST.bz2119217.el9cp services: mon: 3 daemons, quorum f09-h01-000-1029u.rdu2.scalelab.redhat.com,f09-h02-000-1029u,f09-h03-000-1029u (age 4w) mgr: f09-h01-000-1029u.rdu2.scalelab.redhat.com.nnirqz(active, since 4w), standbys: f09-h02-000-1029u.mzsguz osd: 12 osds: 12 up (since 4w), 12 in (since 4w) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: pools: 14 pools, 417 pgs objects: 42.43k objects, 111 GiB usage: 353 GiB used, 4.9 TiB / 5.2 TiB avail pgs: 417 active+clean io: client: 148 KiB/s rd, 16 KiB/s wr, 172 op/s rd, 22 op/s wr progress: Upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 (0s) [............................] I can stop upgrade process # cluster1 orch upgrade stop Stopped upgrade to registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:431d7d287041b25e3fae2c920b9f040b3ee20af61ebf4d10f9d6131d767914dc # cluster1 -s cluster: id: 6a296ec8-483a-11ed-9fd7-ac1f6b7abb24 health: HEALTH_OK services: mon: 3 daemons, quorum f09-h01-000-1029u.rdu2.scalelab.redhat.com,f09-h02-000-1029u,f09-h03-000-1029u (age 4w) mgr: f09-h01-000-1029u.rdu2.scalelab.redhat.com.nnirqz(active, since 4w), standbys: f09-h02-000-1029u.mzsguz osd: 12 osds: 12 up (since 4w), 12 in (since 4w) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: pools: 14 pools, 417 pgs objects: 42.43k objects, 111 GiB usage: 353 GiB used, 4.9 TiB / 5.2 TiB avail pgs: 417 active+clean io: client: 18 KiB/s rd, 1.7 KiB/s wr, 18 op/s rd, 1 op/s wr and it is in HEALTH_OK state --- # for d in mgr mon crash osd mds rgw rbd-mirror cephfs-mirror iscsi nfs ; do echo "daemon ---- ---- $d -----------"; cluster1 orch ps --daemon-type $d; done daemon ---- ---- mgr ----------- NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID mgr.f09-h01-000-1029u.rdu2.scalelab.redhat.com.nnirqz f09-h01-000-1029u.rdu2.scalelab.redhat.com *:9283,8765 running (4w) 3m ago 4w 668M - 18.0.0-333-g29fc1bfd ddb01ae703b8 1e5d8a03bcb9 mgr.f09-h02-000-1029u.mzsguz f09-h02-000-1029u.rdu2.scalelab.redhat.com running (4w) 2m ago 4w 434M - 18.0.0-333-g29fc1bfd ddb01ae703b8 e6d5d5df14b3 daemon ---- ---- mon ----------- NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID mon.f09-h01-000-1029u.rdu2.scalelab.redhat.com f09-h01-000-1029u.rdu2.scalelab.redhat.com running (4w) 3m ago 4w 1389M 2048M 18.0.0-333-g29fc1bfd ddb01ae703b8 74b61f772cbb mon.f09-h02-000-1029u f09-h02-000-1029u.rdu2.scalelab.redhat.com running (4w) 2m ago 4w 1502M 2048M 18.0.0-333-g29fc1bfd ddb01ae703b8 8375126e287e mon.f09-h03-000-1029u f09-h03-000-1029u.rdu2.scalelab.redhat.com running (4w) 4m ago 4w 1714M 2048M 18.0.0-333-g29fc1bfd ddb01ae703b8 1020d028eb26 daemon ---- ---- crash ----------- NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID crash.f09-h01-000-1029u f09-h01-000-1029u.rdu2.scalelab.redhat.com running (4w) 3m ago 4w 7222k - 18.0.0-333-g29fc1bfd ddb01ae703b8 f4b4e36f15ae crash.f09-h02-000-1029u f09-h02-000-1029u.rdu2.scalelab.redhat.com running (4w) 2m ago 4w 7239k - 18.0.0-333-g29fc1bfd ddb01ae703b8 6a5a2c266cce crash.f09-h03-000-1029u f09-h03-000-1029u.rdu2.scalelab.redhat.com running (4w) 4m ago 4w 7788k - 18.0.0-333-g29fc1bfd ddb01ae703b8 f4c73b911425 daemon ---- ---- osd ----------- NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID osd.0 f09-h03-000-1029u.rdu2.scalelab.redhat.com running (4w) 4m ago 4w 88.2G 65.1G 18.0.0-333-g29fc1bfd ddb01ae703b8 c5ea64f32588 osd.1 f09-h02-000-1029u.rdu2.scalelab.redhat.com running (4w) 2m ago 4w 96.2G 64.4G 18.0.0-333-g29fc1bfd ddb01ae703b8 52328d182657 osd.2 f09-h01-000-1029u.rdu2.scalelab.redhat.com running (4w) 3m ago 4w 72.3G 64.1G 18.0.0-333-g29fc1bfd ddb01ae703b8 d504ff6d326d osd.3 f09-h03-000-1029u.rdu2.scalelab.redhat.com running (4w) 4m ago 4w 103G 65.1G 18.0.0-333-g29fc1bfd ddb01ae703b8 03edf538e4d2 osd.4 f09-h02-000-1029u.rdu2.scalelab.redhat.com running (4w) 2m ago 4w 97.3G 64.4G 18.0.0-333-g29fc1bfd ddb01ae703b8 b153acf3a82e osd.5 f09-h01-000-1029u.rdu2.scalelab.redhat.com running (4w) 3m ago 4w 26.5G 64.1G 18.0.0-333-g29fc1bfd ddb01ae703b8 f15133c10080 osd.6 f09-h03-000-1029u.rdu2.scalelab.redhat.com running (4w) 4m ago 4w 71.1G 65.1G 18.0.0-333-g29fc1bfd ddb01ae703b8 fd96f3cc8245 osd.7 f09-h02-000-1029u.rdu2.scalelab.redhat.com running (4w) 2m ago 4w 122G 64.4G 18.0.0-333-g29fc1bfd ddb01ae703b8 814c4a6146d0 osd.8 f09-h01-000-1029u.rdu2.scalelab.redhat.com running (4w) 3m ago 4w 158G 64.1G 18.0.0-333-g29fc1bfd ddb01ae703b8 afb45e38e685 osd.9 f09-h03-000-1029u.rdu2.scalelab.redhat.com running (4w) 4m ago 4w 120G 65.1G 18.0.0-333-g29fc1bfd ddb01ae703b8 012d80aa4566 osd.10 f09-h02-000-1029u.rdu2.scalelab.redhat.com running (4w) 2m ago 4w 64.9G 64.4G 18.0.0-333-g29fc1bfd ddb01ae703b8 c74ad9d68d87 osd.11 f09-h01-000-1029u.rdu2.scalelab.redhat.com running (4w) 3m ago 4w 117G 64.1G 18.0.0-333-g29fc1bfd ddb01ae703b8 dac65c2599d1 daemon ---- ---- mds ----------- No daemons reported daemon ---- ---- rgw ----------- NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID rgw.objectstore.f09-h01-000-1029u.nwqpgg f09-h01-000-1029u.rdu2.scalelab.redhat.com *:80 running (3w) 3m ago 3w 1210M - 18.0.0-333-g29fc1bfd ddb01ae703b8 6d362cc0350e daemon ---- ---- rbd-mirror ----------- NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID rbd-mirror.f09-h03-000-1029u.yjnwsk f09-h03-000-1029u.rdu2.scalelab.redhat.com running (4w) 4m ago 4w 56.3M - 18.0.0-333-g29fc1bfd ddb01ae703b8 7e0cf1e3b361 daemon ---- ---- cephfs-mirror ----------- No daemons reported daemon ---- ---- iscsi ----------- No daemons reported daemon ---- ---- nfs ----------- No daemons reported --- Additional comment from Josh Durgin on 2022-11-08 15:43:11 UTC --- (In reply to Elvir Kuric from comment #8) > On test clusters which is in healthy state I wanted to upgrade to test image > registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-55.0.TEST.bz2119217 > > Can someone advise me here what I am doing wrong, why upgrade process does > not start - but it should > https://docs.ceph.com/en/quincy/cephadm/upgrade/#starting-the-upgrade > Comments are welcome , thank you in advance, > Elvir > # cluster1 -s > cluster: > id: 6a296ec8-483a-11ed-9fd7-ac1f6b7abb24 > health: HEALTH_ERR > Upgrade: cannot upgrade/downgrade to 17.2.5-8.0.TEST.bz2119217.el9cp "ceph health detail" should show why cephadm thinks it can't upgrade to this version. @adking may be able to help further with how to get past that --- Additional comment from Elvir Kuric on 2022-11-08 15:46:20 UTC --- # cluster1 health detail HEALTH_ERR Upgrade: cannot upgrade/downgrade to 17.2.5-8.0.TEST.bz2119217.el9cp [ERR] UPGRADE_BAD_TARGET_VERSION: Upgrade: cannot upgrade/downgrade to 17.2.5-8.0.TEST.bz2119217.el9cp ceph cannot downgrade major versions (from 18.0.0-333-g29fc1bfd (29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5dfdf) reef (dev) to 17.2.5-8.0.TEST.bz2119217.el9cp) --- Additional comment from Adam King on 2022-11-08 17:41:13 UTC --- (In reply to Elvir Kuric from comment #10) > # cluster1 health detail > > HEALTH_ERR Upgrade: cannot upgrade/downgrade to > 17.2.5-8.0.TEST.bz2119217.el9cp > [ERR] UPGRADE_BAD_TARGET_VERSION: Upgrade: cannot upgrade/downgrade to > 17.2.5-8.0.TEST.bz2119217.el9cp > ceph cannot downgrade major versions (from 18.0.0-333-g29fc1bfd > (29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5dfdf) reef (dev) to > 17.2.5-8.0.TEST.bz2119217.el9cp) It looks like the version you're currently on is considered v18 (some main branch image? v18 is for Reef) and you're trying to "upgrade" to a v17 image (which would be quincy or perhaps an older main image before v18 was set up). Gathering here that you're testing a patch built on top of 6.0. Is there a reason for starting from a main branch image here and trying to upgrade to a quincy (RHCS 6) image from there? Cephadm is blocking it because as far as it's concerned you're trying to downgrade across major versions which isn't supported. If this is necessary I think you can technically get around it by first manually upgrading the mgr daemons by redeploying them with a specified image. Note that this doesn't work properly if passed the active mgr (there's a fix open but it wouldn't be in any build you're testing). So you'd have to redeploy the standby mgr(s), do a fail over, then upgrade the previously active one. E.g. with a mgr on vm-00 and vm-01 with vm-00 being the active one I did this with ceph orch daemon redeploy mgr.vm-01.xnmtnp --image quay.io/ceph/ceph:v17.2.5 Wait until "ceph versions" reports a mgr on v17 ceph mgr fail ceph orch daemon redeploy mgr.vm-00.motjgc --image quay.io/ceph/ceph:v17.2.5 Then once both the mgr daemons were on v17 I was able to "upgrade" to v17.2.5 from a cluster previously using a v18 image. I'd keep in mind that this sort of jump from main back to quincy isn't tested or supported so not sure what results you'd get with these particular images, but that's the workaround if starting from whatever main branch image is being used is necessary. --- Additional comment from Vikhyat Umrao on 2022-11-08 17:47:49 UTC --- +1 I think we should not downgrade this cluster, we should redeploy! Coming from v18 to v17 is not good we do not know what kind of things it will brake or will have effects in operations! Josh - thoughts? --- Additional comment from Josh Durgin on 2022-11-08 19:12:19 UTC --- (In reply to Vikhyat Umrao from comment #12) > +1 I think we should not downgrade this cluster, we should redeploy! Coming > from v18 to v17 is not good we do not know what kind of things it will brake > or will have effects in operations! > > Josh - thoughts? Agreed --- Additional comment from Red Hat Bugzilla on 2022-12-31 11:31:16 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 11:31:18 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 19:13:26 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 19:32:34 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 19:59:58 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 20:04:22 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 22:43:29 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 23:43:32 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 23:45:36 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 23:45:50 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 05:35:19 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 05:37:28 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 05:39:35 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 05:47:48 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 05:47:59 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 05:48:10 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 06:27:06 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 06:29:00 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:38:28 UTC --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:39:30 UTC --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:45:17 UTC --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:47:12 UTC --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:48:24 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:49:50 UTC --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:51:50 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-09 08:29:43 UTC --- Account disabled by LDAP Audit for extended failure --- Additional comment from RHEL Program Management on 2023-03-29 17:40:55 UTC --- This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is being proposed as a blocker for this release. Please resolve \triage ASAP. --- Additional comment from Adam Kupczyk on 2023-03-31 19:13:34 UTC --- The solution for the issue will include a) A modification to BlueStore that significantly reduces cpu usage when processing snaps. It is https://github.com/ceph/ceph/pull/49837, passed all tests for Quincy and Reef, waiting review. b) A PR (https://github.com/ceph/ceph/pull/50812) that introduces feature control logic. A ceph option `bluestore_reuse_shared_blob` is introduced. This option will exist in Quincy and Reef, and will be removed in S(quid?). It is an OSD deploy mode option that determines usage of new improved blob processing logic. In Quincy(6.1) It is OFF by default. In Reef(7.x) It is ON by default. Both releases will include new admin socket command for OSD: "bluestore enable_shared_blob_reuse". It will immediately make transition OFF -> ON, enabling reuse of shared blobs. Starting from this moment shared blobs could be reused; in long term reducing number of shared blobs and improving performance. It will never be possible to transition ON -> OFF. --- Additional comment from Neha Ojha on 2023-03-31 20:27:42 UTC --- (In reply to Adam Kupczyk from comment #41) > The solution for the issue will include > a) A modification to BlueStore that significantly reduces cpu usage when > processing snaps. > It is https://github.com/ceph/ceph/pull/49837, passed all tests for Quincy > and Reef, waiting review. > > b) A PR (https://github.com/ceph/ceph/pull/50812) that introduces feature > control logic. > > A ceph option `bluestore_reuse_shared_blob` is introduced. > This option will exist in Quincy and Reef, and will be removed in S(quid?). > It is an OSD deploy mode option that determines usage of new improved blob > processing logic. > > In Quincy(6.1) It is OFF by default. @tnielsen: We'd like to set bluestore_reuse_shared_blob to true for ODF 4.13 fresh installs, perhaps by means of Rook. As far as brown field clusters are concerned, we need to figure out a couple of things 1. Will we support RDR in clusters upgraded to ODF 4.13? 2. If the answer to (1) is no, we don't need to handle or test upgrades for the scope of this BZ. If the answer is yes, after an upgrade to ODF 4.13, bluestore_reuse_shared_blob will be set to false. We need to use the admin socket command mentioned below to enable this config in clusters using RDR. > In Reef(7.x) It is ON by default. > > Both releases will include new admin socket command for OSD: "bluestore > enable_shared_blob_reuse". > It will immediately make transition OFF -> ON, enabling reuse of shared > blobs. > Starting from this moment shared blobs could be reused; in long term > reducing number of shared blobs and improving performance. > > It will never be possible to transition ON -> OFF. --- Additional comment from Travis Nielsen on 2023-04-04 19:50:43 UTC --- (In reply to Neha Ojha from comment #42) > (In reply to Adam Kupczyk from comment #41) > > The solution for the issue will include > > a) A modification to BlueStore that significantly reduces cpu usage when > > processing snaps. > > It is https://github.com/ceph/ceph/pull/49837, passed all tests for Quincy > > and Reef, waiting review. > > > > b) A PR (https://github.com/ceph/ceph/pull/50812) that introduces feature > > control logic. > > > > A ceph option `bluestore_reuse_shared_blob` is introduced. > > This option will exist in Quincy and Reef, and will be removed in S(quid?). > > It is an OSD deploy mode option that determines usage of new improved blob > > processing logic. > > > > In Quincy(6.1) It is OFF by default. > > @tnielsen: We'd like to set bluestore_reuse_shared_blob to true > for ODF 4.13 fresh installs, perhaps by means of Rook. Rook (via OCS operator for downstream) allows setting Ceph values that will be loaded into the ceph.conf for each daemon. As long as the bluestore_reuse_shared_blob value would be picked up from ceph.conf, for new clusters the value could be set in the configmap here [1] by the OCS operator. > As far as brown field > clusters are concerned, we need to figure out a couple of things > > 1. Will we support RDR in clusters upgraded to ODF 4.13? > 2. If the answer to (1) is no, we don't need to handle or test upgrades for > the scope of this BZ. If the answer is yes, after an upgrade to ODF 4.13, > bluestore_reuse_shared_blob will be set to false. We need to use the admin > socket command mentioned below to enable this config in clusters using RDR. In the upgraded clusters, the daemons are restarted and would pick up the new ceph.conf that is updated in the configmap. Malay, during an upgrade, the OCS operator will update that configmap with any new settings, correct? I am not clear about the reconcile strategy of the OCS operator for the configmap "rook-config-override". Neha, but for the brown field case are you saying the admin socket on each OSD would need to have that setting enabled? That would take a separate approach from how Rook sets any settings in Ceph. Or is the ceph.conf or a "ceph config set" command sufficient? [1] https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/cephconfig.go#L31-L40 --- Additional comment from Malay Kumar parida on 2023-04-05 05:54:37 UTC --- > Malay, during an upgrade, the OCS operator will update that configmap with any new settings, correct? Yes Travis, The changes will be picked up after the upgrade. So if some particular config is decided here, we can put that in that configmap. But keep in mind ocs-operator will always reconcile the configmap and will always apply those configs. So if someone wants to remove those configs for any reason that will be impossible. So we have to be a little careful. --- Additional comment from Neha Ojha on 2023-04-05 17:17:41 UTC --- (In reply to Travis Nielsen from comment #43) > (In reply to Neha Ojha from comment #42) > > (In reply to Adam Kupczyk from comment #41) > > > The solution for the issue will include > > > a) A modification to BlueStore that significantly reduces cpu usage when > > > processing snaps. > > > It is https://github.com/ceph/ceph/pull/49837, passed all tests for Quincy > > > and Reef, waiting review. > > > > > > b) A PR (https://github.com/ceph/ceph/pull/50812) that introduces feature > > > control logic. > > > > > > A ceph option `bluestore_reuse_shared_blob` is introduced. > > > This option will exist in Quincy and Reef, and will be removed in S(quid?). > > > It is an OSD deploy mode option that determines usage of new improved blob > > > processing logic. > > > > > > In Quincy(6.1) It is OFF by default. > > > > @tnielsen: We'd like to set bluestore_reuse_shared_blob to true > > for ODF 4.13 fresh installs, perhaps by means of Rook. > > Rook (via OCS operator for downstream) allows setting Ceph values that will > be loaded into the ceph.conf for each daemon. > As long as the bluestore_reuse_shared_blob value would be picked up from > ceph.conf, > for new clusters the value could be set in the configmap here [1] by the OCS > operator. > > > > As far as brown field > > clusters are concerned, we need to figure out a couple of things > > > > 1. Will we support RDR in clusters upgraded to ODF 4.13? > > 2. If the answer to (1) is no, we don't need to handle or test upgrades for > > the scope of this BZ. If the answer is yes, after an upgrade to ODF 4.13, > > bluestore_reuse_shared_blob will be set to false. We need to use the admin > > socket command mentioned below to enable this config in clusters using RDR. > > In the upgraded clusters, the daemons are restarted and would pick up the > new ceph.conf that is updated in the configmap. > Malay, during an upgrade, the OCS operator will update that configmap with > any new settings, correct? > I am not clear about the reconcile strategy of the OCS operator for the > configmap "rook-config-override". > > Neha, but for the brown field case are you saying the admin socket on each > OSD would need to have that setting enabled? > That would take a separate approach from how Rook sets any settings in Ceph. > Or is the ceph.conf or a "ceph config set" command sufficient? Adam, how do you envision this to work with your current PR? > > [1] > https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/ > storagecluster/cephconfig.go#L31-L40 --- Additional comment from Vivek Das on 2023-04-27 07:32:29 UTC --- Hello Adam, Any update on this bug ? This is marked as a test blocker and QE is waiting for the fix. Regards, Vivek Das --- Additional comment from Neha Ojha on 2023-05-03 14:27:38 UTC --- (In reply to Vivek Das from comment #46) > Hello Adam, > > Any update on this bug ? > > This is marked as a test blocker and QE is waiting for the fix. > > Regards, > Vivek Das The main PR https://github.com/ceph/ceph/pull/49837 is going through upstream reviews and teuthology testing. After it gets merged, we'll merge the quincy backport https://github.com/ceph/ceph/pull/50549 and cherry-pick it to downstream for 6.1. --- Additional comment from Adam Kupczyk on 2023-05-16 16:18:46 UTC --- The issue is solved by pulling contents of: https://github.com/ceph/ceph/pull/51451 into https://gitlab.cee.redhat.com/ceph/ceph/-/commits/ceph-6.1-rhel-patches AND adding a single commit that enables the feature: "#define WITH_ESB" --- Additional comment from Ken Dreyer (Red Hat) on 2023-05-16 22:01:43 UTC --- This means that it's enabled in all our downstream builds now? --- Additional comment from Ken Dreyer (Red Hat) on 2023-05-16 22:04:07 UTC --- For the record: this went into dist-git as https://pkgs.devel.redhat.com/cgit/rpms/ceph/commit/?h=ceph-6.1-rhel-9&id=ba5fccbc3b5dbc885e7fb629f3fa815d127d113c (unfortunately lacking "Resolves: rhbz#2119217" lines) --- Additional comment from Neha Ojha on 2023-05-16 23:26:56 UTC --- (In reply to Ken Dreyer (Red Hat) from comment #49) > This means that it's enabled in all our downstream builds now? Hi Ken, The commits have been pushed with the ESB feature turned on by default for the time being in order to 1. unblock ODF 4.13 testing 2. get additional testing on it from RHCS/IBM Ceph QE. Moving the BZ to POST to reflect this. As discussed yesterday, given the risk involved with this feature, we don't want to ship RHCS/IBM Ceph with this feature on for 6.1. We still need a way to turn the compile-time flag (WITH_ESB) off for these builds and just keep it on for ODF where this feature is a must. --- Additional comment from errata-xmlrpc on 2023-05-18 05:56:22 UTC --- This bug has been added to advisory RHBA-2023:112314 by Thomas Serlin (tserlin) --- Additional comment from errata-xmlrpc on 2023-05-18 05:56:23 UTC --- Bug report changed to ON_QA status by Errata System. A QE request has been submitted for advisory RHBA-2023:112314-01 https://errata.devel.redhat.com/advisory/112314 --- Additional comment from Neha Ojha on 2023-05-22 17:58:38 UTC --- Based on the discussion in RHCS release and RDR dependencies meeting this morning, the commits that address this BZ have been reverted from ceph-6.1-rhel-patches. Thomas has created https://gitlab.cee.redhat.com/ceph/ceph/-/commits/private-tserlin-ceph-6.1-rhel-9-test-bz2119217-patches to continue testing of these patches beyond the scope of 6.1. We'll cherry-pick the commits to the new branch. --- Additional comment from Adam Kupczyk on 2023-05-22 18:16:32 UTC --- Just pushed 38 commits 14f4997518c234ad30f9442e2f14b961e349def4 ... 7da3e6ae59de2dacd4d7dc88c7421d9016259fea to https://gitlab.cee.redhat.com/ceph/ceph/-/commits/private-tserlin-ceph-6.1-rhel-9-test-bz2119217-patches. These commits form a full Elastic Shared Blob feature. These commits were recently reverted to remove Elastic Shared Blob feature form ceph-6.1-rhel-patches. --- Additional comment from Vikhyat Umrao on 2023-05-22 18:30:48 UTC --- As discussed in today's mtg and also in comment#54 and comment#55, moving this one out of 6.1! --- Additional comment from on 2023-05-23 04:55:03 UTC --- (In reply to Vikhyat Umrao from comment #56) > As discussed in today's mtg and also in comment#54 and comment#55, moving > this one out of 6.1! Will drop this from the 6.1 errata advisory as well. Thomas --- Additional comment from errata-xmlrpc on 2023-05-23 04:55:51 UTC --- This bug has been dropped from advisory RHBA-2023:112314 by Thomas Serlin (tserlin) --- Additional comment from on 2023-05-23 05:08:56 UTC --- QE can use the following testfix build for testing this BZ: * rhceph-container-6-164.0.TEST.bz2119217 * Pull from: registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-164.0.TEST.bz2119217 * Brew link for container: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2515859 * Ceph build in container: ceph-17.2.6-58.0.TEST.bz2119217.el9cp Based on this -patches branch (7da3e6ae59de2dacd4d7dc88c7421d9016259fea): https://gitlab.cee.redhat.com/ceph/ceph/-/commits/private-tserlin-ceph-6.1-rhel-9-test-bz2119217-patches Thomas --- Additional comment from Vikhyat Umrao on 2023-05-24 15:22:30 UTC --- (In reply to tserlin from comment #57) > (In reply to Vikhyat Umrao from comment #56) > > As discussed in today's mtg and also in comment#54 and comment#55, moving > > this one out of 6.1! > > Will drop this from the 6.1 errata advisory as well. > > Thomas Thank you, Thomas. --- Additional comment from Neha Ojha on 2023-06-01 20:22:28 UTC --- Plan of action as per discussion will all custodians We'll add a patch on top of https://bugzilla.redhat.com/show_bug.cgi?id=2119217#c55 to add bluestore-rdr as an osd_objectstore runtime option, which will invoke all the BlueStore changes needed for RDR. Since osd_objectstore is set at the time of mkfs, this implies 1. greenfield clusters: new clusters enabling RDR will need to set osd_objectstore=bluestore-rdr at install time We need a separate BZ to track the work needed in OCS operator to take user input about RDR at install time, which can then be passed to rook to set osd_objectstore appropriately. 2. brownfield clusters: upgrades will involve migration of one OSD at time to osd_objectstore=bluestore-rdr. We need to implement something very similar to the Filestore to BlueStore migration playbook in rook, which would do the migration one by one with all the necessary flags set. Ideally the migration should performed using a maintenance window. We possibly need another BZ to track this work. --- Additional comment from on 2023-06-10 04:48:56 UTC --- The testfix in comment #59 was x86_64 only, and Boris Ranto asked for a multi-arch (x86_64, ppc64le, s390x) testfix for ODF. I rebuilt the testfix based on the RHCS 6.1 Release Candidate version (and likely GA), ceph-17.2.6-70. The previous testfix -patches branch rebased cleanly on top of the current ceph-6.1-rhel-patches. Details: * rhceph-container-6-177.0.TEST.bz2119217 * Pull from: registry-proxy.engineering.redhat.com/rh-osbs/rhceph:6-177.0.TEST.bz2119217 * Brew link for container: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2543192 * Ceph build in container: ceph-17.2.6-70.0.TEST.bz2119217.el9cp Based on this -patches branch (6d74fefa15d1216867d1d112b47bb83c4913d28f): https://gitlab.cee.redhat.com/ceph/ceph/-/commits/private-tserlin-ceph-6.1-rhel-9-test-bz2119217-70-patches Thomas
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7780
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days