2279928 – [rook][GSS][ODF 4.14.6] Notify/alert end user if legacy OSDs are LVM backed

Bug 2279928 - [rook][GSS][ODF 4.14.6] Notify/alert end user if legacy OSDs are LVM backed

Summary: [rook][GSS][ODF 4.14.6] Notify/alert end user if legacy OSDs are LVM backed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.14
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Divyansh Kamboj
QA Contact:	Vishakha Kathole
Docs Contact:
URL:
Whiteboard:
Depends On:	2273398 2274757 2276532 2276533
Blocks:	2260844 2274657
TreeView+	depends on / blocked

Reported:	2024-05-09 22:48 UTC by Travis Nielsen
Modified:	2024-10-07 17:28 UTC (History)
CC List:	24 users (show)
Fixed In Version:	4.16.0-118
Doc Type:	Bug Fix
Doc Text:	.Alert when there are LVM backed legacy OSDs during upgrade Previously, when OpenShift Data Foundation with legacy OSDs was upgraded from version 4.12 to 4.14, it was noticed that all the OSDs were stuck in a crash loop and down. This led to potential data unavailability and service disruption. With this fix, a check is included to detect legacy OSDs based on local volume manager (LVM) and to alert if such OSDs are present during the upgrade process. As a result, a warning is displayed during upgrade to indicate about the legacy OSDs so that appropriate actions can be taken.
Clone Of:	2274757
Environment:
Last Closed:	2024-07-17 13:22:14 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2612	None	open	exporter: add metric and alert if a system has OSD based on LVM	2024-05-16 09:31:35 UTC
Github	red-hat-storage ocs-operator pull 2636	None	open	Bug 2279928:[release-4.16] exporter: add metric and alert if a system has OSD based on LVM	2024-05-30 07:13:52 UTC
Github	rook rook pull 14187	None	open	osd: Add cephcluster status for legacy osds to replace	2024-05-09 22:48:53 UTC
Red Hat Product Errata	RHSA-2024:4591	None	None	None	2024-07-17 13:22:27 UTC

Description Travis Nielsen 2024-05-09 22:48:53 UTC

+++ This bug was initially created as a clone of Bug #2274757 +++

+++ This bug was initially created as a clone of Bug #2273398 +++

Description of problem (please be detailed as possible and provide log
snippests):

Customer upgraded from 4.12.47 to 4.14.16 we have noticed that all OSDs are in crahs loop with the expand-bluefs container showing errors about devices that can not be found.



Version of all relevant components (if applicable):
ODF 4.14.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, all osds are down 

Is there any workaround available to the best of your knowledge?
No


Can this issue reproducible?
yes, at customer environment

--- Additional comment from RHEL Program Management on 2024-04-04 15:14:00 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.16.0' to '?', and so is being proposed to be fixed at the ODF 4.16.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2024-04-04 15:14:00 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from RHEL Program Management on 2024-04-04 15:14:00 UTC ---

The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".

--- Additional comment from Manjunatha on 2024-04-04 15:15:20 UTC ---

+ Attached osd.4 pod description here 

+ ceph osd pods are in crashloop state with below messages. 
Error: ceph-username is required for osd
rook error: ceph-username is required for osd
Usage:
  rook ceph osd init [flags]

Flags:
      --cluster-id string                 the UID of the cluster CR that owns this cluster
      --cluster-name string               the name of the cluster CR that owns this cluster
      --encrypted-device                  whether to encrypt the OSD with dmcrypt
  -h, --help                              help for init
      --is-device                         whether the osd is a device
      --location string                   location of this node for CRUSH placement
      --node-name string                  the host name of the node (default "rook-ceph-osd-1-6db9cfc7c9-294jn")
      --osd-crush-device-class string     The device class for all OSDs configured on this node
      --osd-crush-initial-weight string   The initial weight of OSD in TiB units
      --osd-database-size int             default size (MB) for OSD database (bluestore)
      --osd-id int                        osd id for which to generate config (default -1)
      --osd-store-type string             the osd store type such as bluestore (default "bluestore")
      --osd-wal-size int                  default size (MB) for OSD write ahead log (WAL) (bluestore) (default 576)
      --osds-per-device int               the number of OSDs per device (default 1)

Global Flags:
      --log-level string   logging level for logging/tracing output (valid values: ERROR,WARNING,INFO,DEBUG) (default "INFO")

'/usr/local/bin/rook' -> '/rook/rook'
+ PVC_SOURCE=/ocs-deviceset-0-1-78k4w
+ PVC_DEST=/mnt/ocs-deviceset-0-1-78k4w
+ CP_ARGS=(--archive --dereference --verbose)
+ '[' -b /mnt/ocs-deviceset-0-1-78k4w ']'
++ stat --format %t%T /ocs-deviceset-0-1-78k4w
+ PVC_SOURCE_MAJ_MIN=8e0
++ stat --format %t%T /mnt/ocs-deviceset-0-1-78k4w
+ PVC_DEST_MAJ_MIN=8e0
+ [[ 8e0 == \8\e\0 ]]
+ echo 'PVC /mnt/ocs-deviceset-0-1-78k4w already exists and has the same major and minor as /ocs-deviceset-0-1-78k4w: 8e0'
PVC /mnt/ocs-deviceset-0-1-78k4w already exists and has the same major and minor as /ocs-deviceset-0-1-78k4w: 8e0
+ exit 0
inferring bluefs devices from bluestore path
unable to read label for /var/lib/ceph/osd/ceph-1: (2) No such file or directory
2024-04-04T13:22:38.461+0000 7f41cddbf900 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (2) No such file or directory

--- Additional comment from Manjunatha on 2024-04-04 15:15:49 UTC ---



--- Additional comment from Manjunatha on 2024-04-04 15:20:40 UTC ---

odf mustgather logs in below path 
path /cases/03783266/0020-must-gather-odf.tar.gz/must-gather.local.8213025456072446876/inspect.local.6887047306785235156/namespaces/openshift-storage

This issue looks similar to bz https://bugzilla.redhat.com/show_bug.cgi?id=2254378
---------------------------
Events from osd-0 deployment:

Events:
  Type     Reason                 Age                   From               Message
  ----     ------                 ----                  ----               -------
  Normal   Scheduled              23m                   default-scheduler  Successfully assigned openshift-storage/rook-ceph-osd-0-6764d4c675-f9w2m to storage-00.dev-intranet-01-wob.ocp.vwgroup.com
  Normal   SuccessfulMountVolume  23m                   kubelet            MapVolume.MapPodDevice succeeded for volume "local-pv-abfd62bb" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-abfd62bb"
  Normal   SuccessfulMountVolume  23m                   kubelet            MapVolume.MapPodDevice succeeded for volume "local-pv-abfd62bb" volumeMapPath "/var/lib/kubelet/pods/2ae21d6a-2aa5-43e8-8c0d-7ecb250656a2/volumeDevices/kubernetes.io~local-volume"
  Normal   AddedInterface         23m                   multus             Add eth0 [100.72.0.27/23] from ovn-kubernetes
  Normal   Pulling                23m                   kubelet            Pulling image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:23041cec90d0c64f043deb9f5b589c2fe3b2e29163cf7576324341ad855affcc"
  Normal   Pulled                 23m                   kubelet            Successfully pulled image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:23041cec90d0c64f043deb9f5b589c2fe3b2e29163cf7576324341ad855affcc" in 14.120104593s (14.120122047s including waiting)
  Normal   Created                23m                   kubelet            Created container config-init
  Normal   Started                23m                   kubelet            Started container config-init
  Normal   Pulled                 23m                   kubelet            Container image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:23041cec90d0c64f043deb9f5b589c2fe3b2e29163cf7576324341ad855affcc" already present on machine
  Normal   Created                23m                   kubelet            Created container copy-bins
  Normal   Started                23m                   kubelet            Started container copy-bins
  Normal   Pulled                 23m                   kubelet            Container image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226" already present on machine
  Normal   Created                23m                   kubelet            Created container blkdevmapper
  Normal   Started                23m                   kubelet            Started container blkdevmapper
  Normal   Pulled                 23m (x3 over 23m)     kubelet            Container image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226" already present on machine
  Normal   Created                23m (x3 over 23m)     kubelet            Created container expand-bluefs
  Normal   Started                23m (x3 over 23m)     kubelet            Started container expand-bluefs
  Warning  BackOff                3m43s (x94 over 23m)  kubelet            Back-off restarting failed container expand-bluefs in pod rook-ceph-osd-0-6764d4c675-f9w2m_openshift-storage(2ae21d6a-2aa5-43e8-8c0d-7ecb250656a2)

--- Additional comment from Andreas Bleischwitz on 2024-04-04 15:29:15 UTC ---

TAM update about the business impact:

All developers on this platform are being blocked due to this issue.
VW also has a lot of VMs running on this bare-metal cluster on which they do all the testing and simulation of car components.

This is now all down and they are not able to work.

--- Additional comment from Manjunatha on 2024-04-04 15:46:36 UTC ---

Customer rebooted the osd nodes few times as suggested below solution but that dint help 
https://access.redhat.com/solutions/7015095

--- Additional comment from Andreas Bleischwitz on 2024-04-04 15:53:31 UTC ---

The customer just informed us that this is a very "old" cluster starting with 4.3.18 and ODF was installed about 3.5 years ago. So there may be a lot of tweaks/leftovers/* in this cluster.

--- Additional comment from Manjunatha on 2024-04-04 16:32:07 UTC ---

Latest mustgather and sosreport from storage node 0 in below supportshell 
path: /cases/03783266

cluster:
    id:     18c9800f-7f91-4994-ad32-2a8a330babd6
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            7 osds down
            2 hosts (10 osds) down
            2 racks (10 osds) down
            Reduced data availability: 505 pgs inactive
 
  services:
    mon: 3 daemons, quorum b,f,g (age 4h)
    mgr: a(active, since 4h)
    mds: 1/1 daemons up, 1 standby
    osd: 15 osds: 4 up (since 23h), 11 in (since 23h); 5 remapped pgs
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   12 pools, 505 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             505 unknown
 
ID   CLASS  WEIGHT    TYPE NAME                                                    STATUS  REWEIGHT  PRI-AFF
 -1         26.19896  root default                                                                          
 -4          8.73299      rack rack0                                                                        
 -3          8.73299          host storage-00-dev-intranet-01-wob-ocp-vwgroup-com                           
  0    ssd   1.74660              osd.0                                              down   1.00000  1.00000
  1    ssd   1.74660              osd.1                                              down   1.00000  1.00000
  2    ssd   1.74660              osd.2                                              down   1.00000  1.00000
  3    ssd   1.74660              osd.3                                              down   1.00000  1.00000
  4    ssd   1.74660              osd.4                                              down   1.00000  1.00000
-12          8.73299      rack rack1                                                                        
-11          8.73299          host storage-01-dev-intranet-01-wob-ocp-vwgroup-com                           
  6    ssd   1.74660              osd.6                                                up   1.00000  1.00000
  7    ssd   1.74660              osd.7                                                up   1.00000  1.00000
  8    ssd   1.74660              osd.8                                                up   1.00000  1.00000
  9    ssd   1.74660              osd.9                                                up   1.00000  1.00000
 10    ssd   1.74660              osd.10                                             down         0  1.00000
 -8          8.73299      rack rack2                                                                        
 -7          8.73299          host storage-02-dev-intranet-01-wob-ocp-vwgroup-com                           
  5    ssd   1.74660              osd.5                                              down   1.00000  1.00000
 11    ssd   1.74660              osd.11                                             down         0  1.00000
 12    ssd   1.74660              osd.12                                             down         0  1.00000
 13    ssd   1.74660              osd.13                                             down         0  1.00000
 14    ssd   1.74660              osd.14                                             down   1.00000  1.00000
...

--- Additional comment from  on 2024-04-04 18:29:13 UTC ---

Seems like the backing device was removed or moved?:

`````
PVC /mnt/ocs-deviceset-0-1-78k4w already exists and has the same major and minor as /ocs-deviceset-0-1-78k4w: 8e0
+ exit 0
inferring bluefs devices from bluestore path
unable to read label for /var/lib/ceph/osd/ceph-1: (2) No such file or directory
`````

// confirm the device for osd-1
$  omc get pods rook-ceph-osd-1-6db9cfc7c9-294jn -o yaml|grep device
    ceph.rook.io/DeviceSet: ocs-deviceset-0
    ceph.rook.io/pvc: ocs-deviceset-0-1-78k4w
    device-class: ssd
      name: devices
      name: ocs-deviceset-0-1-78k4w-bridge
    - "\nset -xe\n\nPVC_SOURCE=/ocs-deviceset-0-1-78k4w\nPVC_DEST=/mnt/ocs-deviceset-0-1-78k4w\nCP_ARGS=(--archive
    - devicePath: /ocs-deviceset-0-1-78k4w
      name: ocs-deviceset-0-1-78k4w
      name: ocs-deviceset-0-1-78k4w-bridge
      name: ocs-deviceset-0-1-78k4w-bridge
      name: devices
      name: ocs-deviceset-0-1-78k4w-bridge
    name: devices
  - name: ocs-deviceset-0-1-78k4w
      claimName: ocs-deviceset-0-1-78k4w
      path: /var/lib/rook/openshift-storage/ocs-deviceset-0-1-78k4w
    name: ocs-deviceset-0-1-78k4w-bridge

$  omc get pvc ocs-deviceset-0-1-78k4w
NAME                      STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS     AGE
ocs-deviceset-0-1-78k4w   Bound    local-pv-32532e89   1788Gi     RWO            ocs-localblock   3y

$  omc get pv local-pv-32532e89 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: local-volume-provisioner-storage-00.dev-intranet-01-wob.ocp.vwgroup.com-da4c2721-f73c-4626-8c98-7ff9f07f3212
  creationTimestamp: "2020-09-09T14:52:54Z"
  finalizers:
  - kubernetes.io/pv-protection
  labels:
    storage.openshift.com/local-volume-owner-name: ocs-blkvol-storage-00
    storage.openshift.com/local-volume-owner-namespace: local-storage
  name: local-pv-32532e89
  resourceVersion: "194139688"
  uid: fdcb6fab-0a53-49ca-bdb1-6e807e969eb7
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1788Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: ocs-deviceset-0-1-78k4w
    namespace: openshift-storage
    resourceVersion: "194139403"
    uid: 8acac407-b475-46ed-9e49-29b377b80137
  local:
    path: /mnt/local-storage/ocs-localblock/sdr
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - storage-00.dev-intranet-01-wob.ocp.vwgroup.com
  persistentVolumeReclaimPolicy: Delete
  storageClassName: ocs-localblock
  volumeMode: Block
status:
  phase: Bound

For this osd at least, they are using device paths and should be using devices by-id/uuid so the device names never change
~~~
  local:
    path: /mnt/local-storage/ocs-localblock/sdr
~~~

Asking for some more data in the case like:

~~~
$ ls -l /mnt/local-storage/ocs-localblock/

I'd also like to gather the following from LSO:

// namespace might be local-storage
$ oc get localvolume -o yaml -n openshift-local-storage
$ oc get localvolumeset -o yaml -n openshift-local-storage
$ oc get localvolumediscovery -o yaml -n openshift-local-storage
~~~

I have a very strong suspicion the kernel picked up the devices in another order and the osds cannot find their backing device. This was caused by the EUS upgrade of OCP, as it does a rollout of MCPs that will reboot the nodes.

--- Additional comment from  on 2024-04-04 21:45:31 UTC ---

Hi,

My suspicion was wrong

// sym links for devices on storage-00 from lso:
[acmdy78@bastion ~]$ oc debug -q node/storage-00.dev-intranet-01-wob.ocp.vwgroup.com -- chroot /host ls -l /mnt/local-storage/ocs-localblock/
total 0
lrwxrwxrwx. 1 root root 50 Sep  9  2020 sdp -> /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02507
lrwxrwxrwx. 1 root root 93 Apr  4 14:34 sdq -> /dev/ceph-e936c994-328c-4f59-8f1d-3a5573a7c64b/osd-block-aaced0de-8884-4551-a5ae-dd86ee436f23
lrwxrwxrwx. 1 root root 50 Sep  9  2020 sdr -> /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490
lrwxrwxrwx. 1 root root 50 Sep  9  2020 sds -> /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02497
lrwxrwxrwx. 1 root root 50 Sep  9  2020 sdt -> /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02489


They have some weird LSO config where each node has its own spec section with its devices listed. Anyways, they have the proper device defined in their LSO configs for the node
~~~
  spec:
    logLevel: Normal
    managementState: Managed
    nodeSelector:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - storage-00.dev-intranet-01-wob.ocp.vwgroup.com
    storageClassDevices:
    - devicePaths:
      - /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02489
      - /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490
      - /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02497
      - /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02504
      - /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02507
      storageClassName: ocs-localblock
~~~

and the symlink that lso knows about for sdr (ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490) points to the correct device. Seems the device names didn't change, and they are using by-id. Taking a step back... since the failure is with the expand-bluefs container:

~~~
  - containerID: cri-o://c4163c5dbd33cab921c113b80350ec20a3af48a2865f7ea43c68f4cdd61afc19
    image: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226
    imageID: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:3d7144d5fe515acf3bf4bbf6456ab8877a4f7cd553c933ca6fd4d891add53038
    lastState:
      terminated:
        containerID: cri-o://c4163c5dbd33cab921c113b80350ec20a3af48a2865f7ea43c68f4cdd61afc19
        exitCode: 1
        finishedAt: "2024-04-04T15:10:29Z"
        reason: Error
        startedAt: "2024-04-04T15:10:29Z"
    name: expand-bluefs
    ready: false
    restartCount: 37
    state:
      waiting:
        message: back-off 5m0s restarting failed container=expand-bluefs pod=rook-ceph-osd-1-6db9cfc7c9-294jn_openshift-storage(4cab01f5-438d-4ffc-a133-cd427bb1cda5)
        reason: CrashLoopBackOff
~~~

because it cannot find its block device:
~~~
$  omc logs rook-ceph-osd-1-6db9cfc7c9-294jn -c expand-bluefs
2024-04-04T15:10:29.457626768Z inferring bluefs devices from bluestore path
2024-04-04T15:10:29.457728034Z unable to read label for /var/lib/ceph/osd/ceph-1: (2) No such file or directory
2024-04-04T15:10:29.457728034Z 2024-04-04T15:10:29.456+0000 7fdba942e900 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (2) No such file or directory
~~~

what if we removed this expand-bluefs container? Maybe the osd will start? If not, is the only option to replace the osd(s)? We do have 1 node up, so we should still hopefully have some good copies? What if setting replica 1 on the pools?

--- Additional comment from Andreas Bleischwitz on 2024-04-05 07:58:58 UTC ---

Hi,

can we have at least an update that this issue is being investigated by engineering? The customer is now suffering from that outage which affects basically their complete development environment (they are developers, and therefore this cluster is their production environment) since about one day.
We currently do not have any idea how to re-enable the OSDs so that they would be able to work again.

Customer:
Volkswagen AG (#556879)


@muagarwa, @gsternag are you able to assist here?

Best regards,
/Andreas

--- Additional comment from Bipin Kunal on 2024-04-05 11:36:28 UTC ---

(In reply to Andreas Bleischwitz from comment #13)
> Hi,
> 
> can we have at least an update that this issue is being investigated by
> engineering? The customer is now suffering from that outage which affects
> basically their complete development environment (they are developers, and
> therefore this cluster is their production environment) since about one day.
> We currently do not have any idea how to re-enable the OSDs so that they
> would be able to work again.
> 
> Customer:
> Volkswagen AG (#556879)
> 
> 
> @muagarwa, @gsternag are you able to assist here?
> 
> Best regards,
> /Andreas

Hi Andreas, Thanks for reaching out to me. I am trying to reach out to engineering team. Meanwhile, it will good to have prio-list email if this is really urgent. Removing the needinfo on Mudit.

--- Additional comment from Radoslaw Zarzynski on 2024-04-05 11:58:00 UTC ---

On it.

--- Additional comment from  on 2024-04-05 13:52:56 UTC ---

Hello,

@bkunal found the KCS https://access.redhat.com/solutions/7026462 I'm going to confirm this is the same for this case. Will post findings when I have any.

--- Additional comment from Bipin Kunal on 2024-04-05 13:57:53 UTC ---

(In reply to kelwhite from comment #18)
> Hello,
> 
> @bkunal found the KCS https://access.redhat.com/solutions/7026462 I'm going
> to confirm this is the same for this case. Will post findings when I have
> any.

Actually Shubham from the Rook team found it and gave it to me.

--- Additional comment from  on 2024-04-05 15:41:15 UTC ---

From the customer: 


// for osd-1:
~~~~
osd-1 is not using /dev/sdr, but i figured it out see this path:

[acmdy78@bastion ~]$ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-1 | grep ceph.rook.io/pvc
    ceph.rook.io/pvc: ocs-deviceset-0-1-78k4w
        ceph.rook.io/pvc: ocs-deviceset-0-1-78k4w
          - key: ceph.rook.io/pvc
[acmdy78@bastion ~]$ oc get pvc ocs-deviceset-0-1-78k4w
NAME                      STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS     AGE
ocs-deviceset-0-1-78k4w   Bound    local-pv-32532e89   1788Gi     RWO            ocs-localblock   3y208d
[acmdy78@bastion ~]$ oc get pv local-pv-32532e89 -o custom-columns=NAME:.metadata.name,PATH:.spec.local.path,NODE:.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values
NAME                PATH                                    NODE
local-pv-32532e89   /mnt/local-storage/ocs-localblock/sdr   [storage-00.dev-intranet-01-wob.ocp.vwgroup.com]
[acmdy78@bastion ~]$ oc debug -q node/storage-00.dev-intranet-01-wob.ocp.vwgroup.com                                                                                        sh-4.4# chroot /host
sh-5.1# ls -lah /mnt/local-storage/ocs-localblock/sdr
lrwxrwxrwx. 1 root root 50 Sep  9  2020 /mnt/local-storage/ocs-localblock/sdr -> /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490
sh-5.1# ls -lah /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490
lrwxrwxrwx. 1 root root 9 Apr  5 14:58 /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490 -> ../../sdo
sh-5.1# ls -lah /dev/sdo
brw-rw----. 1 root disk 8, 224 Apr  5 14:58 /dev/sdo
sh-5.1# lsblk /dev/sdo
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sdo                                                                                                     8:224  0  1.7T  0 disk
`-ceph--f557e476--7bd4--41a0--9323--d6061a4318b3-osd--block--7f80e2ac--e21f--4aa6--8886--ec94d0387196 253:5    0  1.7T  0 lvm
sh-5.1# head --bytes=60 /dev/sdo
sh-5.1# ls -lah /dev/ceph-
ceph-309180b2-697b-473a-a19c-d00cec94427a/ ceph-b100dea8-0b24-4d9b-97c8-ed6dba1bd10d/ ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/
ceph-3c187f41-d43e-4bb2-9421-97f78db94d28/ ceph-e936c994-328c-4f59-8f1d-3a5573a7c64b/
sh-5.1# ls -lah /dev/ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/osd-block-7f80e2ac-e21f-4aa6-8886-ec94d0387196
lrwxrwxrwx. 1 root root 7 Apr  4 10:54 /dev/ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/osd-block-7f80e2ac-e21f-4aa6-8886-ec94d0387196 -> ../dm-5
sh-5.1# ls -lah /dev/dm-5
brw-rw----. 1 root disk 253, 5 Apr  4 10:54 /dev/dm-5
sh-5.1# head --bytes=60 /dev/dm-5
bluestore block device
7f80e2ac-e21f-4aa6-8886-ec94d0387196

the bluestore block device id ov the logical volume is the same that you get for osd-1 in the ceph osd dump command.
Could the Problem be caused by this lvm layer? On clusters we installed later ODF i don't see that lvm is used. This cluster started with OCS 4.3 or 4.4
~~~~

--- Additional comment from  on 2024-04-05 17:12:25 UTC ---

Hi,

Update... We've found the block devices dont exist on the nodes (this is from storage-00):

~~~
/var/lib/rook/openshift-storage/ocs-deviceset-0-1-78k4w:
total 0
drwxr-xr-x. 2 root root      6 Apr  3 14:50 ceph-1
brw-rw-rw-. 1 root disk 8, 224 Apr  4 10:57 ocs-deviceset-0-1-78k4w

/var/lib/rook/openshift-storage/ocs-deviceset-0-1-78k4w/ceph-1:
total 0

/var/lib/rook/openshift-storage/ocs-deviceset-0-2-2p2fc:
total 0
drwxr-xr-x. 2 root root     6 Apr  3 14:50 ceph-2
brw-rw-rw-. 1 root disk 65, 0 Apr  4 10:57 ocs-deviceset-0-2-2p2fc

/var/lib/rook/openshift-storage/ocs-deviceset-0-2-2p2fc/ceph-2:
total 0

/var/lib/rook/openshift-storage/ocs-deviceset-0-3-lh2tq:
total 0
drwxr-xr-x. 2 root root      6 Apr  3 14:50 ceph-3
brw-rw-rw-. 1 root disk 65, 16 Apr  4 10:57 ocs-deviceset-0-3-lh2tq

/var/lib/rook/openshift-storage/ocs-deviceset-0-3-lh2tq/ceph-3:
total 0

/var/lib/rook/openshift-storage/ocs-deviceset-0-4-wfm22:
total 0
drwxr-xr-x. 2 root root     10 Apr  3 14:50 ceph-4
brw-rw-rw-. 1 root disk 253, 4 Apr  4 14:49 ocs-deviceset-0-4-wfm22

/var/lib/rook/openshift-storage/ocs-deviceset-0-4-wfm22/ceph-4:
total 0
~~~

We need to confirm why these are gone. The current ask from engineering is why did these devices vanish. Would rook do anything with this? Can we find anything that will help? 

We're confirming the devices are gone on the other nodes and starting the osd replacement processes via a remote call.

--- Additional comment from  on 2024-04-05 19:50:49 UTC ---

Hello All,

On a remote with the customer. We've confirmed no data loss, phew. Seems the issue is with ceph-volume, it's not activating the device. We tried to do this manually via the below and got osd-9 up and running:

~~~
- Creating a backup of the osd-9 deployment, we're going to remove the liveness probe
- scaled down the rook-ceph and ocs-operators
- oc edit the osd-9 deployment and searched for the expand-bluefs section and removed the container
- oc get pods to see if osd-9 came up (still 1/2) and rshed info the container
   - ceph-volume lvm list 
   - ceph-volume lvm active --no-systemd -- 9 79021ece-c52a-46d1-8e99-69640a926822 // this is the osd fsid from ceph-volume lvm list
   - The osd was activated and when we viewed the osd data dir, the block device was listed:
      - ls -l '/var/lib/ceph/osd/ceph-{id}
~~~

We're looking to get some ceph-volume logs to determine what's going on... Might need to create another BZ for ceph-volume, but we will know more once we review the fresh odf must-gather

--- Additional comment from Travis Nielsen on 2024-04-05 20:56:28 UTC ---

Great to see the OSDs can be brought back up with the workaround and there is no data loss. 

These old LVM-based OSDs that were created (IIRC only in 4.2 and 4.3) are going to be a problem to maintain. We simply don't have tests that upgrades from OSDs created from 10+ releases ago. For this configuration that has not been supported for so long, the way to keep supporting such an old cluster will be to replace each of the OSDs. By purging each OSD one-at-a-time and bringing up a new one, the OSDs can be in a current configuration. 

It would not surprise me that in 4.14 there could have been an update to ceph-volume that caused this issue, because we just haven't tested this configuration for so long.

Guillaume, agreed that old LVM-based OSDs should be replaced?

--- Additional comment from Prashant Dhange on 2024-04-05 21:18:53 UTC ---

Additional details for the completeness :

(In reply to kelwhite from comment #22)
> Hello All,
> 
> On a remote with the customer. We've confirmed no data loss, phew. Seems the
> issue is with ceph-volume, it's not activating the device. We tried to do
> this manually via the below and got osd-9 up and running:
> 
> ~~~
> - Creating a backup of the osd-9 deployment, we're going to remove the
> liveness probe
> - scaled down the rook-ceph and ocs-operators
> - oc edit the osd-9 deployment and searched for the expand-bluefs section
> and removed the container
> - oc get pods to see if osd-9 came up (still 1/2) and rshed info the
> container
>    - ceph-volume lvm list 
All LVs associated with ceph cluster are getting listed here and lsblk/lvs recognizing these LVs.  

>    - ceph-volume lvm active --no-systemd -- 9
> 79021ece-c52a-46d1-8e99-69640a926822 // this is the osd fsid from
> ceph-volume lvm list
>    - The osd was activated and when we viewed the osd data dir, the block
> device was listed:
>       - ls -l '/var/lib/ceph/osd/ceph-{id}
     - Start osd.9 
     # ceph-osd --id 9 --fsid 18c9800f-7f91-4994-ad32-2a8a330babd6  --setuser ceph --setgroup ceph --crush-location="root=default host=storage-01-dev-intranet-01-wob-ocp-vwgroup-com rack=rack1" --log-to-stderr=true --err-to-stderr=true
 --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false
    NOTE : The OSD daemon will run in background and it's safe to exist the container here.

--- Additional comment from Prashant Dhange on 2024-04-05 22:14:32 UTC ---

The latest provided must-gather logs and ceph logs does not shed any light on failure to OSD directory priming or ceph-volume activating the OSD device.

The next action plan :
- Apply the workaround for every OSD on the cluster, refer comment#24
- Get all OSDs up/in and all PGs active+clean
- Re-deploy all OSDs one-by-one.

For the other clusters which might experience similar issues, the recommendations are to re-deploy all the OSDs then only go for cluster upgrade from 4.12.47 to 4.14.16.

Let me know if you need any help on recovering this cluster.

--- Additional comment from Prashant Dhange on 2024-04-05 22:58:31 UTC ---

(In reply to Prashant Dhange from comment #24)
...
>      - Start osd.9 
>      # ceph-osd --id 9 --fsid 18c9800f-7f91-4994-ad32-2a8a330babd6 
> --setuser ceph --setgroup ceph --crush-location="root=default
> host=storage-01-dev-intranet-01-wob-ocp-vwgroup-com rack=rack1"
> --log-to-stderr=true --err-to-stderr=true
>  --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug
> --default-log-to-file=false --default-mon-cluster-log-to-file=false
> --ms-learn-addr-from-peer=false
>     NOTE : The OSD daemon will run in background and it's safe to exist the
> container here.

In ceph-osd run command, change crush-location according to `ceph osd tree` output or copy it from osd deployment config (under spec.containers section). Do not forget to add double quotes for crush-location value.

e.g
# oc get deployment rook-ceph-osd-9 -o yaml
spec:
  affinity:
...
  containers:
  - args:
    - ceph
    - osd
    - start
    - --
    - --foreground
    - --id
    - "9"
    - --fsid
    - 18c9800f-7f91-4994-ad32-2a8a330babd6
    - --setuser
    - ceph
    - --setgroup
    - ceph
    - --crush-location=root=default host=storage-01-dev-intranet-01-wob-ocp-vwgroup-com rack=rack1

--- Additional comment from Rafrojas on 2024-04-06 07:19:41 UTC ---

Hi Prashant

  I joined the call with customer and we aplied the Workaround, we edited the deployment of each OSD and removed the expand-bluefs args from that, we have a backup of all the deployments if required. After that ceph started the recovery and finished after some time, a new must-gather is collected and available on the case, there's a WARN on ceph:

    health: HEALTH_WARN
            15 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats

  I also requested to collect the /var/log/ceph ceph volume logs for the RCA, Donny will collect along the day.

  We accorded to wait until new data is checked until continue with next steps, we cannot confirm that application is working fine, because developers shifts are MON-FRI but we see that the cluster looks on a better shape, with all operators running.

Regards
Rafa

--- Additional comment from Rafrojas on 2024-04-06 08:59:04 UTC ---

Hi Prashant

  Ceph logs collected and attached to the case, waiting for your instructions for next steps

Regards
Rafa

--- Additional comment from Rafrojas on 2024-04-06 12:12:03 UTC ---

Hi Prashant

  CU is waiting for some feedback, they are running this cluster in abnormal state, NA will join the shift soon I'll add the handover on the case from last call and status, please let us know next steps to share with CU ASAP.

Regards
Rafa

--- Additional comment from Prashant Dhange on 2024-04-07 02:55:59 UTC ---

Hi Rafa,

(In reply to Rafrojas from comment #27)
> Hi Prashant
> 
>   I joined the call with customer and we aplied the Workaround, we edited
> the deployment of each OSD and removed the expand-bluefs args from that, we
> have a backup of all the deployments if required.
Good to know that all OSDs are up and running after applying the workaround. 

There is a quick way to patch the OSD deployment to remove bluefs-expand init container using oc patch command :
# oc patch deployment rook-ceph-osd-<osdid> --type=json -p='[{"op": "remove", "path": "/spec/template/spec/initContainers/3"}]'

> After that ceph started
> the recovery and finished after some time, a new must-gather is collected
> and available on the case, there's a WARN on ceph:
> 
>     health: HEALTH_WARN
>             15 OSD(s) reporting legacy (not per-pool) BlueStore omap usage
> stats
This warning is because OSDs were created pre-octopus release. This warning will be addressed as we are re-deploying the OSDs. If we were not planning to re-deploy the OSDs then you need to set `ceph config rm osd bluestore_fsck_quick_fix_on_mount` and restart the OSDs. Refer KCS solution https://access.redhat.com/solutions/7041554 for more details.

> 
>   I also requested to collect the /var/log/ceph ceph volume logs for the
> RCA, Donny will collect along the day.
The latest logs have been analyzed and Guillaume able to find the RCA for the issue. The RCA has been provided in BZ-2273724#c3 comment.

--------------------------------------------------------------------------------------------------------------------------------------------------


NOTES update(attached files to BZ)

Hi,

i will upload the latest Version of my Notes and a detailed output of commnds i ran to manually remove the local-storage lvm-Volumes, for your reference.
I have seen in the diskmaker-manager-Pods (local-storage Operator) that it had Problem removing the lvm-Disks, thus making the prcedure neccasary to remove the VolumeGroups and logicalVolumes manually. I include the logs from the diskmaker-manager so you can have a look if there is a bug in local-storage-Operator, about deleting lvm ocs-localblock Volumes.


Regards Donny

--- Additional comment from Prashant Dhange on 2024-04-09 19:53:14 UTC ---

(In reply to Prashant Dhange from comment #30)
> Hi Rafa,
...
> > After that ceph started
> > the recovery and finished after some time, a new must-gather is collected
> > and available on the case, there's a WARN on ceph:
> > 
> >     health: HEALTH_WARN
> >             15 OSD(s) reporting legacy (not per-pool) BlueStore omap usage
> > stats
> This warning is because OSDs were created pre-octopus release. This warning
> will be addressed as we are re-deploying the OSDs. If we were not planning
> to re-deploy the OSDs then you need to set `ceph config rm osd
> bluestore_fsck_quick_fix_on_mount` and restart the OSDs. 

Correction. Meant to say :

This warning is because OSDs were created pre-octopus release. This warning
will be addressed as we are re-deploying the OSDs. If we were not planning
to re-deploy the OSDs then you need to set 
`ceph config set osd bluestore_fsck_quick_fix_on_mount true`, restart the OSDs
and then `ceph config rm osd bluestore_fsck_quick_fix_on_mount`. 

> Refer KCS solution
> https://access.redhat.com/solutions/7041554 for more details.

--- Additional comment from Prashant Dhange on 2024-04-09 21:11:36 UTC ---

We are still getting more details about the ODF upgrade history from the customer.

Based on available data, here are the steps to reproduce this issue :
- Deploy 4.3.18 cluster with LVM based OSDs
- Start upgrading to ODF 4.4 then to every major release till 4.13.7 e.g from 4.4 to 4.5 to 4.6 so on.
- Verify that ODF cluster is healthy and we are not observing any daemon crash (specifically OSDs)
- Upgrade from 4.13.7 to 4.14.16
- Observe the OSDs are stuck in CLBO state

--- Additional comment from Prashant Dhange on 2024-04-09 23:11:46 UTC ---

Okay. The issue is not related to ceph-volume at all. The problem was OSDs were deployed on OCS 4.3 cluster, so the deployment config has different initContainers compared to ODF versions (probably 4.9 or later).

Init containers sequence for 4.3 deployment config (refer point [2] below) :
Container-1 : ## Init Container 1 : rook ceph osd init  
Container-2 : ## Init Container 2 : Copy rook command to OSD pod
Container-3 : ## Init Container 3 : expand-bluefs
Container-4 : ## Init Container 4 : chown ceph directories

then the actual osd container starts, which executes the "ceph osd start" script which internally calls ceph-volume lvm activate then ceph-osd command. 
Container-5: ceph osd start (refer points [1] and [3] below) 

When the customer upgraded to 4.14.16, the "rook ceph osd init" container failed to mount the osd data directory. Due to this expand-bluefs container failed to start and exited with "_read_bdev_label failed to open /var/lib/ceph/osd/ceph-<osdid>/block: (2) No such file or directory" error.

When we removed expand-bluefs init container as a workaround, the ceph osd started successfully as Container-5 (ceph osd start) was able to execute the lvm activate and start ceph-osd daemon. When I was on the remote session for the first time, we were able to start (after removing expand-bluefs init container) osd.9 manually after executing the lvm activate command then the ceph-osd command.

More details :
[1] ceph osd container :
  containers:
  - args:
    - ceph
    - osd
    - start
    - --
    - --foreground
    - --id
    - "1"
    - --fsid
    - 18c9800f-7f91-4994-ad32-2a8a330babd6
    - --setuser
    - ceph
    - --setgroup
    - ceph
    - --crush-location=root=default host=storage-00-dev-intranet-01-wob-ocp-vwgroup-com
      rack=rack0
    - --osd-op-num-threads-per-shard=2
    - --osd-op-num-shards=8
    - --osd-recovery-sleep=0
    - --osd-snap-trim-sleep=0
    - --osd-delete-sleep=0
    - --bluestore-min-alloc-size=4096
    - --bluestore-prefer-deferred-size=0
    - --bluestore-compression-min-blob-size=8192
    - --bluestore-compression-max-blob-size=65536
    - --bluestore-max-blob-size=65536
    - --bluestore-cache-size=3221225472
    - --bluestore-throttle-cost-per-io=4000
    - --bluestore-deferred-batch-ops=16
    - --default-log-to-stderr=true
    - --default-err-to-stderr=true
    - --default-mon-cluster-log-to-stderr=true
    - '--default-log-stderr-prefix=debug '
    - --default-log-to-file=false
    - --default-mon-cluster-log-to-file=false
    - --ms-learn-addr-from-peer=false
    command:
    - bash
    - -x
    - -c
    - "\nset -o nounset # fail if variables are unset\nchild_pid=\"\"\nsigterm_received=false\nfunction
      sigterm() {\n\techo \"SIGTERM received\"\n\tsigterm_received=true\n\tkill -TERM
      \"$child_pid\"\n}\ntrap sigterm SIGTERM\n\"${@}\" &\n# un-fixable race condition:
      if receive sigterm here, it won't be sent to child process\nchild_pid=\"$!\"\nwait
      \"$child_pid\" # wait returns the same return code of child process when called
      with argument\nwait \"$child_pid\" # first wait returns immediately upon SIGTERM,
      so wait again for child to actually stop; this is a noop if child exited normally\nceph_osd_rc=$?\nif
      [ $ceph_osd_rc -eq 0 ] && ! $sigterm_received; then\n\ttouch /tmp/osd-sleep\n\techo
      \"OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will
      sleep for $ROOK_OSD_RESTART_INTERVAL hours. Restart the pod manually once the
      flapping issue is fixed\"\n\tsleep \"$ROOK_OSD_RESTART_INTERVAL\"h &\n\tchild_pid=\"$!\"\n\twait
      \"$child_pid\"\n\twait \"$child_pid\" # wait again for sleep to stop\nfi\nexit
      $ceph_osd_rc\n"
    - --
    - /rook/rook

[2] initContainers:
## Init Container 1 : rook ceph osd init  
  - args:
    - ceph
    - osd
    - init
    env:
    - name: ROOK_NODE_NAME
      value: storage-00.dev-intranet-01-wob.ocp.vwgroup.com
    - name: ROOK_CLUSTER_ID
      value: aaba77cf-8f28-437d-b88f-36dcafc3a865
    - name: ROOK_CLUSTER_NAME
      value: ocs-storagecluster-cephcluster
    - name: ROOK_PRIVATE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: ROOK_PUBLIC_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: POD_NAMESPACE
      value: openshift-storage
    - name: ROOK_MON_ENDPOINTS
      valueFrom:
        configMapKeyRef:
          key: data
          name: rook-ceph-mon-endpoints
    - name: ROOK_CONFIG_DIR
      value: /var/lib/rook
    - name: ROOK_CEPH_CONFIG_OVERRIDE
      value: /etc/rook/config/override.conf
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: ROOK_CRUSHMAP_ROOT
      value: default
    - name: ROOK_CRUSHMAP_HOSTNAME
    - name: CEPH_VOLUME_DEBUG
      value: "1"
    - name: CEPH_VOLUME_SKIP_RESTORECON
      value: "1"
    - name: DM_DISABLE_UDEV
      value: "1"
    - name: ROOK_OSD_ID
      value: "1"
    - name: ROOK_CEPH_VERSION
      value: ceph version 17.2.6-196 quincy
    - name: ROOK_IS_DEVICE
      value: "true"
    - name: TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES
      value: "134217728"
    envFrom:
    - configMapRef:
        name: rook-ceph-osd-env-override
        optional: true
    image: registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:23041cec90d0c64f043deb9f5b589c2fe3b2e29163cf7576324341ad855affcc
    imagePullPolicy: IfNotPresent
    name: config-init
    resources: {}
    securityContext:
      capabilities:
        drop:
        - ALL
      privileged: true
      readOnlyRootFilesystem: false
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/rook
      name: rook-data
    - mountPath: /etc/ceph
      name: rook-config-override
      readOnly: true
    - mountPath: /run/ceph
      name: ceph-daemons-sock-dir
    - mountPath: /var/log/ceph
      name: rook-ceph-log
    - mountPath: /var/lib/ceph/crash
      name: rook-ceph-crash
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xjzbc
      readOnly: true
## Init Container 2 : Copy rook command to OSD pod  
  - args:
    - --archive
    - --force
    - --verbose
    - /usr/local/bin/rook
    - /rook
    command:
    - cp
    image: registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:23041cec90d0c64f043deb9f5b589c2fe3b2e29163cf7576324341ad855affcc
    imagePullPolicy: IfNotPresent
    name: copy-bins
    resources: {}
    securityContext:
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /rook
      name: rook-binaries
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xjzbc
      readOnly: true
  - command:
    - /bin/bash
    - -c
    - "\nset -xe\n\nPVC_SOURCE=/ocs-deviceset-0-1-78k4w\nPVC_DEST=/mnt/ocs-deviceset-0-1-78k4w\nCP_ARGS=(--archive
      --dereference --verbose)\n\nif [ -b \"$PVC_DEST\" ]; then\n\tPVC_SOURCE_MAJ_MIN=$(stat
      --format '%t%T' $PVC_SOURCE)\n\tPVC_DEST_MAJ_MIN=$(stat --format '%t%T' $PVC_DEST)\n\tif
      [[ \"$PVC_SOURCE_MAJ_MIN\" == \"$PVC_DEST_MAJ_MIN\" ]]; then\n\t\techo \"PVC
      $PVC_DEST already exists and has the same major and minor as $PVC_SOURCE: \"$PVC_SOURCE_MAJ_MIN\"\"\n\t\texit
      0\n\telse\n\t\techo \"PVC's source major/minor numbers changed\"\n\t\tCP_ARGS+=(--remove-destination)\n\tfi\nfi\n\ncp
      \"${CP_ARGS[@]}\" \"$PVC_SOURCE\" \"$PVC_DEST\"\n"
    image: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226
    imagePullPolicy: IfNotPresent
    name: blkdevmapper
    resources:
      limits:
        cpu: "2"
        memory: 5Gi
      requests:
        cpu: "2"
        memory: 5Gi
    securityContext:
      capabilities:
        add:
        - MKNOD
        drop:
        - ALL
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeDevices:
    - devicePath: /ocs-deviceset-0-1-78k4w
      name: ocs-deviceset-0-1-78k4w
    volumeMounts:
    - mountPath: /mnt
      name: ocs-deviceset-0-1-78k4w-bridge
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xjzbc
      readOnly: true
## Init Container 3 : expand-bluefs   
  - args:
    - bluefs-bdev-expand
    - --path
    - /var/lib/ceph/osd/ceph-1
    command:
    - ceph-bluestore-tool
    image: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226
    imagePullPolicy: IfNotPresent
    name: expand-bluefs
    resources:
      limits:
        cpu: "2"
        memory: 5Gi
      requests:
        cpu: "2"
        memory: 5Gi
    securityContext:
      capabilities:
        drop:
        - ALL
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/ceph/osd/ceph-1
      name: ocs-deviceset-0-1-78k4w-bridge
      subPath: ceph-1
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xjzbc
      readOnly: true
## Init Container 4 : chown ceph directories   
  - args:
    - --verbose
    - --recursive
    - ceph:ceph
    - /var/log/ceph
    - /var/lib/ceph/crash
    - /run/ceph
    command:
    - chown
    image: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226
    imagePullPolicy: IfNotPresent
    name: chown-container-data-dir
    resources:
      limits:
        cpu: "2"
        memory: 5Gi
      requests:
        cpu: "2"
        memory: 5Gi
    securityContext:
      capabilities:
        drop:
        - ALL
      privileged: true
      readOnlyRootFilesystem: false
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/rook
      name: rook-data
    - mountPath: /etc/ceph
      name: rook-config-override
      readOnly: true
    - mountPath: /run/ceph
      name: ceph-daemons-sock-dir
    - mountPath: /var/log/ceph
      name: rook-ceph-log
    - mountPath: /var/lib/ceph/crash
      name: rook-ceph-crash
    - mountPath: /dev
      name: devices
    - mountPath: /run/udev
      name: run-udev
    - mountPath: /rook
      name: rook-binaries
    - mountPath: /mnt
      name: ocs-deviceset-0-1-78k4w-bridge
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xjzbc
      readOnly: true
  nodeName: storage-00.dev-intranet-01-wob.ocp.vwgroup.com
  nodeSelector:
    kubernetes.io/hostname: storage-00.dev-intranet-01-wob.ocp.vwgroup.com
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000620000
    seLinuxOptions:
      level: s0:c25,c10
  serviceAccount: rook-ceph-osd
  serviceAccountName: rook-ceph-osd
  shareProcessNamespace: true
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    operator: Equal
    value: "true"
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  topologySpreadConstraints:
  - labelSelector:
      matchExpressions:
      - key: ceph.rook.io/pvc
        operator: Exists
    maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
  volumes:
  - emptyDir: {}
    name: rook-data
  - name: rook-config-override
    projected:
      defaultMode: 420
      sources:
      - configMap:
          items:
          - key: config
            mode: 292
            path: ceph.conf
          name: rook-config-override
  - hostPath:
      path: /var/lib/rook/exporter
      type: DirectoryOrCreate
    name: ceph-daemons-sock-dir
  - hostPath:
      path: /var/lib/rook/openshift-storage/log
      type: ""
    name: rook-ceph-log
  - hostPath:
      path: /var/lib/rook/openshift-storage/crash
      type: ""
    name: rook-ceph-crash
  - hostPath:
      path: /dev
      type: ""
    name: devices
  - name: ocs-deviceset-0-1-78k4w
    persistentVolumeClaim:
      claimName: ocs-deviceset-0-1-78k4w
  - hostPath:
      path: /var/lib/rook/openshift-storage/ocs-deviceset-0-1-78k4w
      type: DirectoryOrCreate
    name: ocs-deviceset-0-1-78k4w-bridge
  - hostPath:
      path: /run/udev
      type: ""
    name: run-udev
  - emptyDir: {}
    name: rook-binaries
  - name: kube-api-access-xjzbc
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt

[3] ceph osd start logs 
2024-04-06T05:46:41.593349071Z + set -o nounset
2024-04-06T05:46:41.593349071Z + child_pid=
2024-04-06T05:46:41.593427396Z + sigterm_received=false
2024-04-06T05:46:41.593427396Z + trap sigterm SIGTERM
2024-04-06T05:46:41.593576845Z + child_pid=52
2024-04-06T05:46:41.593589922Z + wait 52
2024-04-06T05:46:41.593726159Z + /rook/rook ceph osd start -- --foreground --id 1 --fsid 18c9800f-7f91-4994-ad32-2a8a330babd6 --setuser ceph --setgroup ceph '--crush-location=root=default host=storage-00-dev-intranet-01-wob-ocp-vwgroup-com rack=rack0' --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true '--default-log-stderr-prefix=debug ' --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false
2024-04-06T05:46:41.626980032Z 2024-04-06 05:46:41.626898 I | rookcmd: starting Rook v4.14.6-0.7522dc8ddafd09860f2314db3965ef97671cd138 with arguments '/rook/rook ceph osd start -- --foreground --id 1 --fsid 18c9800f-7f91-4994-ad32-2a8a330babd6 --setuser ceph --setgroup ceph --crush-location=root=default host=storage-00-dev-intranet-01-wob-ocp-vwgroup-com rack=rack0 --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug  --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false'
2024-04-06T05:46:41.626980032Z 2024-04-06 05:46:41.626956 I | rookcmd: flag values: --block-path=/dev/ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/osd-block-7f80e2ac-e21f-4aa6-8886-ec94d0387196, --help=false, --log-level=INFO, --lv-backed-pv=true, --osd-id=1, --osd-store-type=, --osd-uuid=7f80e2ac-e21f-4aa6-8886-ec94d0387196, --pvc-backed-osd=true
2024-04-06T05:46:41.626980032Z 2024-04-06 05:46:41.626960 I | ceph-spec: parsing mon endpoints: g=100.69.195.205:3300,f=100.70.70.134:6789,b=100.70.78.99:6789
2024-04-06T05:46:41.628815634Z 2024-04-06 05:46:41.628788 I | cephosd: Successfully updated lvm config file "/etc/lvm/lvm.conf"
2024-04-06T05:46:41.925092800Z 2024-04-06 05:46:41.925022 I | exec: Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
2024-04-06T05:46:41.928518615Z 2024-04-06 05:46:41.928499 I | exec: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
2024-04-06T05:46:41.931919054Z 2024-04-06 05:46:41.931906 I | exec: Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/osd-block-7f80e2ac-e21f-4aa6-8886-ec94d0387196 --path /var/lib/ceph/osd/ceph-1 --no-mon-config
2024-04-06T05:46:41.954830230Z 2024-04-06 05:46:41.954808 I | exec: Running command: /usr/bin/ln -snf /dev/ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/osd-block-7f80e2ac-e21f-4aa6-8886-ec94d0387196 /var/lib/ceph/osd/ceph-1/block
2024-04-06T05:46:41.957864812Z 2024-04-06 05:46:41.957851 I | exec: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block
2024-04-06T05:46:41.961270909Z 2024-04-06 05:46:41.961255 I | exec: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-5
2024-04-06T05:46:41.964681164Z 2024-04-06 05:46:41.964667 I | exec: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
2024-04-06T05:46:41.967586406Z 2024-04-06 05:46:41.967574 I | exec: --> ceph-volume lvm activate successful for osd ID: 1
2024-04-06T05:46:42.029385070Z 2024-04-06 05:46:42.028473 I | exec: debug 2024-04-06T05:46:42.027+0000 7fa35830c5c0  0 set uid:gid to 167:167 (ceph:ceph)
2024-04-06T05:46:42.029462802Z 2024-04-06 05:46:42.029394 I | exec: debug 2024-04-06T05:46:42.027+0000 7fa35830c5c0  0 ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable), process ceph-osd, pid 133
2024-04-06T05:46:42.029462802Z 2024-04-06 05:46:42.029437 I | exec: debug 2024-04-06T05:46:42.027+0000 7fa35830c5c0  0 pidfile_write: ignore empty --pid-file
2024-04-06T05:46:42.029899768Z 2024-04-06 05:46:42.029860 I | exec: debug 2024-04-06T05:46:42.029+0000 7fa35830c5c0  1 bdev(0x55a4d1b87c00 /var/lib/ceph/osd/ceph-1/block) open path /var/lib/ceph/osd/ceph-1/block
2024-04-06T05:46:42.029959756Z 2024-04-06 05:46:42.029947 I | exec: debug 2024-04-06T05:46:42.029+0000 7fa35830c5c0  0 bdev(0x55a4d1b87c00 /var/lib/ceph/osd/ceph-1/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-1/block failed: (22) Invalid argument
2024-04-06T05:46:42.030424427Z 2024-04-06 05:46:42.030409 I | exec: debug 2024-04-06T05:46:42.029+0000 7fa35830c5c0  1 bdev(0x55a4d1b87c00 /var/lib/ceph/osd/ceph-1/block) open size 1920378863616 (0x1bf1f800000, 1.7 TiB) block_size 4096 (4 KiB) non-rotational discard supported
2024-04-06T05:46:42.030649989Z 2024-04-06 05:46:42.030627 I | exec: debug 2024-04-06T05:46:42.029+0000 7fa35830c5c0  1 bluestore(/var/lib/ceph/osd/ceph-1) _set_cache_sizes cache_size 3221225472 meta 0.45 kv 0.45 data 0.06
2024-04-06T05:46:42.030665356Z 2024-04-06 05:46:42.030652 I | exec: debug 2024-04-06T05:46:42.030+0000 7fa35830c5c0  1 bdev(0x55a4d1b87400 /var/lib/ceph/osd/ceph-1/block) open path /var/lib/ceph/osd/ceph-1/block
2024-04-06T05:46:42.030775141Z 2024-04-06 05:46:42.030763 I | exec: debug 2024-04-06T05:46:42.030+0000 7fa35830c5c0  0 bdev(0x55a4d1b87400 /var/lib/ceph/osd/ceph-1/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-1/block failed: (22) Invalid argument

So we need to find out why "rook ceph osd init" was failing to mount the OSD data dir.

@Travis Any thoughts on "rook ceph osd init" failure ?

--- Additional comment from Prashant Dhange on 2024-04-09 23:29:39 UTC ---

(In reply to Prashant Dhange from comment #34)
...
> 
> So we need to find out why "rook ceph osd init" was failing to mount the OSD
> data dir.
> 
> @Travis Any thoughts on "rook ceph osd init" failure ?
https://github.com/rook/rook/commit/33e824a323291de1a261b70e9bd255d5049ee02b commit likely caused this issue as we have removed the fsid, username configs from the env vars.

--- Additional comment from Travis Nielsen on 2024-04-10 02:13:31 UTC ---

(In reply to Prashant Dhange from comment #35)
> (In reply to Prashant Dhange from comment #34)
> ...
> > 
> > So we need to find out why "rook ceph osd init" was failing to mount the OSD
> > data dir.
> > 
> > @Travis Any thoughts on "rook ceph osd init" failure ?
> https://github.com/rook/rook/commit/33e824a323291de1a261b70e9bd255d5049ee02b
> commit likely caused this issue as we have removed the fsid, username
> configs from the env vars.

That commit was also backported all the way to 4.10 [1], so this change was not new in 4.14.
The error about the ceph-username parameter missing must be ignored despite the error in the init container.
It would be really helpful if we can repro this, first looking at the OSD spec and logs in 4.13,
and then upgrading to 4.14 to see what changed in the OSD spec. 

I suspect if the "osd init" container fails, the ceph.conf would not be present and cause the bluefs expand container to fail. 
But I am confused why the "osd init" container failure did not abort starting the OSD in the first place. 
Init containers are not supposed to continue to the next one if they fail.

I still need to dig more, but in the meantime the repro would help.

[1] https://github.com/red-hat-storage/rook/commit/673c331a072a9de41ab2aac5405600104bd44ef2

--- Additional comment from RHEL Program Management on 2024-04-12 17:41:42 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.16.0' to '?', and so is being proposed to be fixed at the ODF 4.16.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2024-04-12 17:41:42 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Prashant Dhange on 2024-04-12 18:02:57 UTC ---

If OSDs are backed by LVM then the root operator should prevent the ODF upgrade and also alert the admin to re-deploy these OSDs before upgrading the ODF cluster.

@Travis, I donot recall exactly but you had some ideas around it during our VW escalation (BZ#2273398) discussion.

--- Additional comment from Santosh Pillai on 2024-04-15 08:03:20 UTC ---

Should It prevent entire ODF upgrade or just the upgrade of the lvm based OSDs?

--- Additional comment from Prashant Dhange on 2024-04-15 22:13:43 UTC ---

(In reply to Santosh Pillai from comment #4)
> Should It prevent entire ODF upgrade or just the upgrade of the lvm based
> OSDs?

Prevention of upgrade for lvm based OSDs is preferred but if we alert the end user before the start of the upgrade then we could avoid the unexpected data-loss situation in advance.

--- Additional comment from Travis Nielsen on 2024-04-16 20:08:02 UTC ---

(In reply to Prashant Dhange from comment #3)
> If OSDs are backed by LVM then the root operator should prevent the ODF
> upgrade and also alert the admin to re-deploy these OSDs before upgrading
> the ODF cluster.
> 
> @Travis, I donot recall exactly but you had some ideas around it during our
> VW escalation (BZ#2273398) discussion.

Preventing the upgrade will be difficult because we won't detect it until after the upgrade is already in progress. Mons and mgr will be upgraded, then OSDs are reconciled and we would discover these LVM-based OSDs. If we fail the reconcile, then it will be difficult to recover from the situation. Instead of failing/preventing the upgrade, let's consider removing the resize init container from these OSDs. Then separately we can find a way to alert the user that they have these legacy OSDs that should be replaced. This gives the user more time to replace them.

--- Additional comment from Travis Nielsen on 2024-04-23 22:45:13 UTC ---

Acking for the fix:
- Rook will save status on the CephCluster CR that a legacy LVM-based OSD is in the cluster
- UI will need to raise an alert based on that status item

--- Additional comment from Prasad Desala on 2024-04-24 12:13:30 UTC ---

Hi Travis,

Since we are unable to deploy a 4.3 cluster to reproduce this issue, could you please provide guidance on the steps to verify this bug on the fix build? Please let us know.

--- Additional comment from Prasad Desala on 2024-04-25 05:38:28 UTC ---

(In reply to Prasad Desala from comment #8)
> Hi Travis,
> 
> Since we are unable to deploy a 4.3 cluster to reproduce this issue, could
> you please provide guidance on the steps to verify this bug on the fix
> build? Please let us know.

Providing qa_ack based on comments https://bugzilla.redhat.com/show_bug.cgi?id=2273398#c39 and https://bugzilla.redhat.com/show_bug.cgi?id=2273398#c41
We may need to verify the fix based on the 4.16 CI regression runs.

--- Additional comment from RHEL Program Management on 2024-04-25 05:38:40 UTC ---

This BZ is being approved for ODF 4.16.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.16.0

--- Additional comment from RHEL Program Management on 2024-04-25 05:38:40 UTC ---

Since this bug has been approved for ODF 4.16.0 release, through release flag 'odf-4.16.0+', the Target Release is being set to 'ODF 4.16.0

--- Additional comment from Travis Nielsen on 2024-04-29 16:19:50 UTC ---

(In reply to Prasad Desala from comment #8)
> Hi Travis,
> 
> Since we are unable to deploy a 4.3 cluster to reproduce this issue, could
> you please provide guidance on the steps to verify this bug on the fix
> build? Please let us know.

Discussion in a separate thread is that we will just have to run regression tests, as we have not been able to repro the issue.

--- Additional comment from Mudit Agarwal on 2024-05-07 05:57:34 UTC ---

Travis, do we have a PR for this?

--- Additional comment from Travis Nielsen on 2024-05-07 19:00:20 UTC ---

(In reply to Mudit Agarwal from comment #13)
> Travis, do we have a PR for this?

Not yet. And we will need two PRs:
1) Rook to update its status when it finds the legacy OSDs
2) UI to raise an alert based on the status (unless UI team already has a way to raise alert based on Rook status, I still need to sync with UI team on this)

--- Additional comment from Travis Nielsen on 2024-05-09 22:45:33 UTC ---

Rook will add status under status.storage.legacyOSDs to the CephCluster CR such as the following:

  status:
    storage:
      deviceClasses:
      - name: hdd
      legacyOSDs:
      - id: 0
        reason: LVM-based OSD on a PVC (id=0) is deprecated and should be replaced
      - id: 1
        reason: LVM-based OSD on a PVC (id=1) is deprecated and should be replaced
      - id: 2
        reason: LVM-based OSD on a PVC (id=2) is deprecated and should be replaced
      osd:
        storeType:
          bluestore: 3

I will clone this BZ to get the needed alert raised based on this status.

Comment 6 Travis Nielsen 2024-05-10 21:59:32 UTC

Based on feedback during PR review, now the output is:

    storage:
      deprecatedOSDs:
        LVM-based OSDs on a PVC are deprecated, see documentation on replacing OSDs:
        - 0
        - 1
        - 2
      deviceClasses:
      - name: hdd

Please confirm if any concern with this format for raising the alert

Comment 9 Divyansh Kamboj 2024-05-13 13:35:55 UTC

There are no concerns that i can see, i'm working on the changes based on this output. Fetching the status on OCS metrics exporter, and generating a metric to generate the alert.

Comment 10 Travis Nielsen 2024-05-13 17:10:49 UTC

(In reply to Divyansh Kamboj from comment #9)
> There are no concerns that i can see, i'm working on the changes based on
> this output. Fetching the status on OCS metrics exporter, and generating a
> metric to generate the alert.

Thanks. The Rook changes are now merged downstream in 4.16 with https://github.com/red-hat-storage/rook/pull/648

Comment 11 Mudit Agarwal 2024-05-14 13:34:13 UTC

Moving this one to MODIFIED, if we need another bug for metrics then please open one.

Comment 12 Mudit Agarwal 2024-05-14 13:35:09 UTC

Sorry, wrong bug.

Comment 14 arun kumar mohan 2024-05-30 06:04:12 UTC

Updating the RDT on behalf of Divyansh Kamboj.

Comment 16 Vishakha Kathole 2024-06-04 11:49:19 UTC

Moving it to the verified state based on the 4.16 CI regression runs.

Comment 18 errata-xmlrpc 2024-07-17 13:22:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Comment 19 kelwhite 2024-10-06 01:34:52 UTC

Can we please backport this to 4.12, 4.13, 4.14, and 4.15? Were seeing this issue be hit in 4.14 upgrades

Note You need to log in before you can comment on or make changes to this bug.

ableisch
amohan
asriram
bkunal
bniver
dkamboj
ebenahar
gabrioux
gsternag
kbg
kelwhite
mmanjuna
muagarwa
nojha
nthomas
odf-bz-bot
pdhange
rafrojas
roemerso
skatiyar
sostapov
tdesala
tnielsen
vkathole