Bug 2276532

Summary: [GSS][ODF 4.15 backport] Legacy LVM-based OSDs are in crashloop state
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Travis Nielsen <tnielsen>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED ERRATA QA Contact: Vishakha Kathole <vkathole>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.14CC: ableisch, asriram, bkunal, bniver, ebenahar, gabrioux, gsternag, kbg, kelwhite, kramdoss, mcaldeir, mmanjuna, muagarwa, nojha, odf-bz-bot, pdhange, rafrojas, roemerso, sostapov, tnielsen
Target Milestone: ---   
Target Release: ODF 4.15.3   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, the expand OSD init container did not handle the legacy OSDs. As a result, the legacy OSDs that were upgraded since Red Hat OpenShift Container Storage 4.3 were crashing with the 4.14 upgrade. With this fix, the expand OSD init container has been removed to avoid the crash. As a result, the legacy OSDs will start as expected. However, it is recommended to replace the legacy OSDs.
Story Points: ---
Clone Of: 2273398 Environment:
Last Closed: 2024-06-11 16:41:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2273398, 2276533    
Bug Blocks: 2274657, 2274757, 2279928    

Description Travis Nielsen 2024-04-22 21:25:02 UTC
+++ This bug was initially created as a clone of Bug #2273398 +++

Description of problem (please be detailed as possible and provide log
snippests):

Customer upgraded from 4.12.47 to 4.14.16 we have noticed that all OSDs are in crahs loop with the expand-bluefs container showing errors about devices that can not be found.



Version of all relevant components (if applicable):
ODF 4.14.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, all osds are down 

Is there any workaround available to the best of your knowledge?
No


Can this issue reproducible?
yes, at customer environment

--- Additional comment from RHEL Program Management on 2024-04-04 15:14:00 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.16.0' to '?', and so is being proposed to be fixed at the ODF 4.16.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2024-04-04 15:14:00 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from RHEL Program Management on 2024-04-04 15:14:00 UTC ---

The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".

--- Additional comment from Manjunatha on 2024-04-04 15:15:20 UTC ---

+ Attached osd.4 pod description here 

+ ceph osd pods are in crashloop state with below messages. 
Error: ceph-username is required for osd
rook error: ceph-username is required for osd
Usage:
  rook ceph osd init [flags]

Flags:
      --cluster-id string                 the UID of the cluster CR that owns this cluster
      --cluster-name string               the name of the cluster CR that owns this cluster
      --encrypted-device                  whether to encrypt the OSD with dmcrypt
  -h, --help                              help for init
      --is-device                         whether the osd is a device
      --location string                   location of this node for CRUSH placement
      --node-name string                  the host name of the node (default "rook-ceph-osd-1-6db9cfc7c9-294jn")
      --osd-crush-device-class string     The device class for all OSDs configured on this node
      --osd-crush-initial-weight string   The initial weight of OSD in TiB units
      --osd-database-size int             default size (MB) for OSD database (bluestore)
      --osd-id int                        osd id for which to generate config (default -1)
      --osd-store-type string             the osd store type such as bluestore (default "bluestore")
      --osd-wal-size int                  default size (MB) for OSD write ahead log (WAL) (bluestore) (default 576)
      --osds-per-device int               the number of OSDs per device (default 1)

Global Flags:
      --log-level string   logging level for logging/tracing output (valid values: ERROR,WARNING,INFO,DEBUG) (default "INFO")

'/usr/local/bin/rook' -> '/rook/rook'
+ PVC_SOURCE=/ocs-deviceset-0-1-78k4w
+ PVC_DEST=/mnt/ocs-deviceset-0-1-78k4w
+ CP_ARGS=(--archive --dereference --verbose)
+ '[' -b /mnt/ocs-deviceset-0-1-78k4w ']'
++ stat --format %t%T /ocs-deviceset-0-1-78k4w
+ PVC_SOURCE_MAJ_MIN=8e0
++ stat --format %t%T /mnt/ocs-deviceset-0-1-78k4w
+ PVC_DEST_MAJ_MIN=8e0
+ [[ 8e0 == \8\e\0 ]]
+ echo 'PVC /mnt/ocs-deviceset-0-1-78k4w already exists and has the same major and minor as /ocs-deviceset-0-1-78k4w: 8e0'
PVC /mnt/ocs-deviceset-0-1-78k4w already exists and has the same major and minor as /ocs-deviceset-0-1-78k4w: 8e0
+ exit 0
inferring bluefs devices from bluestore path
unable to read label for /var/lib/ceph/osd/ceph-1: (2) No such file or directory
2024-04-04T13:22:38.461+0000 7f41cddbf900 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (2) No such file or directory

--- Additional comment from Manjunatha on 2024-04-04 15:15:49 UTC ---



--- Additional comment from Manjunatha on 2024-04-04 15:20:40 UTC ---

odf mustgather logs in below path 
path /cases/03783266/0020-must-gather-odf.tar.gz/must-gather.local.8213025456072446876/inspect.local.6887047306785235156/namespaces/openshift-storage

This issue looks similar to bz https://bugzilla.redhat.com/show_bug.cgi?id=2254378
---------------------------
Events from osd-0 deployment:

Events:
  Type     Reason                 Age                   From               Message
  ----     ------                 ----                  ----               -------
  Normal   Scheduled              23m                   default-scheduler  Successfully assigned openshift-storage/rook-ceph-osd-0-6764d4c675-f9w2m to storage-00.dev-intranet-01-wob.ocp.vwgroup.com
  Normal   SuccessfulMountVolume  23m                   kubelet            MapVolume.MapPodDevice succeeded for volume "local-pv-abfd62bb" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-abfd62bb"
  Normal   SuccessfulMountVolume  23m                   kubelet            MapVolume.MapPodDevice succeeded for volume "local-pv-abfd62bb" volumeMapPath "/var/lib/kubelet/pods/2ae21d6a-2aa5-43e8-8c0d-7ecb250656a2/volumeDevices/kubernetes.io~local-volume"
  Normal   AddedInterface         23m                   multus             Add eth0 [100.72.0.27/23] from ovn-kubernetes
  Normal   Pulling                23m                   kubelet            Pulling image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:23041cec90d0c64f043deb9f5b589c2fe3b2e29163cf7576324341ad855affcc"
  Normal   Pulled                 23m                   kubelet            Successfully pulled image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:23041cec90d0c64f043deb9f5b589c2fe3b2e29163cf7576324341ad855affcc" in 14.120104593s (14.120122047s including waiting)
  Normal   Created                23m                   kubelet            Created container config-init
  Normal   Started                23m                   kubelet            Started container config-init
  Normal   Pulled                 23m                   kubelet            Container image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:23041cec90d0c64f043deb9f5b589c2fe3b2e29163cf7576324341ad855affcc" already present on machine
  Normal   Created                23m                   kubelet            Created container copy-bins
  Normal   Started                23m                   kubelet            Started container copy-bins
  Normal   Pulled                 23m                   kubelet            Container image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226" already present on machine
  Normal   Created                23m                   kubelet            Created container blkdevmapper
  Normal   Started                23m                   kubelet            Started container blkdevmapper
  Normal   Pulled                 23m (x3 over 23m)     kubelet            Container image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226" already present on machine
  Normal   Created                23m (x3 over 23m)     kubelet            Created container expand-bluefs
  Normal   Started                23m (x3 over 23m)     kubelet            Started container expand-bluefs
  Warning  BackOff                3m43s (x94 over 23m)  kubelet            Back-off restarting failed container expand-bluefs in pod rook-ceph-osd-0-6764d4c675-f9w2m_openshift-storage(2ae21d6a-2aa5-43e8-8c0d-7ecb250656a2)

--- Additional comment from Andreas Bleischwitz on 2024-04-04 15:29:15 UTC ---

TAM update about the business impact:

All developers on this platform are being blocked due to this issue.
VW also has a lot of VMs running on this bare-metal cluster on which they do all the testing and simulation of car components.

This is now all down and they are not able to work.

--- Additional comment from Manjunatha on 2024-04-04 15:46:36 UTC ---

Customer rebooted the osd nodes few times as suggested below solution but that dint help 
https://access.redhat.com/solutions/7015095

--- Additional comment from Andreas Bleischwitz on 2024-04-04 15:53:31 UTC ---

The customer just informed us that this is a very "old" cluster starting with 4.3.18 and ODF was installed about 3.5 years ago. So there may be a lot of tweaks/leftovers/* in this cluster.

--- Additional comment from Manjunatha on 2024-04-04 16:32:07 UTC ---

Latest mustgather and sosreport from storage node 0 in below supportshell 
path: /cases/03783266

cluster:
    id:     18c9800f-7f91-4994-ad32-2a8a330babd6
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            7 osds down
            2 hosts (10 osds) down
            2 racks (10 osds) down
            Reduced data availability: 505 pgs inactive
 
  services:
    mon: 3 daemons, quorum b,f,g (age 4h)
    mgr: a(active, since 4h)
    mds: 1/1 daemons up, 1 standby
    osd: 15 osds: 4 up (since 23h), 11 in (since 23h); 5 remapped pgs
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   12 pools, 505 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             505 unknown
 
ID   CLASS  WEIGHT    TYPE NAME                                                    STATUS  REWEIGHT  PRI-AFF
 -1         26.19896  root default                                                                          
 -4          8.73299      rack rack0                                                                        
 -3          8.73299          host storage-00-dev-intranet-01-wob-ocp-vwgroup-com                           
  0    ssd   1.74660              osd.0                                              down   1.00000  1.00000
  1    ssd   1.74660              osd.1                                              down   1.00000  1.00000
  2    ssd   1.74660              osd.2                                              down   1.00000  1.00000
  3    ssd   1.74660              osd.3                                              down   1.00000  1.00000
  4    ssd   1.74660              osd.4                                              down   1.00000  1.00000
-12          8.73299      rack rack1                                                                        
-11          8.73299          host storage-01-dev-intranet-01-wob-ocp-vwgroup-com                           
  6    ssd   1.74660              osd.6                                                up   1.00000  1.00000
  7    ssd   1.74660              osd.7                                                up   1.00000  1.00000
  8    ssd   1.74660              osd.8                                                up   1.00000  1.00000
  9    ssd   1.74660              osd.9                                                up   1.00000  1.00000
 10    ssd   1.74660              osd.10                                             down         0  1.00000
 -8          8.73299      rack rack2                                                                        
 -7          8.73299          host storage-02-dev-intranet-01-wob-ocp-vwgroup-com                           
  5    ssd   1.74660              osd.5                                              down   1.00000  1.00000
 11    ssd   1.74660              osd.11                                             down         0  1.00000
 12    ssd   1.74660              osd.12                                             down         0  1.00000
 13    ssd   1.74660              osd.13                                             down         0  1.00000
 14    ssd   1.74660              osd.14                                             down   1.00000  1.00000

oc get pods
----------
csi-rbdplugin-22sd7                                               3/3     Running                 0                3h9m
csi-rbdplugin-4jq74                                               3/3     Running                 0                3h8m
csi-rbdplugin-5pcb2                                               3/3     Running                 0                3h10m
csi-rbdplugin-8bfc6                                               3/3     Running                 0                3h8m
csi-rbdplugin-dqk7p                                               3/3     Running                 0                3h9m
csi-rbdplugin-fdvn7                                               3/3     Running                 0                3h10m
csi-rbdplugin-plpst                                               3/3     Running                 0                3h8m
csi-rbdplugin-provisioner-5f9c6986bf-j7p87                        6/6     Running                 0                3h10m
csi-rbdplugin-provisioner-5f9c6986bf-lh28b                        6/6     Running                 0                3h10m
csi-rbdplugin-szt6r                                               3/3     Running                 0                3h9m
csi-rbdplugin-v2mbl                                               3/3     Running                 0                3h8m
csi-rbdplugin-v2sl2                                               3/3     Running                 0                3h9m
noobaa-core-0                                                     1/1     Running                 0                4h24m
noobaa-db-pg-0                                                    0/1     ContainerCreating       0                4h21m
noobaa-endpoint-6b7ffdb8c7-m4sc4                                  1/1     Running                 0                4h24m
noobaa-operator-8fbd98874-thsnf                                   2/2     Running                 0                3h10m
ocs-metrics-exporter-675445555-57s4l                              1/1     Running                 0                3h11m
ocs-operator-7f94d94cc6-t9grr                                     1/1     Running                 0                3h11m
odf-console-57f488895f-9bmn9                                      1/1     Running                 0                3h11m
odf-operator-controller-manager-5696cbdd96-9bd8s                  2/2     Running                 0                3h11m
rook-ceph-crashcollector-99efedd2c34d02d8f63821262323e8cf-g7ktw   1/1     Running                 0                4h9m
rook-ceph-crashcollector-ba2f7f929e41f5b369d230c9d1f57030-hvpx7   1/1     Running                 0                4h36m
rook-ceph-crashcollector-e268748b9d65a9160da738c1921524fc-bp2xh   1/1     Running                 0                4h24m
rook-ceph-exporter-99efedd2c34d02d8f63821262323e8cf-cf55b8cdff7   1/1     Running                 0                3h10m
rook-ceph-exporter-ba2f7f929e41f5b369d230c9d1f57030-868576jrwtl   1/1     Running                 0                3h10m
rook-ceph-exporter-e268748b9d65a9160da738c1921524fc-cfd99fznfcm   1/1     Running                 0                3h10m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5555d4ddxlcbp   2/2     Running                 4 (4h10m ago)    4h24m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7c6fbf87kqwvq   2/2     Running                 7 (4h11m ago)    4h36m
rook-ceph-mgr-a-648fcd4f7-rlpk8                                   2/2     Running                 0                4h36m
rook-ceph-mon-b-f75967674-vbh6v                                   2/2     Running                 0                4h24m
rook-ceph-mon-f-85459b595-zsmtc                                   2/2     Running                 0                4h36m
rook-ceph-mon-g-55dbcc6687-p6jtf                                  2/2     Running                 0                4h44m
rook-ceph-operator-5cc55456b5-hk8db                               1/1     Running                 0                3h11m
rook-ceph-osd-0-6764d4c675-f9w2m                                  0/2     Init:CrashLoopBackOff   41 (4m53s ago)   3h9m
rook-ceph-osd-1-6db9cfc7c9-294jn                                  0/2     Init:CrashLoopBackOff   41 (4m15s ago)   3h9m
rook-ceph-osd-10-5cb968ffcc-svnlf                                 0/2     Init:CrashLoopBackOff   41 (3m52s ago)   3h9m
rook-ceph-osd-11-6c9b679dd-gdxfz                                  0/2     Init:CrashLoopBackOff   41 (4m54s ago)   3h9m
rook-ceph-osd-12-569499c577-k5ssz                                 0/2     Init:CrashLoopBackOff   41 (4m36s ago)   3h9m
rook-ceph-osd-13-f6db445dc-cgwc2                                  0/2     Init:CrashLoopBackOff   41 (3m18s ago)   3h9m
rook-ceph-osd-14-74d8c98998-mm6bk                                 0/2     Init:CrashLoopBackOff   41 (4m9s ago)    3h9m
rook-ceph-osd-2-6c5f9b84d5-njc9v                                  0/2     Init:CrashLoopBackOff   41 (4m31s ago)   3h9m
rook-ceph-osd-3-76984bf75-rtqvb                                   0/2     Init:CrashLoopBackOff   41 (3m56s ago)   3h9m
rook-ceph-osd-4-b696776bf-8z9mx                                   0/2     Init:CrashLoopBackOff   13 (4m4s ago)    45m
rook-ceph-osd-5-6684cc7f47-64pq4                                  0/2     Init:CrashLoopBackOff   14 (2m49s ago)   49m
rook-ceph-osd-6-5dcff784bc-gk7st                                  0/2     Init:CrashLoopBackOff   41 (4m9s ago)    3h8m
rook-ceph-osd-7-f66f9d586-rrkjm                                   0/2     Init:CrashLoopBackOff   41 (4m33s ago)   3h8m
rook-ceph-osd-8-7dd767d7f4-g9s6b                                  0/2     Init:CrashLoopBackOff   41 (3m45s ago)   3h8m
rook-ceph-osd-9-76f75fbc55-jq7lh                                  0/2     Init:CrashLoopBackOff   41 (3m57s ago)   3h8m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-78f98f5f6c4n   1/2     CrashLoopBackOff        59 (2m38s ago)   4h24m
rook-ceph-tools-759496b8f8-4klr9                                  1/1     Running                 0                3h10m
ux-backend-server-5fbf8b985-zpjph                                 2/2     Running                 0                3h11m


Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Warning  BackOff  4m38s (x856 over 3h9m)  kubelet  Back-off restarting failed container expand-bluefs in pod rook-ceph-osd-0-6764d4c675-f9w2m_openshift-storage(2ae21d6a-2aa5-43e8-8c0d-7ecb250656a2)

--- Additional comment from  on 2024-04-04 18:29:13 UTC ---

Seems like the backing device was removed or moved?:

`````
PVC /mnt/ocs-deviceset-0-1-78k4w already exists and has the same major and minor as /ocs-deviceset-0-1-78k4w: 8e0
+ exit 0
inferring bluefs devices from bluestore path
unable to read label for /var/lib/ceph/osd/ceph-1: (2) No such file or directory
`````

// confirm the device for osd-1
$  omc get pods rook-ceph-osd-1-6db9cfc7c9-294jn -o yaml|grep device
    ceph.rook.io/DeviceSet: ocs-deviceset-0
    ceph.rook.io/pvc: ocs-deviceset-0-1-78k4w
    device-class: ssd
      name: devices
      name: ocs-deviceset-0-1-78k4w-bridge
    - "\nset -xe\n\nPVC_SOURCE=/ocs-deviceset-0-1-78k4w\nPVC_DEST=/mnt/ocs-deviceset-0-1-78k4w\nCP_ARGS=(--archive
    - devicePath: /ocs-deviceset-0-1-78k4w
      name: ocs-deviceset-0-1-78k4w
      name: ocs-deviceset-0-1-78k4w-bridge
      name: ocs-deviceset-0-1-78k4w-bridge
      name: devices
      name: ocs-deviceset-0-1-78k4w-bridge
    name: devices
  - name: ocs-deviceset-0-1-78k4w
      claimName: ocs-deviceset-0-1-78k4w
      path: /var/lib/rook/openshift-storage/ocs-deviceset-0-1-78k4w
    name: ocs-deviceset-0-1-78k4w-bridge

$  omc get pvc ocs-deviceset-0-1-78k4w
NAME                      STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS     AGE
ocs-deviceset-0-1-78k4w   Bound    local-pv-32532e89   1788Gi     RWO            ocs-localblock   3y

$  omc get pv local-pv-32532e89 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: local-volume-provisioner-storage-00.dev-intranet-01-wob.ocp.vwgroup.com-da4c2721-f73c-4626-8c98-7ff9f07f3212
  creationTimestamp: "2020-09-09T14:52:54Z"
  finalizers:
  - kubernetes.io/pv-protection
  labels:
    storage.openshift.com/local-volume-owner-name: ocs-blkvol-storage-00
    storage.openshift.com/local-volume-owner-namespace: local-storage
  name: local-pv-32532e89
  resourceVersion: "194139688"
  uid: fdcb6fab-0a53-49ca-bdb1-6e807e969eb7
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1788Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: ocs-deviceset-0-1-78k4w
    namespace: openshift-storage
    resourceVersion: "194139403"
    uid: 8acac407-b475-46ed-9e49-29b377b80137
  local:
    path: /mnt/local-storage/ocs-localblock/sdr
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - storage-00.dev-intranet-01-wob.ocp.vwgroup.com
  persistentVolumeReclaimPolicy: Delete
  storageClassName: ocs-localblock
  volumeMode: Block
status:
  phase: Bound

For this osd at least, they are using device paths and should be using devices by-id/uuid so the device names never change
~~~
  local:
    path: /mnt/local-storage/ocs-localblock/sdr
~~~

Asking for some more data in the case like:

~~~
$ ls -l /mnt/local-storage/ocs-localblock/

I'd also like to gather the following from LSO:

// namespace might be local-storage
$ oc get localvolume -o yaml -n openshift-local-storage
$ oc get localvolumeset -o yaml -n openshift-local-storage
$ oc get localvolumediscovery -o yaml -n openshift-local-storage
~~~

I have a very strong suspicion the kernel picked up the devices in another order and the osds cannot find their backing device. This was caused by the EUS upgrade of OCP, as it does a rollout of MCPs that will reboot the nodes.

--- Additional comment from  on 2024-04-04 21:45:31 UTC ---

Hi,

My suspicion was wrong

// sym links for devices on storage-00 from lso:
[acmdy78@bastion ~]$ oc debug -q node/storage-00.dev-intranet-01-wob.ocp.vwgroup.com -- chroot /host ls -l /mnt/local-storage/ocs-localblock/
total 0
lrwxrwxrwx. 1 root root 50 Sep  9  2020 sdp -> /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02507
lrwxrwxrwx. 1 root root 93 Apr  4 14:34 sdq -> /dev/ceph-e936c994-328c-4f59-8f1d-3a5573a7c64b/osd-block-aaced0de-8884-4551-a5ae-dd86ee436f23
lrwxrwxrwx. 1 root root 50 Sep  9  2020 sdr -> /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490
lrwxrwxrwx. 1 root root 50 Sep  9  2020 sds -> /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02497
lrwxrwxrwx. 1 root root 50 Sep  9  2020 sdt -> /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02489


They have some weird LSO config where each node has its own spec section with its devices listed. Anyways, they have the proper device defined in their LSO configs for the node
~~~
  spec:
    logLevel: Normal
    managementState: Managed
    nodeSelector:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - storage-00.dev-intranet-01-wob.ocp.vwgroup.com
    storageClassDevices:
    - devicePaths:
      - /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02489
      - /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490
      - /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02497
      - /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02504
      - /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02507
      storageClassName: ocs-localblock
~~~

and the symlink that lso knows about for sdr (ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490) points to the correct device. Seems the device names didn't change, and they are using by-id. Taking a step back... since the failure is with the expand-bluefs container:

~~~
  - containerID: cri-o://c4163c5dbd33cab921c113b80350ec20a3af48a2865f7ea43c68f4cdd61afc19
    image: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226
    imageID: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:3d7144d5fe515acf3bf4bbf6456ab8877a4f7cd553c933ca6fd4d891add53038
    lastState:
      terminated:
        containerID: cri-o://c4163c5dbd33cab921c113b80350ec20a3af48a2865f7ea43c68f4cdd61afc19
        exitCode: 1
        finishedAt: "2024-04-04T15:10:29Z"
        reason: Error
        startedAt: "2024-04-04T15:10:29Z"
    name: expand-bluefs
    ready: false
    restartCount: 37
    state:
      waiting:
        message: back-off 5m0s restarting failed container=expand-bluefs pod=rook-ceph-osd-1-6db9cfc7c9-294jn_openshift-storage(4cab01f5-438d-4ffc-a133-cd427bb1cda5)
        reason: CrashLoopBackOff
~~~

because it cannot find its block device:
~~~
$  omc logs rook-ceph-osd-1-6db9cfc7c9-294jn -c expand-bluefs
2024-04-04T15:10:29.457626768Z inferring bluefs devices from bluestore path
2024-04-04T15:10:29.457728034Z unable to read label for /var/lib/ceph/osd/ceph-1: (2) No such file or directory
2024-04-04T15:10:29.457728034Z 2024-04-04T15:10:29.456+0000 7fdba942e900 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (2) No such file or directory
~~~

what if we removed this expand-bluefs container? Maybe the osd will start? If not, is the only option to replace the osd(s)? We do have 1 node up, so we should still hopefully have some good copies? What if setting replica 1 on the pools?

--- Additional comment from Andreas Bleischwitz on 2024-04-05 07:58:58 UTC ---

Hi,

can we have at least an update that this issue is being investigated by engineering? The customer is now suffering from that outage which affects basically their complete development environment (they are developers, and therefore this cluster is their production environment) since about one day.
We currently do not have any idea how to re-enable the OSDs so that they would be able to work again.

Customer:
Volkswagen AG (#556879)


@muagarwa, @gsternag are you able to assist here?

Best regards,
/Andreas

--- Additional comment from Bipin Kunal on 2024-04-05 11:36:28 UTC ---

(In reply to Andreas Bleischwitz from comment #13)
> Hi,
> 
> can we have at least an update that this issue is being investigated by
> engineering? The customer is now suffering from that outage which affects
> basically their complete development environment (they are developers, and
> therefore this cluster is their production environment) since about one day.
> We currently do not have any idea how to re-enable the OSDs so that they
> would be able to work again.
> 
> Customer:
> Volkswagen AG (#556879)
> 
> 
> @muagarwa, @gsternag are you able to assist here?
> 
> Best regards,
> /Andreas

Hi Andreas, Thanks for reaching out to me. I am trying to reach out to engineering team. Meanwhile, it will good to have prio-list email if this is really urgent. Removing the needinfo on Mudit.

--- Additional comment from Radoslaw Zarzynski on 2024-04-05 11:58:00 UTC ---

On it.

--- Additional comment from Radoslaw Zarzynski on 2024-04-05 13:15:51 UTC ---

```
[supportshell-1.sush-001.prod.us-west-2.aws.redhat.com] [13:03:07+0000]
[rzarzyns@supportshell-1 03783266]$ cat ./0010-rook-ceph-osd-0-5d664bf845-mf956-expand-bluefs.log
inferring bluefs devices from bluestore path
unable to read label for /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2024-04-04T09:23:13.685+0000 7f9911e4c900 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (2) No such file or directory
```

Let's read through the pacific's `ceph-bluestore-tool` for
the above failure of the `bluefs-bdev-expand` command:

```
  else if (action == "bluefs-bdev-expand") {
    BlueStore bluestore(cct.get(), path);
    auto r = bluestore.expand_devices(cout);
    if (r <0) {
      cerr << "failed to expand bluestore devices: "
           << cpp_strerror(r) << std::endl;
      exit(EXIT_FAILURE);
    }
  }
```

```
int BlueStore::expand_devices(ostream& out)
{
  // ...
      if (_set_bdev_label_size(p, size) >= 0) {
        out << devid
          << " : size label updated to " << size
          << std::endl;
      }
```

```
int BlueStore::_set_bdev_label_size(const string& path, uint64_t size)
{
  bluestore_bdev_label_t label;
  int r = _read_bdev_label(cct, path, &label);
  if (r < 0) {
    derr << "unable to read label for " << path << ": "
          << cpp_strerror(r) << dendl;
  } else {
```

```
int BlueStore::_read_bdev_label(CephContext* cct, string path,
                                bluestore_bdev_label_t *label)
{
  dout(10) << __func__ << dendl;
  int fd = TEMP_FAILURE_RETRY(::open(path.c_str(), O_RDONLY|O_CLOEXEC));
  if (fd < 0) {
    fd = -errno;
    derr << __func__ << " failed to open " << path << ": " << cpp_strerror(fd)
         << dendl;
    return fd;
  }
  // ...
```

We can see the direct underlying problem is the failure of
the `open()` syscall called on `/var/lib/ceph/osd/ceph-0/block`.
Whatever the Rook's container gets inside, COT is unable to
open it. Therefore it looks more like orchestrator's failure
than Ceph.

It seems a good idea to run a shell within the container's
environment and check, with standard unix, tools the presence
the `block`. If it's a sysmlink, the target must be possible
to be `open()ed` as well.

Another thing that struck me is this error:

```
[supportshell-1.sush-001.prod.us-west-2.aws.redhat.com] [13:10:18+0000]
[rzarzyns@supportshell-1 03783266]$ cat ./0020-must-gather-odf.tar.gz/must-gather.local.8213025456072446876/inspect.local.6887047306785235156/namespaces/openshift-storage/pods/rook-ceph-osd-0-5d664bf845-mf956/config-init/config-init/logs/current.log
2024-04-04T09:02:08.139048570Z Error: ceph-username is required for osd
2024-04-04T09:02:08.139350828Z Usage:
2024-04-04T09:02:08.139350828Z   rook ceph osd init [flags]
2024-04-04T09:02:08.139350828Z 
2024-04-04T09:02:08.139350828Z Flags:
2024-04-04T09:02:08.139350828Z       --cluster-id string                 the UID of the cluster CR that owns this cluster
2024-04-04T09:02:08.139350828Z       --cluster-name string               the name of the cluster CR that owns this cluster
2024-04-04T09:02:08.139350828Z       --encrypted-device                  whether to encrypt the OSD with dmcrypt
2024-04-04T09:02:08.139350828Z   -h, --help                              help for init
2024-04-04T09:02:08.139350828Z       --is-device                         whether the osd is a device
2024-04-04T09:02:08.139350828Z       --location string                   location of this node for CRUSH placement
2024-04-04T09:02:08.139350828Z       --node-name string                  the host name of the node (default "rook-ceph-osd-0-5d664bf845-mf956")
2024-04-04T09:02:08.139350828Z       --osd-crush-device-class string     The device class for all OSDs configured on this node
2024-04-04T09:02:08.139350828Z       --osd-crush-initial-weight string   The initial weight of OSD in TiB units
2024-04-04T09:02:08.139350828Z       --osd-database-size int             default size (MB) for OSD database (bluestore)
2024-04-04T09:02:08.139350828Z       --osd-id int                        osd id for which to generate config (default -1)
2024-04-04T09:02:08.139350828Z       --osd-store-type string             the osd store type such as bluestore (default "bluestore")
2024-04-04T09:02:08.139350828Z       --osd-wal-size int                  default size (MB) for OSD write ahead log (WAL) (bluestore) (default 576)
2024-04-04T09:02:08.139350828Z       --osds-per-device int               the number of OSDs per device (default 1)
2024-04-04T09:02:08.139350828Z 
2024-04-04T09:02:08.139350828Z Global Flags:
2024-04-04T09:02:08.139350828Z       --log-level string   logging level for logging/tracing output (valid values: ERROR,WARNING,INFO,DEBUG) (default "INFO")
2024-04-04T09:02:08.139350828Z 
2024-04-04T09:02:08.139365263Z rook error: ceph-username is required for osd
```

I'm not a Rook expert but this looks weird especially taken
into consideration the upgrade. Is `ceph osd init` failing
early because of unspecified user parameter, which leaves
the container's `block` uninitialized? Just speculating.

Best regards,
Radek

--- Additional comment from Bipin Kunal on 2024-04-05 13:23:30 UTC ---

Thanks Radek for checking. I will check if someone from rook can have a look as well.

--- Additional comment from  on 2024-04-05 13:52:56 UTC ---

Hello,

@bkunal found the KCS https://access.redhat.com/solutions/7026462 I'm going to confirm this is the same for this case. Will post findings when I have any.

--- Additional comment from Bipin Kunal on 2024-04-05 13:57:53 UTC ---

(In reply to kelwhite from comment #18)
> Hello,
> 
> @bkunal found the KCS https://access.redhat.com/solutions/7026462 I'm going
> to confirm this is the same for this case. Will post findings when I have
> any.

Actually Shubham from the Rook team found it and gave it to me.

--- Additional comment from  on 2024-04-05 15:41:15 UTC ---

From the customer: 


// for osd-1:
~~~~
osd-1 is not using /dev/sdr, but i figured it out see this path:

[acmdy78@bastion ~]$ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-1 | grep ceph.rook.io/pvc
    ceph.rook.io/pvc: ocs-deviceset-0-1-78k4w
        ceph.rook.io/pvc: ocs-deviceset-0-1-78k4w
          - key: ceph.rook.io/pvc
[acmdy78@bastion ~]$ oc get pvc ocs-deviceset-0-1-78k4w
NAME                      STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS     AGE
ocs-deviceset-0-1-78k4w   Bound    local-pv-32532e89   1788Gi     RWO            ocs-localblock   3y208d
[acmdy78@bastion ~]$ oc get pv local-pv-32532e89 -o custom-columns=NAME:.metadata.name,PATH:.spec.local.path,NODE:.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values
NAME                PATH                                    NODE
local-pv-32532e89   /mnt/local-storage/ocs-localblock/sdr   [storage-00.dev-intranet-01-wob.ocp.vwgroup.com]
[acmdy78@bastion ~]$ oc debug -q node/storage-00.dev-intranet-01-wob.ocp.vwgroup.com                                                                                        sh-4.4# chroot /host
sh-5.1# ls -lah /mnt/local-storage/ocs-localblock/sdr
lrwxrwxrwx. 1 root root 50 Sep  9  2020 /mnt/local-storage/ocs-localblock/sdr -> /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490
sh-5.1# ls -lah /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490
lrwxrwxrwx. 1 root root 9 Apr  5 14:58 /dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0KA02490 -> ../../sdo
sh-5.1# ls -lah /dev/sdo
brw-rw----. 1 root disk 8, 224 Apr  5 14:58 /dev/sdo
sh-5.1# lsblk /dev/sdo
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sdo                                                                                                     8:224  0  1.7T  0 disk
`-ceph--f557e476--7bd4--41a0--9323--d6061a4318b3-osd--block--7f80e2ac--e21f--4aa6--8886--ec94d0387196 253:5    0  1.7T  0 lvm
sh-5.1# head --bytes=60 /dev/sdo
sh-5.1# ls -lah /dev/ceph-
ceph-309180b2-697b-473a-a19c-d00cec94427a/ ceph-b100dea8-0b24-4d9b-97c8-ed6dba1bd10d/ ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/
ceph-3c187f41-d43e-4bb2-9421-97f78db94d28/ ceph-e936c994-328c-4f59-8f1d-3a5573a7c64b/
sh-5.1# ls -lah /dev/ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/osd-block-7f80e2ac-e21f-4aa6-8886-ec94d0387196
lrwxrwxrwx. 1 root root 7 Apr  4 10:54 /dev/ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/osd-block-7f80e2ac-e21f-4aa6-8886-ec94d0387196 -> ../dm-5
sh-5.1# ls -lah /dev/dm-5
brw-rw----. 1 root disk 253, 5 Apr  4 10:54 /dev/dm-5
sh-5.1# head --bytes=60 /dev/dm-5
bluestore block device
7f80e2ac-e21f-4aa6-8886-ec94d0387196

the bluestore block device id ov the logical volume is the same that you get for osd-1 in the ceph osd dump command.
Could the Problem be caused by this lvm layer? On clusters we installed later ODF i don't see that lvm is used. This cluster started with OCS 4.3 or 4.4
~~~~

--- Additional comment from  on 2024-04-05 17:12:25 UTC ---

Hi,

Update... We've found the block devices dont exist on the nodes (this is from storage-00):

~~~
/var/lib/rook/openshift-storage/ocs-deviceset-0-1-78k4w:
total 0
drwxr-xr-x. 2 root root      6 Apr  3 14:50 ceph-1
brw-rw-rw-. 1 root disk 8, 224 Apr  4 10:57 ocs-deviceset-0-1-78k4w

/var/lib/rook/openshift-storage/ocs-deviceset-0-1-78k4w/ceph-1:
total 0

/var/lib/rook/openshift-storage/ocs-deviceset-0-2-2p2fc:
total 0
drwxr-xr-x. 2 root root     6 Apr  3 14:50 ceph-2
brw-rw-rw-. 1 root disk 65, 0 Apr  4 10:57 ocs-deviceset-0-2-2p2fc

/var/lib/rook/openshift-storage/ocs-deviceset-0-2-2p2fc/ceph-2:
total 0

/var/lib/rook/openshift-storage/ocs-deviceset-0-3-lh2tq:
total 0
drwxr-xr-x. 2 root root      6 Apr  3 14:50 ceph-3
brw-rw-rw-. 1 root disk 65, 16 Apr  4 10:57 ocs-deviceset-0-3-lh2tq

/var/lib/rook/openshift-storage/ocs-deviceset-0-3-lh2tq/ceph-3:
total 0

/var/lib/rook/openshift-storage/ocs-deviceset-0-4-wfm22:
total 0
drwxr-xr-x. 2 root root     10 Apr  3 14:50 ceph-4
brw-rw-rw-. 1 root disk 253, 4 Apr  4 14:49 ocs-deviceset-0-4-wfm22

/var/lib/rook/openshift-storage/ocs-deviceset-0-4-wfm22/ceph-4:
total 0
~~~

We need to confirm why these are gone. The current ask from engineering is why did these devices vanish. Would rook do anything with this? Can we find anything that will help? 

We're confirming the devices are gone on the other nodes and starting the osd replacement processes via a remote call.

--- Additional comment from  on 2024-04-05 19:50:49 UTC ---

Hello All,

On a remote with the customer. We've confirmed no data loss, phew. Seems the issue is with ceph-volume, it's not activating the device. We tried to do this manually via the below and got osd-9 up and running:

~~~
- Creating a backup of the osd-9 deployment, we're going to remove the liveness probe
- scaled down the rook-ceph and ocs-operators
- oc edit the osd-9 deployment and searched for the expand-bluefs section and removed the container
- oc get pods to see if osd-9 came up (still 1/2) and rshed info the container
   - ceph-volume lvm list 
   - ceph-volume lvm active --no-systemd -- 9 79021ece-c52a-46d1-8e99-69640a926822 // this is the osd fsid from ceph-volume lvm list
   - The osd was activated and when we viewed the osd data dir, the block device was listed:
      - ls -l '/var/lib/ceph/osd/ceph-{id}
~~~

We're looking to get some ceph-volume logs to determine what's going on... Might need to create another BZ for ceph-volume, but we will know more once we review the fresh odf must-gather

--- Additional comment from Travis Nielsen on 2024-04-05 20:56:28 UTC ---

Great to see the OSDs can be brought back up with the workaround and there is no data loss. 

These old LVM-based OSDs that were created (IIRC only in 4.2 and 4.3) are going to be a problem to maintain. We simply don't have tests that upgrades from OSDs created from 10+ releases ago. For this configuration that has not been supported for so long, the way to keep supporting such an old cluster will be to replace each of the OSDs. By purging each OSD one-at-a-time and bringing up a new one, the OSDs can be in a current configuration. 

It would not surprise me that in 4.14 there could have been an update to ceph-volume that caused this issue, because we just haven't tested this configuration for so long.

Guillaume, agreed that old LVM-based OSDs should be replaced?

--- Additional comment from Prashant Dhange on 2024-04-05 21:18:53 UTC ---

Additional details for the completeness :

(In reply to kelwhite from comment #22)
> Hello All,
> 
> On a remote with the customer. We've confirmed no data loss, phew. Seems the
> issue is with ceph-volume, it's not activating the device. We tried to do
> this manually via the below and got osd-9 up and running:
> 
> ~~~
> - Creating a backup of the osd-9 deployment, we're going to remove the
> liveness probe
> - scaled down the rook-ceph and ocs-operators
> - oc edit the osd-9 deployment and searched for the expand-bluefs section
> and removed the container
> - oc get pods to see if osd-9 came up (still 1/2) and rshed info the
> container
>    - ceph-volume lvm list 
All LVs associated with ceph cluster are getting listed here and lsblk/lvs recognizing these LVs.  

>    - ceph-volume lvm active --no-systemd -- 9
> 79021ece-c52a-46d1-8e99-69640a926822 // this is the osd fsid from
> ceph-volume lvm list
>    - The osd was activated and when we viewed the osd data dir, the block
> device was listed:
>       - ls -l '/var/lib/ceph/osd/ceph-{id}
     - Start osd.9 
     # ceph-osd --id 9 --fsid 18c9800f-7f91-4994-ad32-2a8a330babd6  --setuser ceph --setgroup ceph --crush-location="root=default host=storage-01-dev-intranet-01-wob-ocp-vwgroup-com rack=rack1" --log-to-stderr=true --err-to-stderr=true
 --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false
    NOTE : The OSD daemon will run in background and it's safe to exist the container here.

--- Additional comment from Prashant Dhange on 2024-04-05 22:14:32 UTC ---

The latest provided must-gather logs and ceph logs does not shed any light on failure to OSD directory priming or ceph-volume activating the OSD device.

The next action plan :
- Apply the workaround for every OSD on the cluster, refer comment#24
- Get all OSDs up/in and all PGs active+clean
- Re-deploy all OSDs one-by-one.

For the other clusters which might experience similar issues, the recommendations are to re-deploy all the OSDs then only go for cluster upgrade from 4.12.47 to 4.14.16.

Let me know if you need any help on recovering this cluster.

--- Additional comment from Prashant Dhange on 2024-04-05 22:58:31 UTC ---

(In reply to Prashant Dhange from comment #24)
...
>      - Start osd.9 
>      # ceph-osd --id 9 --fsid 18c9800f-7f91-4994-ad32-2a8a330babd6 
> --setuser ceph --setgroup ceph --crush-location="root=default
> host=storage-01-dev-intranet-01-wob-ocp-vwgroup-com rack=rack1"
> --log-to-stderr=true --err-to-stderr=true
>  --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug
> --default-log-to-file=false --default-mon-cluster-log-to-file=false
> --ms-learn-addr-from-peer=false
>     NOTE : The OSD daemon will run in background and it's safe to exist the
> container here.

In ceph-osd run command, change crush-location according to `ceph osd tree` output or copy it from osd deployment config (under spec.containers section). Do not forget to add double quotes for crush-location value.

e.g
# oc get deployment rook-ceph-osd-9 -o yaml
spec:
  affinity:
...
  containers:
  - args:
    - ceph
    - osd
    - start
    - --
    - --foreground
    - --id
    - "9"
    - --fsid
    - 18c9800f-7f91-4994-ad32-2a8a330babd6
    - --setuser
    - ceph
    - --setgroup
    - ceph
    - --crush-location=root=default host=storage-01-dev-intranet-01-wob-ocp-vwgroup-com rack=rack1

--- Additional comment from Rafrojas on 2024-04-06 07:19:41 UTC ---

Hi Prashant

  I joined the call with customer and we aplied the Workaround, we edited the deployment of each OSD and removed the expand-bluefs args from that, we have a backup of all the deployments if required. After that ceph started the recovery and finished after some time, a new must-gather is collected and available on the case, there's a WARN on ceph:

    health: HEALTH_WARN
            15 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats

  I also requested to collect the /var/log/ceph ceph volume logs for the RCA, Donny will collect along the day.

  We accorded to wait until new data is checked until continue with next steps, we cannot confirm that application is working fine, because developers shifts are MON-FRI but we see that the cluster looks on a better shape, with all operators running.

Regards
Rafa

--- Additional comment from Rafrojas on 2024-04-06 08:59:04 UTC ---

Hi Prashant

  Ceph logs collected and attached to the case, waiting for your instructions for next steps

Regards
Rafa

--- Additional comment from Rafrojas on 2024-04-06 12:12:03 UTC ---

Hi Prashant

  CU is waiting for some feedback, they are running this cluster in abnormal state, NA will join the shift soon I'll add the handover on the case from last call and status, please let us know next steps to share with CU ASAP.

Regards
Rafa

--- Additional comment from Prashant Dhange on 2024-04-07 02:55:59 UTC ---

Hi Rafa,

(In reply to Rafrojas from comment #27)
> Hi Prashant
> 
>   I joined the call with customer and we aplied the Workaround, we edited
> the deployment of each OSD and removed the expand-bluefs args from that, we
> have a backup of all the deployments if required.
Good to know that all OSDs are up and running after applying the workaround. 

There is a quick way to patch the OSD deployment to remove bluefs-expand init container using oc patch command :
# oc patch deployment rook-ceph-osd-<osdid> --type=json -p='[{"op": "remove", "path": "/spec/template/spec/initContainers/3"}]'

> After that ceph started
> the recovery and finished after some time, a new must-gather is collected
> and available on the case, there's a WARN on ceph:
> 
>     health: HEALTH_WARN
>             15 OSD(s) reporting legacy (not per-pool) BlueStore omap usage
> stats
This warning is because OSDs were created pre-octopus release. This warning will be addressed as we are re-deploying the OSDs. If we were not planning to re-deploy the OSDs then you need to set `ceph config rm osd bluestore_fsck_quick_fix_on_mount` and restart the OSDs. Refer KCS solution https://access.redhat.com/solutions/7041554 for more details.

> 
>   I also requested to collect the /var/log/ceph ceph volume logs for the
> RCA, Donny will collect along the day.
The latest logs have been analyzed and Guillaume able to find the RCA for the issue. The RCA has been provided in BZ-2273724#c3 comment.

--- Additional comment from Bob Emerson on 2024-04-07 17:25:09 UTC ---

Customer has posted an update regarding his notes and status in case 03783266



STATUS update:

Hi,


also the migration from the last osds from lvm to raw is now completed and i have reset the min_size of the Pools back to 2.


bash-5.1$ ceph osd unset nobackfill
nobackfill is unset
bash-5.1$ ceph -s
  cluster:
    id:     18c9800f-7f91-4994-ad32-2a8a330babd6
    health: HEALTH_WARN
            Degraded data redundancy: 1358206/4099128 objects degraded (33.134%), 295 pgs degraded, 295 pgs undersized
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum b,f,g (age 5m)
    mgr: a(active, since 14m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 15 osds: 15 up (since 2m), 15 in (since 2m); 295 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 313 pgs
    objects: 1.37M objects, 955 GiB
    usage:   1.7 TiB used, 24 TiB / 26 TiB avail
    pgs:     1358206/4099128 objects degraded (33.134%)
             14864/4099128 objects misplaced (0.363%)
             290 active+undersized+degraded+remapped+backfill_wait
             18  active+clean
             5   active+undersized+degraded+remapped+backfilling

  io:
    client:   20 KiB/s rd, 136 KiB/s wr, 3 op/s rd, 9 op/s wr
    recovery: 2.0 MiB/s, 4.79k keys/s, 817 objects/s

bash-5.1$ ceph osd df tree
ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
 -1         26.19896         -   26 TiB  1.7 TiB  1.7 TiB  1.6 GiB   11 GiB   24 TiB   6.60  1.00    -          root default
 -4          8.73299         -  8.7 TiB  886 GiB  879 GiB  853 MiB  6.0 GiB  7.9 TiB   9.90  1.50    -              rack rack0
 -3          8.73299         -  8.7 TiB  886 GiB  879 GiB  853 MiB  6.0 GiB  7.9 TiB   9.90  1.50    -                  host storage-00-dev-intranet-01-wob-ocp-vwgroup-com
  0    ssd   1.74660   1.00000  1.7 TiB  177 GiB  176 GiB  165 MiB  1.0 GiB  1.6 TiB   9.91  1.50   58      up              osd.0
  1    ssd   1.74660   1.00000  1.7 TiB  199 GiB  197 GiB  261 MiB  1.4 GiB  1.6 TiB  11.13  1.69   65      up              osd.1
  2    ssd   1.74660   1.00000  1.7 TiB  172 GiB  170 GiB   94 MiB  1.4 GiB  1.6 TiB   9.62  1.46   58      up              osd.2
  3    ssd   1.74660   1.00000  1.7 TiB  171 GiB  169 GiB  144 MiB  1.0 GiB  1.6 TiB   9.54  1.45   64      up              osd.3
  4    ssd   1.74660   1.00000  1.7 TiB  167 GiB  165 GiB  189 MiB  1.1 GiB  1.6 TiB   9.32  1.41   68      up              osd.4
-12          8.73299         -  8.7 TiB  884 GiB  879 GiB  798 MiB  4.3 GiB  7.9 TiB   9.88  1.50    -              rack rack1
-11          8.73299         -  8.7 TiB  884 GiB  879 GiB  798 MiB  4.3 GiB  7.9 TiB   9.88  1.50    -                  host storage-01-dev-intranet-01-wob-ocp-vwgroup-com
  6    ssd   1.74660   1.00000  1.7 TiB  185 GiB  184 GiB  128 MiB  940 MiB  1.6 TiB  10.32  1.56   62      up              osd.6
  7    ssd   1.74660   1.00000  1.7 TiB  200 GiB  199 GiB  207 MiB  1.0 GiB  1.6 TiB  11.18  1.69   71      up              osd.7
  8    ssd   1.74660   1.00000  1.7 TiB  161 GiB  160 GiB  173 MiB  939 MiB  1.6 TiB   8.98  1.36   63      up              osd.8
  9    ssd   1.74660   1.00000  1.7 TiB  181 GiB  180 GiB  137 MiB  848 MiB  1.6 TiB  10.11  1.53   64      up              osd.9
 10    ssd   1.74660   1.00000  1.7 TiB  158 GiB  157 GiB  153 MiB  629 MiB  1.6 TiB   8.83  1.34   53      up              osd.10
 -8          8.73299         -  8.7 TiB  693 MiB  121 MiB      0 B  572 MiB  8.7 TiB   0.01  0.00    -              rack rack2
 -7          8.73299         -  8.7 TiB  693 MiB  121 MiB      0 B  572 MiB  8.7 TiB   0.01  0.00    -                  host storage-02-dev-intranet-01-wob-ocp-vwgroup-com
  5    ssd   1.74660   1.00000  1.7 TiB  117 MiB   19 MiB      0 B   97 MiB  1.7 TiB   0.01     0    3      up              osd.5
 11    ssd   1.74660   1.00000  1.7 TiB  144 MiB   27 MiB      0 B  117 MiB  1.7 TiB   0.01  0.00    5      up              osd.11
 12    ssd   1.74660   1.00000  1.7 TiB  148 MiB   27 MiB      0 B  121 MiB  1.7 TiB   0.01  0.00    7      up              osd.12
 13    ssd   1.74660   1.00000  1.7 TiB  143 MiB   23 MiB      0 B  119 MiB  1.7 TiB   0.01  0.00    2      up              osd.13
 14    ssd   1.74660   1.00000  1.7 TiB  141 MiB   23 MiB      0 B  117 MiB  1.7 TiB   0.01  0.00    6      up              osd.14
                         TOTAL   26 TiB  1.7 TiB  1.7 TiB  1.6 GiB   11 GiB   24 TiB   6.60
MIN/MAX VAR: 0/1.69  STDDEV: 4.70
bash-5.1$ for i in .rgw.root ocs-storagecluster-cephblockpool ocs-storagecluster-cephfilesystem-metadata ocs-storagecluster-cephobjectstore.rgw.control ocs-storagecluster-cephfilesystem-data0 ocs-storagecluster-cephobjectstore.rgw.meta ocs-storagecluster-cephobjectstore.rgw.log ocs-storagecluster-cephobjectstore.rgw.buckets.index ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec ocs-storagecluster-cephobjectstore.rgw.buckets.data .mgr ocs-storagecluster-cephobjectstore.rgw.otp ; do ceph osd pool set $i min_size 2 ; done
set pool 1 min_size to 2
set pool 2 min_size to 2
set pool 3 min_size to 2
set pool 4 min_size to 2
set pool 5 min_size to 2
set pool 6 min_size to 2
set pool 7 min_size to 2
set pool 8 min_size to 2
set pool 9 min_size to 2
set pool 10 min_size to 2
set pool 11 min_size to 2
set pool 12 min_size to 2
I am now waiting for the backfill to finish.


Regards Donny


--------------------------------------------------------------------------------------------------------------------------------------------------


NOTES update(attached files to BZ)

Hi,

i will upload the latest Version of my Notes and a detailed output of commnds i ran to manually remove the local-storage lvm-Volumes, for your reference.
I have seen in the diskmaker-manager-Pods (local-storage Operator) that it had Problem removing the lvm-Disks, thus making the prcedure neccasary to remove the VolumeGroups and logicalVolumes manually. I include the logs from the diskmaker-manager so you can have a look if there is a bug in local-storage-Operator, about deleting lvm ocs-localblock Volumes.


Regards Donny

--- Additional comment from Prashant Dhange on 2024-04-09 19:53:14 UTC ---

(In reply to Prashant Dhange from comment #30)
> Hi Rafa,
...
> > After that ceph started
> > the recovery and finished after some time, a new must-gather is collected
> > and available on the case, there's a WARN on ceph:
> > 
> >     health: HEALTH_WARN
> >             15 OSD(s) reporting legacy (not per-pool) BlueStore omap usage
> > stats
> This warning is because OSDs were created pre-octopus release. This warning
> will be addressed as we are re-deploying the OSDs. If we were not planning
> to re-deploy the OSDs then you need to set `ceph config rm osd
> bluestore_fsck_quick_fix_on_mount` and restart the OSDs. 

Correction. Meant to say :

This warning is because OSDs were created pre-octopus release. This warning
will be addressed as we are re-deploying the OSDs. If we were not planning
to re-deploy the OSDs then you need to set 
`ceph config set osd bluestore_fsck_quick_fix_on_mount true`, restart the OSDs
and then `ceph config rm osd bluestore_fsck_quick_fix_on_mount`. 

> Refer KCS solution
> https://access.redhat.com/solutions/7041554 for more details.

--- Additional comment from Prashant Dhange on 2024-04-09 21:11:36 UTC ---

We are still getting more details about the ODF upgrade history from the customer.

Based on available data, here are the steps to reproduce this issue :
- Deploy 4.3.18 cluster with LVM based OSDs
- Start upgrading to ODF 4.4 then to every major release till 4.13.7 e.g from 4.4 to 4.5 to 4.6 so on.
- Verify that ODF cluster is healthy and we are not observing any daemon crash (specifically OSDs)
- Upgrade from 4.13.7 to 4.14.16
- Observe the OSDs are stuck in CLBO state

--- Additional comment from Prashant Dhange on 2024-04-09 23:11:46 UTC ---

Okay. The issue is not related to ceph-volume at all. The problem was OSDs were deployed on OCS 4.3 cluster, so the deployment config has different initContainers compared to ODF versions (probably 4.9 or later).

Init containers sequence for 4.3 deployment config (refer point [2] below) :
Container-1 : ## Init Container 1 : rook ceph osd init  
Container-2 : ## Init Container 2 : Copy rook command to OSD pod
Container-3 : ## Init Container 3 : expand-bluefs
Container-4 : ## Init Container 4 : chown ceph directories

then the actual osd container starts, which executes the "ceph osd start" script which internally calls ceph-volume lvm activate then ceph-osd command. 
Container-5: ceph osd start (refer points [1] and [3] below) 

When the customer upgraded to 4.14.16, the "rook ceph osd init" container failed to mount the osd data directory. Due to this expand-bluefs container failed to start and exited with "_read_bdev_label failed to open /var/lib/ceph/osd/ceph-<osdid>/block: (2) No such file or directory" error.

When we removed expand-bluefs init container as a workaround, the ceph osd started successfully as Container-5 (ceph osd start) was able to execute the lvm activate and start ceph-osd daemon. When I was on the remote session for the first time, we were able to start (after removing expand-bluefs init container) osd.9 manually after executing the lvm activate command then the ceph-osd command.

[3] ceph osd start logs 
2024-04-06T05:46:41.593349071Z + set -o nounset
2024-04-06T05:46:41.593349071Z + child_pid=
2024-04-06T05:46:41.593427396Z + sigterm_received=false
2024-04-06T05:46:41.593427396Z + trap sigterm SIGTERM
2024-04-06T05:46:41.593576845Z + child_pid=52
2024-04-06T05:46:41.593589922Z + wait 52
2024-04-06T05:46:41.593726159Z + /rook/rook ceph osd start -- --foreground --id 1 --fsid 18c9800f-7f91-4994-ad32-2a8a330babd6 --setuser ceph --setgroup ceph '--crush-location=root=default host=storage-00-dev-intranet-01-wob-ocp-vwgroup-com rack=rack0' --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true '--default-log-stderr-prefix=debug ' --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false
2024-04-06T05:46:41.626980032Z 2024-04-06 05:46:41.626898 I | rookcmd: starting Rook v4.14.6-0.7522dc8ddafd09860f2314db3965ef97671cd138 with arguments '/rook/rook ceph osd start -- --foreground --id 1 --fsid 18c9800f-7f91-4994-ad32-2a8a330babd6 --setuser ceph --setgroup ceph --crush-location=root=default host=storage-00-dev-intranet-01-wob-ocp-vwgroup-com rack=rack0 --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug  --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false'
2024-04-06T05:46:41.626980032Z 2024-04-06 05:46:41.626956 I | rookcmd: flag values: --block-path=/dev/ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/osd-block-7f80e2ac-e21f-4aa6-8886-ec94d0387196, --help=false, --log-level=INFO, --lv-backed-pv=true, --osd-id=1, --osd-store-type=, --osd-uuid=7f80e2ac-e21f-4aa6-8886-ec94d0387196, --pvc-backed-osd=true
2024-04-06T05:46:41.626980032Z 2024-04-06 05:46:41.626960 I | ceph-spec: parsing mon endpoints: g=100.69.195.205:3300,f=100.70.70.134:6789,b=100.70.78.99:6789
2024-04-06T05:46:41.628815634Z 2024-04-06 05:46:41.628788 I | cephosd: Successfully updated lvm config file "/etc/lvm/lvm.conf"
2024-04-06T05:46:41.925092800Z 2024-04-06 05:46:41.925022 I | exec: Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
2024-04-06T05:46:41.928518615Z 2024-04-06 05:46:41.928499 I | exec: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
2024-04-06T05:46:41.931919054Z 2024-04-06 05:46:41.931906 I | exec: Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/osd-block-7f80e2ac-e21f-4aa6-8886-ec94d0387196 --path /var/lib/ceph/osd/ceph-1 --no-mon-config
2024-04-06T05:46:41.954830230Z 2024-04-06 05:46:41.954808 I | exec: Running command: /usr/bin/ln -snf /dev/ceph-f557e476-7bd4-41a0-9323-d6061a4318b3/osd-block-7f80e2ac-e21f-4aa6-8886-ec94d0387196 /var/lib/ceph/osd/ceph-1/block
2024-04-06T05:46:41.957864812Z 2024-04-06 05:46:41.957851 I | exec: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block
2024-04-06T05:46:41.961270909Z 2024-04-06 05:46:41.961255 I | exec: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-5
2024-04-06T05:46:41.964681164Z 2024-04-06 05:46:41.964667 I | exec: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
2024-04-06T05:46:41.967586406Z 2024-04-06 05:46:41.967574 I | exec: --> ceph-volume lvm activate successful for osd ID: 1
2024-04-06T05:46:42.029385070Z 2024-04-06 05:46:42.028473 I | exec: debug 2024-04-06T05:46:42.027+0000 7fa35830c5c0  0 set uid:gid to 167:167 (ceph:ceph)
2024-04-06T05:46:42.029462802Z 2024-04-06 05:46:42.029394 I | exec: debug 2024-04-06T05:46:42.027+0000 7fa35830c5c0  0 ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable), process ceph-osd, pid 133
2024-04-06T05:46:42.029462802Z 2024-04-06 05:46:42.029437 I | exec: debug 2024-04-06T05:46:42.027+0000 7fa35830c5c0  0 pidfile_write: ignore empty --pid-file
2024-04-06T05:46:42.029899768Z 2024-04-06 05:46:42.029860 I | exec: debug 2024-04-06T05:46:42.029+0000 7fa35830c5c0  1 bdev(0x55a4d1b87c00 /var/lib/ceph/osd/ceph-1/block) open path /var/lib/ceph/osd/ceph-1/block
2024-04-06T05:46:42.029959756Z 2024-04-06 05:46:42.029947 I | exec: debug 2024-04-06T05:46:42.029+0000 7fa35830c5c0  0 bdev(0x55a4d1b87c00 /var/lib/ceph/osd/ceph-1/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-1/block failed: (22) Invalid argument
2024-04-06T05:46:42.030424427Z 2024-04-06 05:46:42.030409 I | exec: debug 2024-04-06T05:46:42.029+0000 7fa35830c5c0  1 bdev(0x55a4d1b87c00 /var/lib/ceph/osd/ceph-1/block) open size 1920378863616 (0x1bf1f800000, 1.7 TiB) block_size 4096 (4 KiB) non-rotational discard supported
2024-04-06T05:46:42.030649989Z 2024-04-06 05:46:42.030627 I | exec: debug 2024-04-06T05:46:42.029+0000 7fa35830c5c0  1 bluestore(/var/lib/ceph/osd/ceph-1) _set_cache_sizes cache_size 3221225472 meta 0.45 kv 0.45 data 0.06
2024-04-06T05:46:42.030665356Z 2024-04-06 05:46:42.030652 I | exec: debug 2024-04-06T05:46:42.030+0000 7fa35830c5c0  1 bdev(0x55a4d1b87400 /var/lib/ceph/osd/ceph-1/block) open path /var/lib/ceph/osd/ceph-1/block
2024-04-06T05:46:42.030775141Z 2024-04-06 05:46:42.030763 I | exec: debug 2024-04-06T05:46:42.030+0000 7fa35830c5c0  0 bdev(0x55a4d1b87400 /var/lib/ceph/osd/ceph-1/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-1/block failed: (22) Invalid argument

So we need to find out why "rook ceph osd init" was failing to mount the OSD data dir.

@Travis Any thoughts on "rook ceph osd init" failure ?

--- Additional comment from Prashant Dhange on 2024-04-09 23:29:39 UTC ---

(In reply to Prashant Dhange from comment #34)
...
> 
> So we need to find out why "rook ceph osd init" was failing to mount the OSD
> data dir.
> 
> @Travis Any thoughts on "rook ceph osd init" failure ?
https://github.com/rook/rook/commit/33e824a323291de1a261b70e9bd255d5049ee02b commit likely caused this issue as we have removed the fsid, username configs from the env vars.

--- Additional comment from Travis Nielsen on 2024-04-10 02:13:31 UTC ---

(In reply to Prashant Dhange from comment #35)
> (In reply to Prashant Dhange from comment #34)
> ...
> > 
> > So we need to find out why "rook ceph osd init" was failing to mount the OSD
> > data dir.
> > 
> > @Travis Any thoughts on "rook ceph osd init" failure ?
> https://github.com/rook/rook/commit/33e824a323291de1a261b70e9bd255d5049ee02b
> commit likely caused this issue as we have removed the fsid, username
> configs from the env vars.

That commit was also backported all the way to 4.10 [1], so this change was not new in 4.14.
The error about the ceph-username parameter missing must be ignored despite the error in the init container.
It would be really helpful if we can repro this, first looking at the OSD spec and logs in 4.13,
and then upgrading to 4.14 to see what changed in the OSD spec. 

I suspect if the "osd init" container fails, the ceph.conf would not be present and cause the bluefs expand container to fail. 
But I am confused why the "osd init" container failure did not abort starting the OSD in the first place. 
Init containers are not supposed to continue to the next one if they fail.

I still need to dig more, but in the meantime the repro would help.

[1] https://github.com/red-hat-storage/rook/commit/673c331a072a9de41ab2aac5405600104bd44ef2

--- Additional comment from Travis Nielsen on 2024-04-19 22:21:59 UTC ---

Thus far we have not been able to repro:
1. QE is not able to install OCS 4.3 given how much the QE infrastructure has changed since that release ~4 years ago.
2. OCS/ODF only created these types of affected OSDs in 4.3. Since then, all OSDs are created in a different mode that is unaffected (raw mode).
3. Rook upstream does not have an option for a long time either to create these affected types of OSDs. 

The lowest risk approach to avoid other customers hitting this issue if they have also upgraded since OCS 4.3 is to remove this expand init container from these types of OSDs. The only downside is that these legacy OSDs won't be resizable. Anyway, that's a rare operation, and in can be remedied by wiping and replacing these legacy OSDs. 

Removing the resize container in this case is very simple. Moving to POST with this fix.
I recommend backporting this to 4.14. 

We also need to consider raising an alert for users to replace these OSDs when detected in the cluster.

Comment 9 Vishakha Kathole 2024-06-04 11:43:23 UTC
Moving it to the verified state based on the 4.15 CI regression runs.

Comment 15 errata-xmlrpc 2024-06-11 16:41:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.15.3 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:3806