Bug 2263023

Summary: Segmentation fault observed on OSD upon running COT command to list omap entries on the OSD for an EC pool
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Pawan <pdhiran>
Component: RADOSAssignee: Adam Kupczyk <akupczyk>
Status: CLOSED ERRATA QA Contact: Harsh Kumar <hakumar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.0CC: akupczyk, bhubbard, ceph-eng-bugs, cephqe-warriors, dwalveka, ngangadh, nojha, rzarzyns, tserlin, vumrao
Target Milestone: ---   
Target Release: 7.1z2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-18.2.1-244.el9cp Doc Type: Bug Fix
Doc Text:
Previously, there was an error in the code and the code did not check if the ObjectStore collection (equivalent of PG) exists or not. As a result, there would be segmentation faults on accessing null objects. With this fix, the code now checks and skips the operation if null and COT prints that the collection does not exist.
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-11-07 14:38:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pawan 2024-02-06 16:41:59 UTC
Description of problem:

Observing segmentation faults on the OSD, when we try to list the omap entries on the OSD.

# systemctl stop ceph-4ac55332-c500-11ee-ad37-fa163e664e45.service
[root@ceph-pdhiran-hd3aat-node9 ~]# cephadm shell --name osd.8 -- ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-8 --pgid 11.1f benchmark_data_ceph-pdhiran-hd3aat-node7_4171_object20  list-omap
Inferring fsid 4ac55332-c500-11ee-ad37-fa163e664e45
Inferring config /var/lib/ceph/4ac55332-c500-11ee-ad37-fa163e664e45/osd.8/config
Using ceph image with id '23f1e3d0a21b' and tag '<none>' created on 2024-01-31 00:09:32 +0000 UTC
registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:4ce4fff33a42564a2f877420a1898e060d316b2c818c53258e7beb2cd57ce7f3
*** Caught signal (Segmentation fault) **
 in thread 7ff86c8e6580 thread_name:ceph-objectstor
 ceph version 18.2.0-144.el9cp (f2621d6df88c0fe16f313952d9dd897bbec5d90d) reef (stable)
 1: /lib64/libc.so.6(+0x54db0) [0x7ff86ceeedb0]
 2: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x4b) [0x55987c529c4b]
 3: (_action_on_all_objects_in_pg(ObjectStore*, coll_t, action_on_object_t&, bool)+0x4cc) [0x55987c07476c]
 4: (action_on_all_objects_in_exact_pg(ObjectStore*, coll_t, action_on_object_t&, bool)+0x64) [0x55987c075654]
 5: main()
 6: /lib64/libc.so.6(+0x3feb0) [0x7ff86ced9eb0]
 7: __libc_start_main()
 8: _start()


Version-Release number of selected component (if applicable):
# ceph version
ceph version 18.2.0-144.el9cp (f2621d6df88c0fe16f313952d9dd897bbec5d90d) reef (stable)

How reproducible:
Always

Steps to Reproduce:
1. Create a EC pool, write objects
2. Identify a test object, identify the primary OSD for the PG.

# rados -p Inconsistent_snap_pool_ec ls
benchmark_data_ceph-pdhiran-hd3aat-node7_4171_object37
benchmark_data_ceph-pdhiran-hd3aat-node7_4171_object38
benchmark_data_ceph-pdhiran-hd3aat-node7_4171_object48
benchmark_data_ceph-pdhiran-hd3aat-node7_4171_object43

# ceph osd map Inconsistent_snap_pool_ec benchmark_data_ceph-pdhiran-hd3aat-node7_4171_object20 -f json-pretty

{
    "epoch": 250,
    "pool": "Inconsistent_snap_pool_ec",
    "pool_id": 11,
    "objname": "benchmark_data_ceph-pdhiran-hd3aat-node7_4171_object20",
    "raw_pgid": "11.367617bf",
    "pgid": "11.1f",
    "up": [
        8,
        17,
        11,
        5
    ],
    "up_primary": 8,
    "acting": [
        8,
        17,
        11,
        5
    ],
    "acting_primary": 8
}
3. Run COT command to get the omap list. Observe Crash.

# systemctl stop ceph-4ac55332-c500-11ee-ad37-fa163e664e45.service
[root@ceph-pdhiran-hd3aat-node9 ~]# cephadm shell --name osd.8 -- ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-8 --pgid 11.1f benchmark_data_ceph-pdhiran-hd3aat-node7_4171_object20  list-omap
Inferring fsid 4ac55332-c500-11ee-ad37-fa163e664e45
Inferring config /var/lib/ceph/4ac55332-c500-11ee-ad37-fa163e664e45/osd.8/config
Using ceph image with id '23f1e3d0a21b' and tag '<none>' created on 2024-01-31 00:09:32 +0000 UTC
registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:4ce4fff33a42564a2f877420a1898e060d316b2c818c53258e7beb2cd57ce7f3
*** Caught signal (Segmentation fault) **
 in thread 7ff86c8e6580 thread_name:ceph-objectstor
 ceph version 18.2.0-144.el9cp (f2621d6df88c0fe16f313952d9dd897bbec5d90d) reef (stable)
 1: /lib64/libc.so.6(+0x54db0) [0x7ff86ceeedb0]
 2: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x4b) [0x55987c529c4b]
 3: (_action_on_all_objects_in_pg(ObjectStore*, coll_t, action_on_object_t&, bool)+0x4cc) [0x55987c07476c]
 4: (action_on_all_objects_in_exact_pg(ObjectStore*, coll_t, action_on_object_t&, bool)+0x64) [0x55987c075654]
 5: main()
 6: /lib64/libc.so.6(+0x3feb0) [0x7ff86ced9eb0]
 7: __libc_start_main()
 8: _start()

Actual results:
Observing segmentation fault

Expected results:
There should not be segmentation fault for the command execution

Additional info:

Comment 21 errata-xmlrpc 2024-11-07 14:38:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.1 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:9010