Bug 2110797

Summary: MGR - libcephsqlite: crash cause of regex treating '-' as a range operator
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Neha Ojha <nojha>
Status: CLOSED ERRATA QA Contact: Pawan <pdhiran>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.1CC: akupczyk, amathuri, bhubbard, ceph-eng-bugs, cephqe-warriors, choffman, ekristov, kdreyer, ksirivad, lflores, nojha, pdhange, pdhiran, rfriedma, rzarzyns, sseshasa, vereddy, vumrao
Target Milestone: ---Keywords: CodeChange
Target Release: 6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-17.2.3-1.el9cp Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-20 18:57:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vikhyat Umrao 2022-07-25 22:47:48 UTC
Description of problem:
MGR - libcephsqlite: crash cause of regex treating '-' as a range operator

Original report - https://tracker.ceph.com/issues/55304
Rook report - https://tracker.ceph.com/issues/56700


Quincy backport - https://tracker.ceph.com/issues/56702

Version-Release number of selected component (if applicable):
Upstream quincy
RHCS 6

Comment 1 Vikhyat Umrao 2022-07-25 22:58:29 UTC
Quincy backport - https://github.com/ceph/ceph/pull/47270

Comment 2 Vikhyat Umrao 2022-07-28 22:34:41 UTC
(In reply to Vikhyat Umrao from comment #1)
> Quincy backport - https://github.com/ceph/ceph/pull/47270

This one is now merged upstream.

Ken and Thomas - this bug would need your attention. The issue was found upstream in 17.2.2 when upstream users updated their clusters to 17.2.2 and the MGR started crashing. The root cause of this bug is not the Ceph code it is the platform issue. When 17.2.2 was built upstream in CentOS 8 stream the CentOS 8 Stream started using the GCC version 8.5.0-14 and it started hitting this issue while 17.2.1 was built with GCC version 8.5.0-13.

Here is the RHEL bug - https://bugzilla.redhat.com/show_bug.cgi?id=2001788

For more details please check - https://tracker.ceph.com/issues/55304#note-15.

We need to verify downstream if I remember correctly we are using RHEL 9 for RHCS 6 so looks like this issue might be present in GCC version 8.5.0-14 and above and RHCS built on top of RHEL 9 might see this. 

QE team - heads up can you please check what is the RHEL 9 GCC version inside MGR POD.

cephadm shell --name <mgr daemon name>

rpm -qa | grep gcc


Example from 17.2.1 upstream cluster


# cephadm shell --name mgr.f28-h28-000-r630.rdu2.scalelab.redhat.com.ffcshc
Inferring fsid f0d9ede0-0e82-11ed-b543-000af7995756
Inferring config /var/lib/ceph/f0d9ede0-0e82-11ed-b543-000af7995756/mgr.f28-h28-000-r630.rdu2.scalelab.redhat.com.ffcshc/config
Using ceph image with id 'e5af760fa1c1' and tag 'v17.2.1' created on 2022-06-23 19:49:45 +0000 UTC
quay.io/ceph/ceph@sha256:d3f3e1b59a304a280a3a81641ca730982da141dad41e942631e4c5d88711a66b

[ceph: root@f28-h28-000-r630 /]# rpm -qa | grep gcc 
libgcc-8.5.0-13.el8.x86_64

[ceph: root@f28-h28-000-r630 /]# cat /etc/redhat-release 
CentOS Stream release 8

Comment 3 Pawan 2022-07-29 16:09:12 UTC
Hey Vikhyat, here is the gcc version being used.
Collected the info by logging into the pod via podman exec.
# podman exec -it 0a2ac3ebeb5c /bin/bash
[root@ceph-pdhiran-7g0lbr-node1-installer /]#  rpm -qa | grep gcc
libgcc-11.2.1-9.4.el9.x86_64
[root@ceph-pdhiran-7g0lbr-node1-installer /]# cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 (Plow)

When tried via cephadm, hit the error:
# cephadm shell --name ceph-pdhiran-7g0lbr-node1-installer.vjunfz
Inferring fsid a3952e4c-0eef-11ed-82e2-fa163e862102
Inferring config /var/lib/ceph/a3952e4c-0eef-11ed-82e2-fa163e862102/ceph-pdhiran-7g0lbr-node1-installer.vjunfz/config
Using ceph image with id '0cdbd0431cdd' and tag 'ceph-6.0-rhel-9-containers-candidate-12601-20220727135926' created on 2022-07-27 14:01:47 +0000 UTC
registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:11f49ce9958cf4ef4967920d12344ff047556dc2a047cf25604ef1f0d4e00bbb
Error: statfs /var/lib/ceph/a3952e4c-0eef-11ed-82e2-fa163e862102/ceph-pdhiran-7g0lbr-node1-installer.vjunfz/config: no such file or directory
[root@ceph-pdhiran-7g0lbr-node1-installer cephuser]# cephadm shell --name ceph-a3952e4c-0eef-11ed-82e2-fa163e862102-mgr-ceph-pdhiran-7g0lbr-node1-installer-vjunfz
Inferring fsid a3952e4c-0eef-11ed-82e2-fa163e862102
Traceback (most recent call last):
  File "/sbin/cephadm", line 9281, in <module>
    main()
  File "/sbin/cephadm", line 9269, in main
    r = ctx.func(ctx)
  File "/sbin/cephadm", line 2034, in _infer_config
    return func(ctx)
  File "/sbin/cephadm", line 1979, in _infer_fsid
    return func(ctx)
  File "/sbin/cephadm", line 2022, in _infer_config
    ctx.config = config_path(name.split('.', 1)[0], name.split('.', 1)[1])
IndexError: list index out of range

Comment 4 Vikhyat Umrao 2022-07-29 16:13:59 UTC
(In reply to Pawan from comment #3)
> Hey Vikhyat, here is the gcc version being used.
> Collected the info by logging into the pod via podman exec.
> # podman exec -it 0a2ac3ebeb5c /bin/bash
> [root@ceph-pdhiran-7g0lbr-node1-installer /]#  rpm -qa | grep gcc
> libgcc-11.2.1-9.4.el9.x86_64
> [root@ceph-pdhiran-7g0lbr-node1-installer /]# cat /etc/redhat-release
> Red Hat Enterprise Linux release 9.0 (Plow)
> 
> When tried via cephadm, hit the error:
> # cephadm shell --name ceph-pdhiran-7g0lbr-node1-installer.vjunfz
> Inferring fsid a3952e4c-0eef-11ed-82e2-fa163e862102
> Inferring config
> /var/lib/ceph/a3952e4c-0eef-11ed-82e2-fa163e862102/ceph-pdhiran-7g0lbr-node1-
> installer.vjunfz/config
> Using ceph image with id '0cdbd0431cdd' and tag
> 'ceph-6.0-rhel-9-containers-candidate-12601-20220727135926' created on
> 2022-07-27 14:01:47 +0000 UTC
> registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:
> 11f49ce9958cf4ef4967920d12344ff047556dc2a047cf25604ef1f0d4e00bbb
> Error: statfs
> /var/lib/ceph/a3952e4c-0eef-11ed-82e2-fa163e862102/ceph-pdhiran-7g0lbr-node1-
> installer.vjunfz/config: no such file or directory
> [root@ceph-pdhiran-7g0lbr-node1-installer cephuser]# cephadm shell --name
> ceph-a3952e4c-0eef-11ed-82e2-fa163e862102-mgr-ceph-pdhiran-7g0lbr-node1-
> installer-vjunfz
> Inferring fsid a3952e4c-0eef-11ed-82e2-fa163e862102
> Traceback (most recent call last):
>   File "/sbin/cephadm", line 9281, in <module>
>     main()
>   File "/sbin/cephadm", line 9269, in main
>     r = ctx.func(ctx)
>   File "/sbin/cephadm", line 2034, in _infer_config
>     return func(ctx)
>   File "/sbin/cephadm", line 1979, in _infer_fsid
>     return func(ctx)
>   File "/sbin/cephadm", line 2022, in _infer_config
>     ctx.config = config_path(name.split('.', 1)[0], name.split('.', 1)[1])
> IndexError: list index out of range

The command is wrong! It should have the tag `mgr.`

cephadm shell --name mgr.ceph-pdhiran-7g0lbr-node1-installer.vjunfz

Comment 5 Ken Dreyer (Red Hat) 2022-07-29 19:18:48 UTC
Since this was due to a compiler change, will this affect versions of Ceph below quincy? Like RHCS 4 (nautilus) or 5 (pacific)?

Comment 7 Vikhyat Umrao 2022-07-29 19:54:40 UTC
(In reply to Ken Dreyer (Red Hat) from comment #5)
> Since this was due to a compiler change, will this affect versions of Ceph
> below quincy? Like RHCS 4 (nautilus) or 5 (pacific)?

Ken - in the internal discussion per Neha, the issue affects pacific as well, the only difference is that the devicehealth module does not use libcephsqlite there, so it is less likely to hit this issue in pacific and we have not seen any reports from nautilus and pacific.

But I will have Neha check on this and let us know her feedback.

Comment 8 David Galloway 2022-07-29 21:56:17 UTC
Just to be clear - the version of gcc is only important at *build* time.  We don't have any evidence indicating runtime gcc version matters.

Comment 44 errata-xmlrpc 2023-03-20 18:57:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1360