Bug 1727907

Summary:

OSP16 gnocchi-metricd/platform-python3.6 segfaults in envs with ceph

Product:

Red Hat OpenStack

Reporter:

Pavel Sedlák <psedlak>

Component:

openstack-containers

Assignee:

Matthias Runge <mrunge>

Status:

CLOSED ERRATA

QA Contact:

Leonid Natapov <lnatapov>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

16.1 (Train)

CC:

apevec, astupnik, bhubbard, cmuresan, cylopez, dhill, ealcaniz, fpantano, gfidente, jdillama, jjoyce, johfulto, jschluet, j.thadden, lhh, lmadsen, m.andre, marjones, migawa, mmagr, mrunge, msecaur, pkilambi, vkapalav, ykaul

Target Milestone:

Keywords:

Reopened, Triaged, ZStream

Target Release:

16.1 (Train on RHEL 8.2)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openstack-gnocchi-base-container-16.1-57

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-10-28 19:02:05 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
coredump file + container and rpm info	none

Description Pavel Sedlák 2019-07-08 13:40:02 UTC

In OSP15/el8 setup, on controller nodes /var/log/messages shows segfault issues
with gnocchi-metricd / platform-python3.6.

Visible in /var/log/messages (high amounts, thousands during first ~hour):
> Jul  4 13:26:49 controller-0 kernel: gnocchi-metricd[198106]: segfault at 309cca87 ip 00000000309cca87 sp 00007fe40f7fca30 error 14 in platform-python3.6[5568377b2000+2000]
> Jul  4 13:26:49 controller-0 kernel: gnocchi-metricd[198109]: segfault at 309cca87 ip 00000000309cca87 sp 00007fe40f7fca30 error 14 in platform-python3.6[5568377b2000+2000]

Issue seems to be specific only to setups with ceph, not lvm as backend.
And from the few indicies I'm able to gather it may be related to "/lib64/librados.so.2" or "/usr/lib64/ceph/libceph-common.so.0".

Just looking inside the gnocchi-metricd container:
> [root@controller-0 heat-admin]# podman ps |grep gnocc
> 0673dab164c7  192.168.24.1:8787/rhosp15/openstack-gnocchi-statsd:20190625.1           dumb-init --singl...  3 days ago  Up 3 days ago         gnocchi_statsd
> ae419997782f  192.168.24.1:8787/rhosp15/openstack-gnocchi-metricd:20190625.1          dumb-init --singl...  3 days ago  Up 3 days ago         gnocchi_metricd
> 81168523d0e7  192.168.24.1:8787/rhosp15/openstack-gnocchi-api:20190625.1              dumb-init --singl...  3 days ago  Up 3 days ago         gnocchi_api
> [root@controller-0 heat-admin]# podman exec -t -i ae419997782f bash
> ()[gnocchi@controller-0 /]$ ps -ef
> UID          PID    PPID  C STIME TTY          TIME CMD
> gnocchi        1       0  0 Jul04 ?        00:00:00 dumb-init --single-child -- kolla_start
> gnocchi        8       1  0 Jul04 ?        00:34:02 gnocchi-metricd: master process [/usr/bin/gnocchi-metricd]
> gnocchi       38       8  0 Jul04 ?        00:00:54 gnocchi-metricd: reporting worker(0)
> gnocchi       40       8  0 Jul04 ?        00:00:29 gnocchi-metricd: janitor worker(0)
> gnocchi   410485       0  0 12:05 ?        00:00:02 bash
> gnocchi   445805       0  0 12:22 pts/0    00:00:00 bash
> gnocchi   445928       8 17 12:22 ?        00:00:00 gnocchi-metricd: processing worker(3)
> gnocchi   445971       8 24 12:22 ?        00:00:00 gnocchi-metricd: processing worker(2)
> gnocchi   446039       8 49 12:22 ?        00:00:00 gnocchi-metricd: processing worker(1)
> gnocchi   446080  410485  0 12:22 ?        00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 0.5
> gnocchi   446081  445805  0 12:22 pts/0    00:00:00 ps -ef
> ()[gnocchi@controller-0 /]$ ps -ef
> UID          PID    PPID  C STIME TTY          TIME CMD
> gnocchi        1       0  0 Jul04 ?        00:00:00 dumb-init --single-child -- kolla_start
> gnocchi        8       1  0 Jul04 ?        00:34:02 gnocchi-metricd: master process [/usr/bin/gnocchi-metricd]
> gnocchi       38       8  0 Jul04 ?        00:00:54 gnocchi-metricd: reporting worker(0)
> gnocchi       40       8  0 Jul04 ?        00:00:29 gnocchi-metricd: janitor worker(0)
> gnocchi   410485       0  0 12:05 ?        00:00:02 bash
> gnocchi   445805       0  0 12:22 pts/0    00:00:00 bash
> gnocchi   445971       8 26 12:22 ?        00:00:01 [gnocchi-metricd] <defunct>
> gnocchi   446039       8 47 12:22 ?        00:00:00 [gnocchi-metricd] <defunct>
> gnocchi   446088       8 61 12:22 ?        00:00:00 gnocchi-metricd: processing worker(0)
> gnocchi   446127  410485  0 12:22 ?        00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 0.5
> gnocchi   446128  445805  0 12:22 pts/0    00:00:00 ps -ef
> 

versions of rpms inside the container are:
> puppet-gnocchi-14.4.1-0.20190420061300.b480da5.el8ost.noarch
> python3-gnocchiclient-7.0.4-0.20190312220152.64814b9.el8ost.noarch
> python3-gnocchi-4.3.3-0.20190327110329.c531da6.el8ost.noarch
> gnocchi-metricd-4.3.3-0.20190327110329.c531da6.el8ost.noarch
> gnocchi-common-4.3.3-0.20190327110329.c531da6.el8ost.noarch
>
> librados2-14.2.1-584.g4409ccf.el8cp.x86_64

From coredumptctl on controller i've got very little info, of no use to me,
even with installing all gdb/coredumpctl recommended debuginfo pkgs.
I guess due to slightly different specific package-versions in containers.
Even now i see slightly newer builds then few days ago of librados available etc.
(And no gdb/debuginfo inside containers)
If there is better approach for me to gather details let me know please.

> [root@controller-0 heat-admin]# coredumpctl debug 688739
>            PID: 688739 (gnocchi-metricd)
>            UID: 42416 (42416)
>            GID: 42416 (42416)
>         Signal: 11 (SEGV)
>      Timestamp: Mon 2019-07-08 13:29:01 UTC (53s ago)
>   Command Line: gnocchi-metricd: processing worker(1)
>     Executable: /usr/libexec/platform-python3.6
>  Control Group: /
>          Slice: -.slice
>        Boot ID: 8d4a2d9743034b3fbf2e759dd9c8d566
>     Machine ID: 1955d626eaf2404ba8bc019f6e74ff62
>       Hostname: controller-0
>        Storage: /var/lib/systemd/coredump/core.gnocchi-metricd.42416.8d4a2d9743034b3fbf2e759dd9c8d566.688739.1562592541000000.lz4
>        Message: Process 688739 (gnocchi-metricd) of user 42416 dumped core.
>
>                 Stack trace of thread 589539:
>                 #0  0x00000000309cca87 n/a (n/a)
>                 #1  0x000055683885f6d8 n/a (n/a)
...
> warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?)
> warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.0" is not at the expected address (wrong library or version mismatch?)
> warning: Could not load shared library symbols for 23 libraries, e.g. /usr/lib64/python3.6/site-packages/setproctitle.cpython-36m-x86_64-linux-gnu.so.
> Use the "info sharedlibrary" command to see the complete listing.
> Do you need "set solib-search-path" or "set sysroot"?
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `gnocchi-metricd: processing worker(1)    '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00000000309cca87 in ?? ()
> [Current thread is 1 (Thread 0x7fe40d7fa700 (LWP 589539))]
> (gdb) bt
> #0  0x00000000309cca87 in ?? ()
> #1  0x00007fe43bcd60b0 in ?? ()
> #2  0x00007fe418399208 in ?? ()
> #3  0x0000000000000034 in ?? ()
> #4  0x00007fe418ca1b50 in ?? ()
> #5  0x00007fe418390548 in ?? ()
> #6  0x00007fe4183e7f38 in ?? ()
> #7  0x00007fe449b9e4c0 in small_ints () from /lib64/libpython3.6m.so.1.0
> #8  0xad1893ec59c4c600 in ?? ()
> #9  0x0000000000000000 in ?? ()

Comment 10 Brad Hubbard 2019-07-17 21:35:29 UTC

> warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?)
> warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.0" is not at the expected address (wrong library or version mismatch?)

Your ceph debuginfo does not match what was running when the segfault happened.

Comment 11 Brad Hubbard 2019-07-18 04:32:09 UTC

(In reply to Brad Hubbard from comment #10)
> > warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?)
> > warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.0" is not at the expected address (wrong library or version mismatch?)
> 
> Your ceph debuginfo does not match what was running when the segfault
> happened.

I just reread the initial description. If you are having trouble matching the binaries in use to the correct debuginfo version then I'd suggest getting a coredump and using the buildids in the coredump to match it to the correct debuginfo packages and creating a container for debugging based on the same image. You should then be able to get a reasonable backtrace. If you would like me to do it please provide the coredump and details of the container in use and where I can pull that container from.

Comment 19 Pavel Sedlák 2019-09-05 15:54:13 UTC

Created attachment 1612039 [details]
coredump file + container and rpm info

Comment 29 Matthew Secaur 2019-10-18 13:45:34 UTC

Brad,
I suppose I was directing my comment at you, or really anyone who wanted to have a look at this.

I am attaching [1] a sosreport and a coredump from each controller.  I am assuming (big assumption!) that the gnocchi-metricd that is coredumping is not running in a container since I see the errors in /var/log/messages and not in /var/log/containers/*.  The environment is a freshly installed OSP15+HCI environment in a lab for TAM use.  All nodes are virtualized on the same physical server.  I am seeing the same behavior on another physical node that is also running OSP15 on VMs, but that environment is using external Ceph whose nodes are VMs running on a third machine.

The errors do not start immediately after deployment. The first error on the first controller was at Oct 17 17:30:31, which is about 30 minutes after the deployment finished. Each segfault appears to be preceded by a line in the logs about a healthcheck starting inside various containers (nova_conductor, neutron_api, swift_object_server, etc.). After the first error, it repeats in the log right around every 30 seconds, though there are about 3-4 coredumps created in /var/lib/systemd/coredump/ every second.

I'm happy to provide additional information, as required.

[1] http://file.rdu.redhat.com/~msecaur/BZ1727907_sosreports_and_dumps.tar

Comment 30 Matthew Secaur 2019-10-22 11:34:13 UTC

Perhaps unsurprisingly, disabling telemetry on the overcloud seems to work around this problem. I redeployed yesterday afternoon without telemetry and I haven't had a single core dump all night.

Comment 31 Brad Hubbard 2019-10-22 11:38:22 UTC

I'll take a look at this tomorrow Matthew

Comment 40 David Hill 2020-08-25 13:42:08 UTC

Using the file backend instead of CEPH doesn't reproduce this issue so this is most probably related to CEPH.

Comment 64 Leonid Natapov 2020-09-28 11:51:03 UTC

RHOS-16.1-RHEL-8-20200925.n.1(undercloud\
NO cradox packages. 

()[root@controller-0 /]# rpm -qa | grep rados
python3-rados-14.2.8-91.el8cp.x86_64
librados2-14.2.8-91.el8cp.x86_64


No  tracebacks observed.

Comment 65 Matthias Runge 2020-09-30 15:17:39 UTC

*** Bug 1854732 has been marked as a duplicate of this bug. ***

Comment 73 errata-xmlrpc 2020-10-28 19:02:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 containers bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4382