In OSP15/el8 setup, on controller nodes /var/log/messages shows segfault issues with gnocchi-metricd / platform-python3.6. Visible in /var/log/messages (high amounts, thousands during first ~hour): > Jul 4 13:26:49 controller-0 kernel: gnocchi-metricd[198106]: segfault at 309cca87 ip 00000000309cca87 sp 00007fe40f7fca30 error 14 in platform-python3.6[5568377b2000+2000] > Jul 4 13:26:49 controller-0 kernel: gnocchi-metricd[198109]: segfault at 309cca87 ip 00000000309cca87 sp 00007fe40f7fca30 error 14 in platform-python3.6[5568377b2000+2000] Issue seems to be specific only to setups with ceph, not lvm as backend. And from the few indicies I'm able to gather it may be related to "/lib64/librados.so.2" or "/usr/lib64/ceph/libceph-common.so.0". Just looking inside the gnocchi-metricd container: > [root@controller-0 heat-admin]# podman ps |grep gnocc > 0673dab164c7 192.168.24.1:8787/rhosp15/openstack-gnocchi-statsd:20190625.1 dumb-init --singl... 3 days ago Up 3 days ago gnocchi_statsd > ae419997782f 192.168.24.1:8787/rhosp15/openstack-gnocchi-metricd:20190625.1 dumb-init --singl... 3 days ago Up 3 days ago gnocchi_metricd > 81168523d0e7 192.168.24.1:8787/rhosp15/openstack-gnocchi-api:20190625.1 dumb-init --singl... 3 days ago Up 3 days ago gnocchi_api > [root@controller-0 heat-admin]# podman exec -t -i ae419997782f bash > ()[gnocchi@controller-0 /]$ ps -ef > UID PID PPID C STIME TTY TIME CMD > gnocchi 1 0 0 Jul04 ? 00:00:00 dumb-init --single-child -- kolla_start > gnocchi 8 1 0 Jul04 ? 00:34:02 gnocchi-metricd: master process [/usr/bin/gnocchi-metricd] > gnocchi 38 8 0 Jul04 ? 00:00:54 gnocchi-metricd: reporting worker(0) > gnocchi 40 8 0 Jul04 ? 00:00:29 gnocchi-metricd: janitor worker(0) > gnocchi 410485 0 0 12:05 ? 00:00:02 bash > gnocchi 445805 0 0 12:22 pts/0 00:00:00 bash > gnocchi 445928 8 17 12:22 ? 00:00:00 gnocchi-metricd: processing worker(3) > gnocchi 445971 8 24 12:22 ? 00:00:00 gnocchi-metricd: processing worker(2) > gnocchi 446039 8 49 12:22 ? 00:00:00 gnocchi-metricd: processing worker(1) > gnocchi 446080 410485 0 12:22 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 0.5 > gnocchi 446081 445805 0 12:22 pts/0 00:00:00 ps -ef > ()[gnocchi@controller-0 /]$ ps -ef > UID PID PPID C STIME TTY TIME CMD > gnocchi 1 0 0 Jul04 ? 00:00:00 dumb-init --single-child -- kolla_start > gnocchi 8 1 0 Jul04 ? 00:34:02 gnocchi-metricd: master process [/usr/bin/gnocchi-metricd] > gnocchi 38 8 0 Jul04 ? 00:00:54 gnocchi-metricd: reporting worker(0) > gnocchi 40 8 0 Jul04 ? 00:00:29 gnocchi-metricd: janitor worker(0) > gnocchi 410485 0 0 12:05 ? 00:00:02 bash > gnocchi 445805 0 0 12:22 pts/0 00:00:00 bash > gnocchi 445971 8 26 12:22 ? 00:00:01 [gnocchi-metricd] <defunct> > gnocchi 446039 8 47 12:22 ? 00:00:00 [gnocchi-metricd] <defunct> > gnocchi 446088 8 61 12:22 ? 00:00:00 gnocchi-metricd: processing worker(0) > gnocchi 446127 410485 0 12:22 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 0.5 > gnocchi 446128 445805 0 12:22 pts/0 00:00:00 ps -ef > versions of rpms inside the container are: > puppet-gnocchi-14.4.1-0.20190420061300.b480da5.el8ost.noarch > python3-gnocchiclient-7.0.4-0.20190312220152.64814b9.el8ost.noarch > python3-gnocchi-4.3.3-0.20190327110329.c531da6.el8ost.noarch > gnocchi-metricd-4.3.3-0.20190327110329.c531da6.el8ost.noarch > gnocchi-common-4.3.3-0.20190327110329.c531da6.el8ost.noarch > > librados2-14.2.1-584.g4409ccf.el8cp.x86_64 From coredumptctl on controller i've got very little info, of no use to me, even with installing all gdb/coredumpctl recommended debuginfo pkgs. I guess due to slightly different specific package-versions in containers. Even now i see slightly newer builds then few days ago of librados available etc. (And no gdb/debuginfo inside containers) If there is better approach for me to gather details let me know please. > [root@controller-0 heat-admin]# coredumpctl debug 688739 > PID: 688739 (gnocchi-metricd) > UID: 42416 (42416) > GID: 42416 (42416) > Signal: 11 (SEGV) > Timestamp: Mon 2019-07-08 13:29:01 UTC (53s ago) > Command Line: gnocchi-metricd: processing worker(1) > Executable: /usr/libexec/platform-python3.6 > Control Group: / > Slice: -.slice > Boot ID: 8d4a2d9743034b3fbf2e759dd9c8d566 > Machine ID: 1955d626eaf2404ba8bc019f6e74ff62 > Hostname: controller-0 > Storage: /var/lib/systemd/coredump/core.gnocchi-metricd.42416.8d4a2d9743034b3fbf2e759dd9c8d566.688739.1562592541000000.lz4 > Message: Process 688739 (gnocchi-metricd) of user 42416 dumped core. > > Stack trace of thread 589539: > #0 0x00000000309cca87 n/a (n/a) > #1 0x000055683885f6d8 n/a (n/a) ... > warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?) > warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.0" is not at the expected address (wrong library or version mismatch?) > warning: Could not load shared library symbols for 23 libraries, e.g. /usr/lib64/python3.6/site-packages/setproctitle.cpython-36m-x86_64-linux-gnu.so. > Use the "info sharedlibrary" command to see the complete listing. > Do you need "set solib-search-path" or "set sysroot"? > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > Core was generated by `gnocchi-metricd: processing worker(1) '. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x00000000309cca87 in ?? () > [Current thread is 1 (Thread 0x7fe40d7fa700 (LWP 589539))] > (gdb) bt > #0 0x00000000309cca87 in ?? () > #1 0x00007fe43bcd60b0 in ?? () > #2 0x00007fe418399208 in ?? () > #3 0x0000000000000034 in ?? () > #4 0x00007fe418ca1b50 in ?? () > #5 0x00007fe418390548 in ?? () > #6 0x00007fe4183e7f38 in ?? () > #7 0x00007fe449b9e4c0 in small_ints () from /lib64/libpython3.6m.so.1.0 > #8 0xad1893ec59c4c600 in ?? () > #9 0x0000000000000000 in ?? ()
> warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?) > warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.0" is not at the expected address (wrong library or version mismatch?) Your ceph debuginfo does not match what was running when the segfault happened.
(In reply to Brad Hubbard from comment #10) > > warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?) > > warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.0" is not at the expected address (wrong library or version mismatch?) > > Your ceph debuginfo does not match what was running when the segfault > happened. I just reread the initial description. If you are having trouble matching the binaries in use to the correct debuginfo version then I'd suggest getting a coredump and using the buildids in the coredump to match it to the correct debuginfo packages and creating a container for debugging based on the same image. You should then be able to get a reasonable backtrace. If you would like me to do it please provide the coredump and details of the container in use and where I can pull that container from.
Created attachment 1612039 [details] coredump file + container and rpm info
Brad, I suppose I was directing my comment at you, or really anyone who wanted to have a look at this. I am attaching [1] a sosreport and a coredump from each controller. I am assuming (big assumption!) that the gnocchi-metricd that is coredumping is not running in a container since I see the errors in /var/log/messages and not in /var/log/containers/*. The environment is a freshly installed OSP15+HCI environment in a lab for TAM use. All nodes are virtualized on the same physical server. I am seeing the same behavior on another physical node that is also running OSP15 on VMs, but that environment is using external Ceph whose nodes are VMs running on a third machine. The errors do not start immediately after deployment. The first error on the first controller was at Oct 17 17:30:31, which is about 30 minutes after the deployment finished. Each segfault appears to be preceded by a line in the logs about a healthcheck starting inside various containers (nova_conductor, neutron_api, swift_object_server, etc.). After the first error, it repeats in the log right around every 30 seconds, though there are about 3-4 coredumps created in /var/lib/systemd/coredump/ every second. I'm happy to provide additional information, as required. [1] http://file.rdu.redhat.com/~msecaur/BZ1727907_sosreports_and_dumps.tar
Perhaps unsurprisingly, disabling telemetry on the overcloud seems to work around this problem. I redeployed yesterday afternoon without telemetry and I haven't had a single core dump all night.
I'll take a look at this tomorrow Matthew
Using the file backend instead of CEPH doesn't reproduce this issue so this is most probably related to CEPH.
RHOS-16.1-RHEL-8-20200925.n.1(undercloud\ NO cradox packages. ()[root@controller-0 /]# rpm -qa | grep rados python3-rados-14.2.8-91.el8cp.x86_64 librados2-14.2.8-91.el8cp.x86_64 No tracebacks observed.
*** Bug 1854732 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 containers bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4382