Bug 1727907
Summary: | OSP16 gnocchi-metricd/platform-python3.6 segfaults in envs with ceph | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Pavel Sedlák <psedlak> | ||||
Component: | openstack-containers | Assignee: | Matthias Runge <mrunge> | ||||
Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 16.1 (Train) | CC: | apevec, astupnik, bhubbard, cmuresan, cylopez, dhill, ealcaniz, fpantano, gfidente, jdillama, jjoyce, johfulto, jschluet, j.thadden, lhh, lmadsen, m.andre, marjones, migawa, mmagr, mrunge, msecaur, pkilambi, vkapalav, ykaul | ||||
Target Milestone: | z2 | Keywords: | Reopened, Triaged, ZStream | ||||
Target Release: | 16.1 (Train on RHEL 8.2) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-gnocchi-base-container-16.1-57 | Doc Type: | No Doc Update | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-10-28 19:02:05 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Pavel Sedlák
2019-07-08 13:40:02 UTC
> warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?)
> warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.0" is not at the expected address (wrong library or version mismatch?)
Your ceph debuginfo does not match what was running when the segfault happened.
(In reply to Brad Hubbard from comment #10) > > warning: .dynamic section for "/lib64/librados.so.2" is not at the expected address (wrong library or version mismatch?) > > warning: .dynamic section for "/usr/lib64/ceph/libceph-common.so.0" is not at the expected address (wrong library or version mismatch?) > > Your ceph debuginfo does not match what was running when the segfault > happened. I just reread the initial description. If you are having trouble matching the binaries in use to the correct debuginfo version then I'd suggest getting a coredump and using the buildids in the coredump to match it to the correct debuginfo packages and creating a container for debugging based on the same image. You should then be able to get a reasonable backtrace. If you would like me to do it please provide the coredump and details of the container in use and where I can pull that container from. Created attachment 1612039 [details]
coredump file + container and rpm info
Brad, I suppose I was directing my comment at you, or really anyone who wanted to have a look at this. I am attaching [1] a sosreport and a coredump from each controller. I am assuming (big assumption!) that the gnocchi-metricd that is coredumping is not running in a container since I see the errors in /var/log/messages and not in /var/log/containers/*. The environment is a freshly installed OSP15+HCI environment in a lab for TAM use. All nodes are virtualized on the same physical server. I am seeing the same behavior on another physical node that is also running OSP15 on VMs, but that environment is using external Ceph whose nodes are VMs running on a third machine. The errors do not start immediately after deployment. The first error on the first controller was at Oct 17 17:30:31, which is about 30 minutes after the deployment finished. Each segfault appears to be preceded by a line in the logs about a healthcheck starting inside various containers (nova_conductor, neutron_api, swift_object_server, etc.). After the first error, it repeats in the log right around every 30 seconds, though there are about 3-4 coredumps created in /var/lib/systemd/coredump/ every second. I'm happy to provide additional information, as required. [1] http://file.rdu.redhat.com/~msecaur/BZ1727907_sosreports_and_dumps.tar Perhaps unsurprisingly, disabling telemetry on the overcloud seems to work around this problem. I redeployed yesterday afternoon without telemetry and I haven't had a single core dump all night. I'll take a look at this tomorrow Matthew Using the file backend instead of CEPH doesn't reproduce this issue so this is most probably related to CEPH. RHOS-16.1-RHEL-8-20200925.n.1(undercloud\ NO cradox packages. ()[root@controller-0 /]# rpm -qa | grep rados python3-rados-14.2.8-91.el8cp.x86_64 librados2-14.2.8-91.el8cp.x86_64 No tracebacks observed. *** Bug 1854732 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 containers bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4382 |