Bug 1368587 - segfault error 4 in librados.so.2.0.0 when scaling OSP Director
Summary: segfault error 4 in librados.so.2.0.0 when scaling OSP Director
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 1.3.2
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: rc
: 3.2
Assignee: Sébastien Han
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-19 20:50 UTC by Omri Hochman
Modified: 2018-12-13 22:09 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-13 22:09:55 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 2904 0 None None None 2016-08-25 23:24:48 UTC
Ceph Project Bug Tracker 16266 0 None None None 2016-08-25 22:14:34 UTC

Description Omri Hochman 2016-08-19 20:50:28 UTC
OSP-Director: After scale up-compute, the Openstack-Nova-Compute service keeps 
cycling and crashing with: segfault error 4 in librados.so.2.0.0. 


Environment On undercloud (9.0 GA build) 
-----------------------------------------
instack-undercloud-4.0.0-13.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-33.el7ost.noarch
puppet-3.6.2-4.el7sat.noarch
openstack-puppet-modules-8.1.8-2.el7ost.noarch
openstack-tripleo-puppet-elements-2.0.0-4.el7ost.noarch

Environment( on the new added compute) :
----------------------------------------
ceph-mon-0.94.5-14.el7cp.x86_64
librados2-0.94.5-14.el7cp.x86_64
ceph-osd-0.94.5-14.el7cp.x86_64
ceph-0.94.5-14.el7cp.x86_64
ceph-common-0.94.5-14.el7cp.x86_64
librbd1-0.94.5-14.el7cp.x86_64



Scenario:
----------
(1) Upgrade Setup with SSL (and 2 Ceph nodes) from OSP8 -> OSP9 
(2) Boot Instance   
(3) Scale-up add another compute-node
(4) reboot both undecloud and overcloud 
(5) check the openstack-nova-compute service on the new compute node.

Results: 
---------

The openstack-nova-compute service on the new added compute node is in status Activating - it seems that the service got into cycle attempting to start and then it crash.. the /var/log/messages shows this error, that might indicate Ceph problem :

overcloud-compute-1 kernel: nova-compute[25233]: segfault at 0 ip 00007fd261d8417a sp 00007fd2612533b0 error 4 in librados.so.2.0.0[7fd261a56000+504000]
   

root@overcloud-compute-1 ~]# systemctl status openstack-nova-compute
● openstack-nova-compute.service - OpenStack Nova Compute Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-nova-compute.service; enabled; vendor preset: disabled)
   Active: activating (start) since Fri 2016-08-19 20:43:07 UTC; 3s ago
 Main PID: 31464 (nova-compute)
   CGroup: /system.slice/openstack-nova-compute.service
           └─31464 /usr/bin/python2 /usr/bin/nova-compute

Aug 19 20:43:07 overcloud-compute-1.localdomain systemd[1]: Starting OpenStack Nova Compute Server...
Aug 19 20:43:10 overcloud-compute-1.localdomain nova-compute[31464]: Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messag...ications".
Aug 19 20:43:10 overcloud-compute-1.localdomain nova-compute[31464]: Option "notification_topics" from group "DEFAULT" is deprecated. Use option "topics" from group "oslo_messag...ications".
Hint: Some lines were ellipsized, use -l to show in full.



[root@overcloud-compute-1 ~]# ceph health
HEALTH_WARN 192 pgs degraded; 192 pgs stuck degraded; 192 pgs stuck unclean; 192 pgs stuck undersized; 192 pgs undersized; recovery 19/57 objects degraded (33.333%)


[root@overcloud-compute-1 ~]# ceph status
    cluster 947713ba-d023-11e5-aaa7-525400c91767
     health HEALTH_WARN
            192 pgs degraded
            192 pgs stuck degraded
            192 pgs stuck unclean
            192 pgs stuck undersized
            192 pgs undersized
            recovery 19/57 objects degraded (33.333%)
     monmap e2: 3 mons at {overcloud-controller-0=10.19.105.14:6789/0,overcloud-controller-1=10.19.105.12:6789/0,overcloud-controller-2=10.19.105.15:6789/0}
            election epoch 16, quorum 0,1,2 overcloud-controller-1,overcloud-controller-0,overcloud-controller-2
     osdmap e53: 2 osds: 2 up, 2 in; 2 remapped pgs
      pgmap v5288: 192 pgs, 5 pools, 45659 kB data, 19 objects
            8468 MB used, 507 GB / 515 GB avail
            19/57 objects degraded (33.333%)
                 190 active+undersized+degraded
                   2 active+undersized+degraded+remapped



var/log/messahes  ( on the added compute nodes) 
-----------------
Aug 19 20:35:16 overcloud-compute-1 kernel: nova-compute[25180]: segfault at 0 ip 00007fe969d8417a sp 00007fe9682513b0 error 4 in librados.so.2.0.0[7fe969a56000+504000]
Aug 19 20:35:16 overcloud-compute-1 journal: End of file while reading data: Input/output error
Aug 19 20:35:16 overcloud-compute-1 systemd: openstack-nova-compute.service: main process exited, code=killed, status=11/SEGV
Aug 19 20:35:16 overcloud-compute-1 systemd: Unit openstack-nova-compute.service entered failed state.
Aug 19 20:35:16 overcloud-compute-1 systemd: openstack-nova-compute.service failed.
Aug 19 20:35:16 overcloud-compute-1 systemd: openstack-nova-compute.service holdoff time over, scheduling restart.
Aug 19 20:35:16 overcloud-compute-1 systemd: Starting OpenStack Nova Compute Server...
Aug 19 20:35:19 overcloud-compute-1 nova-compute: Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications".
Aug 19 20:35:19 overcloud-compute-1 nova-compute: Option "notification_topics" from group "DEFAULT" is deprecated. Use option "topics" from group "oslo_messaging_notifications".
Aug 19 20:35:19 overcloud-compute-1 systemd: Started OpenStack Nova Compute Server.
Aug 19 20:35:20 overcloud-compute-1 kernel: nova-compute[25233]: segfault at 0 ip 00007fd261d8417a sp 00007fd2612533b0 error 4 in librados.so.2.0.0[7fd261a56000+504000]
Aug 19 20:35:20 overcloud-compute-1 journal: End of file while reading data: Input/output error
Aug 19 20:35:20 overcloud-compute-1 systemd: openstack-nova-compute.service: main process exited, code=killed, status=11/SEGV
Aug 19 20:35:20 overcloud-compute-1 systemd: Unit openstack-nova-compute.service entered failed state.
Aug 19 20:35:20 overcloud-compute-1 systemd: openstack-nova-compute.service failed.
Aug 19 20:35:20 overcloud-compute-1 systemd: openstack-nova-compute.service holdoff time over, scheduling restart.
Aug 19 20:35:20 overcloud-compute-1 systemd: Starting OpenStack Nova Compute Server...







/var/log/nova/nova-compute.log:
------------------------------------------
2016-08-19 20:49:32.187 36550 ERROR nova.compute.manager [req-f599342e-198e-4d8a-bdf1-669063670cc8 - - - - -] No compute node record for host overcloud-compute-1.localdomain
2016-08-19 20:49:32.190 36550 WARNING nova.compute.monitors [req-f599342e-198e-4d8a-bdf1-669063670cc8 - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
2016-08-19 20:49:32.191 36550 INFO nova.compute.resource_tracker [req-f599342e-198e-4d8a-bdf1-669063670cc8 - - - - -] Auditing locally available compute resources for node overcloud-compute-1.localdomain
2016-08-19 20:49:36.010 36603 WARNING oslo_reports.guru_meditation_report [-] Guru mediation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future release, so please use SIGUSR2 to generate reports.
2016-08-19 20:49:36.011 36603 WARNING oslo_config.cfg [-] Option "compute_manager" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored in the future.
2016-08-19 20:49:36.125 36603 INFO oslo_service.periodic_task [-] Skipping periodic task _periodic_update_dns because its interval is negative
2016-08-19 20:49:36.130 36603 WARNING oslo_config.cfg [-] Option "security_group_api" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored in the future.
2016-08-19 20:49:36.132 36603 INFO nova.virt.driver [-] Loading compute driver 'libvirt.LibvirtDriver'
2016-08-19 20:49:36.206 36603 INFO os_brick.initiator.connector [-] Init DISCO connector
2016-08-19 20:49:36.300 36603 WARNING oslo_config.cfg [req-a37fda0d-2983-4834-850a-0b1ba25f5d86 - - - - -] Option "auth_plugin" from group "neutron" is deprecated. Use option "auth_type" from group "neutron".
2016-08-19 20:49:36.308 36603 INFO nova.service [-] Starting compute node (version 13.1.1-2.el7ost)
2016-08-19 20:49:36.326 36603 INFO nova.virt.libvirt.driver [-] Connection event '1' reason 'None'
2016-08-19 20:49:36.353 36603 INFO nova.virt.libvirt.host [req-9a157607-4a33-44bc-bc7e-ad952136185b - - - - -] Libvirt host capabilities <capabilities>

  <host>
    <uuid>4c4c4544-004d-4c10-804c-cac04f563132</uuid>
    <cpu>
      <arch>x86_64</arch>
      <model>IvyBridge</model>
      <vendor>Intel</vendor>
      <topology sockets='1' cores='6' threads='2'/>
      <feature name='invtsc'/>
      <feature name='pdpe1gb'/>
      <feature name='osxsave'/>
      <feature name='dca'/>
      <feature name='pcid'/>
      <feature name='pdcm'/>
      <feature name='xtpr'/>
      <feature name='tm2'/>
      <feature name='est'/>
      <feature name='smx'/>
      <feature name='vmx'/>
      <feature name='ds_cpl'/>
      <feature name='monitor'/>
      <feature name='dtes64'/>
      <feature name='pbe'/>
      <feature name='tm'/>
      <feature name='ht'/>
      <feature name='ss'/>
      <feature name='acpi'/>
      <feature name='ds'/>
      <pages unit='KiB' size='4'/>
      <pages unit='KiB' size='2048'/>
      <pages unit='KiB' size='1048576'/>
    </cpu>
    <power_management>
      <suspend_mem/>
    </power_management>
    <migration_features>
      <live/>
      <uri_transports>
        <uri_transport>tcp</uri_transport>
        <uri_transport>rdma</uri_transport>
      </uri_transports>
    </migration_features>
    <topology>
      <cells num='1'>
        <cell id='0'>
          <memory unit='KiB'>25119280</memory>
          <pages unit='KiB' size='4'>6279820</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='10'/>
          </distances>
          <cpus num='12'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0,6'/>
            <cpu id='1' socket_id='0' core_id='1' siblings='1,7'/>
            <cpu id='2' socket_id='0' core_id='2' siblings='2,8'/>
            <cpu id='3' socket_id='0' core_id='3' siblings='3,9'/>
            <cpu id='4' socket_id='0' core_id='4' siblings='4,10'/>
            <cpu id='5' socket_id='0' core_id='5' siblings='5,11'/>
            <cpu id='6' socket_id='0' core_id='0' siblings='0,6'/>
            <cpu id='7' socket_id='0' core_id='1' siblings='1,7'/>
            <cpu id='8' socket_id='0' core_id='2' siblings='2,8'/>
            <cpu id='9' socket_id='0' core_id='3' siblings='3,9'/>
            <cpu id='10' socket_id='0' core_id='4' siblings='4,10'/>
            <cpu id='11' socket_id='0' core_id='5' siblings='5,11'/>
          </cpus>
        </cell>
      </cells>
    </topology>
    <secmodel>
      <model>selinux</model>
      <doi>0</doi>
      <baselabel type='kvm'>system_u:system_r:svirt_t:s0</baselabel>
      <baselabel type='qemu'>system_u:system_r:svirt_tcg_t:s0</baselabel>
    </secmodel>
    <secmodel>
      <model>dac</model>
      <doi>0</doi>
      <baselabel type='kvm'>+107:+107</baselabel>
      <baselabel type='qemu'>+107:+107</baselabel>
    </secmodel>
  </host>

  <guest>
    <os_type>hvm</os_type>
    <arch name='i686'>
      <wordsize>32</wordsize>
      <emulator>/usr/libexec/qemu-kvm</emulator>
      <machine maxCpus='240'>pc-i440fx-rhel7.2.0</machine>
      <machine canonical='pc-i440fx-rhel7.2.0' maxCpus='240'>pc</machine>
      <machine maxCpus='240'>pc-i440fx-rhel7.0.0</machine>
      <machine maxCpus='240'>pc-q35-rhel7.1.0</machine>
      <machine maxCpus='240'>rhel6.3.0</machine>
      <machine maxCpus='240'>pc-q35-rhel7.2.0</machine>
      <machine canonical='pc-q35-rhel7.2.0' maxCpus='240'>q35</machine>
      <machine maxCpus='240'>rhel6.4.0</machine>
      <machine maxCpus='240'>rhel6.0.0</machine>
      <machine maxCpus='240'>pc-i440fx-rhel7.1.0</machine>
      <machine maxCpus='240'>rhel6.5.0</machine>
      <machine maxCpus='240'>rhel6.6.0</machine>
      <machine maxCpus='240'>rhel6.1.0</machine>
      <machine maxCpus='240'>pc-q35-rhel7.0.0</machine>
      <machine maxCpus='240'>rhel6.2.0</machine>
      <domain type='qemu'/>
      <domain type='kvm'>
        <emulator>/usr/libexec/qemu-kvm</emulator>
      </domain>
    </arch>
    <features>
      <cpuselection/>
      <deviceboot/>
      <disksnapshot default='on' toggle='no'/>
      <acpi default='on' toggle='yes'/>
      <apic default='on' toggle='no'/>
      <pae/>
      <nonpae/>
    </features>
  </guest>

  <guest>
    <os_type>hvm</os_type>
    <arch name='x86_64'>
      <wordsize>64</wordsize>
      <emulator>/usr/libexec/qemu-kvm</emulator>
      <machine maxCpus='240'>pc-i440fx-rhel7.2.0</machine>
      <machine canonical='pc-i440fx-rhel7.2.0' maxCpus='240'>pc</machine>
      <machine maxCpus='240'>pc-i440fx-rhel7.0.0</machine>
      <machine maxCpus='240'>pc-q35-rhel7.1.0</machine>
      <machine maxCpus='240'>rhel6.3.0</machine>
      <machine maxCpus='240'>pc-q35-rhel7.2.0</machine>
      <machine canonical='pc-q35-rhel7.2.0' maxCpus='240'>q35</machine>
      <machine maxCpus='240'>rhel6.4.0</machine>
      <machine maxCpus='240'>rhel6.0.0</machine>
      <machine maxCpus='240'>pc-i440fx-rhel7.1.0</machine>
      <machine maxCpus='240'>rhel6.5.0</machine>
      <machine maxCpus='240'>rhel6.6.0</machine>
      <machine maxCpus='240'>rhel6.1.0</machine>
      <machine maxCpus='240'>pc-q35-rhel7.0.0</machine>
      <machine maxCpus='240'>rhel6.2.0</machine>
      <domain type='qemu'/>
      <domain type='kvm'>
        <emulator>/usr/libexec/qemu-kvm</emulator>
      </domain>
    </arch>
    <features>
      <cpuselection/>
      <deviceboot/>
      <disksnapshot default='on' toggle='no'/>
      <acpi default='on' toggle='yes'/>
      <apic default='on' toggle='no'/>
    </features>
  </guest>

</capabilities>

2016-08-19 20:49:36.492 36603 ERROR nova.compute.manager [req-9a157607-4a33-44bc-bc7e-ad952136185b - - - - -] No compute node record for host overcloud-compute-1.localdomain
2016-08-19 20:49:36.496 36603 WARNING nova.compute.monitors [req-9a157607-4a33-44bc-bc7e-ad952136185b - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
2016-08-19 20:49:36.497 36603 INFO nova.compute.resource_tracker [req-9a157607-4a33-44bc-bc7e-ad952136185b - - - - -] Auditing locally available compute resources for node overcloud-compute-1.localdomain

Comment 3 Omri Hochman 2016-08-19 21:20:25 UTC
Adding SOS report: 
The reports should be available here: http://rhos-release.virt.bos.redhat.com/log/bz1368587

Comment 4 Marius Cornea 2016-08-22 18:21:31 UTC
The symptom looks to be pretty much the same as the one reported for BZ#1356107 where it was caused by the /etc/ceph/ceph.client.openstack.keyring on the new compute node had a wrong format.
It was like so:

[client.openstack]
        key = AAAAAAAAAAAAAAAA

Where a valid key is:

[client.openstack]
        key = AQBHOJdXAAAAABAAod6tL8beRSB1IasVq0FywQ==

Comment 5 seb 2016-08-25 08:12:37 UTC
I'm really tempted to close this one.
As Marius mentioned, it looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1356107

Omri, can we close this?

Comment 6 Ken Dreyer (Red Hat) 2016-08-25 17:11:55 UTC
It sounds like an input validation bug in librados? If users can segfault librados with a malformed key, it sounds like we should keep this BZ open to track that?

Comment 7 Ken Dreyer (Red Hat) 2016-08-25 22:14:35 UTC
Brad mentioned that this segfault is likely fixed upstream in Ceph master (https://github.com/ceph/ceph/pull/9703 prints errors instead). It will make its way into the next major RH Ceph Storage release (RHCS 3).

Is that ok for RHEL OSP, or should we consider this for backporting to RHCS 2?

Comment 9 Brad Hubbard 2016-08-25 23:24:48 UTC
Adding http://tracker.ceph.com/issues/2904 which hopefully explains how we got the "AAAAAAAAAAAAAAAA" key in the first place and how that has also been resolved. It was merged in v11.0.0-371-gcbc9839 upstream.

Comment 10 seb 2016-08-31 13:52:12 UTC
Currently discussing when we can have this in RHCS 2 (perhaps 2.0.1), then we will close this (after the cherry pick).

Comment 12 seb 2016-09-22 12:26:32 UTC
Yogev, can we get this in QA?
Thanks!

Comment 14 Yogev Rabl 2017-03-07 18:14:35 UTC
This bug will be tested later on, during the OSP 11 cycle. 
We don't have the necessary resource to validate it at the moment.

Comment 15 Ken Dreyer (Red Hat) 2017-11-28 19:00:02 UTC
Fixed as of RHCEPH 3.0.


Note You need to log in before you can comment on or make changes to this bug.