| Summary: | nova-compute segfaults in librbd1 | |||
|---|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | VIKRANT <vaggarwa> | |
| Component: | RBD | Assignee: | Jason Dillaman <jdillama> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 1.3.2 | CC: | apevec, ceph-eng-bugs, fpercoco, jeckersb, kdreyer, lhh, oblaut, plemenko, srevivo | |
| Target Milestone: | rc | |||
| Target Release: | 1.3.4 | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1378064 (view as bug list) | Environment: | ||
| Last Closed: | 2016-09-28 13:19:46 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1378064 | |||
Let me explain what was actually fixed in bug 1315842. Prior to the build python-oslo-messaging-1.8.3-5.el7ost, any AMQP error (which is actually a valid reply from the RabbitMQ, please note that) will block one connection thus effectively decrease number of FDs available. At some moment there won't be left any, so the client which communicates with the cluster via oslo.messaging will stop sending/receiving messages. This was fixed. So now, regardless of type of the reply from the bus, client will continue to operate. Still in case of any error, they will be logged. Unfortunately if a client went offline (restarted) we can't do much from the oslo.messaging side or from RabbitMQ side. Normally connection should be better maintained from the client's side. This is the main issue. Meanwhile the scary message was demoted to debug level in this commit (merged upstream in ver. 4.0.0): https://review.openstack.org/#/c/256312/ That's likely an issue preventing client from reconnecting: https://bugs.launchpad.net/oslo.messaging/+bug/1493890 And proposed fix: https://review.openstack.org/#/c/253510/ Thanks for the updates Peter, these are really helpful. Should I ask customer to manually apply the patch mentioned in proposed fix to verify the functionality ? (In reply to VIKRANT from comment #6) > Thanks for the updates Peter, these are really helpful. > > Should I ask customer to manually apply the patch mentioned in proposed fix > to verify the functionality ? No, please don't! We're reviewing these changes internally (a possibility to backport them, if they are safe to apply, any possible new bugs introducing). And if everything is fine then we'll update this ticket and post a new build information. This is the root cause of the problem, from nova-compute.log: Sep 19 07:55:26 hpbs11-10to16-compute-0.localdomain kernel: nova-compute[7031]: segfault at 0 ip 00007f08f0b5df5a sp 00007f08a77fdb90 error 4 in librbd.so.1.0.0[7f08f09f9000+562000] Sep 19 07:55:26 hpbs11-10to16-compute-0.localdomain libvirtd[32920]: End of file while reading data: Input/output error Sep 19 07:55:26 hpbs11-10to16-compute-0.localdomain systemd[1]: openstack-nova-compute.service: main process exited, code=killed, status=11/SEGV Sep 19 07:55:26 hpbs11-10to16-compute-0.localdomain systemd[1]: Unit openstack-nova-compute.service entered failed state. Sep 19 07:55:26 hpbs11-10to16-compute-0.localdomain systemd[1]: openstack-nova-compute.service failed. Sep 19 07:55:26 hpbs11-10to16-compute-0.localdomain systemd[1]: openstack-nova-compute.service holdoff time over, scheduling restart. Sep 19 07:55:26 hpbs11-10to16-compute-0.localdomain systemd[1]: Starting OpenStack Nova Compute Server... Sep 19 07:55:28 hpbs11-10to16-compute-0.localdomain systemd[1]: Started OpenStack Nova Compute Server. So ultimately this is a Ceph crash in librbd that takes out nova-compute. You get the error messages in conductor because it can't contact the old compute instance because it crashed. Fixing up summary and reassigning to Ceph/RBD This is with librbd1-0.94.5-13.el7cp.x86_64, btw. I'll need a core dump from the failed qemu process, but most likely is fixed by BZ1358697 (librbd1-0.94.5-15.el7cp.x86_64). Jason, Upgrading the packages to following versions as you mentioned, fixed the issue. librbd1-0.94.5-15.el7.x86_64.rpm librados2-0.94.5-15.el7.x86_64.rpm libradosstriper1-0.94.5-15.el7.x86_64.rpm ceph-0.94.5-15.el7.x86_64.rpm ceph-common-0.94.5-15.el7.x86_64.rpm ceph-mon-0.94.5-15.el7.x86_64.rpm ceph-osd-0.94.5-15.el7.x86_64.rpm python-rados-0.94.5-15.el7.x86_64.rpm python-rbd-0.94.5-15.el7.x86_64.rpm We can close this Bug. |
Description of problem: Cu is using nova affinity hints to spawn multiple instances on single compute node. Some of the instances are getting stuck in BUILD or going into ERROR state. Nothing reported in nova-compute.log file, only error related to rabbitmq is found. Message seen in nova-conductor.log file on controller node. ~~~ 2016-09-19 07:55:26.701 15757 INFO oslo_messaging._drivers.amqpdriver [req-159bf22f-ad2f-4a58-8047-2a0e2ee5cd9f 2909d7a039c242478b876add46036beb e01e128d13b540a78f37736a211c0483 - - -] The reply a105baabb7da484f9c00e4055cb4fa35 cannot be sent reply_21df10f1050d494599d6950fca2dae66 reply queue don't exist, retrying... ~~~ Version-Release number of selected component (if applicable): RHEL OSP 7 How reproducible: Everytime for the Cu after couple of attempts to spawn instances. Steps to Reproduce: 1. Spawn an instance using affinity hint. It's getting spawned successfully. 2. Spawn 14 instances using affinity hint again. 3. If the instances are getting successfully spawned, delete the instances and repeat the step 2 until we hit the issue. Actual results: Instances are getting failed because of rabbitmq missing queue. Expected results: Instance should not get failed. Additional info: Below errors are seen in rabbitmq logs. ~~~ =ERROR REPORT==== 19-Sep-2016::07:55:26 === connection <0.2157.9>, channel 2 - soft error: {amqp_error,not_found, "no exchange 'reply_21df10f1050d494599d6950fca2dae66' in vhost '/'", 'exchange.declare'} =ERROR REPORT==== 19-Sep-2016::07:55:26 === connection <0.2157.9>, channel 1 - soft error: {amqp_error,not_found, "no exchange 'reply_21df10f1050d494599d6950fca2dae66' in vhost '/'", 'exchange.declare'} ~~~ Found the Bug [1] with similar issue, Cu is already running version python-oslo-messaging-1.8.3-5.el7ost.noarch in which this issue is fixed. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1315842