Bug 1749447
Summary: | nova-compute does not send Basic.Publish to conductor and service times out in Red Hat OpenStack Platform 13 | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Andreas Karis <akaris> |
Component: | openstack-nova | Assignee: | Matthew Booth <mbooth> |
Status: | CLOSED DUPLICATE | QA Contact: | OSP DFG:Compute <osp-dfg-compute> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 13.0 (Queens) | CC: | broskos, dasmith, eglynn, jhakimra, kchamart, pkesavar, sbauza, sgordon, vromanso |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-09-06 13:35:16 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Andreas Karis
2019-09-05 15:58:45 UTC
During that time period, The nova-compute.log shows: 2019-09-04 17:02:53.230 1 DEBUG oslo_service.periodic_task [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] Running periodic task ComputeManager.update_available_resource run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215 2019-09-04 17:02:53.248 1 DEBUG nova.compute.resource_tracker [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] Auditing locally available compute resources for dcvbgfdellpod-overcloud-computeovsdpdk-5.localdomain (node: dcvbgfdellpod-overcloud-computeovsdpdk-5.localdomain) update_available_resource /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:689 2019-09-04 17:03:07.133 1 DEBUG oslo_concurrency.processutils [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] Running cmd (subprocess): /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/3962a175-766a-4a07-a38f-6854c1e46326/disk --force-share execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372 2019-09-04 17:03:07.324 1 DEBUG oslo_concurrency.processutils [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] CMD "/usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/3962a175-766a-4a07-a38f-6854c1e46326/disk --force-share" returned: 0 in 0.191s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409 2019-09-04 17:03:07.325 1 DEBUG oslo_concurrency.processutils [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] Running cmd (subprocess): /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/3962a175-766a-4a07-a38f-6854c1e46326/disk --force-share execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372 2019-09-04 17:03:07.588 1 DEBUG oslo_concurrency.processutils [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] CMD "/usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/3962a175-766a-4a07-a38f-6854c1e46326/disk --force-share" returned: 0 in 0.263s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409 2019-09-04 17:04:20.299 1 DEBUG oslo_concurrency.processutils [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] Running cmd (subprocess): /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/69695e7c-e285-4ade-b592-05014234676d/disk --force-share execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372 2019-09-04 17:04:20.581 1 DEBUG oslo_concurrency.processutils [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] CMD "/usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/69695e7c-e285-4ade-b592-05014234676d/disk --force-share" returned: 0 in 0.282s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409 2019-09-04 17:04:20.582 1 DEBUG oslo_concurrency.processutils [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] Running cmd (subprocess): /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/69695e7c-e285-4ade-b592-05014234676d/disk --force-share execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372 2019-09-04 17:04:20.876 1 DEBUG oslo_concurrency.processutils [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] CMD "/usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/69695e7c-e285-4ade-b592-05014234676d/disk --force-share" returned: 0 in 0.294s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409 nova compute sends status reports every 10 seconds via AMQP which look similar to this, but these messages are not sent out any more (I masked the hypervisor name in the below to ovsdpdk-5.localdomain): ~~~ "{\"_context_domain\": null, \"_msg_id\": \"88ffd4f05e214cfba46899f44b50b67e\", \"_context_global_request_id\": null, \"_context_quota_class\": null, \"_context_read_only\": false, \"_context_request_id\": \"req-c6ecaa26-d99b-436e-a641-48e998492709\", \"_context_service_catalog\": [], \"args\": {\"objmethod\": \"save\", \"args\": [], \"objinst\": {\"nova_object.version\": \"1.22\", \"nova_object.changes\": [\"report_count\"], \"nova_object.name\": \"Service\", \"nova_object.data\": {\"binary\": \"nova-compute\", \"uuid\": \"051d586e-e5ca-4eba-b3f4-56101081a3df\", \"deleted\": false, \"created_at\": \"2019-08-26T22:57:15Z\", \"updated_at\": \"2019-09-04T16:59:50Z\", \"report_count\": 53574, \"topic\": \"compute\", \"host\": \"ovsdpdk-5.localdomain\", \"version\": 30, \"disabled\": false, \"forced_down\": false, \"last_seen_up\": \"2019-09-04T16:59:50Z\", \"deleted_at\": null, \"disabled_reason\": null, \"id\": 185}, \"nova_object.namespace\": \"nova\"}, \"kwargs\": {}}, \"_unique_id\": \"02e250c11d264100b92c4f33e03fc25f\", \"_context_resource_uuid\": null, \"_context_instance_lock_checked\": false, \"_context_is_admin_project\": true, \"_context_user\": null, \"_context_user_id\": null, \"_context_project_name\": null, \"_context_read_deleted\": \"no\", \"_context_user_identity\": \"- - - - -\", \"_reply_q\": \"reply_301d63b27bf147b8a3bb3ff2556486b1\", \"_context_auth_token\": null, \"_context_show_deleted\": false, \"_context_tenant\": null, \"_context_roles\": [], \"_context_is_admin\": true, \"version\": \"3.0\", \"_context_project_id\": null, \"_context_project_domain\": null, \"_context_timestamp\": \"2019-09-04T15:26:28.894135\", \"_context_user_domain\": null, \"_context_user_name\": null, \"method\": \"object_action\", \"_context_project\": null, \"_context_remote_address\": null}" ~~~ The nova log entries might hint at: https://github.com/openstack/nova/blob/master/nova/virt/images.py#L74 And: https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/prlimit.py#L47 But I think now that this looks like a Red Herring. We can further see that the nova process goes dead silent and the only thing that passes over the socket is , for most of the time, TCP keepalives. The heartbeat mechanism for oslo still works, too ... however, I suspect that the nova thread responsible for updating the service hangs. cat ./sos_commands/docker/docker_ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 371e235371da 10.243.0.5:8787/ven-ericsson-osp13_containers_for_vbgf_pod-neutron-openvswitch-agent:13.0-90 "kolla_start" 8 days ago Up 8 days (healthy) neutron_ovs_agent 8a16bf665b92 10.243.0.5:8787/ven-ericsson-osp13_containers_for_vbgf_pod-nova-compute:13.0-98 "kolla_start" 8 days ago Up About an hour (healthy) nova_compute 0a5c0e25091c 10.243.0.5:8787/ven-ericsson-osp13_containers_for_vbgf_pod-cron:13.0-90 "kolla_start" 8 days ago Up 8 days logrotate_crond c0daf6a5ddd0 10.243.0.5:8787/ven-ericsson-osp13_containers_for_vbgf_pod-ceilometer-compute:13.0-82 "kolla_start" 8 days ago Up 8 days ceilometer_agent_compute e4677ab3e60a 10.243.0.5:8787/ven-ericsson-osp13_containers_for_vbgf_pod-nova-compute:13.0-98 "kolla_start" 8 days ago Up About an hour (healthy) nova_migration_target e3ae860918a3 10.243.0.5:8787/ven-ericsson-osp13_containers_for_vbgf_pod-iscsid:13.0-82 "kolla_start" 8 days ago Up 8 days (healthy) iscsid fed9252ce94e 10.243.0.5:8787/ven-ericsson-osp13_containers_for_vbgf_pod-nova-libvirt:13.0-101 "kolla_start" 8 days ago Up About an hour nova_libvirt 2e2b67b81a63 10.243.0.5:8787/ven-ericsson-osp13_containers_for_vbgf_pod-nova-libvirt:13.0-101 "kolla_start" 8 days ago Up About an hour nova_virtlogd At first glance this looks very much like a symptom of https://bugs.launchpad.net/nova/+bug/1840912, which I already have an upstream patch posted for. I'm still pushing this. *** This bug has been marked as a duplicate of bug 1733930 *** Here we can see consecutive log messages in resource tracker separated by over a minute. 2019-09-04 17:00:36.434 1 DEBUG nova.compute.resource_tracker [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] Auditing locally available compute resources for overcloud-compute ovsdpdk-5.localdomain (node: overcloud-computeovsdpdk-5.localdomain) update_available_resource /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:689 2019-09-04 17:01:49.026 1 DEBUG oslo_concurrency.processutils [req-220c716c-3175-444e-86c1-b4f44497cdf5 - - - - -] Running cmd (subprocess): /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073 741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/3962a175-766a-4a07-a38f-6854c1e46326/disk --force-share execute /usr/lib/python2.7/site-packages/oslo_concurrency/proc essutils.py:372 |