Description of problem: After running an update from beta to 2018-06-08.3 puddle and scaling out compute (adding 1 compute node to overcloud) all old nodes (3 controllers, 1 compute) are in ERROR status, In the 2 of the controllers, the openstack-memcached docker is stuck in Restarting: 410b48140798 192.168.24.1:8787/rhosp13/openstack-memcached:2018-06-08.3 "/bin/bash -c 'sou..." 16 hours ago Restarting (1) 5 hours ago memcached Trying to check docker log generate permission error: [heat-admin@controller-0 ~]$ sudo docker logs memcached /bin/bash: /var/log/memcached.log: Permission denied Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1.Install beta overcloud 2.Update overcloud 3. Actual results: Expected results: Additional info:
Hi Raviv, looking at home/stack/overcloud_scale_up.log I see that the heat process completed successfully: 2018-06-11 23:08:24Z [overcloud-AllNodesDeploySteps-54k52bq4bzns]: UPDATE_COMPLETE Stack UPDATE completed successfully 2018-06-11 23:08:24Z [AllNodesDeploySteps]: UPDATE_COMPLETE state changed 2018-06-11 23:08:31Z [overcloud]: UPDATE_COMPLETE Stack UPDATE completed successfully Looking at the CI job I couldn't find an obvious error there neither. Could you tell me tell me what is "ERROR status" in that sentence. > all old nodes (3 controllers, 1 compute) are in ERROR status, In the meantime, continuing to dig into the memcache issue.
Hi Sofer The ERROR status is the state the overcloud nodes have in the output of this command: 'openstack server list'
Oki, so we're waiting for the ci test to be run again to get more info on the "openstack server list". In particular we want to get the output of "openstack server show <id>" to get the exact error message, and that may be related to virt-env configuration: <yprokule> rbartal: chem ain't ERROR state of those nodes related to issues with vbmc ? [16:28] This need checking (and thus the live env) For the memcached issue it may be unrelated to scale out. Trying to get a reproducer by updating another osp13 environment. Currently I can see that the /var/log/containers/memcache directory is missing from controller-0,1,2 but that may be a logging artifact error (form the sosreport) We can also see that the container start to fail after the reboot of the node that seems to happen in the job after the update: <jistr> Jun 11 22:17:29 controller-0 systemd[1]: Stopped Docker Application Container Engine. <jistr> -- Reboot -- <jistr> Jun 11 22:18:30 controller-0 systemd[1]: Starting Docker Application Container Engine... <jistr> and first mention of the error is right after that: [18:27] <jistr> Jun 11 22:18:40 controller-0 dockerd-current[2589]: /bin/bash: /var/log/memcached.log: Permission denied In between there is a reboot. We have those log related to the memcached container after the reboot[1], but not sure it's related. Next step would be to run an update, and reboot, and see what happen to containers, as it seems this is happening before scale out. [1] http://pastebin.test.redhat.com/603667
(In reply to Sofer Athlan-Guyot from comment #4) > Next step would be to run an update, and reboot, and see what happen > to containers, as it seems this is happening before scale out. > I had an updated environment and rebooted the controller nodes post update and the memcached container started ok: [heat-admin@controller-0 ~]$ sudo docker ps | grep memcache 5cb57b77bc17 192.168.24.1:8787/rhosp13/openstack-memcached:2018-06-12.1 "/bin/bash -c 'sou..." 21 hours ago Up 23 minutes memcached s[heat-admin@controller-1 ~]$ sudo docker ps | grep memcache 662ac02958b7 192.168.24.1:8787/rhosp13/openstack-memcached:2018-06-12.1 "/bin/bash -c 'sou..." 21 hours ago Up 12 minutes memcached [root@controller-2 heat-admin]# docker ps | grep memcache 9fd750ebc24c 192.168.24.1:8787/rhosp13/openstack-memcached:2018-06-12.1 "/bin/bash -c 'sou..." 22 hours ago Up 36 seconds memcached
Created https://bugzilla.redhat.com/show_bug.cgi?id=1591278 to follow the memcached issue, focusing on the this error, which happen after the scale out (undercloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+--------------+--------+------------------------+---------------------------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+---------------------------------+------------+ | d79ce12e-52e7-415a-86d2-953da21a0f81 | compute-1 | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | compute | | 5ad98bd6-c879-4b18-84a5-69df3020a0fb | controller-2 | ERROR | ctlplane=192.168.24.9 | overcloud-full_20180613T152505Z | controller | | 82746f0a-d2c9-429d-bb48-648a61c156ae | compute-0 | ERROR | ctlplane=192.168.24.13 | overcloud-full_20180613T152505Z | compute | | b6732932-9954-4d36-a756-b2f2e796a53e | controller-0 | ERROR | ctlplane=192.168.24.10 | overcloud-full_20180613T152505Z | controller | | b590302c-982a-4a14-b800-5cfd487ac52a | controller-1 | ERROR | ctlplane=192.168.24.16 | overcloud-full_20180613T152505Z | controller | +--------------------------------------+--------------+--------+------------------------+---------------------------------+------------+ (undercloud) [stack@undercloud-0 ~]$ openstack server show controller-2 +-------------------------------------+------------------------------------------------------------------------+ | Field | Value | +-------------------------------------+------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | undercloud-0.redhat.local | | OS-EXT-SRV-ATTR:hypervisor_hostname | 4dc3f9d6-7809-4164-9f3f-ba14f3c46ca1 | | OS-EXT-SRV-ATTR:instance_name | instance-00000001 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | error | | OS-SRV-USG:launched_at | 2018-06-13T15:55:58.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | ctlplane=192.168.24.9 | | config_drive | True | | created | 2018-06-13T15:51:47Z | | flavor | controller (c0698e0b-5333-410d-8878-1c44490f6d32) | | hostId | 84894593ace41fc1609184266da515ff275043cd68a4be54a84e6758 | | id | 5ad98bd6-c879-4b18-84a5-69df3020a0fb | | image | overcloud-full_20180613T152505Z (fc175c62-1252-48d3-a7ba-b7c5bb9f5def) | | key_name | default | | name | controller-2 | | project_id | 82315f5afedc4594bbb392c4891453c2 | | properties | | | security_groups | name='default' | | status | ERROR | | updated | 2018-06-13T17:46:38Z | | user_id | 1d157ecf0818414c818bedf56d68518e | | volumes_attached | | +-------------------------------------+------------------------------------------------------------------------+
After a long debug session with Jiri, we are still not able to get the root cause of the issue. According to the timestanp in the db, the vm got into error either at the end of "openstack undercloud upgrade" or at the beginning of "openstack overcloud image upload --update-existing". We are trying to get a reproducer, but running manually the step of the CI. Regarding the job in itself we noted that a lot of tempest are run before the end of the scale out: - we would recommend only one tempest run at the end as this could very much change the cluster in strange way; - the infrared scale out command is wrong: infrared cloud-config -o scale-up.yml --tasks scale_up --splitstack no --build beta --mirror tlv --scale-nodes compute-1 but should be infrared cloud-config -o scale-up.yml --tasks scale_up --splitstack no --build passed_phase2 --mirror tlv --scale-nodes compute-1 (build changed to passed_phased2)
Note that for making sense of times in the environment, timezones are important: node time (logs etc.) is UTC-4, CLI times (like `openstack server show` pasted in earlier comment) are UTC, and Jenkins logs are UTC+3. In addition to the failures happening around `openstack undercloud upgrade` and `openstack overcloud image upload --update-existing`, we observed this: Close to 'updated' time from `openstack server show` there are errors in nova-compute.log: 2018-06-13 13:42:23.917 19257 DEBUG ironicclient.common.http [req-3a652a75-e204-4713-a384-adff28e259a2 - - - - -] Error contacting Ironic server: Unable to attach VIF because VIF 1711a06f-cc58-4552-b55b-9ce4a870f3f3 is already attached to Ironic Port cd2051a2-fc3c-42ee-8892-6d6230f90d24 (HTTP 409). Attempt 1 of 2 wrapper /usr/lib/python2.7/site-packages/ironicclient/common/http.py:258 2018-06-13 13:42:26.334 19257 ERROR ironicclient.common.http [req-3a652a75-e204-4713-a384-adff28e259a2 - - - - -] Error contacting Ironic server: Unable to attach VIF because VIF 1711a06f-cc58-4552-b55b-9ce4a870f3f3 is already attached to Ironic Port cd2051a2-fc3c-42ee-8892-6d6230f90d24 (HTTP 409). Attempt 2 of 2: Conflict: Unable to attach VIF because VIF 1711a06f-cc58-4552-b55b-9ce4a870f3f3 is already attached to Ironic Port cd2051a2-fc3c-42ee-8892-6d6230f90d24 (HTTP 409) and in ironic app.log: 2018-06-13 13:42:23.911 17079 DEBUG wsme.api [req-c90be820-9a77-47d3-a2ad-4778e5c73b09 a1dde5ec4fff4d4889dc30059a4abe4a 5f6f06c988c24a7eb4d761360980fdc4 - default default] Client-side error: Unable to attach VIF because VIF 1711a06f-cc58-4552-b55b-9ce4a870f3f3 is already attached to Ironic Port cd2051a2-fc3c-42ee-8892-6d6230f90d24 Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 226, in inner return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 2969, in vif_attach task.driver.network.vif_attach(task, vif_info) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 551, in vif_attach port_like_obj = get_free_port_like_object(task, vif_id, physnets) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 172, in get_free_port_like_object physnets) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 117, in _get_free_portgroups_and_ports if _vif_attached(p, vif_id): File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 55, in _vif_attached vif=vif_id, object_uuid=port_like_obj.uuid) VifAlreadyAttached: Unable to attach VIF because VIF 1711a06f-cc58-4552-b55b-9ce4a870f3f3 is already attached to Ironic Port cd2051a2-fc3c-42ee-8892-6d6230f90d24 format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222 It's also worth noting that both Ironic and Nova-compute were restarted very shortly before these errors appeared (timestamps of the restarts fall within undercloud update, so that's expected).
So it's not a scale up issue. We are able to reproduce it upstream using this simple sequence: - install master; - update undercldoud; -> "overcloud (controller,compute)" vm_state switch to error It seems not even a undercloud reboot is needed. You have to wait a little after the undercloud update to get it and then the overcloud nodes switch to vm_state error (from the undercloud . stackrc && openstack server list). It should be noted that the overcloud is working fine and can still be heat updated/upgrade and scale up, but it looks very bad on the display (see https://bugzilla.redhat.com/show_bug.cgi?id=1590297#c9) Nova people are looking into it currently, and we working on a clean reproducer on osp13.
Just adding clarification to make sure we don't assume something that's not completely sure: > So it's not a scale up issue. We are able to reproduce it upstream using > this simple sequence: > > - install master; > - update undercldoud; > -> "overcloud (controller,compute)" vm_state switch to error > > It seems not even a undercloud reboot is needed. I saw the VIF errors in logs *before* reboot, but i only noticed the Nova VM state in ERROR *after* reboot. Just waiting (or another service restart) instead of rebooting the undercloud might be enough for VMs to switch state, but i don't think we're sure of that yet.
Looks like a race in the service startup. (undercloud) [stack@undercloud ~]$ openstack server list +--------------------------------------+-------------------------+--------+------------------------+------------------------+--------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------------------+--------+------------------------+------------------------+--------------+ | fcda23f8-7b98-41bf-9e45-7a4579164874 | overcloud-controller-1 | ERROR | ctlplane=192.168.24.7 | overcloud-full_renamed | oooq_control | | ced0e0db-ad1f-4381-a523-0a79ae0303ff | overcloud-controller-2 | ERROR | ctlplane=192.168.24.21 | overcloud-full_renamed | oooq_control | | 0731685f-a8fc-4126-868a-bdf9879238f3 | overcloud-controller-0 | ERROR | ctlplane=192.168.24.12 | overcloud-full_renamed | oooq_control | | c4fd4e28-fe44-40d2-9b40-7196b7c3b6a2 | overcloud-novacompute-2 | ERROR | ctlplane=192.168.24.15 | overcloud-full_renamed | oooq_compute | | 5bef3afd-4485-4cbc-b76c-f126c85bd015 | overcloud-novacompute-1 | ERROR | ctlplane=192.168.24.8 | overcloud-full_renamed | oooq_compute | | e4c9da33-c452-446c-82c4-bd55a6b294d8 | overcloud-novacompute-0 | ERROR | ctlplane=192.168.24.9 | overcloud-full_renamed | oooq_compute | +--------------------------------------+-------------------------+--------+------------------------+------------------------+--------------+ Looking at controller-1... (undercloud) [stack@undercloud ~]$ openstack server show overcloud-controller-1 +-------------------------------------+---------------------------------------------------------------+ | Field | Value | +-------------------------------------+---------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | undercloud | | OS-EXT-SRV-ATTR:hypervisor_hostname | b6a32fba-b57a-4e7b-a6ce-99941b4d134d | | OS-EXT-SRV-ATTR:instance_name | instance-00000005 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | error | | OS-SRV-USG:launched_at | 2018-06-11T14:56:34.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | ctlplane=192.168.24.7 | | config_drive | True | | created | 2018-06-11T14:53:09Z | | flavor | oooq_control (04f8ba26-e9bd-472f-b92a-919e7ec8bed1) | | hostId | da8848b6f3dc77a51235f70dcf44df197261eca0529173f670c94ce9 | | id | fcda23f8-7b98-41bf-9e45-7a4579164874 | | image | overcloud-full_renamed (4d965d80-61fc-45d0-88e1-e6365c4afd57) | | key_name | default | | name | overcloud-controller-1 | | project_id | 3b778414471a47e4b0760bbecc9d4070 | | properties | | | security_groups | name='default' | | status | ERROR | | updated | 2018-06-15T09:29:17Z | | user_id | 356b9a5451c643bb8162b9349bc9487b | | volumes_attached | | +-------------------------------------+---------------------------------------------------------------+ From /var/log/ironic/ironic-conductor.log I can see that ironic-conductor started loading extensions after the updated timestamp: 2018-06-15 09:29:18.576 1408 DEBUG oslo_concurrency.lockutils [req-6ad8cb26-b607-4f65-a8b7-30ac72a7a997 - - - - -] Lock "extension_manager" acquired by "ironic.common.driver_factory._init_extension_manager" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 The errors in /var/log/ironic/app.log occurred before this, because there wasn't a conductor registered that supports ipmi: 2018-06-15 09:29:14.419 1793 DEBUG wsme.api [req-81ce866e-1c15-4499-a452-044a44efa1c0 1a609f3c25c24c45ac30ee2fcc721eac 5909cc46dbad48058daa89d48f07ba71 - default default] Client-side error: No valid host was found. Reason: No conductor service registered which supports driver ipmi. format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222 2018-06-15 09:29:15.304 1792 DEBUG wsme.api [req-be8a2443-12ad-4115-9640-b91b39d15765 1a609f3c25c24c45ac30ee2fcc721eac 5909cc46dbad48058daa89d48f07ba71 - default default] Client-side error: No valid host was found. Reason: No conductor service registered which supports driver ipmi. format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222 2018-06-15 09:29:16.105 1793 DEBUG wsme.api [req-0f2c2010-3bce-4600-8c8a-54a6aa0b8965 1a609f3c25c24c45ac30ee2fcc721eac 5909cc46dbad48058daa89d48f07ba71 - default default] Client-side error: No valid host was found. Reason: No conductor service registered which supports driver ipmi. format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222 2018-06-15 09:29:16.797 1792 DEBUG wsme.api [req-e660c0cd-8a25-4b93-815f-edc26c15ddba 1a609f3c25c24c45ac30ee2fcc721eac 5909cc46dbad48058daa89d48f07ba71 - default default] Client-side error: No valid host was found. Reason: No conductor service registered which supports driver ipmi. format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222 2018-06-15 09:29:17.596 1793 DEBUG wsme.api [req-ffa50a06-3a62-4f3e-937c-4304818dca4c 1a609f3c25c24c45ac30ee2fcc721eac 5909cc46dbad48058daa89d48f07ba71 - default default] Client-side error: No valid host was found. Reason: No conductor service registered which supports driver ipmi. format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222 2018-06-15 09:29:18.277 1793 DEBUG wsme.api [req-c391da42-997e-49ee-b604-11ab81f875e2 1a609f3c25c24c45ac30ee2fcc721eac 5909cc46dbad48058daa89d48f07ba71 - default default] Client-side error: No valid host was found. Reason: No conductor service registered which supports driver ipmi. format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222 And /var/log/nova/nova-compute.log: 2018-06-15 09:29:12.007 2623 INFO nova.service [req-42a10a27-6527-45d3-9b7d-0c4dc2ac2f13 - - - - -] Starting compute node (version 18.0.0-0.20180601221704.f902e0d.el7) 2018-06-15 09:29:12.135 2623 DEBUG nova.servicegroup.drivers.db [req-42a10a27-6527-45d3-9b7d-0c4dc2ac2f13 - - - - -] Seems service nova-compute on host undercloud is down. Last heartbeat was 2018-06-15 09:27:40. Elapsed time is 92.135605 is_up /usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py:79 2018-06-15 09:29:12.260 2623 DEBUG nova.compute.manager [req-42a10a27-6527-45d3-9b7d-0c4dc2ac2f13 - - - - -] [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] Checking state _get_power_state /usr/lib/python2.7/site-packages/nova/compute/manager.py:1167 2018-06-15 09:29:13.819 2623 DEBUG nova.virt.ironic.driver [req-42a10a27-6527-45d3-9b7d-0c4dc2ac2f13 - - - - -] plug: instance_uuid=5bef3afd-4485-4cbc-b76c-f126c85bd015 vif=[{"profile": {}, "ovs_interfaceid": null, "preserve_on_delete": true, "network": {"bridge": null, "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "192.168.24.8"}], "version": 4, "meta": {"dhcp_server": "192.168.24.5"}, "dns": [{"meta": {}, "version": 4, "type": "dns", "address": "192.168.23.1"}], "routes": [{"interface": null, "cidr": "169.254.169.254/32", "meta": {}, "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "192.168.24.1"}}], "cidr": "192.168.24.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "192.168.24.1"}}], "meta": {"injected": false, "tenant_id": "3b778414471a47e4b0760bbecc9d4070", "mtu": 1500}, "id": "1fe54003-4d8c-4b37-9f20-f04216ac4e26", "label": "ctlplane"}, "devname": "tapcbd2c5be-32", "vnic_type": "baremetal", "qbh_params": null, "meta": {}, "details": {}, "address": "00:1a:24:52:db:70", "active": true, "type": "other", "id": "cbd2c5be-32fd-4c5c-9a14-43e323071912", "qbg_params": null}] _plug_vifs /usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py:1397 2018-06-15 09:29:14.422 2623 ERROR nova.virt.ironic.driver [req-42a10a27-6527-45d3-9b7d-0c4dc2ac2f13 - - - - -] Cannot attach VIF cbd2c5be-32fd-4c5c-9a14-43e323071912 to the node 059ac0f4-3a78-4f67-9f75-57a7a1f9ec5c due to error: No valid host was found. Reason: No conductor service registered which supports driver ipmi. (HTTP 400): BadRequest: No valid host was found. Reason: No conductor service registered which supports driver ipmi. (HTTP 400) 2018-06-15 09:29:14.422 2623 ERROR nova.compute.manager [req-42a10a27-6527-45d3-9b7d-0c4dc2ac2f13 - - - - -] [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] Vifs plug failed: VirtualInterfacePlugException: Cannot attach VIF cbd2c5be-32fd-4c5c-9a14-43e323071912 to the node 059ac0f4-3a78-4f67-9f75-57a7a1f9ec5c due to error: No valid host was found. Reason: No conductor service registered which supports driver ipmi. (HTTP 400) 2018-06-15 09:29:14.422 2623 ERROR nova.compute.manager [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] Traceback (most recent call last): 2018-06-15 09:29:14.422 2623 ERROR nova.compute.manager [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 942, in _init_instance 2018-06-15 09:29:14.422 2623 ERROR nova.compute.manager [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] self.driver.plug_vifs(instance, net_info) 2018-06-15 09:29:14.422 2623 ERROR nova.compute.manager [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1441, in plug_vifs 2018-06-15 09:29:14.422 2623 ERROR nova.compute.manager [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] self._plug_vifs(node, instance, network_info) 2018-06-15 09:29:14.422 2623 ERROR nova.compute.manager [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1408, in _plug_vifs 2018-06-15 09:29:14.422 2623 ERROR nova.compute.manager [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] raise exception.VirtualInterfacePlugException(msg) 2018-06-15 09:29:14.422 2623 ERROR nova.compute.manager [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] VirtualInterfacePlugException: Cannot attach VIF cbd2c5be-32fd-4c5c-9a14-43e323071912 to the node 059ac0f4-3a78-4f67-9f75-57a7a1f9ec5c due to error: No valid host was found. Reason: No conductor service registered which supports driver ipmi. (HTTP 400) 2018-06-15 09:29:14.422 2623 ERROR nova.compute.manager [instance: 5bef3afd-4485-4cbc-b76c-f126c85bd015] etc... for the other instances
I think the hypothesis with race in service startup is correct. I'm able to reproduce the 'ERROR' state of overcloud in `openstack server list` by simply doing: sudo systemctl stop openstack-ironic-conductor sudo systemctl restart openstack-nova-compute After this ^ nodes go to 'ERROR'. Tried to recover from it this way: sudo systemctl start openstack-ironic-conductor sudo systemctl restart openstack-nova-compute but the nodes still stay in 'ERROR'.
Hi, Reproduced on OSP13 as well: (undercloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | 9f80c38a-9f33-4a18-88e0-b89776e62150 | compute-0 | ACTIVE | ctlplane=192.168.24.18 | overcloud-full | compute | | e87efe17-b939-4df2-af0c-8e2effd58c95 | controller-1 | ACTIVE | ctlplane=192.168.24.9 | overcloud-full | controller | | 5a3ea20c-75e8-49fe-90b6-edad01fc0a48 | controller-2 | ACTIVE | ctlplane=192.168.24.13 | overcloud-full | controller | | ba0f26e7-ec2c-4e61-be8e-05edf00ce78a | controller-0 | ACTIVE | ctlplane=192.168.24.8 | overcloud-full | controller | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ (undercloud) [stack@undercloud-0 ~]$ sudo systemctl stop openstack-ironic-conductor (undercloud) [stack@undercloud-0 ~]$ sudo systemctl restart openstack-nova-compute (undercloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | 9f80c38a-9f33-4a18-88e0-b89776e62150 | compute-0 | ERROR | ctlplane=192.168.24.18 | overcloud-full | compute | | e87efe17-b939-4df2-af0c-8e2effd58c95 | controller-1 | ERROR | ctlplane=192.168.24.9 | overcloud-full | controller | | 5a3ea20c-75e8-49fe-90b6-edad01fc0a48 | controller-2 | ERROR | ctlplane=192.168.24.13 | overcloud-full | controller | | ba0f26e7-ec2c-4e61-be8e-05edf00ce78a | controller-0 | ERROR | ctlplane=192.168.24.8 | overcloud-full | controller | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ Version (osp13 beta): undercloud) [stack@undercloud-0 ~]$ yum list | egrep 'ironic|nova' openstack-ironic-api.noarch 1:10.1.2-3.el7ost @rhelosp-13.0-puddle openstack-ironic-common.noarch 1:10.1.2-3.el7ost @rhelosp-13.0-puddle openstack-ironic-conductor.noarch openstack-ironic-inspector.noarch openstack-ironic-staging-drivers.noarch openstack-nova-api.noarch 1:17.0.3-0.20180420001138.el7ost openstack-nova-common.noarch 1:17.0.3-0.20180420001138.el7ost openstack-nova-compute.noarch 1:17.0.3-0.20180420001138.el7ost openstack-nova-conductor.noarch 1:17.0.3-0.20180420001138.el7ost openstack-nova-placement-api.noarch openstack-nova-scheduler.noarch 1:17.0.3-0.20180420001138.el7ost puppet-ironic.noarch 12.4.0-0.20180329034302.8285d85.el7ost puppet-nova.noarch 12.4.0-3.el7ost @rhelosp-13.0-puddle python-ironic-inspector-client.noarch python-ironic-lib.noarch 2.12.1-1.el7ost @rhelosp-13.0-puddle python-nova.noarch 1:17.0.3-0.20180420001138.el7ost python-novaclient.noarch 1:9.1.1-1.el7ost @rhelosp-13.0-puddle python2-ironic-neutron-agent.noarch python2-ironicclient.noarch 2.2.0-1.el7ost @rhelosp-13.0-puddle openstack-ironic-python-agent.noarch openstack-ironic-ui.noarch 3.1.2-1.el7ost rhelosp-13.0-puddle openstack-nova.noarch 1:17.0.3-0.20180420001138.el7ost openstack-nova-cells.noarch 1:17.0.3-0.20180420001138.el7ost openstack-nova-console.noarch 1:17.0.3-0.20180420001138.el7ost openstack-nova-migration.noarch 1:17.0.3-0.20180420001138.el7ost openstack-nova-network.noarch 1:17.0.3-0.20180420001138.el7ost openstack-nova-novncproxy.noarch 1:17.0.3-0.20180420001138.el7ost openstack-nova-serialproxy.noarch openstack-nova-spicehtml5proxy.noarch python-nova-tests.noarch 1:17.0.3-0.20180420001138.el7ost python-novajoin.noarch 1.0.18-2.el7ost rhelosp-13.0-puddle python2-ironic-python-agent.noarch Re-Assigning this one to DFG:Compute. Beside the race/wrong handling of the vm_state, it would be nice to get a way to recover from the error state. Thanks,
From comment 14, it looks like the service ordering issue is between ironic conductor and ironic api, so nova is the victim here. Guessing the simplest solution (for GA, at least) would be to ensure conductor comes up first, which sounds like a PIDONE thing? I'll let you make that call, though. Please let me know asap if compute can help.
Hi all, We did some brainstorming on IRC, and that's what we came up with: 1. On nova-compute start, it tries to attach VIFs to ACTIVE instances. This behavior has been around since really long ago, and it seems correct. 2. A VIF attach call for Ironic requires a conductor to process it. With no ironic-conductor instances up we cannot handle it. The error message is a bit unfortunate but in fact correct: actions for the "ipmi" driver cannot be handled by an alive conductors since we don't have alive conductors at all. Again, this is the correct behavior given the Ironic architecture. 3. It seems like we've had this issue since long ago, I'm not entirely sure why we've never seen it. Is it possible that something has changed in upgrade/update ordering? 4. In any case, we have two options: 4.1. Make Nova somehow retry VIF attachment. This is probably too invasive and also probably incorrect for VMs. 4.2. Make sure we never start a nova-compute instance when no ironic-conductor instances are up (this also applies to the overcloud) I think the last option is the most viable for an urgent fix. Can someone please provide their insights?
(In reply to Sofer Athlan-Guyot from comment #16) > Hi, > > Reproduced on OSP13 as well: > > (undercloud) [stack@undercloud-0 ~]$ openstack server list > > +--------------------------------------+--------------+--------+------------- > -----------+----------------+------------+ > | ID | Name | Status | > Networks | Image | Flavor | > > +--------------------------------------+--------------+--------+------------- > -----------+----------------+------------+ > | 9f80c38a-9f33-4a18-88e0-b89776e62150 | compute-0 | ACTIVE | > ctlplane=192.168.24.18 | overcloud-full | compute | > | e87efe17-b939-4df2-af0c-8e2effd58c95 | controller-1 | ACTIVE | > ctlplane=192.168.24.9 | overcloud-full | controller | > | 5a3ea20c-75e8-49fe-90b6-edad01fc0a48 | controller-2 | ACTIVE | > ctlplane=192.168.24.13 | overcloud-full | controller | > | ba0f26e7-ec2c-4e61-be8e-05edf00ce78a | controller-0 | ACTIVE | > ctlplane=192.168.24.8 | overcloud-full | controller | > > +--------------------------------------+--------------+--------+------------- > -----------+----------------+------------+ > (undercloud) [stack@undercloud-0 ~]$ sudo systemctl stop > openstack-ironic-conductor > (undercloud) [stack@undercloud-0 ~]$ sudo systemctl restart > openstack-nova-compute > (undercloud) [stack@undercloud-0 ~]$ openstack server list > > +--------------------------------------+--------------+--------+------------- > -----------+----------------+------------+ > | ID | Name | Status | > Networks | Image | Flavor | > > +--------------------------------------+--------------+--------+------------- > -----------+----------------+------------+ > | 9f80c38a-9f33-4a18-88e0-b89776e62150 | compute-0 | ERROR | > ctlplane=192.168.24.18 | overcloud-full | compute | > | e87efe17-b939-4df2-af0c-8e2effd58c95 | controller-1 | ERROR | > ctlplane=192.168.24.9 | overcloud-full | controller | > | 5a3ea20c-75e8-49fe-90b6-edad01fc0a48 | controller-2 | ERROR | > ctlplane=192.168.24.13 | overcloud-full | controller | > | ba0f26e7-ec2c-4e61-be8e-05edf00ce78a | controller-0 | ERROR | > ctlplane=192.168.24.8 | overcloud-full | controller | > > +--------------------------------------+--------------+--------+------------- > -----------+----------------+------------+ > > Version (osp13 beta): > > undercloud) [stack@undercloud-0 ~]$ yum list | egrep 'ironic|nova' > openstack-ironic-api.noarch 1:10.1.2-3.el7ost > @rhelosp-13.0-puddle > openstack-ironic-common.noarch 1:10.1.2-3.el7ost > @rhelosp-13.0-puddle > openstack-ironic-conductor.noarch > openstack-ironic-inspector.noarch > openstack-ironic-staging-drivers.noarch > openstack-nova-api.noarch 1:17.0.3-0.20180420001138.el7ost > openstack-nova-common.noarch 1:17.0.3-0.20180420001138.el7ost > openstack-nova-compute.noarch 1:17.0.3-0.20180420001138.el7ost > openstack-nova-conductor.noarch 1:17.0.3-0.20180420001138.el7ost > openstack-nova-placement-api.noarch > openstack-nova-scheduler.noarch 1:17.0.3-0.20180420001138.el7ost > puppet-ironic.noarch 12.4.0-0.20180329034302.8285d85.el7ost > puppet-nova.noarch 12.4.0-3.el7ost > @rhelosp-13.0-puddle > python-ironic-inspector-client.noarch > python-ironic-lib.noarch 2.12.1-1.el7ost > @rhelosp-13.0-puddle > python-nova.noarch 1:17.0.3-0.20180420001138.el7ost > python-novaclient.noarch 1:9.1.1-1.el7ost > @rhelosp-13.0-puddle > python2-ironic-neutron-agent.noarch > python2-ironicclient.noarch 2.2.0-1.el7ost > @rhelosp-13.0-puddle > openstack-ironic-python-agent.noarch > openstack-ironic-ui.noarch 3.1.2-1.el7ost > rhelosp-13.0-puddle > openstack-nova.noarch 1:17.0.3-0.20180420001138.el7ost > openstack-nova-cells.noarch 1:17.0.3-0.20180420001138.el7ost > openstack-nova-console.noarch 1:17.0.3-0.20180420001138.el7ost > openstack-nova-migration.noarch 1:17.0.3-0.20180420001138.el7ost > openstack-nova-network.noarch 1:17.0.3-0.20180420001138.el7ost > openstack-nova-novncproxy.noarch 1:17.0.3-0.20180420001138.el7ost > openstack-nova-serialproxy.noarch > openstack-nova-spicehtml5proxy.noarch > python-nova-tests.noarch 1:17.0.3-0.20180420001138.el7ost > python-novajoin.noarch 1.0.18-2.el7ost > rhelosp-13.0-puddle > python2-ironic-python-agent.noarch > > > Re-Assigning this one to DFG:Compute. Beside the race/wrong handling > of the vm_state, it would be nice to get a way to recover from the > error state. nova reset-state? > > Thanks,
(In reply to Jiri Stransky from comment #15) > I think the hypothesis with race in service startup is correct. I'm able to > reproduce the 'ERROR' state of overcloud in `openstack server list` by > simply doing: > > sudo systemctl stop openstack-ironic-conductor > sudo systemctl restart openstack-nova-compute > > After this ^ nodes go to 'ERROR'. Tried to recover from it this way: > > sudo systemctl start openstack-ironic-conductor > sudo systemctl restart openstack-nova-compute > > but the nodes still stay in 'ERROR'. Hard reboot is intended to resolve this in the general case. However, in this specific case a reset-state would probably be more appropriate, as no action should actually be required on an ironic node.
Hello, We can re-set the node state with: (undercloud) [stack@undercloud ~]$ openstack server list +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+ | a8c447c1-7f47-42be-807a-c475dc76ea89 | overcloud-controller-0 | ERROR | ctlplane=192.168.24.6 | overcloud-full | oooq_control | | 8ef23886-298e-451b-8094-f10c6c6c02fb | overcloud-novacompute-0 | ERROR | ctlplane=192.168.24.11 | overcloud-full | oooq_compute | +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+ (undercloud) [stack@undercloud ~]$ openstack server set --state active a8c447c1-7f47-42be-807a-c475dc76ea89 (undercloud) [stack@undercloud ~]$ openstack server list +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+ | a8c447c1-7f47-42be-807a-c475dc76ea89 | overcloud-controller-0 | ACTIVE | ctlplane=192.168.24.6 | overcloud-full | oooq_control | | 8ef23886-298e-451b-8094-f10c6c6c02fb | overcloud-novacompute-0 | ERROR | ctlplane=192.168.24.11 | overcloud-full | oooq_compute | +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+ mmm In my case the nodes always worked, just had to manually set the state to ACTIVE again.
The `openstack server set --state active` workaround seems to work for me (at least as far as the CLI status output is concerned, not sure if something still stays broken under the hood). However `nova reset-state` brings the node to ERROR again. And i tested on environment where the nodes never went to ERROR previously, and `nova reset-state` moves them to ERROR anyway...
After more discussion this isn't a PIDONE thing, as that's not relevant to the undercloud. We're going to tackle this for GA by ensuring that if nova compute and ironic conductor are both installed and running, then ironic conductor should start first. We're going to do this in 2 ways: 1. Adding a systemd Before= dependency on nova-compute to ironic-conductor 2. Adding a corresponding dependencies to the puppet modules in instack-undercloud to ensure the same happens during upgrade.
Hi, so: for i in controller-{0,1,2} compute-0 ; do openstack server set --state active $i ; done is working as expected. Message post bz collusion ... > 1. Adding a systemd Before= dependency on nova-compute to ironic-conductor > 2. Adding a corresponding dependencies to the puppet modules in instack-undercloud to ensure the same happens during upgrade. So I have a review (https://review.openstack.org/576165) for 2, undercloud ordering, but that still leave open: a. containerized undercloud ordering; b. overcloud container ordering (as I have understood, that it may be an issue there as well, right ?) I think we should tackle the a. only in an upstream launchpad bug and b. in its own bz ?
(In reply to Sofer Athlan-Guyot from comment #26) > Hi, > > so: > > for i in controller-{0,1,2} compute-0 ; do > openstack server set --state active $i ; > done > > is working as expected. > > Message post bz collusion ... > > > 1. Adding a systemd Before= dependency on nova-compute to ironic-conductor > > 2. Adding a corresponding dependencies to the puppet modules in instack-undercloud to ensure the same happens during upgrade. > > So I have a review (https://review.openstack.org/576165) for 2, undercloud > ordering, but that still leave open: > a. containerized undercloud ordering; > b. overcloud container ordering (as I have understood, that it may be an > issue there as well, right ?) > > I think we should tackle the a. only in an upstream launchpad bug and b. in > its own bz ? I already created bug 1592466 for that. We need a separate bz to land it anyway, as it's in a different component. I've moved your gerrit review over there and assigned you to it.
I tested the proposed systemd change on an OSP13 undercloud, and although nova-compute does indeed start after ironic-conductor, I only observed a delay of approximately 1 second. I could also see from the logs that nova compute was coming up before ironic conductor initialisation had completed. For the sake of posterity here are my notes on the issue: In the systemd config for ironic conductor we have added: Before=openstack-nova-compute.service Ironic conductor does: mgr = rpc_service.RPCService(CONF.host, 'ironic.conductor.manager', 'ConductorManager') # Part 1 below launcher = service.launch(CONF, mgr, restart_method='mutate') # Part 2 below launcher.wait() # Part 1 service.launch() does ServiceLauncher.launch_service(cond manager) ServiceLauncher.launch_service() does self.services.add(cond manager) Services.add() does threadgroup.add_thread(run_service, cond manager, done event) run_service is executed in a separate thread which does: cond manager.start() done.wait() Ironic RPCService.start does: ... self.manager.init_host(admin_context) ... So... the thread calls start() on the service then immediately waits for the event to signal that it should exit. Note that there was no systemd call in the thread. # Part 2 ServiceLauncher.wait() does: systemd.notify_once() ...wait for all service threads to exit... So... notice that we called systemd.notify_once() immediately without waiting for the completion of service.start().
We're currently investigating hacking something into nova's ironic driver instead. This would be a downstream-only change intended to get us to GA, with a clear goal to remove it in favour of a proper fix for z1. The proper fix would be either of (but preferrably both): bug 1592427 bug 1592432
So to add to matthew comment, we have added a retry loop in the nova-compute code, but both review (ordering restart and retry loop) would be needed to make sure that the retry loop doesn't fail too often.
Adding the gerrit early downport of the ordering restart if it's needed (not merged in master)
I don't think the systemd service ordering or notification changes will solve the problem. I also don't think a static hack of sleep(10) before starting nova-compute is reliable enough to be considered a serious fix. I've proposed an upstream retry loop in plug_vifs() which can differentiate a normal failure from the race-at-service-startup case, and only retry (with a delay) in that situation. The downstream pre-backport is proposed here: https://code.engineering.redhat.com/gerrit/#/c/141934 It would be good to get ironic people to validate that approach before we move ahead, but my suggestion would be to go with that slightly more invasive code fix because I think it will pay off in reliability.
Hi Dan FYI, Please use the cloned BZs for tracking this issue/fix on the different components involved: openstack-nova: https://bugzilla.redhat.com/show_bug.cgi?id=1592427 openstack-ironic: https://bugzilla.redhat.com/show_bug.cgi?id=1592432 We are using this BZ for merging and making available the fix for the component instack-undercloud. The fix we will track in this BZ will be https://review.openstack.org/#/c/576165/ with the corresponding backport/package build. I know it's a little messier to use a different BZ for each fix in each component but in this way, each PM should be able to prioritize the work involved for each DFG. Thanks!
*** Bug 1593909 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086