Description of problem: In several virtual setups, I've been using VMs for the undercloud and overcloud nodes. The ironic nodes are using pxe_ssh with an ssh key. This has been working fine on OSP8/9/10.. until recently (Started around March 9th). As of a few days ago, I have not been able to get a successful deploy on OSP10. The issue that I am running into is this: - Ironic nodes introspect fine. I can turn them on or off using ironic node-set-power-state. - When I attempt a deploy, nova nodes get scheduled but they never get powered on for the deploy: This used to work fine from OSP10 GA until a few days ago. I restored the OSP8 snapshot of my undercloud and was able to deploy without trouble so it seems limited to OSP10 and possibly recent-updates.
The ironic nodes are defined as: ironic node-create -n ${IRONIC_NODE} \ -d pxe_ssh \ -i ssh_address=${HVM_HOST_IP} \ -i ssh_username=${HVM_USER} \ -i ssh_virt_type=vbox \ -i ssh_key_contents=\"\$(cat ${IRONIC_KEY} ) \" " ironic node-update ${IRONIC_UUID} add \ properties/cpus=${vm_slave_cpu_default} \ properties/memory_mb=${vm_slave_memory_default} \ properties/local_gb=62 \ properties/cpu_arch=x86_64 \ driver_info/vbox_use_headless=true \ And the ironic port is defined as: ironic port-create -n ${IRONIC_UUID} -a ${IRONIC_MAC} (Ref: https://www.rdoproject.org/install/tripleo-cli)
The deploy command is (OSP8/9/10) time openstack overcloud deploy \ --templates ${TOP_DIR}/templates \ --control-scale 1 \ --compute-scale 1 \ --ceph-storage-scale 1 \ --swift-storage-scale 0 \ --control-flavor control \ --compute-flavor compute \ --ceph-storage-flavor ceph-storage \ --swift-storage-flavor swift-storage \ --ntp-server '10.20.0.1", "10.20.0.2' \ --validation-errors-fatal \ -e ${TOP_DIR}/templates/environments/network-isolation.yaml \ -e ${TOP_DIR}/templates/environments/storage-environment.yaml \ -e ${TOP_DIR}/net-bond-with-vlans-with-nic4.yaml \ -e ${TOP_DIR}/rhel-registration-environment.yaml \ -e ${TOP_DIR}/storage-environment.yaml \ -e ${TOP_DIR}/krynn-environment.yaml \ -e ${TOP_DIR}/extraconfig-environment.yaml \ -e ${TOP_DIR}/enable-tls.yaml \ -e ${TOP_DIR}/inject-trust-anchor.yaml \ -e ${TOP_DIR}/local-environment.yaml
On OSP8, the nodes look like this: +--------------------------------------+-----------------------------------------+ | Node UUID | Node Capabilities | +--------------------------------------+-----------------------------------------+ | 4a375e67-0e97-4125-93b7-e28bd2e53cea | profile:control,boot_option:local | | 9929d709-268c-4a9a-8d86-66c7ff1c9230 | profile:control,boot_option:local | | 8abd67d0-db10-4954-9069-9a0055517f3b | profile:control,boot_option:local | | 3f69e60b-6c0f-4924-8aa5-09b3cd46a2f7 | profile:ceph-storage,boot_option:local | | 5dfa06f8-3e10-4ca5-9521-7b256f8dac68 | profile:ceph-storage,boot_option:local | | 9a41846c-9e81-41a2-8aad-3ab8c5f0195b | profile:ceph-storage,boot_option:local | | 9bbfea6b-171a-43db-ab29-3d2c5ac3360b | profile:swift-storage,boot_option:local | | fa22a008-4472-4710-85d2-4bff5d8396cb | profile:swift-storage,boot_option:local | | 7ba4044e-d266-4fcc-9ba5-51dc9285967f | profile:swift-storage,boot_option:local | | bd563d93-decb-4b99-960a-5a2265a5f29d | profile:compute,boot_option:local | | 0dd30949-bc0d-4b1e-8f13-40992eb161f6 | profile:compute,boot_option:local | | 1a1cffde-c135-4951-a4b9-9070243950ec | profile:compute,boot_option:local | | e538092c-39ea-4339-8548-f46c9d6ff089 | profile:compute,boot_option:local | | 32ff444b-e920-4174-9e17-187f41b0036e | profile:compute,boot_option:local | | e06ba4f5-8c10-4afd-9177-9c872df9c22e | profile:compute,boot_option:local | | c813fbfa-09ca-446b-8c07-aa0c3c4189a4 | profile:compute,boot_option:local | +--------------------------------------+-----------------------------------------+ on OSP10, I am using the same type of node tagging, which can be seen using "openstack overcloud profiles list".
on OSP8, nodes spawn successfully: [stack@instack ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+----------------------+ | 2b073652-e7cc-4ab4-995b-d1ea67c56619 | krynn-ceph-0 | ACTIVE | - | Running | ctlplane=10.20.0.108 | | 911d08b3-3da0-404d-b6ca-2cad856c565d | krynn-cmpt-0 | ACTIVE | - | Running | ctlplane=10.20.0.107 | | 42d36d92-81c8-467e-8884-d266a4ef89ed | krynn-ctrl-0 | ACTIVE | - | Running | ctlplane=10.20.0.104 | +--------------------------------------+--------------+--------+------------+-------------+----------------------+ On OSP10, the nodes stay in 'SPAWNING' until the deploy times out: - When I attempt a deploy, nova nodes get scheduled but they never get powered on for the deploy: [stack@instack ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+----------------------+ | 23c40e7a-688c-440d-8b4c-9a91fe989e22 | krynn-cmpt-0 | BUILD | spawning | NOSTATE | ctlplane=10.20.0.104 | | 7ae97372-8c79-4905-a7cd-cc8bea993dc8 | krynn-ctrl-0 | BUILD | spawning | NOSTATE | ctlplane=10.20.0.111 | +--------------------------------------+--------------+--------+------------+-------------+----------------------+ [stack@instack ~]$ ironic node-list +--------------------------------------+------------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------------+--------------------------------------+-------------+--------------------+-------------+ [....] | 57b7a089-d3b6-44c9-8c1d-77eba0b1d011 | osp-baremetal-2 | 7ae97372-8c79-4905-a7cd-cc8bea993dc8 | power off | deploying | False | [....] | aa04082a-7a58-4b81-bbf9-b8cab8891b9d | osp-baremetal-11 | 23c40e7a-688c-440d-8b4c-9a91fe989e22 | power off | deploying | False | [....] +--------------------------------------+------------------+--------------------------------------+-------------+--------------------+-------------+
On OSP10, 'openstack overcloud deploy' stays stuck at: [....] 2017-03-16 16:20:05Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS state changed 2017-03-16 16:20:06Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed In nova-compute.log, I see this message scrolling: 2017-03-16 12:26:57.164 1759 DEBUG nova.virt.ironic.driver [-] [instance: 23c40e7a-688c-440d-8b4c-9a91fe989e22] Still waiting for ironic node aa04082a-7a58-4b81-bbf9-b8cab8891b9d to become ACTIVE: power_state="power on", target_power_state=None, provision_state="deploying", target_provision_state="active" _log_ironic_polling /usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py:120 and in ironic-api.log: 2017-03-16 12:20:13.152 12872 INFO eventlet.wsgi.server [req-cba6bc11-75ca-4583-9cea-e677dd13015d ironic service - - -] 10.20.0.2 "GET /v1/nodes/?instance_uuid=23c40e7a-688c-440d-8b4c-9a91fe989e22&fields=uuid,power_state,target_power_state,provision_state,target_provision_state,last_error,maintenance,properties,instance_uuid HTTP/1.1" status: 200 len: 1007 time: 0.0328538
please take note that all of the following ironic operations work fine in that OSP10 environmement. [stack@instack ~]$ ironic node-set-boot-device osp-baremetal-1 pxe [stack@instack ~]$ ironic node-set-power-state osp-baremetal-1 on [stack@instack ~]$ ironic node-set-power-state osp-baremetal-1 off [stack@instack ~]$ ironic node-set-boot-device osp-baremetal-1 disk [stack@instack ~]$ ironic node-set-power-state osp-baremetal-1 on [stack@instack ~]$ openstack baremetal introspection start osp-baremetal-1
[stack@instack ~]$ ironic node-list +--------------------------------------+------------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------------+---------------+-------------+--------------------+-------------+ | c34ec3ff-fc32-410a-934d-b59c7b281bff | osp-baremetal-1 | None | power off | available | False | | 9bc16019-e06e-48a2-bd0d-339bc142c1b3 | osp-baremetal-2 | None | power off | available | False | | eb416481-f60a-47fe-a081-2a94b8db621b | osp-baremetal-3 | None | power off | available | False | | ddbd07f2-a9ee-44b6-b96c-9804365c3da6 | osp-baremetal-4 | None | power off | available | False | | 5a6f6568-a4ad-4b9d-84e6-8ac01deb7f03 | osp-baremetal-5 | None | power off | available | False | | f0c5fd84-6c0e-499c-af3c-6d8cdcdb4e64 | osp-baremetal-6 | None | power off | available | False | | 4dcb81d1-fc5f-407d-b081-d79dc2e3df3d | osp-baremetal-7 | None | power off | available | False | | 3edab1d3-e5e1-431a-8a79-74405f02e614 | osp-baremetal-8 | None | power off | available | False | | 11c00cb9-c467-486c-bacb-a8aeb84225ae | osp-baremetal-9 | None | power off | available | False | | 7a99deb8-927a-4113-b564-f84de3e68b1b | osp-baremetal-10 | None | power off | available | False | | ebbd5862-a404-4488-b210-df755886cb7c | osp-baremetal-11 | None | power off | available | False | | f3d39ac9-bc5d-4ef3-974b-d74b06317aa8 | osp-baremetal-12 | None | power off | available | False | | 2ab58646-8999-44ff-92d6-45aa0ef59e93 | osp-baremetal-13 | None | power off | available | False | | 491e7821-74f0-4ed9-b19a-2a6bb894add1 | osp-baremetal-14 | None | power off | available | False | | b36046a3-3beb-4dba-a556-5121fddad739 | osp-baremetal-15 | None | power off | available | False | | fab2ca7f-f124-4b65-9704-8c7aa0bb7b31 | osp-baremetal-16 | None | power off | available | False | +--------------------------------------+------------------+---------------+-------------+--------------------+-------------+
Deploy started with: [stack@instack ~]$ ./OSP/osp10/bin/deploy17_micro.sh + openstack overcloud deploy --templates --control-scale 1 --compute-scale 1 --ceph-storage-scale 1 --swift-storage-scale 0 --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph-storage --swift-storage-flavor swift-storage --ntp-server '10.20.0.1", "10.20.0.2' --validation-errors-fatal -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /home/stack/OSP/osp10/net-bond-with-vlans-with-nic4.yaml -e /home/stack/OSP/osp10/rhel-registration-environment.yaml -e /home/stack/OSP/osp10/storage-environment.yaml -e /home/stack/OSP/osp10/krynn-environment.yaml -e /home/stack/OSP/osp10/extraconfig-environment.yaml -e /home/stack/OSP/osp10/enable-tls.yaml -e /home/stack/OSP/osp10/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /home/stack/OSP/osp10/local-environment.yaml 2 nodes with profile control won't be used for deployment now 6 nodes with profile compute won't be used for deployment now 2 nodes with profile ceph-storage won't be used for deployment now Configuration has 3 warnings, fix them before proceeding. Removing the current plan files
After a little while, I get this: [....] 2017-03-16 20:41:41Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2017-03-16 20:41:42Z [overcloud.CephStorage.0.UserData]: CREATE_COMPLETE state changed 2017-03-16 20:41:42Z [overcloud.CephStorage.0.CephStorage]: CREATE_IN_PROGRESS state changed And in nova: [stack@instack ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+----------------------+ | 9d746160-11dd-432b-99de-5ef8b05391da | krynn-ceph-0 | BUILD | spawning | NOSTATE | ctlplane=10.20.0.107 | | 1d119d19-eacb-4b09-ab90-80428dfd7a64 | krynn-cmpt-0 | BUILD | spawning | NOSTATE | ctlplane=10.20.0.109 | | b0edc22a-da3f-4beb-bcb1-02e1fb4dfd7e | krynn-ctrl-0 | BUILD | spawning | NOSTATE | ctlplane=10.20.0.105 | +--------------------------------------+--------------+--------+------------+-------------+---------- but these nodes never attempt to power-on: [stack@instack ~]$ ironic node-list|grep -v None +--------------------------------------+------------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------------+--------------------------------------+-------------+--------------------+-------------+ | 9bc16019-e06e-48a2-bd0d-339bc142c1b3 | osp-baremetal-2 | b0edc22a-da3f-4beb-bcb1-02e1fb4dfd7e | power off | deploying | False | | ddbd07f2-a9ee-44b6-b96c-9804365c3da6 | osp-baremetal-4 | 9d746160-11dd-432b-99de-5ef8b05391da | power off | deploying | False | | fab2ca7f-f124-4b65-9704-8c7aa0bb7b31 | osp-baremetal-16 | 1d119d19-eacb-4b09-ab90-80428dfd7a64 | power off | deploying | False | +--------------------------------------+------------------+--------------------------------------+-------------+--------------------+-------------+
If we take the instance named fab2ca7f-f124-4b65-9704-8c7aa0bb7b31 on osp-baremetal-16, there are messages in /var/log/nova/nova-scheduler.log about it: First we have: 2017-03-16 16:41:41.736 30962 DEBUG oslo_concurrency.lockutils [req-c2af360a-96f3-4ab6-b737-bd6bc73974d0 a1012a438d54465a8299389609938b35 c03f66100ae044c8b205585c01fae8bf - - -] Lock "(u'instack.lasthome.solace.krynn', u'fab2ca7f-f124-4b65-9704-8c7aa0bb7b31')" acquired by "nova.scheduler.host_manager._locked_update" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:270 2017-03-16 16:41:41.736 30962 DEBUG nova.scheduler.host_manager [req-c2af360a-96f3-4ab6-b737-bd6bc73974d0 a1012a438d54465a8299389609938b35 c03f66100ae044c8b205585c01fae8bf - - -] Update host state from compute node: ComputeNode(cpu_allocation_ratio=16.0,cpu_info='',created_at=2017-03-16T19:33:04Z,current_workload=0,deleted=False,deleted_at=None,disk_allocation_ratio=1.0,disk_available_least=62,free_disk_gb=62,free_ram_mb=8256,host='instack.lasthome.solace.krynn',host_ip=10.0.128.169,hypervisor_hostname='fab2ca7f-f124-4b65-9704-8c7aa0bb7b31',hypervisor_type='ironic',hypervisor_version=1,id=62,local_gb=62,local_gb_used=0,memory_mb=8256,memory_mb_used=0,metrics='[]',numa_topology=None,pci_device_pools=PciDevicePoolList,ram_allocation_ratio=1.0,running_vms=0,service_id=None,stats={boot_option='local',cpu_arch='x86_64',profile='compute'},supported_hv_specs=[HVSpec],updated_at=2017-03-16T20:41:04Z,uuid=9bd7f87c-66d7-4835-847f-7a27f0ec979a,vcpus=4,vcpus_used=0) _locked_update /usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py:168 this repeats a few times, and then we have: 2017-03-16 16:41:43.315 30962 DEBUG nova.scheduler.filters.compute_capabilities_filter [req-c27cd857-5778-4487-bade-280cf153e9b4 a1012a438d54465a8299389609938b35 c03f66100ae044c8b205585c01fae8bf - - -] (instack.lasthome.solace.krynn, fab2ca7f-f124-4b65-9704-8c7aa0bb7b31) ram: 4160MB disk: 22528MB io_ops: 0 instances: 0 fails extra_spec requirements. 'control' does not match 'compute' _satisfies_extra_specs /usr/lib/python2.7/site-packages/nova/scheduler/filters/compute_capabilities_filter.py:103 2017-03-16 16:41:43.316 30962 DEBUG nova.scheduler.filters.compute_capabilities_filter [req-c27cd857-5778-4487-bade-280cf153e9b4 a1012a438d54465a8299389609938b35 c03f66100ae044c8b205585c01fae8bf - - -] (instack.lasthome.solace.krynn, fab2ca7f-f124-4b65-9704-8c7aa0bb7b31) ram: 4160MB disk: 22528MB io_ops: 0 instances: 0 fails instance_type extra_specs requirements host_passes /usr/lib/python2.7/site-packages/nova/scheduler/filters/compute_capabilities_filter.py:112
I tried parsing the logs for 'ERROR' but there aren't any relevant errors, except in nova as the undercloud was booting. I'll attaching the logs shortly.
There seems to be some strange stuff going on, I just checked ironic-conductor.log and noticed this: [random blob....] bsz6syET1fOUV5rRZxsS5fOLYzEUms/FPR5M6cFL3Io85L13XGbjqV4r3B2dFefkZl+rOaZ+qKgz3JFBwX9djL3AM59czTGB1okkZjKDIujyqXeiab88r07FwsDFLhMRqnYildK/m2kOJkL66W5c6iGu3X3KswNYXF/8NcY9uH5cxeec6VTifmED7ePzqP5 ZqJWY9Gva14rsbWuJ6VLCYbjgdNpIs6/z2ebhZbeHt6sh+zmVZz86ytLt9fXU5//OPjdDxWztZWnrz68x/eXjXR7crRWW9zd3vzl/H4pDjvrfZS5fJwd/3B+OB8fav3w9HZoHpQce/9fx4PBl/vjNcHuwcr69tfP9p6+LDorW1trfceFsVab/BgOOit339Y PBh+8/VqMRj8Uz7p0rd8O9rb6adfZ7i59mhtbX1zPf239ujxZrH2+P43jzfXN79ZXR8Ujx6ljW5987tmreLDcQqHRv37aw/XHz5+8Ojx13ki32qqrTzj6yi6idMyu726w64X9WkqdbtrDx/m6WOWqsN5dJKP8VKxdT9PeV8d0ZiENlKZUrvi09/3qfV+EAO E+R6ok3R2zTktVh9ccw97PXMmAsDfu4+msgcA9T8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPwN+m/gxomVADALAA==", "swap_mb": "0"}', 61)] 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last): 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters context) 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters cursor.execute(statement, parameters) 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 146, in execute 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters result = self._query(query) 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 296, in _query 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters conn.query(q) [....] 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 112, in _check_mysql_exception 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters raise errorclass(errno, errorvalue) 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters DataError: (1406, u"Data too long for column 'instance_info' at row 1") 2017-03-17 14:32:05.886 18764 ERROR oslo_db.sqlalchemy.exc_filters 2017-03-17 14:32:05.912 18764 ERROR oslo_db.sqlalchemy.exc_filters [req-eb000c3c-2ac0-41ec-914d-3e8aae9a2417 - - - - -] DBAPIError exception wrapped from (pymysql.err.DataError) (1406, u"Data too long for column 'last_error' at row 1") [SQL: u'UPDATE nodes SET updated_at=%s, provision_state=%s, provision_updated_at=%s, last_error=%s, instance_info=%s WHERE nodes.id = %s'] [parameters: (datetime.datetime(2017, 3, 17, 18, 32, 5, 902889), u'deploy failed', datetime.datetime(2017, 3, 17, 18, 32, 5, 902312), u'Async execution of do_node_deploy failed with error: (pymysql.err.DataError) (1406, u"Data too long for column \'instance_info\' at row 1") [SQL: u\'UPDATE nodes SET updated_at=%s, provision_state=%s, provision_updated_at=%s, instance_info=%s WHERE nodes.id = %s\'] [parameters: (datetime.datetime(2017, 3, 17, 18, 32, 5, 881427), u\'deploy failed\', datetime.datetime(2017, 3, 17, 18, 32, 5, 880807), \'{"root_gb": "40", "display_name": "krynn-cmpt-0", "image_source": "92b54d1e-99f5-4ea3-9cc9-62d496d4ff0e", "capabilities": "{\\\\"profile\\\\": \\\\"compute\\\\", \\\\"boot_option\\\\": \\\\"local\\\\"}", "memory_mb": "4096", "vcpus": "1", "local_gb": "127", "configdrive": "H4sICJIrzFgC/3RtcHNmNWNFNwDsvemSG8mWHgje29PdGd3WM9N9RzYma3V7gaybmSQCW+4s4TZzQTJB5sJKZDJJFnlhAUQAiEwgAowIAAmysk39Q+qW2Zj+ankDLW8gqWWmB5CeRDb6qX8z5xz3 [random blob....] There seems to be some kind of binary blob that gets passed as args, hence the failure..
This was subsequently tracked down to this issue: Data too long for column 'instance_info' at row 1" With multiple firstboot scripts, the instance_info column in the ironic DB became too small to hold the config drive information. The solution was to remove some scripts from firstboot and deployments started working again. This strikes me as a technical issue for two reasons: 1) config_drive is supposed to be capable of expanding as large as 64Mb but here, the 'instance_info' column in the ironic DB limits it to a much smaller set. 2) The message that instance_info is suddenly too large for the DB column should get propagated back to the user.. It really wasn't obvious from the symptoms or from the message in ironic-conductor.
Other seems to have run into this problem and attempted to enhance the situation. Here are two (unfortunately abandonned) patches: https://review.openstack.org/#/c/369617 https://review.openstack.org/#/c/334967
Upstream patch in progress - https://review.openstack.org/#/c/501761/
Upstream launched pad is a bit complicated - the upstream bug opened here was https://bugs.launchpad.net/ironic/+bug/1673887 which was closed as a duplicate of https://bugs.launchpad.net/ironic/+bug/1596421, which was then closed as Invalid for https://bugs.launchpad.net/ironic/+bug/1575935. Updated referenced Launchpad bug.
Nova part is https://blueprints.launchpad.net/openstack/?searchtext=rebuild-ironic-config-drive, we may need an RFE for them..
The fix has not landed. We have updated Nova to pass configdrive on rebuild, but we haven't stopped storing configdrive in the database. I'll check upstream what our next steps are.
We have another customer facing this issue in LLR OSP10. Is there a plan to backport the fix in OSP10 since it is a regression issue? My customer is not willing to upgrade to OSP13 in a short term and living with this issue in OSP10 is not an option.
>We have another customer facing this issue in LLR OSP10. Is there a plan to backport >the fix in OSP10 since it is a regression issue? My customer is not willing to upgrade >to OSP13 in a short term and living with this issue in OSP10 is not an option. Salim - we currently do not have a fix for this upstream. See https://bugs.launchpad.net/ironic/+bug/1596421 The potential fix is under discussion, once it lands we'll consider the backport. It is likely this fix can be backported but we need to review once the upstream fix lands. Changing priority of this to High.
Unfortunately, we cannot come up with a solution that can be backported. You may have to change the "instance_info" column of the "nodes" table from TEXT to LONGTEXT manually.
Pierre - this package will be in the OSP-13 beta release on My 9th. Is that OK?
(In reply to Dmitry Tantsur from comment #25) > Unfortunately, we cannot come up with a solution that can be backported. You > may have to change the "instance_info" column of the "nodes" table from TEXT > to LONGTEXT manually. Hi Dmitry, As OSP10 deployments become more and more complex, what is the procedure to change instance_info from TEXT to LONGTEXT on existing underclouds? I know I can just go into the DB and change that but would there be a local override that would work for that in the undercloud configuration? Is there something better than : mysql -e 'alter table ironic.nodes modify instance_info longtext;' and restarting openstack-ironic\* ?
Verified: Environment: puppet-ironic-12.4.0-0.20180329034302.8285d85.el7ost.noarch python2-ironic-neutron-agent-1.0.0-1.el7ost.noarch openstack-ironic-conductor-10.1.2-4.el7ost.noarch python-ironic-inspector-client-3.1.1-1.el7ost.noarch openstack-ironic-staging-drivers-0.9.0-4.el7ost.noarch python2-ironicclient-2.2.0-1.el7ost.noarch instack-undercloud-8.4.1-4.el7ost.noarch openstack-ironic-api-10.1.2-4.el7ost.noarch openstack-ironic-inspector-7.2.1-0.20180409163360.el7ost.noarch python-ironic-lib-2.12.1-1.el7ost.noarch openstack-ironic-common-10.1.2-4.el7ost.noarch Checked the type on instance_info field in nodes table that's in ironic DB: it's 'longtext' On previous releases it was 'text'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086