Bug 1430758
Summary: | Node Deployment failure due to root_device hints using WWN | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sai Sindhur Malleni <smalleni> |
Component: | openstack-ironic-python-agent | Assignee: | Dmitry Tantsur <dtantsur> |
Status: | CLOSED ERRATA | QA Contact: | mlammon |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 10.0 (Newton) | CC: | bengland, bfournie, dtantsur, jkachuck, kambiz, lmartins, mburns, mhalas, mknutson, mkovacik, racedoro, rhel-osp-director-maint, slinaber, smalleni, srevivo |
Target Milestone: | z9 | Keywords: | Triaged, ZStream |
Target Release: | 10.0 (Newton) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | scale_lab | ||
Fixed In Version: | openstack-ironic-python-agent-1.5.2-6.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-09-17 16:59:16 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sai Sindhur Malleni
2017-03-09 14:27:23 UTC
raised priority, this is really important for scale lab testing of Openstack-Ceph at scale, which is stated DFG goal for Ocata right? Sai, why is the WWN hint needed? You told me once, I don't see it here and can't remember it. Are the system disks the same size as the OSD disks? If ironic metadata cleaning is used, there should be no problem with booting off wrong disk right? Ben, Yes as you might be knowing already in cases with multiple disks and especially the situation we have in the scale lab where device naming is inconsistent we need to specify root disk via WWN. Hi Sai, could you share the introspection data? https://docs.openstack.org/developer/tripleo-docs/advanced_deployment/introspection_data.html Milan, could you help Sai with this one? Ramon, the upstream tracker seems to be a legit bug and Lucas seems to be working on the IPA patch there; if not, we can backlog it. Sai, could you please confirm that https://review.openstack.org/#/c/443649/ fixes the issue for you? Lucas has left our team, so I need to overtake it. Dmitry, we rebuilt the ramdisk image with the patch, but that did not help. This seem to still exist in RHOS 10 (Newton). We have hit it on the following server when setting /dev/sdb/ as the root device NODE: f800fe90-3e4a-4148-816f-cd2a4f13cf4b [ { "size": 1000204886016, "rotational": true, "vendor": "ATA", "name": "/dev/sda", "wwn_vendor_extension": null, "wwn_with_extension": "0x5000c500505bdb82", "model": "ST1000NM0033-9ZM", "wwn": "0x5000c500505bdb82", "serial": "Z1W037BK" }, { "size": 240057409536, "rotational": false, "vendor": "ATA", "name": "/dev/sdb", "wwn_vendor_extension": null, "wwn_with_extension": "0x55cd2e414c819b40", "model": "INTEL SSDSC2BF24", "wwn": "0x55cd2e414c819b40", "serial": "CVTS518100Z0240JGN" } ] [stack@lmorlct0311dir0 ~]$ openstack baremetal node show f800fe90-3e4a-4148-816f-cd2a4f13cf4b +------------------------+-------------------------------------------------------------------------------------------------+ | Field | Value | +------------------------+-------------------------------------------------------------------------------------------------+ | clean_step | {} | | console_enabled | False | | created_at | 2017-08-28T16:37:49+00:00 | | driver | fake_pxe | | driver_info | {u'deploy_ramdisk': u'64c0786d-28aa-45e4-a8bd-f3167813f10e', u'deploy_kernel': u'cba60158-7c99 | | | -43ca-b7af-0a5ee080820e'} | | driver_internal_info | {u'agent_url': u'http://172.20.0.10:9999', u'is_whole_disk_image': False, | | | u'agent_last_heartbeat': 1503945298} | | extra | {u'hardware_swift_object': u'extra_hardware-f800fe90-3e4a-4148-816f-cd2a4f13cf4b'} | | inspection_finished_at | None | | inspection_started_at | None | | instance_info | {} | | instance_uuid | None | | last_error | None | | maintenance | False | | maintenance_reason | None | | name | None | | ports | [{u'href': u'http://10.240.41.179:13385/v1/nodes/f800fe90-3e4a-4148-816f-cd2a4f13cf4b/ports', | | | u'rel': u'self'}, {u'href': u'http://10.240.41.179:13385/nodes/f800fe90-3e4a-4148-816f- | | | cd2a4f13cf4b/ports', u'rel': u'bookmark'}] | | power_state | power off | | properties | {u'cpu_arch': u'x86_64', u'root_device': {u'serial': u'CVTS518100Z0240JGN'}, u'cpus': u'4', | | | u'capabilities': u'profile:control,boot_option:local', u'memory_mb': u'32768', u'local_gb': | | | u'222'} | | provision_state | available | | provision_updated_at | 2017-08-28T18:35:00+00:00 | | raid_config | {} | | reservation | None | | states | [{u'href': u'http://10.240.41.179:13385/v1/nodes/f800fe90-3e4a-4148-816f-cd2a4f13cf4b/states', | | | u'rel': u'self'}, {u'href': u'http://10.240.41.179:13385/nodes/f800fe90-3e4a-4148-816f- | | | cd2a4f13cf4b/states', u'rel': u'bookmark'}] | | target_power_state | None | | target_provision_state | None | | target_raid_config | {} | | updated_at | 2017-08-28T18:35:00+00:00 | | uuid | f800fe90-3e4a-4148-816f-cd2a4f13cf4b | +------------------------+-------------------------------------------------------------------------------------------------+ results in 2017-08-28 14:32:30.610 7440 DEBUG oslo_messaging._drivers.amqpdriver [req-b5616222-9a6b-479a-802b-5dd76c38dfb3 2e125d9506b3484686a1d017c8e7a577 277da56f19ad4559b703b36dde26be28 - - -] CAST unique_id: 817128e32cbd4c4fa96874aa81b91368 NOTIFY exchange 'nova' topic 'notifications.error' _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:479 2017-08-28 14:35:03.362 7439 ERROR nova.scheduler.utils [req-5d75bff2-885f-4dd5-a1bc-236fda9f531f 2e125d9506b3484686a1d017c8e7a577 277da56f19ad4559b703b36dde26be28 - - -] [instance: a2e43458-b8a6-4aa6-8c29-62b91ef951b9] Error from last host: lmorlct0311dir0 (node f800fe90-3e4a-4148-816f-cd2a4f13cf4b): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1783, in _do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1981, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance a2e43458-b8a6-4aa6-8c29-62b91ef951b9 was re-scheduled: Failed to provision instance a2e43458-b8a6-4aa6-8c29-62b91ef951b9: Failed to start the iSCSI target to deploy the node f800fe90-3e4a-4148-816f-cd2a4f13cf4b. Error: {u\'message\': u"Error finding the disk or partition device to deploy the image onto: No suitable device was found for deployment using these hints {u\'serial\': u\'CVTS518100Z0240JGN\'}", u\'code\': 404, u\'type\': u\'DeviceNotFound\', u\'details\': u"No suitable device was found for deployment using these hints {u\'serial\': u\'CVTS518100Z0240JGN\'}"}\n'] Hello Lenovo, Please let me know if you would be able to attach introspection data plus ramdisk logs from deployment. Thank You Joe Kachuck I'm confused by comment 8 error of device not found. It is unlikely that the device really went away. questions: - could this be some sort of race condition where the serial number hasn't been made visible yet in sysfs or wherever? For example, is the device in JBOD mode? This is an intermittent problem, right? In other words, if it had waited an extra few seconds would the device still not have been found? Seems unlikely, disks are typically already spun up before the OS reboots, particularly the system disk (otherwise it wouldn't boot). But perhaps in JBOD mode the storage controller doesn't do this for the non-system-disk devices, and just lets the OS hot plug the devices as they come online? If so, then the answer is to not use JBOD mode. You then have a virtual drive, and it will always be online before the OS boots so there should be no race. - does this fail the same way if WWID is used as device hint? - if you specify the other device's serial number, does that work? - is there a way to turn on extra logging to see what facts it knew about available devices at the point when it decided that there was no such device? Or is this already in ramdisk logs? HTH -ben I've posted an update of the patch, this will hopefully fix the bug. I cannot test it, however. Fix is in master. If it does resolve problem should we backport to OSP-10 (which this bug is flagged for)? This patch should be backported to Newton Hi there, If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to -. Thanks, Alex Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2671 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |