Description of problem: In some cases numa_topology isn't populated at introspection time Version-Release number of selected component (if applicable): Latest How reproducible: Always Steps to Reproduce: 1. Introspect nodes 2. Try to deploy a sr-iov node using director 3. Actual results: Fails due to numa_topology missing Expected results: Succeeds Additional info: downloading the introspection data show it's missing but the inspector logs the node does have NUMA: Jan 15 20:28:46 localhost.localdomain ironic-python-agent[3698]: 2020-01-15 20:28:44.974 3698 DEBUG ironic_lib.utils [-] Execution completed, command line is "lscpu" execute /usr/lib/python2.7/site-packages/ironic_lib/utils.py:101 Jan 15 20:28:46 localhost.localdomain ironic-python-agent[3698]: 2020-01-15 20:28:44.975 3698 DEBUG ironic_lib.utils [-] Command stdout is: "Architecture: x86_64 Jan 15 20:28:46 localhost.localdomain ironic-python-agent[3698]: CPU op-mode(s): 32-bit, 64-bit Jan 15 20:28:46 localhost.localdomain ironic-python-agent[3698]: Byte Order: Little Endian Jan 15 20:28:46 localhost.localdomain ironic-python-agent[3698]: CPU(s): 80 Jan 15 20:28:46 localhost.localdomain ironic-python-agent[3698]: On-line CPU(s) list: 0-79 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: Thread(s) per core: 2 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: Core(s) per socket: 20 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: Socket(s): 2 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: NUMA node(s): 2 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: Vendor ID: GenuineIntel Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: CPU family: 6 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: Model: 85 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: Model name: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: Stepping: 7 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: CPU MHz: 2500.000 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: BogoMIPS: 5000.00 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: Virtualization: VT-x Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: L1d cache: 32K Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: L1i cache: 32K Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: L2 cache: 1024K Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: L3 cache: 28160K Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: NUMA node0 CPU(s): 0-19,40-59 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: NUMA node1 CPU(s): 20-39,60-79 Jan 15 20:28:47 localhost.localdomain ironic-python-agent[3698]: Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp_epp pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
Created attachment 1652876 [details] ipa logs from rhosp13 vm lab
The issue is as if /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py was never called...
lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:def collect_numa_topology_info(data, failures): lib/python2.7/site-packages/ironic_python_agent-3.2.3-py2.7.egg-info/entry_points.txt:numa-topology = ironic_python_agent.numa_inspector:collect_numa_topology_info
Dave - I assume that the numa_topology colllector is set in inspector.ipxe? for example: $ sudo grep collector /httpboot/inspector.ipxe kernel --timeout 60000 http://10.8.146.1:8088/agent.kernel ipa-inspection-callback-url=http://10.8.146.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,numa-topology,logs systemd.journald.forward_to_console=yes BOOTIF=${mac} ipa-debug=1 ipa-inspection-dhcp-all-interfaces=1 ipa-collect-lldp=1 initrd=agent.ramdisk || goto retry_boot
I added some debugging code and we see it's calling only those 2 collectors: for name, collector in collectors: try: LOG.debug("collecting %s", name) <=============== collector(data, failures) except Exception as exc: # No reraise here, try to keep going failures.add('collector %s failed: %s', name, exc) [root@undercloud-0-rhosp13 ramdisk]# grep collecting journal Jan 16 15:32:06 localhost.localdomain ironic-python-agent[784]: 2020-01-16 15:32:06.716 784 DEBUG ironic_python_agent.inspector [-] collecting default inspect /usr/lib/python2.7/site-packages/ironic_python_agent/inspector.py:87 Jan 16 15:33:09 localhost.localdomain ironic-python-agent[784]: 2020-01-16 15:33:09.510 784 DEBUG ironic_python_agent.inspector [-] collecting extra-hardware inspect /usr/lib/python2.7/site-packages/ironic_python_agent/inspector.py:87
It's either that it's not called or the logs are truncated and we're missing the real error here but numa-topology is enabled: Jan 16 15:32:06 localhost.localdomain ironic-python-agent[784]: 2020-01-16 15:32:06.711 784 INFO ironic_python_agent.inspector [-] inspection is enabled with collectors ['default', 'extra-hardware', 'numa-topology', 'logs']
I skipped the extra-hardware collector and now I see numa-topology: [root@undercloud-0-rhosp13 ramdisk]# cat journal | grep collecting Jan 16 15:54:17 localhost.localdomain ironic-python-agent[774]: 2020-01-16 15:54:16.932 774 DEBUG ironic_python_agent.inspector [-] collecting default inspect /usr/lib/python2.7/site-packages/ironic_python_agent/inspector.py:89 Jan 16 15:55:18 localhost.localdomain ironic-python-agent[774]: 2020-01-16 15:55:18.644 774 DEBUG ironic_python_agent.inspector [-] collecting numa-topology inspect /usr/lib/python2.7/site-packages/ironic_python_agent/inspector.py:89
Jan 16 15:55:18 localhost.localdomain ironic-python-agent[774]: 2020-01-16 15:55:18.644 774 DEBUG ironic_python_agent.inspector [-] collecting numa-topology inspect /usr/lib/python2.7/site-packages/ironic_python_agent/inspector.py:89 Jan 16 15:55:18 localhost.localdomain ironic-python-agent[774]: 2020-01-16 15:55:18.646 774 DEBUG ironic_python_agent.numa_inspector [-] Found memory available 1572472 KB in NUMA node 0 get_nodes_memory_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:91 Jan 16 15:55:18 localhost.localdomain ironic-python-agent[774]: 2020-01-16 15:55:18.647 774 DEBUG ironic_python_agent.numa_inspector [-] Found memory available 1572668 KB in NUMA node 1 get_nodes_memory_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:91 Jan 16 15:55:18 localhost.localdomain ironic-python-agent[774]: 2020-01-16 15:55:18.647 774 DEBUG ironic_python_agent.numa_inspector [-] Found a thread sibling 0 for CPU 0 in NUMA node 0 get_nodes_cores_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:163 Jan 16 15:55:18 localhost.localdomain ironic-python-agent[774]: 2020-01-16 15:55:18.647 774 DEBUG ironic_python_agent.numa_inspector [-] Found a thread sibling 1 for CPU 0 in NUMA node 0 get_nodes_cores_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:163
It's still not working though: openstack baremetal introspection data save 079479c7-ad79-4ab9-81cd-af92d5c3d0e0 | jq .numa_topology null and I'm still missing the "logs" collector.
I also commented out get_nodes_nics_info in numa_collector.py: try: numa_info['ram'] = get_nodes_memory_info(numa_node_dirs) numa_info['cpus'] = get_nodes_cores_info(numa_node_dirs) # numa_info['nics'] = get_nodes_nics_info(nic_device_path) and now I get the numa_topology: (undercloud) [stack@undercloud-0-rhosp13 images]$ openstack baremetal introspection data save 079479c7-ad79-4ab9-81cd-af92d5c3d0e0 | jq .numa_topology { "cpus": [ { "cpu": 0, "thread_siblings": [ 2, 3 ], "numa_node": 1 }, { "cpu": 0, "thread_siblings": [ 0, 1 ], "numa_node": 0 } ], "ram": [ { "size_kb": 1572472, "numa_node": 0 }, { "size_kb": 1572668, "numa_node": 1 } ] }
I suspect the errors/raise are not behaving as expected here.
The following patch might help for nics without numa_node https://review.opendev.org/702990
The above patch fixed the numa issue with the nics: Jan 16 16:42:42 localhost.localdomain ironic-python-agent[785]: 2020-01-16 16:42:42.265 785 DEBUG ironic_python_agent.inspector [-] collecting numa-topology inspect /usr/lib/python2.7/site-packages/ironic_python_agent/inspector.py:89 Jan 16 16:42:42 localhost.localdomain ironic-python-agent[785]: 2020-01-16 16:42:42.267 785 DEBUG ironic_python_agent.numa_inspector [-] Found memory available 1572472 KB in NUMA node 0 get_nodes_memory_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:91 Jan 16 16:42:42 localhost.localdomain ironic-python-agent[785]: 2020-01-16 16:42:42.268 785 DEBUG ironic_python_agent.numa_inspector [-] Found memory available 1572668 KB in NUMA node 1 get_nodes_memory_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:91 Jan 16 16:42:42 localhost.localdomain ironic-python-agent[785]: 2020-01-16 16:42:42.269 785 DEBUG ironic_python_agent.numa_inspector [-] Found a thread sibling 0 for CPU 0 in NUMA node 0 get_nodes_cores_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:163 Jan 16 16:42:42 localhost.localdomain ironic-python-agent[785]: 2020-01-16 16:42:42.270 785 DEBUG ironic_python_agent.numa_inspector [-] Found a thread sibling 1 for CPU 0 in NUMA node 0 get_nodes_cores_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:163 Jan 16 16:42:42 localhost.localdomain ironic-python-agent[785]: 2020-01-16 16:42:42.271 785 DEBUG ironic_python_agent.numa_inspector [-] Found a thread sibling 2 for CPU 0 in NUMA node 1 get_nodes_cores_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:163 Jan 16 16:42:42 localhost.localdomain ironic-python-agent[785]: 2020-01-16 16:42:42.272 785 DEBUG ironic_python_agent.numa_inspector [-] Found a thread sibling 3 for CPU 0 in NUMA node 1 get_nodes_cores_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:163 Jan 16 16:42:42 localhost.localdomain ironic-python-agent[785]: 2020-01-16 16:42:42.273 785 DEBUG ironic_python_agent.inspector [-] collecting logs inspect /usr/lib/python2.7/site-packages/ironic_python_agent/inspector.py:89 Remaining to find out why extra-hardware is breaking.
From customer logs... Jan 16 13:34:04 localhost.localdomain ironic-python-agent[781]: 2020-01-16 13:34:04.762 781 INFO root [-] The system is a virtual machine, so biosdevname utility does not provide names for virtual NICs. Also there is no BMC address. The log ends with hardware-detect completion, similar to Dave's setup, not sure if logs are getting truncated: Jan 16 13:34:06 localhost.localdomain ironic-python-agent[781]: 2020-01-16 13:34:06.227 781 DEBUG ironic_python_agent.inspector [-] BMC IP address: 0.0.0.0 collect_default /usr/lib/python2.7/site-packages/ironic_python_agent/inspector.py:211 Jan 16 13:34:06 localhost.localdomain ironic-python-agent[781]: 2020-01-16 13:34:06.228 781 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): hardware-detect execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372 Jan 16 13:34:07 localhost.localdomain kernel: random: crng init done Jan 16 13:34:07 localhost.localdomain kernel: ipmi_si: IPMI System Interface driver Jan 16 13:34:07 localhost.localdomain kernel: ipmi_si: Unable to find any System Interface(s) Jan 16 13:34:07 localhost.localdomain ironic-python-agent[781]: 2020-01-16 13:34:07.433 781 DEBUG oslo_concurrency.processutils [-] CMD "hardware-detect" returned: 0 in 1.205s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409 Jan 16 13:34:07 localhost.localdomain ironic-python-agent[781]: 2020-01-16 13:34:07.434 781 DEBUG ironic_lib.utils [-] Execution completed, command line is "hardware-detect" execute /usr/lib/python2.7/site-packages/ironic_lib/utils.py:101 The patch looks good Dave, but yeah need to figure out why extra-hardware also had to be removed.
I manually backported the patch from [1] and I still got numa_topology working so I reverted my numa_topology fix and will see if it breaks again. [1] https://review.rdoproject.org/r/#/c/21190/
It looks like this is the reason why ipa isn't returning the numa_topology at all in my case: Jan 16 19:03:03 localhost.localdomain ironic-python-agent[771]: 2020-01-16 19:03:03.948 771 WARNING ironic_python_agent.numa_inspector [-] Failed to get some NUMA information (Error in NUMA node data format: Failed to gather NIC's for NUMA node eth0: [Errno 2] No such file or directory: '/sys/class/net/eth0/device/numa_node'): IncompatibleNumaFormatError: Error in NUMA node data format: Failed to gather NIC's for NUMA node eth0: [Errno 2] No such file or directory: '/sys/class/net/eth0/device/numa_node' So in case of a virtual NUMA environment, my patch is required as the eth0/device/numa_node file isn't present. That would also be the case if a given PCI adapter isn't binded to a numa node but I don't know hardware that well to tell if this situation is possible or not.
> The log ends with hardware-detect completion, similar to Dave's setup, not sure if logs are getting truncated Yep, that's the log truncation problem, we've fixed it in z10. I've +2'ed the upstream patch, great catch David!
>> The log ends with hardware-detect completion, similar to Dave's setup, not sure if logs are getting truncated > Yep, that's the log truncation problem, we've fixed it in z10. Great. BTW, it turns out that truncation fix didn't make z10 however and will be in z11 (https://bugzilla.redhat.com/show_bug.cgi?id=1722028).
Looks like the customer might have hit exactly this issue too: Jan 17 11:48:00 localhost.localdomain NetworkManager[961]: <info> [1579279680.5184] device (usb0): interface index 12 renamed iface from 'usb0' to 'enp0s20f0u1u6' and in the numa-topology collector we don't see it: Jan 17 11:50:49 localhost.localdomain ironic-python-agent[3776]: 2020-01-17 11:50:49.163 3776 DEBUG ironic_python_agent.numa_inspector [-] Found a NIC eno2 in NUMA node 0 get_nodes_nics_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:202 Jan 17 11:50:49 localhost.localdomain ironic-python-agent[3776]: 2020-01-17 11:50:49.186 3776 DEBUG ironic_python_agent.numa_inspector [-] Found a NIC eno1 in NUMA node 0 get_nodes_nics_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:202 Jan 17 11:50:49 localhost.localdomain ironic-python-agent[3776]: 2020-01-17 11:50:49.208 3776 DEBUG ironic_python_agent.numa_inspector [-] Found a NIC ens3f1 in NUMA node 0 get_nodes_nics_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:202 Jan 17 11:50:49 localhost.localdomain ironic-python-agent[3776]: 2020-01-17 11:50:49.231 3776 DEBUG ironic_python_agent.numa_inspector [-] Found a NIC ens3f0 in NUMA node 0 get_nodes_nics_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:202 Jan 17 11:50:49 localhost.localdomain ironic-python-agent[3776]: 2020-01-17 11:50:49.254 3776 DEBUG ironic_python_agent.numa_inspector [-] Found a NIC ens6f1 in NUMA node 1 get_nodes_nics_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:202 Jan 17 11:50:49 localhost.localdomain ironic-python-agent[3776]: 2020-01-17 11:50:49.277 3776 DEBUG ironic_python_agent.numa_inspector [-] Found a NIC ens6f0 in NUMA node 1 get_nodes_nics_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:202 Jan 17 11:50:49 localhost.localdomain ironic-python-agent[3776]: 2020-01-17 11:50:49.300 3776 DEBUG ironic_python_agent.numa_inspector [-] Found a NIC ens5f1 in NUMA node 1 get_nodes_nics_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:202 Jan 17 11:50:49 localhost.localdomain ironic-python-agent[3776]: 2020-01-17 11:50:49.323 3776 DEBUG ironic_python_agent.numa_inspector [-] Found a NIC ens5f0 in NUMA node 1 get_nodes_nics_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:202 Jan 17 11:50:49 localhost.localdomain ironic-python-agent[3776]: 2020-01-17 11:50:49.345 3776 DEBUG ironic_python_agent.numa_inspector [-] Found a NIC ens4f1 in NUMA node 0 get_nodes_nics_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:202 Jan 17 11:50:49 localhost.localdomain ironic-python-agent[3776]: 2020-01-17 11:50:49.369 3776 DEBUG ironic_python_agent.numa_inspector [-] Found a NIC ens4f0 in NUMA node 0 get_nodes_nics_info /usr/lib/python2.7/site-packages/ironic_python_agent/numa_inspector.py:202
sudo lshw | grep usb capabilities: pci pnp upgrade shadowing cdboot bootselect edd int14serial acpi usb biosbootspecification uefi *-usb *-usbhost:0 bus info: usb@1 logical name: usb1 capabilities: usb-2.00 *-usb:0 bus info: usb@1:1 capabilities: usb-2.00 *-usb bus info: usb@1:1.6 capabilities: usb-2.00 ethernet physical *-usb:1 bus info: usb@1:4 capabilities: usb-1.10 configuration: driver=usbhid maxpower=100mA speed=2Mbit/s *-usbhost:1 bus info: usb@2 logical name: usb2 capabilities: usb-3.00 Looks like they have an USB ethernet adapter plugged in their servers. That's interesting.
Verified at customer site.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0765