Bug 1813461 - OSP 16 DCN OVS scaled node introspection failed on timed out
Summary: OSP 16 DCN OVS scaled node introspection failed on timed out
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Alistair Tonner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-13 21:05 UTC by Yuri Obshansky
Modified: 2020-03-24 21:55 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-24 21:55:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ironic-inspector log (264.67 KB, application/gzip)
2020-03-13 21:05 UTC, Yuri Obshansky
no flags Details
ironic-conductor log (361.16 KB, application/gzip)
2020-03-13 21:05 UTC, Yuri Obshansky
no flags Details

Description Yuri Obshansky 2020-03-13 21:05:18 UTC
Created attachment 1669984 [details]
ironic-inspector log

Description of problem:
OSP 16 DCN with OVS network backend and Spine-Leaf network topology.
Scaled up node failed on introspection 
.........
Waiting for introspection to finish...
Introspection of node attempt failed:dcn3-compute-0.
Retrying 1 nodes that failed introspection. Attempt 1 of 3 
Introspection of node attempt failed:4299a8c6-98b4-4935-9372-ecaaf2794190.
Retrying 1 nodes that failed introspection. Attempt 2 of 3 
Introspection of node attempt failed:4299a8c6-98b4-4935-9372-ecaaf2794190.
Retrying 1 nodes that failed introspection. Attempt 3 of 3 
Introspection of node attempt failed:4299a8c6-98b4-4935-9372-ecaaf2794190.
Retry limit reached with 1 nodes still failing introspectionWaiting for messages on queue 'tripleo' with no timeout.
Introspection completed with errors: Retry limit reached with 1 nodes still failing introspection non-zero return code

ironic-inspector.log
2020-03-13 09:23:43.692 8 ERROR ironic_inspector.node_cache [-] Introspection for nodes ['4299a8c6-98b4-4935-9372-ecaaf2794190'] has timed out
2020-03-13 09:23:43.696 8 DEBUG ironic_inspector.node_cache [-] [node: 4299a8c6-98b4-4935-9372-ecaaf2794190 state waiting] Attempting to acquire l
ock acquire_lock /usr/lib/python3.6/site-packages/ironic_inspector/node_cache.py:105
2020-03-13 09:23:43.696 8 DEBUG ironic_inspector.node_cache [-] [node: 4299a8c6-98b4-4935-9372-ecaaf2794190 state waiting] Successfully acquired l
ock acquire_lock /usr/lib/python3.6/site-packages/ironic_inspector/node_cache.py:107
2020-03-13 09:23:43.696 8 DEBUG ironic_inspector.node_cache [-] [node: 4299a8c6-98b4-4935-9372-ecaaf2794190 state waiting] Successfully released l
ock release_lock /usr/lib/python3.6/site-packages/ironic_inspector/node_cache.py:119
2020-03-13 09:23:43.697 8 DEBUG ironic_inspector.node_cache [-] [node: 4299a8c6-98b4-4935-9372-ecaaf2794190 state waiting] Executing fsm(waiting).
process_event(timeout) fsm_event /usr/lib/python3.6/site-packages/ironic_inspector/node_cache.py:207
2020-03-13 09:23:43.697 8 INFO ironic_inspector.node_cache [-] [node: 4299a8c6-98b4-4935-9372-ecaaf2794190 state waiting] Updating node state: wai
ting --> error
2020-03-13 09:23:43.698 8 DEBUG ironic_inspector.node_cache [-] [node: 4299a8c6-98b4-4935-9372-ecaaf2794190 state waiting] Committing fields: {'st
ate': 'error'} _commit /usr/lib/python3.6/site-packages/ironic_inspector/node_cache.py:150
2020-03-13 09:23:43.708 8 DEBUG ironic_inspector.node_cache [-] [node: 4299a8c6-98b4-4935-9372-ecaaf2794190 state error] Committing fields: {'fini
shed_at': datetime.datetime(2020, 3, 13, 9, 23, 43, 697300), 'error': 'Introspection timeout'} _commit /usr/lib/python3.6/site-packages/ironic_ins
pector/node_cache.py:150


ironic-conductor.log
2020-03-13 15:19:06.047 7 DEBUG ironic.common.utils [req-dc3d6f4a-e35e-47e3-9db6-d203f4865372 - - - - -] Execution completed, command line is "ipmitool -I lanplus -H 172.16.10.1 -L ADMINISTRATOR -p 6234 -U admin -v -R 12 -N 5 -f /tmp/tmpfvihl7u4 power status" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:77
2020-03-13 15:19:06.047 7 DEBUG ironic.common.utils [req-dc3d6f4a-e35e-47e3-9db6-d203f4865372 - - - - -] Command stdout is: "Chassis Power is on
" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:78
2020-03-13 15:19:06.047 7 DEBUG ironic.common.utils [req-dc3d6f4a-e35e-47e3-9db6-d203f4865372 - - - - -] Command stderr is: "Running Get PICMG Properties my_addr 0x20, transit 0, target 0x20
Error response 0xc1 from Get PICMG Properities
Running Get VSO Capabilities my_addr 0x20, transit 0, target 0x20
Invalid completion code received: Invalid command
Discovered IPMB address 0x0
" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:79


Version-Release number of selected component (if applicable):
RHOS_TRUNK-16.0-RHEL-8-20200226.n.1

puppet-ironic-15.4.1-0.20191022165413.8fe6978.el8ost.noarch
python3-ironicclient-3.1.0-0.20190926124316.04ef2d7.el8ost.noarch
openstack-ironic-python-agent-builder-1.1.1-0.20191203040321.a34dfda.el8ost.noarch
python3-ironic-inspector-client-3.7.0-0.20190923163033.d95a4cd.el8ost.noarch


How reproducible:
https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/edge/job/DFG-edge-day2-16-rhel-virthost-3cont-2comp-2leafs-x-2comp-ovs-scale-up-down/7/consoleFull
or
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/edge/job/DFG-edge-day2-16-rhel-virthost-3cont-2comp-2leafs-x-2comp-ovs-scale-up-down/2/console

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

The same test flow with OVN network backend works great.
https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/edge/job/DFG-edge-day2-16-rhel-virthost-3cont-2comp-2leafs-x-2comp-ovn-scale-up-down/1/consoleFull

Comment 1 Yuri Obshansky 2020-03-13 21:05:58 UTC
Created attachment 1669985 [details]
ironic-conductor log

Comment 5 Bob Fournier 2020-03-16 16:12:54 UTC
Bogdan - note that is the neutron dhcp server you are looking at but this bug is an introspection issue, so that's the inspector dnsmasq.

Comment 6 Bogdan Dobrelya 2020-03-17 08:59:50 UTC
@Bob, indeed thanks. I've retried it for the br-ctlplane but the result look unchanged - no replies logged by tcpdump

Comment 7 Bogdan Dobrelya 2020-03-19 13:40:57 UTC
Could this perchance be a duplicate of bug 1814616?

[stack@site-undercloud-0 ~]$ sudo ovs-vsctl -V
ovs-vsctl (Open vSwitch) 2.11.0
DB Schema 7.16.1
[stack@site-undercloud-0 ~]$ cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.1 (Ootpa)

Comment 8 Bob Fournier 2020-03-19 13:58:39 UTC
Bogdan - interesting idea. We had been assuming that the problem with DHCP packets not being received by inspector dnsmasq was only happening with 16.1 and OVS 2.13, but it does look similar to this issue.  We haven't had any other reports from 16.0 CI of this problem though.

Comment 12 Bogdan Dobrelya 2020-03-20 14:29:02 UTC
so this is likely the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1814616#c9

Comment 13 Harald Jensås 2020-03-20 16:17:41 UTC
(In reply to Bogdan Dobrelya from comment #7)
> Could this perchance be a duplicate of bug 1814616?
> 

I think this is not a duplicate, of bug 1814616

Ironic inspector's dnsmasq does receive requests here:
[root@site-undercloud-0 ~]# tail -n 5 /var/log/containers/ironic-inspector/dnsmasq.log
Mar 20 16:14:59 dnsmasq-dhcp[7]: no address range available for DHCP request via 10.35.64.34
Mar 20 16:15:09 dnsmasq-dhcp[7]: no address range available for DHCP request via 10.35.64.34
Mar 20 16:15:09 dnsmasq-dhcp[7]: no address range available for DHCP request via 10.35.64.34
Mar 20 16:15:20 dnsmasq-dhcp[7]: no address range available for DHCP request via 10.35.64.34
Mar 20 16:15:20 dnsmasq-dhcp[7]: no address range available for DHCP request via 10.35.64.34



[root@site-undercloud-0 ~]# cat /var/lib/config-data/puppet-generated/ironic_inspector/etc/ironic-inspector/dnsmasq.conf 
port=0
interface=br-ctlplane

dhcp-range=set:leaf0,192.168.24.100,192.168.24.190,255.255.255.0,10m
dhcp-option=tag:leaf0,option:router,192.168.24.254
dhcp-option-force=tag:leaf0,option:mtu,1500
dhcp-range=set:leaf1,192.168.34.100,192.168.34.190,255.255.255.0,10m
dhcp-option=tag:leaf1,option:router,192.168.34.254
dhcp-option-force=tag:leaf1,option:mtu,1500
dhcp-range=set:leaf2,192.168.44.100,192.168.44.190,255.255.255.0,10m
dhcp-option=tag:leaf2,option:router,192.168.44.254
dhcp-option-force=tag:leaf2,option:mtu,1500
dhcp-sequential-ip
dhcp-match=ipxe,175
dhcp-match=set:efi,option:client-arch,7
dhcp-match=set:efi,option:client-arch,9
dhcp-match=set:efi,option:client-arch,11
# dhcpv6s for Client System Architecture Type (61)
dhcp-match=set:efi6,option6:61,0007
dhcp-match=set:efi6,option6:61,0009
dhcp-match=set:efi6,option6:61,0011
dhcp-userclass=set:ipxe6,iPXE
# Client is already running iPXE; move to next stage of chainloading
dhcp-boot=tag:ipxe,http://192.168.24.1:8088/inspector.ipxe
dhcp-option=tag:ipxe6,option6:bootfile-url,http://192.168.24.1:8088/inspector.ipxe
# Client is PXE booting over EFI without iPXE ROM; send EFI version of iPXE chainloader
dhcp-boot=tag:efi,tag:!ipxe,ipxe.efi
dhcp-option=tag:efi6,tag:!ipxe6,option6:bootfile-url,tftp://192.168.24.1/ipxe.efi
# Client is running PXE over BIOS; send BIOS version of iPXE chainloader
dhcp-boot=undionly.kpxe,localhost.localdomain,192.168.24.1

dhcp-hostsdir=/var/lib/ironic-inspector/dhcp-hostsdir

Comment 14 Bogdan Dobrelya 2020-03-20 16:39:52 UTC
If I add log-dhcp I can also see 
dnsmasq-dhcp[7]: no address range available for DHCP request via 192.168.54.254

logged when I do virsh reset dcn3-compute-0

so this probably a configuration issue indeed?

()[root@site-undercloud-0 /]# grep  range /etc/ironic-inspector/dnsmasq.conf                                                                                                                                
dhcp-range=172.20.0.100,172.20.0.110,255.255.255.0,10m
dhcp-range=set:leaf0,192.168.24.100,192.168.24.190,255.255.255.0,10m
dhcp-range=set:leaf1,192.168.34.100,192.168.34.190,255.255.255.0,10m
dhcp-range=set:leaf2,192.168.44.100,192.168.44.190,255.255.255.0,10m

[root@site-undercloud-0 stack]# grep 192.168.54. undercloud.conf 
cidr = 192.168.54.0/24
dhcp_start = 192.168.54.10
dhcp_end = 192.168.54.90
inspection_iprange = 192.168.54.100,192.168.54.190
gateway = 192.168.54.254

Comment 17 Yuri Obshansky 2020-03-24 21:55:23 UTC
That is not bug just configuration issue in undercloud.conf
Should be re-installed UC with new subnet in
subnets = leaf0,leaf1,leaf2,leaf3
Test passed
https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/edge/job/DFG-edge-day2-16-rhel-virthost-3cont-2comp-2leafs-x-2comp-ovs-scale-up-down/16/

I appreciate your help and time

Thank you
Yuri


Note You need to log in before you can comment on or make changes to this bug.