Bug 1950268 - Removing a VM and its ports (VFs) produces a kernel crash when using a RT image in computes
Summary: Removing a VM and its ports (VFs) produces a kernel crash when using a RT ima...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: DPDK
Version: FDP 21.B
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Flavio Leitner
QA Contact: liting
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-16 09:26 UTC by Miguel Angel Nieto
Modified: 2023-07-13 07:25 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-1267 0 None None None 2021-09-09 16:27:06 UTC

Description Miguel Angel Nieto 2021-04-16 09:26:53 UTC
Description of problem:
Removing a VM and its ports (VFs) produces a kernel crash when using a RT image in computes

[ 8184.214770] IPv4: martian source 10.35.74.8 from 10.35.74.126, on dev eno1
[ 8184.214773] ll header: 00000000: ff ff ff ff ff ff 9c cc 83 58 1c 60 08 06        .........X.`..
[ 8192.714949] i40e 0000:05:00.2: Setting MAC b6:e2:14:b6:d6:4e on VF 8
[ 8192.800714] i40e 0000:05:00.2: Bring down and up the VF interface to make this change effective.
[ 8192.811921] iavf 0000:05:0b.0: enabling device (0000 -> 0002)
[ 8192.874279] iavf 0000:05:0b.0: Multiqueue Enabled: Queue pair count = 4
[ 8192.878943] iavf 0000:05:0b.0: MAC address: b6:e2:14:b6:d6:4e
[ 8192.878945] iavf 0000:05:0b.0: GRO is enabled
[ 8192.893759] iavf 0000:05:0b.0 enp5s0f2v8: renamed from eth0
[ 8192.999646] iavf 0000:05:0b.0: Reset warning received from the PF
[ 8192.999649] iavf 0000:05:0b.0: Scheduling reset task
[ 8193.105429] i40e 0000:05:00.2: VF 8 is now untrusted
[ 8193.108240] IPv6: ADDRCONF(NETDEV_UP): enp5s0f2v8: link is not ready
[ 8193.121854] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 8193.121856] PGD 0 P4D 0
[ 8193.121860] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 8193.121863] CPU: 21 PID: 5689 Comm: NetworkManager Kdump: loaded Not tainted 4.18.0-193.28.1.rt13.77.el8_2.x86_64 #1
[ 8193.121864] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.8.0 005/17/2018
[ 8193.121872] RIP: 0010:iavf_alloc_rx_buffers+0x4f/0x250 [iavf]
[ 8193.121874] Code: 0f 85 df 00 00 00 0f b7 47 48 41 89 f7 48 89 fb 49 89 c4 48 8d 14 40 49 89 c5 48 8b 47 20 49 c1 e4 05 4c 03 67 08 48 8d 2c d0 <48> 83 7d 08 00 0f b7 4b 46 0f 84 c1 00 00 00 48 83 83 80 00 00 00
[ 8193.121875] RSP: 0018:ffffc16857923558 EFLAGS: 00010246
[ 8193.121877] RAX: 0000000000000000 RBX: ffff9b72e22e1000 RCX: 0000000000000200
[ 8193.121878] RDX: 0000000000000000 RSI: 00000000000001ff RDI: ffff9b72e22e1000
[ 8193.121879] RBP: 0000000000000000 R08: 0000000000000600 R09: ffff9b7b220a0ec0
[ 8193.121880] R10: 0000000092492480 R11: 0000000000000000 R12: 0000000000000000
[ 8193.121881] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000001ff
[ 8193.121882] FS:  00007f7fff96d200(0000) GS:ffff9b7b3f880000(0000) knlGS:0000000000000000
[ 8193.121883] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8193.121884] CR2: 0000000000000008 CR3: 0000003fe9bce001 CR4: 00000000003626e0
[ 8193.121886] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8193.121887] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8193.121887] Call Trace:
[ 8193.121896]  iavf_configure+0x124/0x180 [iavf]
[ 8193.121901]  iavf_open+0x100/0x180 [iavf]
[ 8193.121905]  __dev_open+0xcd/0x160
[ 8193.121908]  __dev_change_flags+0x1ad/0x220
[ 8193.121912]  dev_change_flags+0x21/0x60
[ 8193.121916]  do_setlink+0x314/0xed0
[ 8193.121920]  ? preempt_count_add+0x79/0xb0
[ 8193.121922]  ? preempt_count_add+0x79/0xb0
[ 8193.121926]  ? __nla_validate_parse+0x51/0x840

It is reproduce running the following testcase:
python -m testtools.run nfv_tempest_plugin.tests.scenario.test_nfv_sriov_usecases.TestSriovScenarios.test_sriov_free_resource
And the following templates:
https://gitlab.cee.redhat.com/mnietoji/deployment_templates/-/tree/460218cb433959a6b73597a437882966391b1417/tht/panther08/ospd-16.1-geneve-ovn-dpdk-sriov-ctlplane-dataplane-bonding-rt-hybrid-performance-panther08

The testcase does something similar to the following:
#!/usr/bin/env bash

#networks
openstack network create --provider-network-type geneve mgmt
openstack subnet create --gateway 10.10.10.254  --network mgmt --subnet-range 10.10.10.0/24  --dhcp --dns-nameserver 10.46.0.31 --dns-nameserver 8.8.8.8 --allocation-pool start=10.10.10.100,end=10.10.10.200 mgmt_subnet
openstack network create --provider-physical-network sriov-1 --provider-network-type vlan sriov_vf
openstack subnet create --gateway 40.0.0.254  --network sriov_vf --subnet-range 40.0.0.0/24  --dhcp --dns-nameserver 10.46.0.31 --dns-nameserver 8.8.8.8 --allocation-pool start=40.0.0.100,end=40.0.0.200 sriov_vf_subnet

#ports
openstack port create --network mgmt --vnic-type normal mgmt_1
openstack port create --network mgmt --vnic-type normal mgmt_2
openstack port create --network mgmt --vnic-type normal mgmt_3
openstack port create --network mgmt --vnic-type normal mgmt_4
openstack port create --network sriov_vf --vnic-type direct sriov_vf_1
openstack port create --network sriov_vf --vnic-type direct sriov_vf_2
openstack port create --network sriov_vf --vnic-type direct sriov_vf_3
openstack port create --network sriov_vf --vnic-type direct sriov_vf_4
#flavor
openstack flavor create --ram 8192 --disk 20 --vcpus 6 nfv_qe_base_flavor
openstack flavor set nfv_qe_base_flavor --property hw:mem_page_size=large --property hw:cpu_policy=dedicated --property hw:cpu_realtime=yes --property hw:cpu_emulator_threads=isolate --property hw:cpu_realtime_mask=^0-1

#image
curl -o rhel-guest-image-7-6-210-x86-64-qcow2 http://rhos-qe-mirror-tlv.usersys.redhat.com/brewroot/packages/rhel-guest-image/7.6/210/images/rhel-guest-image-7.6-210.x86_64.qcow2
openstack image  create --disk-format qcow2 --container-format bare --public --file ./rhel-guest-image-7-6-210-x86-64-qcow2 rhel-guest-image-7-6-210-x86-64-qcow2

#keypair
openstack keypair create --public-key /home/stack/.ssh/id_rsa.pub mykeypair

#vms
openstack server create --key-name  mykeypair --flavor nfv_qe_base_flavor --image rhel-guest-image-7-6-210-x86-64-qcow2 --security-group default --port mgmt_1 --port sriov_vf_1 myinstance1
openstack server create --key-name  mykeypair --flavor nfv_qe_base_flavor --image rhel-guest-image-7-6-210-x86-64-qcow2 --security-group default --port mgmt_2 --port sriov_vf_2 myinstance2
openstack server create --key-name  mykeypair --flavor nfv_qe_base_flavor --image rhel-guest-image-7-6-210-x86-64-qcow2 --security-group default --port mgmt_3 --port sriov_vf_3 myinstance3
openstack server create --key-name  mykeypair --flavor nfv_qe_base_flavor --image rhel-guest-image-7-6-210-x86-64-qcow2 --security-group default --port mgmt_4 --port sriov_vf_4 myinstance4
#destroy ports and vms
ips=$(openstack server list --a -c Networks -f value | sed 's/[=,;]/ /g' | awk '{print $2,$4}')
ips=$(echo $ips | sed 's/ /|/g')
ports=$(openstack port list -f value | egrep $ips | awk '{print $1}')
servers=$(openstack server list --a -c ID -f value)
for server in $servers;do
    openstack server delete $server
done
for port in $ports;do
    openstack port delete $port
done

It is not reproduced every time the testcase is run, but I have reproduced it several times


Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20210323.n.0(venv) (overcloud) [stack@undercloud-0 ~]
Red Hat Enterprise Linux release 8.2 (Ootpa)
Linux computeovndpdksriovrt-1 4.18.0-193.28.1.rt13.77.el8_2.x86_64 #1 SMP PREEMPT RT Fri Oct 16 14:11:07 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
look above


Actual results:
kernel crash

Expected results:
No kernel crash should be generated


Additional info:
I will upload sos reports and kernel crash dumps


Note You need to log in before you can comment on or make changes to this bug.