Bug 1640149 - [FFU] After fast forward upgrade on SRIOV hybrid setup, PF-port instance is unreachable
Summary: [FFU] After fast forward upgrade on SRIOV hybrid setup, PF-port instance is u...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Rodolfo Alonso
QA Contact: Roee Agiman
URL:
Whiteboard:
Depends On:
Blocks: 1544752
TreeView+ depends on / blocked
 
Reported: 2018-10-17 12:34 UTC by Roee Agiman
Modified: 2019-04-29 13:46 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-29 13:46:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Ping 10.0.1.14 (PF private IP) path (4.13 MB, image/jpeg)
2018-10-26 09:01 UTC, Rodolfo Alonso
no flags Details

Description Roee Agiman 2018-10-17 12:34:29 UTC
Description of problem:
Did FFU on hybrid setup of SRIOV.
Had 3 instances - one with PF port, one with VF and one with normal port.
Process finished ok, the VF and the normal ports instances are still reachable and functioning as expected; The PF-port instance lost it's connectivity, even after rebooting the instance it is still unreachable.

Version-Release number of selected component (if applicable):
FFU OSP10-OSP13

How reproducible:
1/1

Steps to Reproduce:
1. Deploy SRIOV hybrid setup
2. Create 3 instances with the 3 different port types
3. Run FFU, try to ping/ssh the instances after the process finishes.

Actual results:
PF-port instance unreachable

Expected results:
All instances are reachable.

Additional info:
My setup is available for investigations for now. approach for credentials.

Comment 1 Brian Haley 2018-10-18 13:52:01 UTC
Roee - can you give us info on logging into this system?  Also, after FFU if you spawn a new SRIOV instance does it also fail to ping?

Comment 4 Rodolfo Alonso 2018-10-26 09:01:19 UTC
Created attachment 1497663 [details]
Ping 10.0.1.14 (PF private IP) path

Ping 10.0.1.14 (PF private IP) path

Comment 5 Rodolfo Alonso 2018-10-26 09:09:08 UTC
Hello Roee:

I have created again the PF VM using your scripts. I added a password to "could-user" [1] in order to access to the VM.

From the controller0 DHCP netspace [2] I have connectivity with the PF machine (net-64-1-pf). As you can see in [3], the network configuration is correct.

I've tested that the ARP message is correctly flooded across all the compute node interfaces (computesriov-0/1, p1p1 and p1p2). When the PF interface is detached from the kernel and given to libvirt, the message continues flowing inside the VM.

Also I've tested pinging from the PF machine to the NORMAL machine (10.0.1.14 --> 10.0.1.7) [4].

[1] cloud-config file:
(overcloud) [stack@undercloud-0 ~]$ cat create_int.yaml
#cloud-config
write_files:
  - path: /etc/sysconfig/network-scripts/ifcfg-eth0.228
    owner: "root"
    permissions: "777"
    content: |
      DEVICE="eth0.228"
      BOOTPROTO="dhcp"
      ONBOOT="yes"
      VLAN="yes"
      PERSISTENT_DHCLIENT="yes"
runcmd:
  - [ sh, -c , "systemctl restart network" ]

users:
    - name: cloud-user
      lock-passwd: False
      plain_text_passwd: pass
      chpasswd: { expire: False }
      sudo: ALL=(ALL) NOPASSWD:ALL
      ssh_pwauth: True


[2] ip netns exec qdhcp-9907fd65-8e8d-4a58-a7b4-7a6d82b08949 ping 10.0.1.14


[3] [cloud-user@net-64-1-pf ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a0:36:9f:7f:28:b8 brd ff:ff:ff:ff:ff:ff
    inet6 2620:52:0:23a4:a236:9fff:fe7f:28b8/64 scope global noprefixroute dynamic 
       valid_lft 2591976sec preferred_lft 604776sec
    inet6 2001::a236:9fff:fe7f:28b8/64 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::a236:9fff:fe7f:28b8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: eth0.228@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether a0:36:9f:7f:28:b8 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.14/24 brd 10.0.1.255 scope global noprefixroute dynamic eth0.228
       valid_lft 85290sec preferred_lft 85290sec
    inet6 2001::a236:9fff:fe7f:28b8/64 scope global mngtmpaddr noprefixroute dynamic 
       valid_lft 86395sec preferred_lft 14395sec
    inet6 fe80::a236:9fff:fe7f:28b8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever


[4] [cloud-user@net-64-1-pf ~]$ ping 10.0.1.7
PING 10.0.1.7 (10.0.1.7) 56(84) bytes of data.
64 bytes from 10.0.1.7: icmp_seq=1 ttl=64 time=0.339 ms
64 bytes from 10.0.1.7: icmp_seq=2 ttl=64 time=0.363 ms

Comment 6 Rodolfo Alonso 2018-10-26 09:10:56 UTC
Hello Roee:

I have created again the PF VM using your scripts. I added a password to "could-user" [1] in order to access to the VM.

From the controller0 DHCP netspace [2] I have connectivity with the PF machine (net-64-1-pf). As you can see in [3], the network configuration is correct. Please, check the attached file to see the ping path.

I've tested that the ARP message is correctly flooded across all the compute node interfaces (computesriov-0/1, p1p1 and p1p2). When the PF interface is detached from the kernel and given to libvirt, the message continues flowing inside the VM.

Also I've tested pinging from the PF machine to the NORMAL machine (10.0.1.14 --> 10.0.1.7) [4].

Can you check it again? Please, give me feedback if 

[1] cloud-config file:
(overcloud) [stack@undercloud-0 ~]$ cat create_int.yaml
#cloud-config
write_files:
  - path: /etc/sysconfig/network-scripts/ifcfg-eth0.228
    owner: "root"
    permissions: "777"
    content: |
      DEVICE="eth0.228"
      BOOTPROTO="dhcp"
      ONBOOT="yes"
      VLAN="yes"
      PERSISTENT_DHCLIENT="yes"
runcmd:
  - [ sh, -c , "systemctl restart network" ]

users:
    - name: cloud-user
      lock-passwd: False
      plain_text_passwd: pass
      chpasswd: { expire: False }
      sudo: ALL=(ALL) NOPASSWD:ALL
      ssh_pwauth: True


[2] ip netns exec qdhcp-9907fd65-8e8d-4a58-a7b4-7a6d82b08949 ping 10.0.1.14


[3] [cloud-user@net-64-1-pf ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a0:36:9f:7f:28:b8 brd ff:ff:ff:ff:ff:ff
    inet6 2620:52:0:23a4:a236:9fff:fe7f:28b8/64 scope global noprefixroute dynamic 
       valid_lft 2591976sec preferred_lft 604776sec
    inet6 2001::a236:9fff:fe7f:28b8/64 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::a236:9fff:fe7f:28b8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: eth0.228@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether a0:36:9f:7f:28:b8 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.14/24 brd 10.0.1.255 scope global noprefixroute dynamic eth0.228
       valid_lft 85290sec preferred_lft 85290sec
    inet6 2001::a236:9fff:fe7f:28b8/64 scope global mngtmpaddr noprefixroute dynamic 
       valid_lft 86395sec preferred_lft 14395sec
    inet6 fe80::a236:9fff:fe7f:28b8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever


[4] [cloud-user@net-64-1-pf ~]$ ping 10.0.1.7
PING 10.0.1.7 (10.0.1.7) 56(84) bytes of data.
64 bytes from 10.0.1.7: icmp_seq=1 ttl=64 time=0.339 ms
64 bytes from 10.0.1.7: icmp_seq=2 ttl=64 time=0.363 ms

Comment 13 Nate Johnston 2019-03-07 16:26:04 UTC
Roee,

If you haven't hit this issue in the 4 months since your last comment, can we conclude that the issue is resolved?  If you hit it again then we can always reopen the BZ.

Nate

Comment 15 Nate Johnston 2019-04-29 13:46:11 UTC
Closing since we can't get a reproducer.  Thanks!


Note You need to log in before you can comment on or make changes to this bug.