Bug 1640149

Summary: [FFU] After fast forward upgrade on SRIOV hybrid setup, PF-port instance is unreachable
Product: Red Hat OpenStack Reporter: Roee Agiman <ragiman>
Component: openstack-neutronAssignee: Rodolfo Alonso <ralonsoh>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Roee Agiman <ragiman>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: amuller, augol, bhaley, chrisw, njohnston, ragiman, ralonsoh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-29 13:46:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1544752    
Attachments:
Description Flags
Ping 10.0.1.14 (PF private IP) path none

Description Roee Agiman 2018-10-17 12:34:29 UTC
Description of problem:
Did FFU on hybrid setup of SRIOV.
Had 3 instances - one with PF port, one with VF and one with normal port.
Process finished ok, the VF and the normal ports instances are still reachable and functioning as expected; The PF-port instance lost it's connectivity, even after rebooting the instance it is still unreachable.

Version-Release number of selected component (if applicable):
FFU OSP10-OSP13

How reproducible:
1/1

Steps to Reproduce:
1. Deploy SRIOV hybrid setup
2. Create 3 instances with the 3 different port types
3. Run FFU, try to ping/ssh the instances after the process finishes.

Actual results:
PF-port instance unreachable

Expected results:
All instances are reachable.

Additional info:
My setup is available for investigations for now. approach for credentials.

Comment 1 Brian Haley 2018-10-18 13:52:01 UTC
Roee - can you give us info on logging into this system?  Also, after FFU if you spawn a new SRIOV instance does it also fail to ping?

Comment 4 Rodolfo Alonso 2018-10-26 09:01:19 UTC
Created attachment 1497663 [details]
Ping 10.0.1.14 (PF private IP) path

Ping 10.0.1.14 (PF private IP) path

Comment 5 Rodolfo Alonso 2018-10-26 09:09:08 UTC
Hello Roee:

I have created again the PF VM using your scripts. I added a password to "could-user" [1] in order to access to the VM.

From the controller0 DHCP netspace [2] I have connectivity with the PF machine (net-64-1-pf). As you can see in [3], the network configuration is correct.

I've tested that the ARP message is correctly flooded across all the compute node interfaces (computesriov-0/1, p1p1 and p1p2). When the PF interface is detached from the kernel and given to libvirt, the message continues flowing inside the VM.

Also I've tested pinging from the PF machine to the NORMAL machine (10.0.1.14 --> 10.0.1.7) [4].

[1] cloud-config file:
(overcloud) [stack@undercloud-0 ~]$ cat create_int.yaml
#cloud-config
write_files:
  - path: /etc/sysconfig/network-scripts/ifcfg-eth0.228
    owner: "root"
    permissions: "777"
    content: |
      DEVICE="eth0.228"
      BOOTPROTO="dhcp"
      ONBOOT="yes"
      VLAN="yes"
      PERSISTENT_DHCLIENT="yes"
runcmd:
  - [ sh, -c , "systemctl restart network" ]

users:
    - name: cloud-user
      lock-passwd: False
      plain_text_passwd: pass
      chpasswd: { expire: False }
      sudo: ALL=(ALL) NOPASSWD:ALL
      ssh_pwauth: True


[2] ip netns exec qdhcp-9907fd65-8e8d-4a58-a7b4-7a6d82b08949 ping 10.0.1.14


[3] [cloud-user@net-64-1-pf ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a0:36:9f:7f:28:b8 brd ff:ff:ff:ff:ff:ff
    inet6 2620:52:0:23a4:a236:9fff:fe7f:28b8/64 scope global noprefixroute dynamic 
       valid_lft 2591976sec preferred_lft 604776sec
    inet6 2001::a236:9fff:fe7f:28b8/64 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::a236:9fff:fe7f:28b8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: eth0.228@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether a0:36:9f:7f:28:b8 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.14/24 brd 10.0.1.255 scope global noprefixroute dynamic eth0.228
       valid_lft 85290sec preferred_lft 85290sec
    inet6 2001::a236:9fff:fe7f:28b8/64 scope global mngtmpaddr noprefixroute dynamic 
       valid_lft 86395sec preferred_lft 14395sec
    inet6 fe80::a236:9fff:fe7f:28b8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever


[4] [cloud-user@net-64-1-pf ~]$ ping 10.0.1.7
PING 10.0.1.7 (10.0.1.7) 56(84) bytes of data.
64 bytes from 10.0.1.7: icmp_seq=1 ttl=64 time=0.339 ms
64 bytes from 10.0.1.7: icmp_seq=2 ttl=64 time=0.363 ms

Comment 6 Rodolfo Alonso 2018-10-26 09:10:56 UTC
Hello Roee:

I have created again the PF VM using your scripts. I added a password to "could-user" [1] in order to access to the VM.

From the controller0 DHCP netspace [2] I have connectivity with the PF machine (net-64-1-pf). As you can see in [3], the network configuration is correct. Please, check the attached file to see the ping path.

I've tested that the ARP message is correctly flooded across all the compute node interfaces (computesriov-0/1, p1p1 and p1p2). When the PF interface is detached from the kernel and given to libvirt, the message continues flowing inside the VM.

Also I've tested pinging from the PF machine to the NORMAL machine (10.0.1.14 --> 10.0.1.7) [4].

Can you check it again? Please, give me feedback if 

[1] cloud-config file:
(overcloud) [stack@undercloud-0 ~]$ cat create_int.yaml
#cloud-config
write_files:
  - path: /etc/sysconfig/network-scripts/ifcfg-eth0.228
    owner: "root"
    permissions: "777"
    content: |
      DEVICE="eth0.228"
      BOOTPROTO="dhcp"
      ONBOOT="yes"
      VLAN="yes"
      PERSISTENT_DHCLIENT="yes"
runcmd:
  - [ sh, -c , "systemctl restart network" ]

users:
    - name: cloud-user
      lock-passwd: False
      plain_text_passwd: pass
      chpasswd: { expire: False }
      sudo: ALL=(ALL) NOPASSWD:ALL
      ssh_pwauth: True


[2] ip netns exec qdhcp-9907fd65-8e8d-4a58-a7b4-7a6d82b08949 ping 10.0.1.14


[3] [cloud-user@net-64-1-pf ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a0:36:9f:7f:28:b8 brd ff:ff:ff:ff:ff:ff
    inet6 2620:52:0:23a4:a236:9fff:fe7f:28b8/64 scope global noprefixroute dynamic 
       valid_lft 2591976sec preferred_lft 604776sec
    inet6 2001::a236:9fff:fe7f:28b8/64 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::a236:9fff:fe7f:28b8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: eth0.228@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether a0:36:9f:7f:28:b8 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.14/24 brd 10.0.1.255 scope global noprefixroute dynamic eth0.228
       valid_lft 85290sec preferred_lft 85290sec
    inet6 2001::a236:9fff:fe7f:28b8/64 scope global mngtmpaddr noprefixroute dynamic 
       valid_lft 86395sec preferred_lft 14395sec
    inet6 fe80::a236:9fff:fe7f:28b8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever


[4] [cloud-user@net-64-1-pf ~]$ ping 10.0.1.7
PING 10.0.1.7 (10.0.1.7) 56(84) bytes of data.
64 bytes from 10.0.1.7: icmp_seq=1 ttl=64 time=0.339 ms
64 bytes from 10.0.1.7: icmp_seq=2 ttl=64 time=0.363 ms

Comment 13 Nate Johnston 2019-03-07 16:26:04 UTC
Roee,

If you haven't hit this issue in the 4 months since your last comment, can we conclude that the issue is resolved?  If you hit it again then we can always reopen the BZ.

Nate

Comment 15 Nate Johnston 2019-04-29 13:46:11 UTC
Closing since we can't get a reproducer.  Thanks!