Description of problem: Due to OVN bug [1] OCP installations with Kuryr fail as some pods are unable to start because subports incorrectly stay in DOWN status after they're plugged to a trunk port. Normally detaching and attaching the subport to the trunk helps and Kuryr should be able to use that workaround when needed. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1937851 Version-Release number of selected component (if applicable): How reproducible: ~50% of cases? Steps to Reproduce: 1. Run OCP installation with Kuryr on OSP 16.1 with OVN. Actual results: Some pods will randomly fail to start and will be kept on ContainerCreating. kuryr-controller will keep getting restarted. The port associated to the problematic pod will be in DOWN status. Expected results: Everything works smoothly and all subports that are associated to the pods are in ACTIVE state.
I'm putting this as a blocker, it's affecting many of Kuryr installations with OSP 16.1 and OVN. The code of the workaround seems ot be done.
Verified on 4.8.0-0.nightly-2021-06-16-190035 Saw in kuryr controller log: 2021-06-17 14:33:47.964 1 WARNING kuryr_kubernetes.controller.drivers.nested_vlan_vif [-] Subport ed214b6b-cad7-4be0-a3c6-df7324e00317 is in DOWN status for more than 137 seconds. This is a Neutron issue. Attempting to reattach the subport to trunk 77da3289-5191-4504-9bb3-cb7390fcc50e using VLAN ID 1682 to fix it.: kuryr_kubernetes.exceptions.ResourceNotReady: Resource not ready: VIFVlanNested(active=False,address=fa:16:3e:87:89:cf,has_traffic_filtering=False,id=ed214b6b-cad7-4be0-a3c6-df7324e00317,network=Network(f18aae10-8e51-498d-a01c-02daa71e4f84),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='taped214b6b-ca',vlan_id=1682) 2021-06-17 14:33:49.167 1 WARNING kuryr_kubernetes.controller.drivers.nested_vlan_vif [-] Reattached subport ed214b6b-cad7-4be0-a3c6-df7324e00317, its state will be rechecked when event will be retried.: kuryr_kubernetes.exceptions.ResourceNotReady: Resource not ready: VIFVlanNested(active=False,address=fa:16:3e:87:89:cf,has_traffic_filtering=False,id=ed214b6b-cad7-4be0-a3c6-df7324e00317,network=Network(f18aae10-8e51-498d-a01c-02daa71e4f84),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='taped214b6b-ca',vlan_id=1682) Installation finished successfully and port is active $ openstack port list |grep ed214b6b-cad7-4be0-a3c6-df7324e00317 | ed214b6b-cad7-4be0-a3c6-df7324e00317 | | fa:16:3e:87:89:cf | ip_address='10.128.85.86', subnet_id='2c4ed61f-29db-45f5-aa25-c3804ba68884' | ACTIVE |
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438