Bug 1950590

Summary: CNO: Too many OVN netFlows collectors causes ovnkube pods CrashLoopBackOff
Product: OpenShift Container Platform Reporter: Ross Brattain <rbrattai>
Component: NetworkingAssignee: Aniket Bhat <anbhat>
Networking sub component: ovn-kubernetes QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aconstan, memodi, ricarril
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:01:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
bad-cno-8.yaml none

Description Ross Brattain 2021-04-17 03:51:42 UTC
Description of problem:


The exportNetworkFlows collector API spec does not specify maxItems.

Too many collectors will cause ovnkube to CrashLoopBackOff

container dies with
standard_init_linux.go:219: exec user process caused: argument list too long

due to
       E2BIG           Argument list too long (POSIX.1-2001).

presumably due to ovnkube command line bytes approaching `getconf ARG_MAX` bytes.

The limit seems to be less 2 MB.

Greater than 2MB in the YAML causes etcd to reject the change with

error: networks.operator.openshift.io "cluster" could not be patched: etcdserver: request is too large




Version-Release number of selected component (if applicable):

4.8.0-0.nightly-2021-04-16-032542

How reproducible:

Always

Steps to Reproduce:
1. generate ~51200 IP address and port pairs as a YAML list

          - 10.1.1.1:2056
          - 10.1.1.2:2056
          - 10.1.1.3:2056

2. oc edit network.operator
3. paste in the list into the collector list


  exportNetworkFlows:
    netFlow:
      collectors:



Actual results:

ovnkube-node-hmg9w     3/4     CrashLoopBackOff   8

standard_init_linux.go:219: exec user process caused: argument list too long

NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                 AGE
ovnkube-node     6         6         5       1            5           beta.kubernetes.io/os=linux                                   13h

Expected results:

oc edit schema validation fails using maxItems limit.


Additional info:


5697 IP port pairs seemed to work.  I did not try to measure performance impact

All user inputs must have a specified maximum.

Comment 1 Ross Brattain 2021-04-17 03:54:53 UTC
Created attachment 1772708 [details]
bad-cno-8.yaml

CNO YAML that fails

Comment 2 Ross Brattain 2021-04-17 04:38:35 UTC
we need to check maxItems in API schema and in CNO renderOVNKubernetes and in ovn-kubernetes ParseFlowCollectors, MonitoringFlags, and setupOVNNode.  

We could exceed ovnkube `--netflow-targets=` arg as well as ovs-vsctl `targets=[%s]` command line limits.

Comment 4 Ricardo Carrillo Cruz 2021-04-21 11:16:45 UTC
https://github.com/openshift/cluster-network-operator/pull/1068 for bumping openshift/api

Comment 6 Ross Brattain 2021-04-27 18:04:18 UTC
Verified on 4.8.0-0.nightly-2021-04-25-231500

Schema error is triggered.

# networks.operator.openshift.io "cluster" was not valid:
# * spec.exportNetworkFlows.netFlow.collectors: Invalid value: 47: spec.exportNetworkFlows.netFlow.collectors in body should have at most 10 items

oc explain updated

$ oc explain --api-version=operator.openshift.io/v1 networks.spec.exportNetworkFlows.netFlow
KIND:     Network
VERSION:  operator.openshift.io/v1

RESOURCE: netFlow <Object>

DESCRIPTION:
     netFlow defines the NetFlow configuration.

FIELDS:
   collectors   <[]string>
     netFlow defines the NetFlow collectors that will consume the flow data
     exported from OVS. It is a list of strings formatted as ip:port with a
     maximum of ten items

Comment 10 errata-xmlrpc 2021-07-27 23:01:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438