Bug 1950590 - CNO: Too many OVN netFlows collectors causes ovnkube pods CrashLoopBackOff
Summary: CNO: Too many OVN netFlows collectors causes ovnkube pods CrashLoopBackOff
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.8.0
Assignee: Aniket Bhat
QA Contact: Ross Brattain
Depends On:
TreeView+ depends on / blocked
Reported: 2021-04-17 03:51 UTC by Ross Brattain
Modified: 2022-01-11 16:47 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-07-27 23:01:42 UTC
Target Upstream Version:

Attachments (Terms of Use)
bad-cno-8.yaml (43.63 KB, application/gzip)
2021-04-17 03:54 UTC, Ross Brattain
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift api pull 900 0 None open Bug 1950590: Cap netflow/sflow/ipfix collectors to 10 2021-04-18 17:01:04 UTC
Github openshift cluster-network-operator pull 1068 0 None open Bug 1950590: Bump openshift/api and update-codegen for netflow maxitems 2021-04-22 17:10:11 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:01:58 UTC

Description Ross Brattain 2021-04-17 03:51:42 UTC
Description of problem:

The exportNetworkFlows collector API spec does not specify maxItems.

Too many collectors will cause ovnkube to CrashLoopBackOff

container dies with
standard_init_linux.go:219: exec user process caused: argument list too long

due to
       E2BIG           Argument list too long (POSIX.1-2001).

presumably due to ovnkube command line bytes approaching `getconf ARG_MAX` bytes.

The limit seems to be less 2 MB.

Greater than 2MB in the YAML causes etcd to reject the change with

error: networks.operator.openshift.io "cluster" could not be patched: etcdserver: request is too large

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. generate ~51200 IP address and port pairs as a YAML list


2. oc edit network.operator
3. paste in the list into the collector list


Actual results:

ovnkube-node-hmg9w     3/4     CrashLoopBackOff   8

standard_init_linux.go:219: exec user process caused: argument list too long

NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                 AGE
ovnkube-node     6         6         5       1            5           beta.kubernetes.io/os=linux                                   13h

Expected results:

oc edit schema validation fails using maxItems limit.

Additional info:

5697 IP port pairs seemed to work.  I did not try to measure performance impact

All user inputs must have a specified maximum.

Comment 1 Ross Brattain 2021-04-17 03:54:53 UTC
Created attachment 1772708 [details]

CNO YAML that fails

Comment 2 Ross Brattain 2021-04-17 04:38:35 UTC
we need to check maxItems in API schema and in CNO renderOVNKubernetes and in ovn-kubernetes ParseFlowCollectors, MonitoringFlags, and setupOVNNode.  

We could exceed ovnkube `--netflow-targets=` arg as well as ovs-vsctl `targets=[%s]` command line limits.

Comment 4 Ricardo Carrillo Cruz 2021-04-21 11:16:45 UTC
https://github.com/openshift/cluster-network-operator/pull/1068 for bumping openshift/api

Comment 6 Ross Brattain 2021-04-27 18:04:18 UTC
Verified on 4.8.0-0.nightly-2021-04-25-231500

Schema error is triggered.

# networks.operator.openshift.io "cluster" was not valid:
# * spec.exportNetworkFlows.netFlow.collectors: Invalid value: 47: spec.exportNetworkFlows.netFlow.collectors in body should have at most 10 items

oc explain updated

$ oc explain --api-version=operator.openshift.io/v1 networks.spec.exportNetworkFlows.netFlow
KIND:     Network
VERSION:  operator.openshift.io/v1

RESOURCE: netFlow <Object>

     netFlow defines the NetFlow configuration.

   collectors   <[]string>
     netFlow defines the NetFlow collectors that will consume the flow data
     exported from OVS. It is a list of strings formatted as ip:port with a
     maximum of ten items

Comment 10 errata-xmlrpc 2021-07-27 23:01:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.