Bug 1793108 - ovn-master ran out of space on system
Summary: ovn-master ran out of space on system
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.0
Assignee: Alexander Constantinescu
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On: 1820106
Blocks: 1821166
TreeView+ depends on / blocked
 
Reported: 2020-01-20 17:16 UTC by Joe Talerico
Modified: 2020-05-04 11:25 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1820106 1821166 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:25:32 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 566 None closed Bug 1793108: [Back-port] Remove logfile from ovnkube configmap 2020-09-15 18:40:50 UTC
Red Hat Product Errata RHBA-2020:0581 None None None 2020-05-04 11:25:55 UTC

Description Joe Talerico 2020-01-20 17:16:54 UTC
Description of problem:
The problem manifested itself by not allowing our automation to deploy our tooling:

  Warning  FailedCreatePodSandBox  4m39s (x572 over 4h35m)  kubelet, ip-10-0-144-180.us-west-2.compute.internal  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_scale-ci-mastervertical-mpzjn_scale-ci-tooling_7fd0a212-a03a-49c6-99df-cda39288db59_0(8e03feff97ea1b99584a86f47b8bcda2def64de8917961c85e0b043bed861427): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition


Which led us to look at the ovn-master pod, which had screens of `not able to log to /dev/stdout`

In the AWS console, the node had a warning that AWS Checks couldn't find the state of the node. I rebooted the node, and it came back up -- previously, it was unreachable even via SSH or debug node.


Version-Release number of selected component (if applicable):
OCP4.4

How reproducible:
N/A

Comment 1 Alexander Constantinescu 2020-02-26 13:11:37 UTC
Hi Joe

Wondering if you've encountered the issue again during your scale tests? 

I have not been able to reproduce the /dev/stdout issue, but I assume I am simply not generating enough logs to trigger the problem. 

/Alex

Comment 2 Joe Talerico 2020-02-26 18:07:19 UTC
Hey Alex - I have not seen this recently, we can close for now.

Comment 4 agopi 2020-03-26 18:35:22 UTC
Hi Alexander, 

I notice the same error with OCP 4.3.5.

Comment 5 Joe Talerico 2020-03-26 18:50:04 UTC
As Aakarsh mentioned, we have seen this again :

root@ip-172-31-12-199: ~/ovs-policy # oc rsh -c ovnkube-master -n openshift-ovn-kubernetes ovnkube-master-sb8rp                                                                                                                                               
sh-4.2# df -h
Filesystem                            Size  Used Avail Use% Mounted on
overlay                               220G   18G  203G   8% /
tmpfs                                  64M   64M     0 100% /dev
shm                                    64M     0   64M   0% /dev/shm
tmpfs                                  63G   12M   63G   1% /run/ovn
/dev/mapper/coreos-luks-root-nocrypt  220G   18G  203G   8% /env
tmpfs                                  63G  8.0K   63G   1% /ovn-cert
tmpfs                                  63G   24K   63G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                  63G     0   63G   0% /proc/acpi
tmpfs                                  63G     0   63G   0% /proc/scsi
tmpfs                                  63G     0   63G   0% /sys/firmware

Where /dev has

sh-4.2# cd /dev
sh-4.2# ls
core  fd  full  mqueue  null  ptmx  pts  random  shm  stderr  stdin  stdout  stdout-2020-03-19T18-03-52.057  stdout-2020-03-19T18-03-52.057.gz  termination-log  tty  urandom  zero
sh-4.2# ls -talrh
total 64M
drwxrwxrwt. 2 root root   40 Mar 18 16:00 mqueue
drwxrwxrwt. 2 root root   40 Mar 18 16:00 shm
-rw-rw-rw-. 1 root root    0 Mar 18 16:00 termination-log
drwxr-xr-x. 2 root root    0 Mar 18 16:00 pts
drwxr-xr-x. 1 root root   92 Mar 18 16:00 ..
crw-rw-rw-. 1 root root 1, 5 Mar 18 16:00 zero
crw-rw-rw-. 1 root root 1, 9 Mar 18 16:00 urandom
crw-rw-rw-. 1 root root 5, 0 Mar 18 16:00 tty
lrwxrwxrwx. 1 root root   15 Mar 18 16:00 stdout-2020-03-19T18-03-52.057 -> /proc/self/fd/1
lrwxrwxrwx. 1 root root   15 Mar 18 16:00 stdin -> /proc/self/fd/0
lrwxrwxrwx. 1 root root   15 Mar 18 16:00 stderr -> /proc/self/fd/2
crw-rw-rw-. 1 root root 1, 8 Mar 18 16:00 random
lrwxrwxrwx. 1 root root    8 Mar 18 16:00 ptmx -> pts/ptmx
crw-rw-rw-. 1 root root 1, 3 Mar 18 16:00 null
crw-rw-rw-. 1 root root 1, 7 Mar 18 16:00 full
lrwxrwxrwx. 1 root root   13 Mar 18 16:00 fd -> /proc/self/fd
lrwxrwxrwx. 1 root root   11 Mar 18 16:00 core -> /proc/kcore
-rw-------. 1 root root    0 Mar 19 18:03 stdout-2020-03-19T18-03-52.057.gz
drwxr-xr-x. 5 root root  400 Mar 19 18:03 .
-rw-------. 1 root root  64M Mar 26 18:44 stdout


Which the stdout is full of :
time="2020-03-19T18:08:11Z" level=debug msg="exec(68668): /usr/bin/ovn-nbctl --timeout=15 get logical_switch_port nodevertical0_nodevert-pod-2670 dynamic_addresses addresses"
time="2020-03-19T18:08:11Z" level=debug msg="exec(68668): stdout: \"[]\\n[dynamic]\\n\""
time="2020-03-19T18:08:11Z" level=debug msg="exec(68668): stderr: \"\""
time="2020-03-19T18:08:11Z" level=debug msg="exec(68669): /usr/bin/ovn-nbctl --timeout=15 get logical_switch_port nodevertical0_nodevert-pod-2670 dynamic_addresses addresses"
time="2020-03-19T18:08:11Z" level=debug msg="exec(68669): stdout: \"\\\"76:1e:ab:82:06:89 10.130.6.136\\\"\\n[dynamic]\\n\""
time="2020-03-19T18:08:11Z" level=debug msg="exec(68669): stderr: \"\""
time="2020-03-19T18:08:11Z" level=debug msg="exec(68670): /usr/bin/ovn-nbctl --timeout=15 lsp-set-port-security nodevertical0_nodevert-pod-2670 76:1e:ab:82:06:89 10.130.6.136/23"
time="2020-03-19T18:08:11Z" level=debug msg="exec(68670): stdout: \"\""
time="2020-03-19T18:08:11Z" level=debug msg="exec(68670): stderr: \"\""
time="2020-03-19T18:08:11Z" level=debug msg="setAddressSet for a9236571437960368105 with [10.129.12.84 10.130.10.104 10.128.6.125 10.131.0.51 10.131.12.38 10.128.8.62 10.131.12.70 10.130.10.71 10.130.6.111 10.131.8.126 10.129.6.35 10.130.4.92 10.131.8.93
 10.131.6.116 10.130.4.128 10.129.8.55 10.128.12.62 10.128.12.114 10.130.8.128 10.130.10.57 10.129.6.29 10.131.14.35 10.129.8.99 10.128.8.128 10.131.4.135 10.130.8.9 10.128.12.24 10.130.10.78 10.131.10.96 10.128.8.99 10.130.10.24 10.128.6.51 10.129.2.58 
10.129.12.74 10.130.4.86 10.128.6.93 10.128.6.138 10.129.2.49 10.131.6.93 10.130.8.66 10.130.10.45 10.128.6.56 10.131.10.67 10.128.6.92 10.128.10.121 10.129.12.124 10.128.6.16 10.131.14.36 10.131.6.60 10.131.12.74 10.128.6.78 10.130.6.80 10.128.2.109 10.
130.10.129 10.130.10.20 10.129.8.133 10.129.12.140 10.130.6.19 10.129.8.63 10.128.12.69 10.131.8.91 10.128.8.94 10.130.10.105 10.130.4.102 10.128.2.13 10.130.8.22 10.131.10.29 10.131.14.34 10.131.14.43 10.128.10.75 10.130.8.120 10.131.4.131 10.128.12.19 
10.128.8.59 10.131.6.71 10.128.10.91 10.131.4.90 10.129.8.123 10.128.6.54 10.131.4.91 10.131.0.101 10.129.6.124 10.128.12.31 10.131.12.45 10.131.0.120 10.131.4.121 10.131.0.18 10.131.14.90 10.128.12.111 10.131.14.129 10.128.6.140 10.129.6.33 10.131.10.59
 10.130.4.75 10.130.10.97 10.131.6.118 10.131.14.133 10.129.8.21 10.131.4.19 10.128.8.46 10.130.8.84 10.128.8.17 10.130.8.114 10.129.12.128 10.131.14.97 10.131.6.45 10.130.10.74 10.128.12.80 10.129.8.95 10.131.6.120 10.131.10.120 10.130.10.12 10.131.4.96
 10.128.10.107 10.128.6.143 10.129.12.66 10.130.4.97 10.131.14.134 10.128.2.128 10.131.14.54 10.129.2.44 10.129.2.55 10.130.10.73 10.131.8.83 10.131.4.104 10.128.10.37 10.129.12.42 10.128.10.53 10.128.2.50 10.128.10.59 10.128.2.63 10.131.14.111 10.129.12
.37 10.131.12.41 10.129.8.58 10.131.10.88 10.131.0.124 10.130.4.134 10.131.10.16 10.131.12.80 10.128.10.77 10.128.6.86 10.130.8.111 10.131.0.125 10.131.14.128 10.130.4.149 10.131.12.67 10.128.6.142 10.130.4.47 10.131.12.68 10.129.2.101 10.128.2.121 10.12
8.6.128 10.130.4.27 10.130.6.55 10.130.4.98 10.130.6.30 10.128.10.84 10.129.2.91 10.128.10.132 10.131.12.134 10.130.8.139 10.131.0.23 10.130.10.50 10.129.2.24 10.128.8.38 10.131.8.70 10.131.12.77 10.131.10.122 10.129.8.127 10.131.6.138 10.130.8.26 10.128
.2.22 10.129.6.58 10.130.4.95 10.129.6.101 10.129.8.12 10.129.8.40 10.129.2.73 10.130.10.95 10.128.8.103 10.131.14.137 10.130.6.25 10.130.8.50 10.128.6.64 10.130.6.105 10.130.6.110 10.130.10.124 10.131.4.124 10.128.8.32 10.128.2.52 10.130.6.70 10.131.0.2
1 10.128.6.48 10.128.6.73 10.130.6.76 10.130.6.87 10.128.2.82 10.128.8.106 10.128.8.10 10.130.8.59 10.130.8.95 10.129.12.102 10.130.8.121 10.129.8.136 10.128.2.20 10.130.10.47 10.129.6.80 10.129.2.94 10.128.2.117 10.130.10.7 10.130.8.25 10.130.4.37 10.12
9.8.88 10.129.2.88 10.128.6.127 10.130.10.18 10.129.8.62 10.131.6.73 10.128.2.91 10.128.6.116 10.131.12.117 10.129.12.112 10.130.4.115 10.129.8.47 10.128.10.20 10.130.4.30 10.129.8.52 10.129.12.79 10.130.10.86 10.131.12.99 10.129.12.105 10.131.14.13 10.1
28.10.131 10.131.14.60 10.131.4.61 10.130.6.85 10.128.6.112 10.129.8.23 10.129.6.104 10.130.8.130 10.131.14.95 10.131.4.43 10.130.8.48 10.131.10.49 10.129.6.47 10.129.8.75 10.128.12.32 10.129.12.26 10.130.10.62 10.130.4.68 10.129.12.103 10.128.12.118 10.
130.10.121 10.128.6.23 10.128.10.69 10.129.8.92 10.131.10.139 10.130.4.38 10.130.6.42 10.131.12.106 10.130.8.116 10.129.8.15 10.129.12.47 10.130.4.53 10.128.10.67 10.131.6.77 10.130.4.120 10.130.6.135 10.131.8.45 10.129.12.70 10.129.2.68 10.129.12.73 10.
131.6.28 10.128.6.106 10.130.6.33 10.131.0.29 10.131.10.17 10.129.8.19 10.128.2.17 10.129.6.55 10.131.6.109 10.128.2.118 10.128.8.12 10.130.8.32 10.131.4.33 10.128.12.42 10.128.10.51 10.130.10.25 10.129.8.28 10.129.6.40 10.131.0.45 10.131.8.100 10.131.4.
119 10.131.4.120 10.130.4.9 10.131.4.34 10.128.12.51 10.128.10.63 10.131.10.68 10.130.10.110 10.130.10.115 10.129.2.15 10.130.10.34 10.131.0.77 10.128.10.27 10.131.14.115 10.130.10.142 10.129.12.9 10.129.12.15 10.129.8.61 10.128.12.11 10.130.10.49 10.131
.14.47 10.131.6.72 10.129.6.75 10.131.6.78 10.131.0.128 10.128.8.132 10.131.10.12 10.128.2.10 10.131.0.52 10.128.6.87 10.131.8.111 10.131.6.124 10.131.14.8 10.128.12.34 10.129.12.56 10.129.2.89 10.129.12.97 10.128.2.98 10.130.10.8 10.130.10.43 10.131.10. 
.... continues with many, many more IPs.

Comment 6 Anurag saxena 2020-03-26 18:59:47 UTC
Joe, can you also comment on number of workers and masters you used? and which build you were trying on wrt comment 5?

Comment 7 agopi 2020-03-26 19:03:00 UTC
@anurag this was with 3 masters, 4 workers and 2 infra nodes. The build was `4.3.5` version

Comment 8 Anurag saxena 2020-03-27 15:59:20 UTC
@agopi @Joe, i was not able to reproduce it with 3M-4W on Azure and AWS on 4.3.5. Is this repro always in your case? Also did it reproduce immediately post cluster install or post some testing which might have resulted into this? Any info related to that would be useful.

Comment 9 agopi 2020-03-27 16:13:56 UTC
Hey @anurag, its not immediately after the cluster install, we usually try to create a few pods and then this error seems to occur after a few days at least in my case. 

Also @Joe did a little digging and found out the following:


`/usr/bin/ovnkube --init-master ip-10-0-139-244.us-west-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --ovn-empty-lb-events --loglevel 4 --metrics-bind-address 0.0.0.0:9102 --sb-address ssl://10.0.139.244:9642,ssl://10.0.149.21:9642,ssl://10.0.171.170:9642 --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt` is how the ovnkube master is invoked, and if we take a look at the /run/ovnkube-config/ovnkube.conf file, its looking as follows:

```
[default]
mtu="8901"
cluster-subnets="10.128.0.0/14/23"
encap-port="6081"
[kubernetes]
service-cidr="172.30.0.0/16"
ovn-config-namespace="openshift-ovn-kubernetes"
apiserver="https://api-int.awsovn435aakarsh.perf-testing.devcluster.openshift.com:6443"
[logging]
logfile="/dev/stdout"
[gateway]
mode=local
nodeport=true
```

The main line is the logging.logfile value which we see is written to /dev/stdout, however /dev partition max size is 64MB... and Joe looked at the log rotating and mentioned its supposed to rotate at 100M ( Joe can add more details about it).

Comment 10 Alexander Constantinescu 2020-03-30 11:41:24 UTC
Hi guys

Thanks for the investigation! Really helpful and I know what's causing this now. I will need to double check how to apply the fix, but at least it's clear what's happening now. 

FYI: the issue is related to the log-rotation mechanism that ovn-kubernetes is using in 4.3. The fact that it's writing to /dev/stdout which is limited to 64MB triggers log-rotation to happen incorrectly, which are these files we see here in your output:

sh-4.2# ls -talrh
total 64M
lrwxrwxrwx. 1 root root   15 Mar 18 16:00 stdout-2020-03-19T18-03-52.057 -> /proc/self/fd/1
-rw-------. 1 root root    0 Mar 19 18:03 stdout-2020-03-19T18-03-52.057.gz
-rw-------. 1 root root  64M Mar 26 18:44 stdout

The stdout-DATE files indicate that log rotation happened using the internal mechanism of ovn-kubernetes, but as /dev is full it seems it's been done incorrectly. 

Part of the solution is indeed to apply the same log file as done upstreams, but that's only half the solution needed. The other half is actually already existing in 4.4 and 4.5, but I need to check if we can back-port that to 4.3

-Alex

Comment 13 Anurag saxena 2020-04-13 20:44:05 UTC
Did setup a cluster to observe the fix on 4.4 for couple of days

Comment 16 errata-xmlrpc 2020-05-04 11:25:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.