Bug 1793108

Summary: ovn-master ran out of space on system
Product: OpenShift Container Platform Reporter: Joe Talerico <jtaleric>
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: agopi, akamra, bbennett, rsevilla, scuppett
Version: 4.4Keywords: Reopened
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1820106 1821166 (view as bug list) Environment:
Last Closed: 2020-05-04 11:25:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1820106    
Bug Blocks: 1821166    

Description Joe Talerico 2020-01-20 17:16:54 UTC
Description of problem:
The problem manifested itself by not allowing our automation to deploy our tooling:

  Warning  FailedCreatePodSandBox  4m39s (x572 over 4h35m)  kubelet, ip-10-0-144-180.us-west-2.compute.internal  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_scale-ci-mastervertical-mpzjn_scale-ci-tooling_7fd0a212-a03a-49c6-99df-cda39288db59_0(8e03feff97ea1b99584a86f47b8bcda2def64de8917961c85e0b043bed861427): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition

Which led us to look at the ovn-master pod, which had screens of `not able to log to /dev/stdout`

In the AWS console, the node had a warning that AWS Checks couldn't find the state of the node. I rebooted the node, and it came back up -- previously, it was unreachable even via SSH or debug node.

Version-Release number of selected component (if applicable):

How reproducible:

Comment 1 Alexander Constantinescu 2020-02-26 13:11:37 UTC
Hi Joe

Wondering if you've encountered the issue again during your scale tests? 

I have not been able to reproduce the /dev/stdout issue, but I assume I am simply not generating enough logs to trigger the problem. 


Comment 2 Joe Talerico 2020-02-26 18:07:19 UTC
Hey Alex - I have not seen this recently, we can close for now.

Comment 4 agopi 2020-03-26 18:35:22 UTC
Hi Alexander, 

I notice the same error with OCP 4.3.5.

Comment 5 Joe Talerico 2020-03-26 18:50:04 UTC
As Aakarsh mentioned, we have seen this again :

root@ip-172-31-12-199: ~/ovs-policy # oc rsh -c ovnkube-master -n openshift-ovn-kubernetes ovnkube-master-sb8rp                                                                                                                                               
sh-4.2# df -h
Filesystem                            Size  Used Avail Use% Mounted on
overlay                               220G   18G  203G   8% /
tmpfs                                  64M   64M     0 100% /dev
shm                                    64M     0   64M   0% /dev/shm
tmpfs                                  63G   12M   63G   1% /run/ovn
/dev/mapper/coreos-luks-root-nocrypt  220G   18G  203G   8% /env
tmpfs                                  63G  8.0K   63G   1% /ovn-cert
tmpfs                                  63G   24K   63G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                  63G     0   63G   0% /proc/acpi
tmpfs                                  63G     0   63G   0% /proc/scsi
tmpfs                                  63G     0   63G   0% /sys/firmware

Where /dev has

sh-4.2# cd /dev
sh-4.2# ls
core  fd  full  mqueue  null  ptmx  pts  random  shm  stderr  stdin  stdout  stdout-2020-03-19T18-03-52.057  stdout-2020-03-19T18-03-52.057.gz  termination-log  tty  urandom  zero
sh-4.2# ls -talrh
total 64M
drwxrwxrwt. 2 root root   40 Mar 18 16:00 mqueue
drwxrwxrwt. 2 root root   40 Mar 18 16:00 shm
-rw-rw-rw-. 1 root root    0 Mar 18 16:00 termination-log
drwxr-xr-x. 2 root root    0 Mar 18 16:00 pts
drwxr-xr-x. 1 root root   92 Mar 18 16:00 ..
crw-rw-rw-. 1 root root 1, 5 Mar 18 16:00 zero
crw-rw-rw-. 1 root root 1, 9 Mar 18 16:00 urandom
crw-rw-rw-. 1 root root 5, 0 Mar 18 16:00 tty
lrwxrwxrwx. 1 root root   15 Mar 18 16:00 stdout-2020-03-19T18-03-52.057 -> /proc/self/fd/1
lrwxrwxrwx. 1 root root   15 Mar 18 16:00 stdin -> /proc/self/fd/0
lrwxrwxrwx. 1 root root   15 Mar 18 16:00 stderr -> /proc/self/fd/2
crw-rw-rw-. 1 root root 1, 8 Mar 18 16:00 random
lrwxrwxrwx. 1 root root    8 Mar 18 16:00 ptmx -> pts/ptmx
crw-rw-rw-. 1 root root 1, 3 Mar 18 16:00 null
crw-rw-rw-. 1 root root 1, 7 Mar 18 16:00 full
lrwxrwxrwx. 1 root root   13 Mar 18 16:00 fd -> /proc/self/fd
lrwxrwxrwx. 1 root root   11 Mar 18 16:00 core -> /proc/kcore
-rw-------. 1 root root    0 Mar 19 18:03 stdout-2020-03-19T18-03-52.057.gz
drwxr-xr-x. 5 root root  400 Mar 19 18:03 .
-rw-------. 1 root root  64M Mar 26 18:44 stdout

Which the stdout is full of :
time="2020-03-19T18:08:11Z" level=debug msg="exec(68668): /usr/bin/ovn-nbctl --timeout=15 get logical_switch_port nodevertical0_nodevert-pod-2670 dynamic_addresses addresses"
time="2020-03-19T18:08:11Z" level=debug msg="exec(68668): stdout: \"[]\\n[dynamic]\\n\""
time="2020-03-19T18:08:11Z" level=debug msg="exec(68668): stderr: \"\""
time="2020-03-19T18:08:11Z" level=debug msg="exec(68669): /usr/bin/ovn-nbctl --timeout=15 get logical_switch_port nodevertical0_nodevert-pod-2670 dynamic_addresses addresses"
time="2020-03-19T18:08:11Z" level=debug msg="exec(68669): stdout: \"\\\"76:1e:ab:82:06:89\\\"\\n[dynamic]\\n\""
time="2020-03-19T18:08:11Z" level=debug msg="exec(68669): stderr: \"\""
time="2020-03-19T18:08:11Z" level=debug msg="exec(68670): /usr/bin/ovn-nbctl --timeout=15 lsp-set-port-security nodevertical0_nodevert-pod-2670 76:1e:ab:82:06:89"
time="2020-03-19T18:08:11Z" level=debug msg="exec(68670): stdout: \"\""
time="2020-03-19T18:08:11Z" level=debug msg="exec(68670): stderr: \"\""
time="2020-03-19T18:08:11Z" level=debug msg="setAddressSet for a9236571437960368105 with [ 10.
130.10.129 10.129.12
.37 10.12
8.6.128 10.128
1 10.12
9.8.88 10.1
28.10.131 10.
130.10.121 10.
131.6.28 10.131.4.
119 10.131
.14.47 10.131.10. 
.... continues with many, many more IPs.

Comment 6 Anurag saxena 2020-03-26 18:59:47 UTC
Joe, can you also comment on number of workers and masters you used? and which build you were trying on wrt comment 5?

Comment 7 agopi 2020-03-26 19:03:00 UTC
@anurag this was with 3 masters, 4 workers and 2 infra nodes. The build was `4.3.5` version

Comment 8 Anurag saxena 2020-03-27 15:59:20 UTC
@agopi @Joe, i was not able to reproduce it with 3M-4W on Azure and AWS on 4.3.5. Is this repro always in your case? Also did it reproduce immediately post cluster install or post some testing which might have resulted into this? Any info related to that would be useful.

Comment 9 agopi 2020-03-27 16:13:56 UTC
Hey @anurag, its not immediately after the cluster install, we usually try to create a few pods and then this error seems to occur after a few days at least in my case. 

Also @Joe did a little digging and found out the following:

`/usr/bin/ovnkube --init-master ip-10-0-139-244.us-west-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --ovn-empty-lb-events --loglevel 4 --metrics-bind-address --sb-address ssl://,ssl://,ssl:// --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt` is how the ovnkube master is invoked, and if we take a look at the /run/ovnkube-config/ovnkube.conf file, its looking as follows:


The main line is the logging.logfile value which we see is written to /dev/stdout, however /dev partition max size is 64MB... and Joe looked at the log rotating and mentioned its supposed to rotate at 100M ( Joe can add more details about it).

Comment 10 Alexander Constantinescu 2020-03-30 11:41:24 UTC
Hi guys

Thanks for the investigation! Really helpful and I know what's causing this now. I will need to double check how to apply the fix, but at least it's clear what's happening now. 

FYI: the issue is related to the log-rotation mechanism that ovn-kubernetes is using in 4.3. The fact that it's writing to /dev/stdout which is limited to 64MB triggers log-rotation to happen incorrectly, which are these files we see here in your output:

sh-4.2# ls -talrh
total 64M
lrwxrwxrwx. 1 root root   15 Mar 18 16:00 stdout-2020-03-19T18-03-52.057 -> /proc/self/fd/1
-rw-------. 1 root root    0 Mar 19 18:03 stdout-2020-03-19T18-03-52.057.gz
-rw-------. 1 root root  64M Mar 26 18:44 stdout

The stdout-DATE files indicate that log rotation happened using the internal mechanism of ovn-kubernetes, but as /dev is full it seems it's been done incorrectly. 

Part of the solution is indeed to apply the same log file as done upstreams, but that's only half the solution needed. The other half is actually already existing in 4.4 and 4.5, but I need to check if we can back-port that to 4.3


Comment 13 Anurag saxena 2020-04-13 20:44:05 UTC
Did setup a cluster to observe the fix on 4.4 for couple of days

Comment 16 errata-xmlrpc 2020-05-04 11:25:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.