Bug 1958094

Summary: Audit log files are corrupted sometimes
Product: OpenShift Container Platform Reporter: Stefan Schimanski <sttts>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: aos-bugs, mfojtik, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:07:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stefan Schimanski 2021-05-07 07:35:38 UTC
Log files in /var/log/kube-apiserver are sometimes corrupted at the tail. We suspect that kubelet does not run instances of the same pod in sequence, but with overlap.

Comment 2 Ke Wang 2021-05-18 11:20:26 UTC
$ oc version -o yaml
clientVersion:
  buildDate: "2021-05-14T22:17:07Z"
  compiler: gc
  gitCommit: 629bdbe335bbf2f68e5a5f6e3fc25de8c249fd3c
  gitTreeState: clean
  gitVersion: 4.8.0-202105142152.p0-629bdbe
  goVersion: go1.16.1
  major: ""
  minor: ""
  platform: linux/amd64
openshiftVersion: 4.8.0-0.nightly-2021-05-18-033553
releaseClientVersion: 4.8.0-0.nightly-2021-05-17-231618
serverVersion:
  buildDate: "2021-05-17T20:26:19Z"
  compiler: gc
  gitCommit: 9d99e1c27544615392364de66fc7fa926bd9e752
  gitTreeState: clean
  gitVersion: v1.21.0-rc.0+9d99e1c
  goVersion: go1.16.1
  major: "1"
  minor: 21+
  platform: linux/amd64

$ oc describe pod -n openshift-kube-apiserver kube-apiserver-ip-10-0-142-223.us-east-2.compute.internal
...
Containers:
  kube-apiserver:
    Container ID:  cri-o://b588396f834dfbb0298fd22ada48bcc68ae822e50f2267b5208bd961c1905d28
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb66d45f64e61b8896d933d5efccbde43d0d1bfa59ef0e970bc21dd19f662a05
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb66d45f64e61b8896d933d5efccbde43d0d1bfa59ef0e970bc21dd19f662a05
    Port:          6443/TCP
    Host Port:     6443/TCP
    Command:
      /bin/bash
      -ec
    Args:
      LOCK=/var/log/kube-apiserver/.lock
      echo -n "Acquiring exclusive lock ${LOCK}"
      exec {LOCK_FD}>${LOCK} && flock -n "${LOCK_FD}" || {
        echo "$(date -Iseconds -u) kubelet did not terminate old kube-apiserver before new one" >> /var/log/kube-apiserver/lock.log
        echo -n ": WARNING: kubelet did not terminate old kube-apiserver before new one."
        # we didn't get an exclusive lock. We keep going with the risk to corrupt audit logs.
      }
      echo
...      

flock was added in startup of kube-apiserver.

Observed the testgrid https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-aws&sort-by-flakiness, the following case [1] e2etest ran passed since May 10th.

[1] openshift-tests.[sig-cli] oc adm must-gather when looking at the audit logs [sig-node] kubelet runs apiserver processes strictly sequentially in order to not risk audit log corruption [Suite:openshift/conformance/parallel]

From above, the PR works as expected, so move the bug VERIFIED.

Comment 5 errata-xmlrpc 2021-07-27 23:07:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438