Bug 1958094 - Audit log files are corrupted sometimes
Summary: Audit log files are corrupted sometimes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-07 07:35 UTC by Stefan Schimanski
Modified: 2021-07-27 23:07 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:07:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 1128 0 None open Bug 1958094: Add flock to kube-apiserver startup 2021-05-07 07:40:14 UTC
Github openshift must-gather pull 231 0 None open Bug 1958094: gather_audit_logs: ignore .lock file 2021-05-07 12:53:02 UTC
Github openshift origin pull 26138 0 None open Bug 1958094: audit-logs: allow lock related files 2021-05-07 09:23:47 UTC
Github openshift origin pull 26139 0 None open Bug 1958094: must-gather: add oauth-apiserver to expected audit logs 2021-05-08 12:02:34 UTC
Github openshift origin pull 26148 0 None open Bug 1958094: must-gather: check file name instead of whole path 2021-05-11 08:18:19 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:07:35 UTC

Description Stefan Schimanski 2021-05-07 07:35:38 UTC
Log files in /var/log/kube-apiserver are sometimes corrupted at the tail. We suspect that kubelet does not run instances of the same pod in sequence, but with overlap.

Comment 2 Ke Wang 2021-05-18 11:20:26 UTC
$ oc version -o yaml
clientVersion:
  buildDate: "2021-05-14T22:17:07Z"
  compiler: gc
  gitCommit: 629bdbe335bbf2f68e5a5f6e3fc25de8c249fd3c
  gitTreeState: clean
  gitVersion: 4.8.0-202105142152.p0-629bdbe
  goVersion: go1.16.1
  major: ""
  minor: ""
  platform: linux/amd64
openshiftVersion: 4.8.0-0.nightly-2021-05-18-033553
releaseClientVersion: 4.8.0-0.nightly-2021-05-17-231618
serverVersion:
  buildDate: "2021-05-17T20:26:19Z"
  compiler: gc
  gitCommit: 9d99e1c27544615392364de66fc7fa926bd9e752
  gitTreeState: clean
  gitVersion: v1.21.0-rc.0+9d99e1c
  goVersion: go1.16.1
  major: "1"
  minor: 21+
  platform: linux/amd64

$ oc describe pod -n openshift-kube-apiserver kube-apiserver-ip-10-0-142-223.us-east-2.compute.internal
...
Containers:
  kube-apiserver:
    Container ID:  cri-o://b588396f834dfbb0298fd22ada48bcc68ae822e50f2267b5208bd961c1905d28
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb66d45f64e61b8896d933d5efccbde43d0d1bfa59ef0e970bc21dd19f662a05
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb66d45f64e61b8896d933d5efccbde43d0d1bfa59ef0e970bc21dd19f662a05
    Port:          6443/TCP
    Host Port:     6443/TCP
    Command:
      /bin/bash
      -ec
    Args:
      LOCK=/var/log/kube-apiserver/.lock
      echo -n "Acquiring exclusive lock ${LOCK}"
      exec {LOCK_FD}>${LOCK} && flock -n "${LOCK_FD}" || {
        echo "$(date -Iseconds -u) kubelet did not terminate old kube-apiserver before new one" >> /var/log/kube-apiserver/lock.log
        echo -n ": WARNING: kubelet did not terminate old kube-apiserver before new one."
        # we didn't get an exclusive lock. We keep going with the risk to corrupt audit logs.
      }
      echo
...      

flock was added in startup of kube-apiserver.

Observed the testgrid https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-aws&sort-by-flakiness, the following case [1] e2etest ran passed since May 10th.

[1] openshift-tests.[sig-cli] oc adm must-gather when looking at the audit logs [sig-node] kubelet runs apiserver processes strictly sequentially in order to not risk audit log corruption [Suite:openshift/conformance/parallel]

From above, the PR works as expected, so move the bug VERIFIED.

Comment 5 errata-xmlrpc 2021-07-27 23:07:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.