Bug 2100171

Summary: kube-controller-manager becomes degraded after time chaos test.
Product: OpenShift Container Platform Reporter: Simon <skordas>
Component: kube-controller-managerAssignee: Filip Krepinsky <fkrepins>
Status: CLOSED MIGRATED QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.11CC: mfojtik
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-12 15:47:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon 2022-06-22 16:07:24 UTC
Description of problem:
Very similar issue https://bugzilla.redhat.com/show_bug.cgi?id=1997630

During time chaos test kube-controller-manager became degraded.

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-06-22-015220

How reproducible:
100%

Steps to Reproduce:
1. git clone https://github.com/openshift-scale/kraken.git
2. cd kraken
3. Edit config file (kraken/config/config.yml) to only have time_scenarios (under chaos_scenarios section) and correct path to KUBECONFIG
4. Edit test file: (kraken/scenarios/openshift/time_scenarios_example.yml)
time_scenarios:
  - action: skew_time
    object_type: pod
    namespace: openshift-apiserver
    container_name: openshift-apiserver
    label_selector: app=openshift-apiserver-a
  - action: skew_date
    object_type: node
    label_selector: node-role.kubernetes.io/worker
5. pip3 install -r requirements.txt
6. Run: python3 run_kraken.py --config <config_file_location>

or manual:

for pods: oc get pods -n openshift-apiserver -l app=openshift-apiserver-a
oc exec $pod -n openshift-apiserver -- date -set 01:01:01

Actual results:
$ oc get co kube-controller-manager
NAME                      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-controller-manager   4.11.0-0.nightly-2022-06-22-015220   True        False         True       4h32m   StaticPodsDegraded: pod/kube-controller-manager-ip-10-0-169-52.us-east-2.compute.internal container "kube-controller-manager" is waiting: CreateContainerError: error reserving ctr name k8s_kube-controller-manager_kube-controller-manager-ip-10-0-169-52.us-east-2.compute.internal_openshift-kube-controller-manager_fe72bea3ab13c65c3fb62a7894c9a7e7_1 for id 04f611acd98f023845f45005e93ec545b40d97f57cbc2a50e97d856b2568595c: name is reserved...

Comment 2 Filip Krepinsky 2023-01-12 15:31:16 UTC
The manual reproduction worked for me 

1. run
for pod in `oc get pods -oname -n openshift-apiserver`; do
oc exec $pod -n openshift-apiserver -- date -s 01:01:01 &
done

2. wait for some time until KCM gets degraded
3. oc get co kube-controller-manager


this causes the KCM to crash, but the container is not restarted by kubelet afterwards. The other KCM pod takes leadership and that KCM pod seems to work properly. Seems like a kubelet issue: will create a jira bug for this

kubelet logs show:

Jan 12 15:20:13.985008 ip-10-0-183-145 kubenswrapper[1485]: E0112 15:20:13.984813    1485 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with RunContainerError: \"container e396b5021f210b78e470a0161cd45eb34c650b7742538655e46403428e931903 is not in created state: running\"" pod="openshift-kube-controller-manager/kube-controller-manager-ip-10-0-183-145.eu-north-1.compute.internal" podUID=40bae0f0fa982f6f44ead6c84dd248aa


kube-controller-mananger container after it fails

  - containerID: cri-o://a1cc365f308c77e48ff6e523d95e301c1e62587144c4a2c5c7d613dd16830971
    image: registry.ci.openshift.org/ocp/4.13-2023-01-11-123219@sha256:f9d04e98d280566ac4c4dea26d6d7085c0e0640df9c2a3e462b4aef67bfe5ef0
    imageID: registry.ci.openshift.org/ocp/4.13-2023-01-11-123219@sha256:f9d04e98d280566ac4c4dea26d6d7085c0e0640df9c2a3e462b4aef67bfe5ef0
    lastState:
      terminated:
        containerID: cri-o://a1cc365f308c77e48ff6e523d95e301c1e62587144c4a2c5c7d613dd16830971
        exitCode: 1
        finishedAt: "2023-01-12T01:01:14Z"
        message: |
          atch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthorized
          E0112 01:01:12.271715       1 reflector.go:140] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: the server has asked for the client to provide crede
ntials
          W0112 01:01:12.760779       1 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:134: failed to list *v1.ResourceQuota: Unauthorized
          E0112 01:01:12.760838       1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ResourceQuota: failed to list *v1.ResourceQuota: Unauthorized
          W0112 01:01:13.341535       1 reflector.go:424] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:90: failed to list *v1.PartialObjectMetadata: Unauthorized
          E0112 01:01:13.341567       1 reflector.go:140] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthoriz
ed
          W0112 01:01:13.524521       1 reflector.go:424] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:90: failed to list *v1.PartialObjectMetadata: Unauthorized
          E0112 01:01:13.524546       1 reflector.go:140] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Unauthoriz
ed
          E0112 01:01:14.629356       1 reflector.go:140] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: the server has asked for the client to provide crede
ntials
          E0112 01:01:14.667114       1 reflector.go:140] vendor/k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: the server has asked for the client to provide credentials
          I0112 01:01:14.693929       1 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
          E0112 01:01:14.694384       1 controllermanager.go:318] "leaderelection lost"
        reason: Error
        startedAt: "2023-01-12T14:37:02Z"
    name: kube-controller-manager
    ready: false
    restartCount: 2
    started: false
    state:
      waiting:
        message: 'container e396b5021f210b78e470a0161cd45eb34c650b7742538655e46403428e931903
          is not in created state: running'
        reason: RunContainerError