1712645 – "server returned HTTP status 500 Internal Server Error" for some kubelet endpoints

Bug 1712645 - "server returned HTTP status 500 Internal Server Error" for some kubelet endpoints

Summary: "server returned HTTP status 500 Internal Server Error" for some kubelet endp...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Ryan Phillips
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1716913 (view as bug list)
Depends On:
Blocks:	1713098 1716914
TreeView+	depends on / blocked

Reported:	2019-05-22 03:10 UTC by Junqi Zhao
Modified:	2020-02-27 12:55 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1713098 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:29:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
server returned HTTP status 500 Internal Server Error (104.07 KB, image/png) 2019-05-22 03:10 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes pull 77426	0	'None'	closed	Remove terminated pod from summary api.	2020-11-11 05:43:03 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:29:33 UTC

Description Junqi Zhao 2019-05-22 03:10:06 UTC

Created attachment 1571779 [details]
server returned HTTP status 500 Internal Server Error

Description of problem:
Deploy OCP 4.1 on vmware/bare metal environment, checked the prometheus /targets page
see the attached picture, find "server returned HTTP status 500 Internal Server Error" for some kubelet endpoints

take https://139.178.76.11:10250/metrics for example, it is on compute-0 node
$ oc get node -o wide | grep compute-0
compute-0                                              Ready    worker   24h     v1.13.4+27816e1b1   139.178.76.11   139.178.76.11   Red Hat Enterprise Linux CoreOS 410.8.20190517.0 (Ootpa)   4.18.0-80.1.2.el8_0.x86_64   cri-o://1.13.9-1.rhaos4.1.gitd70609a.el8

login this node, and execute the curl command, there is error
*****************************************************************
$ curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://139.178.76.11:10250/metrics
An error has occurred during metrics collection:

collected metric kubelet_container_log_filesystem_used_bytes label:<name:"container" value:"sdn" > label:<name:"namespace" value:"openshift-sdn" > label:<name:"pod" value:"sdn-snhfk" > gauge:<value:122880 >  was collected before with the same name and label values
*****************************************************************
$ sudo crictl ps -a | grep sdn
de29abd833b73       9a4ee92f8b0b7aa5e61862a782b1dc2f79aef3f10672906acd99b1ff22c4a0f9                                                                                10 minutes ago      Running             sdn                          4                   7ec7c2aee096c
15b510d75c6d4       9a4ee92f8b0b7aa5e61862a782b1dc2f79aef3f10672906acd99b1ff22c4a0f9                                                                                20 hours ago        Exited              sdn                          3                   7ec7c2aee096c
*****************************************************************
$ oc -n openshift-sdn get po sdn-snhfk -o wide
NAME        READY   STATUS    RESTARTS   AGE   IP              NODE        NOMINATED NODE   READINESS GATES
sdn-snhfk   1/1     Running   4          24h   139.178.76.11   compute-0   <none>           <none>
*****************************************************************
$ oc -n openshift-sdn get po sdn-snhfk -oyaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: 2019-05-21T02:27:14Z
  generateName: sdn-
  labels:
    app: sdn
    component: network
    controller-revision-hash: 597c997bdd
    openshift.io/component: network
    pod-template-generation: "1"
    type: infra
  name: sdn-snhfk
  namespace: openshift-sdn
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: sdn
    uid: f25a96ba-7b6f-11e9-8ae8-0050568bbe97
  resourceVersion: "940577"
  selfLink: /api/v1/namespaces/openshift-sdn/pods/sdn-snhfk
  uid: f25e750e-7b6f-11e9-8ae8-0050568bbe97
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - compute-0
  containers:
  - command:
    - /bin/bash
    - -c
    - |
      #!/bin/bash
      set -euo pipefail
 
      # if another process is listening on the cni-server socket, wait until it exits
      trap 'kill $(jobs -p); rm -f /etc/cni/net.d/80-openshift-network.conf ; exit 0' TERM
      retries=0
      while true; do
        if echo 'test' | socat - UNIX-CONNECT:/var/run/openshift-sdn/cni-server.sock &>/dev/null; then
          echo "warning: Another process is currently listening on the CNI socket, waiting 15s ..." 2>&1
          sleep 15 & wait
          (( retries += 1 ))
        else
          break
        fi
        if [[ "${retries}" -gt 40 ]]; then
          echo "error: Another process is currently listening on the CNI socket, exiting" 2>&1
          exit 1
        fi
      done
 
      # local environment overrides
      if [[ -f /etc/sysconfig/openshift-sdn ]]; then
        set -o allexport
        source /etc/sysconfig/openshift-sdn
        set +o allexport
      fi
      #BUG: cdc accidentally mounted /etc/sysconfig/openshift-sdn as DirectoryOrCreate; clean it up so we can ultimately mount /etc/sysconfig/openshift-sdn as FileOrCreate
      # Once this is released, then we can mount it properly
      if [[ -d /etc/sysconfig/openshift-sdn ]]; then
        rmdir /etc/sysconfig/openshift-sdn || true
      fi
 
      # Take over network functions on the node
      rm -f /etc/cni/net.d/80-openshift-network.conf
      cp -f /opt/cni/bin/* /host/opt/cni/bin/
 
      # Launch the network process
      exec /usr/bin/openshift-sdn --config=/config/sdn-config.yaml  --url-only-kubeconfig=/etc/kubernetes/kubeconfig --loglevel=${DEBUG_LOGLEVEL:-2}
    env:
    - name: OPENSHIFT_DNS_DOMAIN
      value: cluster.local
    - name: K8S_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6528d6c46b61b0ea653e82fe0ea34a2186fcb22815edc655c8b9ac1ecffc18b6
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - rm
          - -f
          - /etc/cni/net.d/80-openshift-network.conf
          - /host/opt/cni/bin/openshift-sdn
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: healthz
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: sdn
    ports:
    - containerPort: 10256
      hostPort: 10256
      name: healthz
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - test
        - -f
        - /etc/cni/net.d/80-openshift-network.conf
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: 100m
        memory: 200Mi
    securityContext:
      privileged: true
      procMount: Default
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /config
      name: config
      readOnly: true
    - mountPath: /etc/kubernetes/kubeconfig
      name: host-kubeconfig
      readOnly: true
    - mountPath: /var/run
      name: host-var-run
    - mountPath: /var/run/dbus/
      name: host-var-run-dbus
      readOnly: true
    - mountPath: /var/run/openvswitch/
      name: host-var-run-ovs
      readOnly: true
    - mountPath: /var/run/kubernetes/
      name: host-var-run-kubernetes
      readOnly: true
    - mountPath: /var/run/openshift-sdn
      name: host-var-run-openshift-sdn
    - mountPath: /host
      name: host-slash
    - mountPath: /host/opt/cni/bin
      name: host-cni-bin
    - mountPath: /etc/cni/net.d
      name: host-cni-netd
    - mountPath: /var/lib/cni/networks/openshift-sdn
      name: host-var-lib-cni-networks-openshift-sdn
    - mountPath: /lib/modules
      name: host-modules
      readOnly: true
    - mountPath: /etc/sysconfig
      name: etc-sysconfig
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: sdn-token-bcn77
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  hostPID: true
  nodeName: compute-0
  nodeSelector:
    beta.kubernetes.io/os: linux
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: sdn
  serviceAccountName: sdn
  terminationGracePeriodSeconds: 30
  tolerations:
  - operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - configMap:
      defaultMode: 420
      name: sdn-config
    name: config
  - hostPath:
      path: /etc/kubernetes/kubeconfig
      type: ""
    name: host-kubeconfig
  - hostPath:
      path: /etc/sysconfig
      type: ""
    name: etc-sysconfig
  - hostPath:
      path: /lib/modules
      type: ""
    name: host-modules
  - hostPath:
      path: /var/run
      type: ""
    name: host-var-run
  - hostPath:
      path: /var/run/dbus
      type: ""
    name: host-var-run-dbus
  - hostPath:
      path: /var/run/openvswitch
      type: ""
    name: host-var-run-ovs
  - hostPath:
      path: /var/run/kubernetes
      type: ""
    name: host-var-run-kubernetes
  - hostPath:
      path: /var/run/openshift-sdn
      type: ""
    name: host-var-run-openshift-sdn
  - hostPath:
      path: /
      type: ""
    name: host-slash
  - hostPath:
      path: /var/lib/cni/bin
      type: ""
    name: host-cni-bin
  - hostPath:
      path: /etc/kubernetes/cni/net.d
      type: ""
    name: host-cni-netd
  - hostPath:
      path: /var/lib/cni/networks/openshift-sdn
      type: ""
    name: host-var-lib-cni-networks-openshift-sdn
  - name: sdn-token-bcn77
    secret:
      defaultMode: 420
      secretName: sdn-token-bcn77
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-05-21T02:27:14Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2019-05-22T02:01:17Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2019-05-22T02:01:17Z
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2019-05-21T02:27:14Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://de29abd833b734d03f4d07e156ab34eea950dc818f8fb00f9c17079f27e0e0ec
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6528d6c46b61b0ea653e82fe0ea34a2186fcb22815edc655c8b9ac1ecffc18b6
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6528d6c46b61b0ea653e82fe0ea34a2186fcb22815edc655c8b9ac1ecffc18b6
    lastState:
      terminated:
        containerID: cri-o://15b510d75c6d44229a54981fd488b4b5beedbd642a3559d49cfa1e28b9222819
        exitCode: 0
        finishedAt: 2019-05-22T02:01:05Z
        reason: Completed
        startedAt: 2019-05-21T06:22:52Z
    name: sdn
    ready: true
    restartCount: 4
    state:
      running:
        startedAt: 2019-05-22T02:01:09Z
  hostIP: 139.178.76.11
  phase: Running
  podIP: 139.178.76.11
  qosClass: Burstable
  startTime: 2019-05-21T02:27:14Z
*****************************************************************
Version-Release number of selected component (if applicable):
payload: 4.1.0-0.nightly-2019-05-18-050636

$ oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"27816e1b1", GitTreeState:"clean", BuildDate:"2019-05-17T23:03:34Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+f2cc675", GitCommit:"f2cc675", GitTreeState:"clean", BuildDate:"2019-05-18T00:35:43Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:
Frequently

Steps to Reproduce:
1. Deploy OCP 4.1 on vmware/bare metal environment, and let it run for a few days, and check the prometheus /targets page
2.
3.

Actual results:
"server returned HTTP status 500 Internal Server Error" for some kubelet endpoints

Expected results:
Should not have this error

Additional info:

Comment 3 Simon Pasquier 2019-05-22 09:01:51 UTC

if I'm not wrong, it is fixed by https://github.com/kubernetes/kubernetes/pull/77426.

Comment 6 Clayton Coleman 2019-05-22 14:11:01 UTC

I think this can be deferred to 4.1.1, it does not significantly impact the usability of the cluster.

Comment 7 Derek Carr 2019-05-22 14:27:09 UTC

Agree, we should backport the linked PR as well https://github.com/kubernetes/kubernetes/pull/77426

Comment 8 Seth Jennings 2019-05-22 21:08:10 UTC

Ryan, can you backport this to Origin master (4.2).  I'll clone this bug for the 4.1.z backport.

Comment 9 Ryan Phillips 2019-05-23 15:47:35 UTC

PR: https://github.com/openshift/origin/pull/22889

Comment 10 Seth Jennings 2019-06-04 17:34:27 UTC

Moving back to assigned because we also need this https://github.com/prometheus/client_golang/pull/513

Comment 11 Seth Jennings 2019-06-04 17:35:31 UTC

*** Bug 1716913 has been marked as a duplicate of this bug. ***

Comment 12 Seth Jennings 2019-06-04 17:35:59 UTC

*** Bug 1716914 has been marked as a duplicate of this bug. ***

Comment 13 Seth Jennings 2019-06-04 17:38:26 UTC

The second PR is only needed for 4.1 so, I'm going to clone to track the backport of both PRs to 4.1 and move this back to MODIFIED.

Comment 15 Junqi Zhao 2019-06-28 05:27:27 UTC

Tested with 4.2.0-0.nightly-2019-06-27-204704
prometheus /targets page, checked all 10250/metrics endpoints, they all are UP

Comment 19 errata-xmlrpc 2019-10-16 06:29:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.