Created attachment 1571779 [details] server returned HTTP status 500 Internal Server Error Description of problem: Deploy OCP 4.1 on vmware/bare metal environment, checked the prometheus /targets page see the attached picture, find "server returned HTTP status 500 Internal Server Error" for some kubelet endpoints take https://139.178.76.11:10250/metrics for example, it is on compute-0 node $ oc get node -o wide | grep compute-0 compute-0 Ready worker 24h v1.13.4+27816e1b1 139.178.76.11 139.178.76.11 Red Hat Enterprise Linux CoreOS 410.8.20190517.0 (Ootpa) 4.18.0-80.1.2.el8_0.x86_64 cri-o://1.13.9-1.rhaos4.1.gitd70609a.el8 login this node, and execute the curl command, there is error ***************************************************************** $ curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://139.178.76.11:10250/metrics An error has occurred during metrics collection: collected metric kubelet_container_log_filesystem_used_bytes label:<name:"container" value:"sdn" > label:<name:"namespace" value:"openshift-sdn" > label:<name:"pod" value:"sdn-snhfk" > gauge:<value:122880 > was collected before with the same name and label values ***************************************************************** $ sudo crictl ps -a | grep sdn de29abd833b73 9a4ee92f8b0b7aa5e61862a782b1dc2f79aef3f10672906acd99b1ff22c4a0f9 10 minutes ago Running sdn 4 7ec7c2aee096c 15b510d75c6d4 9a4ee92f8b0b7aa5e61862a782b1dc2f79aef3f10672906acd99b1ff22c4a0f9 20 hours ago Exited sdn 3 7ec7c2aee096c ***************************************************************** $ oc -n openshift-sdn get po sdn-snhfk -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES sdn-snhfk 1/1 Running 4 24h 139.178.76.11 compute-0 <none> <none> ***************************************************************** $ oc -n openshift-sdn get po sdn-snhfk -oyaml apiVersion: v1 kind: Pod metadata: creationTimestamp: 2019-05-21T02:27:14Z generateName: sdn- labels: app: sdn component: network controller-revision-hash: 597c997bdd openshift.io/component: network pod-template-generation: "1" type: infra name: sdn-snhfk namespace: openshift-sdn ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: DaemonSet name: sdn uid: f25a96ba-7b6f-11e9-8ae8-0050568bbe97 resourceVersion: "940577" selfLink: /api/v1/namespaces/openshift-sdn/pods/sdn-snhfk uid: f25e750e-7b6f-11e9-8ae8-0050568bbe97 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchFields: - key: metadata.name operator: In values: - compute-0 containers: - command: - /bin/bash - -c - | #!/bin/bash set -euo pipefail # if another process is listening on the cni-server socket, wait until it exits trap 'kill $(jobs -p); rm -f /etc/cni/net.d/80-openshift-network.conf ; exit 0' TERM retries=0 while true; do if echo 'test' | socat - UNIX-CONNECT:/var/run/openshift-sdn/cni-server.sock &>/dev/null; then echo "warning: Another process is currently listening on the CNI socket, waiting 15s ..." 2>&1 sleep 15 & wait (( retries += 1 )) else break fi if [[ "${retries}" -gt 40 ]]; then echo "error: Another process is currently listening on the CNI socket, exiting" 2>&1 exit 1 fi done # local environment overrides if [[ -f /etc/sysconfig/openshift-sdn ]]; then set -o allexport source /etc/sysconfig/openshift-sdn set +o allexport fi #BUG: cdc accidentally mounted /etc/sysconfig/openshift-sdn as DirectoryOrCreate; clean it up so we can ultimately mount /etc/sysconfig/openshift-sdn as FileOrCreate # Once this is released, then we can mount it properly if [[ -d /etc/sysconfig/openshift-sdn ]]; then rmdir /etc/sysconfig/openshift-sdn || true fi # Take over network functions on the node rm -f /etc/cni/net.d/80-openshift-network.conf cp -f /opt/cni/bin/* /host/opt/cni/bin/ # Launch the network process exec /usr/bin/openshift-sdn --config=/config/sdn-config.yaml --url-only-kubeconfig=/etc/kubernetes/kubeconfig --loglevel=${DEBUG_LOGLEVEL:-2} env: - name: OPENSHIFT_DNS_DOMAIN value: cluster.local - name: K8S_NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6528d6c46b61b0ea653e82fe0ea34a2186fcb22815edc655c8b9ac1ecffc18b6 imagePullPolicy: IfNotPresent lifecycle: preStop: exec: command: - rm - -f - /etc/cni/net.d/80-openshift-network.conf - /host/opt/cni/bin/openshift-sdn livenessProbe: failureThreshold: 3 httpGet: path: /healthz port: healthz scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: sdn ports: - containerPort: 10256 hostPort: 10256 name: healthz protocol: TCP readinessProbe: exec: command: - test - -f - /etc/cni/net.d/80-openshift-network.conf failureThreshold: 3 initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 resources: requests: cpu: 100m memory: 200Mi securityContext: privileged: true procMount: Default terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /config name: config readOnly: true - mountPath: /etc/kubernetes/kubeconfig name: host-kubeconfig readOnly: true - mountPath: /var/run name: host-var-run - mountPath: /var/run/dbus/ name: host-var-run-dbus readOnly: true - mountPath: /var/run/openvswitch/ name: host-var-run-ovs readOnly: true - mountPath: /var/run/kubernetes/ name: host-var-run-kubernetes readOnly: true - mountPath: /var/run/openshift-sdn name: host-var-run-openshift-sdn - mountPath: /host name: host-slash - mountPath: /host/opt/cni/bin name: host-cni-bin - mountPath: /etc/cni/net.d name: host-cni-netd - mountPath: /var/lib/cni/networks/openshift-sdn name: host-var-lib-cni-networks-openshift-sdn - mountPath: /lib/modules name: host-modules readOnly: true - mountPath: /etc/sysconfig name: etc-sysconfig readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: sdn-token-bcn77 readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true hostNetwork: true hostPID: true nodeName: compute-0 nodeSelector: beta.kubernetes.io/os: linux priority: 2000001000 priorityClassName: system-node-critical restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: sdn serviceAccountName: sdn terminationGracePeriodSeconds: 30 tolerations: - operator: Exists - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists - effect: NoSchedule key: node.kubernetes.io/disk-pressure operator: Exists - effect: NoSchedule key: node.kubernetes.io/unschedulable operator: Exists - effect: NoSchedule key: node.kubernetes.io/network-unavailable operator: Exists - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists volumes: - configMap: defaultMode: 420 name: sdn-config name: config - hostPath: path: /etc/kubernetes/kubeconfig type: "" name: host-kubeconfig - hostPath: path: /etc/sysconfig type: "" name: etc-sysconfig - hostPath: path: /lib/modules type: "" name: host-modules - hostPath: path: /var/run type: "" name: host-var-run - hostPath: path: /var/run/dbus type: "" name: host-var-run-dbus - hostPath: path: /var/run/openvswitch type: "" name: host-var-run-ovs - hostPath: path: /var/run/kubernetes type: "" name: host-var-run-kubernetes - hostPath: path: /var/run/openshift-sdn type: "" name: host-var-run-openshift-sdn - hostPath: path: / type: "" name: host-slash - hostPath: path: /var/lib/cni/bin type: "" name: host-cni-bin - hostPath: path: /etc/kubernetes/cni/net.d type: "" name: host-cni-netd - hostPath: path: /var/lib/cni/networks/openshift-sdn type: "" name: host-var-lib-cni-networks-openshift-sdn - name: sdn-token-bcn77 secret: defaultMode: 420 secretName: sdn-token-bcn77 status: conditions: - lastProbeTime: null lastTransitionTime: 2019-05-21T02:27:14Z status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: 2019-05-22T02:01:17Z status: "True" type: Ready - lastProbeTime: null lastTransitionTime: 2019-05-22T02:01:17Z status: "True" type: ContainersReady - lastProbeTime: null lastTransitionTime: 2019-05-21T02:27:14Z status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://de29abd833b734d03f4d07e156ab34eea950dc818f8fb00f9c17079f27e0e0ec image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6528d6c46b61b0ea653e82fe0ea34a2186fcb22815edc655c8b9ac1ecffc18b6 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6528d6c46b61b0ea653e82fe0ea34a2186fcb22815edc655c8b9ac1ecffc18b6 lastState: terminated: containerID: cri-o://15b510d75c6d44229a54981fd488b4b5beedbd642a3559d49cfa1e28b9222819 exitCode: 0 finishedAt: 2019-05-22T02:01:05Z reason: Completed startedAt: 2019-05-21T06:22:52Z name: sdn ready: true restartCount: 4 state: running: startedAt: 2019-05-22T02:01:09Z hostIP: 139.178.76.11 phase: Running podIP: 139.178.76.11 qosClass: Burstable startTime: 2019-05-21T02:27:14Z ***************************************************************** Version-Release number of selected component (if applicable): payload: 4.1.0-0.nightly-2019-05-18-050636 $ oc version Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"27816e1b1", GitTreeState:"clean", BuildDate:"2019-05-17T23:03:34Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+f2cc675", GitCommit:"f2cc675", GitTreeState:"clean", BuildDate:"2019-05-18T00:35:43Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} How reproducible: Frequently Steps to Reproduce: 1. Deploy OCP 4.1 on vmware/bare metal environment, and let it run for a few days, and check the prometheus /targets page 2. 3. Actual results: "server returned HTTP status 500 Internal Server Error" for some kubelet endpoints Expected results: Should not have this error Additional info:
if I'm not wrong, it is fixed by https://github.com/kubernetes/kubernetes/pull/77426.
I think this can be deferred to 4.1.1, it does not significantly impact the usability of the cluster.
Agree, we should backport the linked PR as well https://github.com/kubernetes/kubernetes/pull/77426
Ryan, can you backport this to Origin master (4.2). I'll clone this bug for the 4.1.z backport.
PR: https://github.com/openshift/origin/pull/22889
Moving back to assigned because we also need this https://github.com/prometheus/client_golang/pull/513
*** Bug 1716913 has been marked as a duplicate of this bug. ***
*** Bug 1716914 has been marked as a duplicate of this bug. ***
The second PR is only needed for 4.1 so, I'm going to clone to track the backport of both PRs to 4.1 and move this back to MODIFIED.
Tested with 4.2.0-0.nightly-2019-06-27-204704 prometheus /targets page, checked all 10250/metrics endpoints, they all are UP
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922