1957374 – mcddrainerr doesn't list specific pod

Bug 1957374 - mcddrainerr doesn't list specific pod

Summary: mcddrainerr doesn't list specific pod

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Kirsten Garrison
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-05 17:17 UTC by Kirsten Garrison
Modified:	2021-08-04 15:34 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:06:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Verification (72.11 KB, image/jpeg) 2021-05-07 20:06 UTC, Michael Nguyen	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2568	0	None	open	Bug 1957374: install: Automatically fill in the pod for MCDDrainError	2021-05-05 17:25:01 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:06:57 UTC

Description Kirsten Garrison 2021-05-05 17:17:25 UTC

Description of problem:

It is confusing to figure out which mcd to check logs in as the message doesn't explicitly list the hash: `oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash>`

and makes troubleshooting more difficult.

Expected results: let me know which pod to check

Comment 2 Michael Nguyen 2021-05-07 20:06:14 UTC

Created attachment 1780869 [details]
Verification

Verification

Comment 3 Michael Nguyen 2021-05-07 20:07:06 UTC

Verified on 4.8.0-0.nightly-2021-05-07-075528.  Triggered the mcd_drain_err with the steps below, then looked at Prometheus.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-07-075528   True        False         67m     Cluster version is 4.8.0-0.nightly-2021-05-07-075528
$$ cd openshift/
$ cat pdb.yaml 
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: dontevict
$ oc create -f pdb.yaml 
poddisruptionbudget.policy/dontevict created
$ oc get pdb
NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
dontevict   1               N/A               0                     7s
$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-9wr6012-f76d1-z7bjv-master-0         Ready    master   89m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-master-1         Ready    master   89m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-master-2         Ready    master   89m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-worker-b-gwn8c   Ready    worker   80m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-worker-c-c2ndb   Ready    worker   80m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-worker-d-2sc2x   Ready    worker   80m   v1.21.0-rc.0+291e731
$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ci-ln-9wr6012-f76d1-z7bjv-worker-b-gwn8c"} } }' --image=docker.io/busybox dont-evict-this-pod -- sleep 1h
pod/dont-evict-this-pod created
$ oc get pods
NAME                  READY   STATUS    RESTARTS   AGE
dont-evict-this-pod   1/1     Running   0          7s
$ cat file-ig3.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-file
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        filesystem: root
        mode: 0644
        path: /etc/test
$ oc create -f file-ig3.yaml 
machineconfig.machineconfiguration.openshift.io/test-file created
$ oc get nodes
NAME                                       STATUS                     ROLES    AGE    VERSION
ci-ln-9wr6012-f76d1-z7bjv-master-0         Ready                      master   100m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-master-1         Ready                      master   100m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-master-2         Ready                      master   100m   v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-worker-b-gwn8c   Ready,SchedulingDisabled   worker   91m    v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-worker-c-c2ndb   Ready                      worker   91m    v1.21.0-rc.0+291e731
ci-ln-9wr6012-f76d1-z7bjv-worker-d-2sc2x   Ready                      worker   91m    v1.21.0-rc.0+291e731
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-617f5a3d5cb1e6c6b2e34b8a7294d683   True      False      False      3              3                   3                     0                      98m
worker   rendered-worker-15776fcc21358742a1f4cb79346b7d50   False     True       False      3              1                   1                     0                      98m
$ oc -n openshift-monitoring get routes
NAME                HOST/PORT                                                                                             PATH   SERVICES            PORT    TERMINATION          WILDCARD
alertmanager-main   alertmanager-main-openshift-monitoring.apps.ci-ln-9wr6012-f76d1.origin-ci-int-gce.dev.openshift.com          alertmanager-main   web     reencrypt/Redirect   None
grafana             grafana-openshift-monitoring.apps.ci-ln-9wr6012-f76d1.origin-ci-int-gce.dev.openshift.com                    grafana             https   reencrypt/Redirect   None
prometheus-k8s      prometheus-k8s-openshift-monitoring.apps.ci-ln-9wr6012-f76d1.origin-ci-int-gce.dev.openshift.com             prometheus-k8s      web     reencrypt/Redirect   None
thanos-querier      thanos-querier-openshift-monitoring.apps.ci-ln-9wr6012-f76d1.origin-ci-int-gce.dev.openshift.com             thanos-querier      web     reencrypt/Redirect   None
$ oc -n openshift-console get routes
NAME        HOST/PORT                                                                                  PATH   SERVICES    PORT    TERMINATION          WILDCARD
console     console-openshift-console.apps.ci-ln-9wr6012-f76d1.origin-ci-int-gce.dev.openshift.com            console     https   reencrypt/Redirect   None
downloads   downloads-openshift-console.apps.ci-ln-9wr6012-f76d1.origin-ci-int-gce.dev.openshift.com          downloads   http    edge/Redirect        None
$ oc get pods -A --field-selector spec.nodeName=ci-ln-9wr6012-f76d1-z7bjv-worker-b-gwn8c
NAMESPACE                                NAME                           READY   STATUS    RESTARTS   AGE
default                                  dont-evict-this-pod            1/1     Running   0          13m
openshift-cluster-csi-drivers            gcp-pd-csi-driver-node-l9sm4   3/3     Running   0          94m
openshift-cluster-node-tuning-operator   tuned-qt6d5                    1/1     Running   0          94m
openshift-dns                            dns-default-hl5bn              2/2     Running   0          93m
openshift-dns                            node-resolver-v7vdx            1/1     Running   0          94m
openshift-image-registry                 node-ca-r4mnv                  1/1     Running   0          94m
openshift-ingress-canary                 ingress-canary-knnbf           1/1     Running   0          93m
openshift-machine-config-operator        machine-config-daemon-zgc8w    2/2     Running   0          94m
openshift-monitoring                     node-exporter-qkzkz            2/2     Running   0          94m
openshift-multus                         multus-pv4px                   1/1     Running   0          94m
openshift-multus                         network-metrics-daemon-zk2vz   2/2     Running   0          94m
openshift-network-diagnostics            network-check-target-wkr6m     1/1     Running   0          94m
openshift-sdn                            sdn-dnb8q                      2/2     Running   0          94m

$ oc -n openshift-machine-config-operator logs machine-config-daemon-zgc8w -c machine-config-daemon
E0507 18:59:57.986704    1970 daemon.go:330] WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: default/dont-evict-this-pod; ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/gcp-pd-csi-driver-node-l9sm4, openshift-cluster-node-tuning-operator/tuned-qt6d5, openshift-dns/dns-default-hl5bn, openshift-dns/node-resolver-v7vdx, openshift-image-registry/node-ca-r4mnv, openshift-ingress-canary/ingress-canary-knnbf, openshift-machine-config-operator/machine-config-daemon-zgc8w, openshift-monitoring/node-exporter-qkzkz, openshift-multus/multus-pv4px, openshift-multus/network-metrics-daemon-zk2vz, openshift-network-diagnostics/network-check-target-wkr6m, openshift-sdn/sdn-dnb8q
I0507 18:59:57.992498    1970 daemon.go:330] evicting pod default/dont-evict-this-pod
E0507 18:59:58.001489    1970 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0507 19:00:03.004519    1970 daemon.go:330] evicting pod default/dont-evict-this-pod
E0507 19:00:03.012675    1970 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0507 19:00:08.013652    1970 daemon.go:330] evicting pod default/dont-evict-this-pod
E0507 19:00:08.023949    1970 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0507 19:00:13.027718    1970 daemon.go:330] evicting pod default/dont-evict-this-pod
E0507 19:00:13.038685    1970 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0507 19:00:18.042846    1970 daemon.go:330] evicting pod default/dont-evict-this-pod
E0507 19:00:18.054457    1970 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0507 19:00:23.055472    1970 daemon.go:330] evicting pod default/dont-evict-this-pod
E0507 19:00:23.067515    1970 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0507 19:00:28.071655    1970 daemon.go:330] evicting pod default/dont-evict-this-pod
E0507 19:00:28.081238    1970 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0507 19:00:33.082191    1970 daemon.go:330] evicting pod default/dont-evict-this-pod
E0507 19:00:33.092854    1970 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0507 19:00:38.097446    1970 daemon.go:330] evicting pod default/dont-evict-this-pod
E0507 19:00:38.105617    1970 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0507 19:00:43.108816    1970 daemon.go:330] evicting pod default/dont-evict-this-pod
E0507 19:00:43.118708    1970 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Comment 6 errata-xmlrpc 2021-07-27 23:06:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.