Bug 1708602

Summary: Got error from machine-config-daemon when the rhel worker is adding to the cluster but cluster is applying new rendered machineconfig
Product: OpenShift Container Platform Reporter: weiwei jiang <wjiang>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED ERRATA QA Contact: Micah Abbott <miabbott>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: amurdaca, walters, xtian
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description weiwei jiang 2019-05-10 10:22:18 UTC
Description of problem:
When the rhel worker is adding to the cluster and the cluster is applying new rendered machineconfig at the same time, machineconfig daemon got error "Marking Degraded due to: failed to log pending config: exit status 1".



Version-Release number of selected component (if applicable):
oc get clusterversion                                                      
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-09-182710   True        False         4h50m   Cluster version is 4.1.0-0.nightly-2019-05-09-182710


How reproducible:
Always

Steps to Reproduce:
1. Make new kubelet machineconfig for workers
2. Add new rhel worker to the cluster during the cluster is applying new machineconfig
3.

Actual results:
# oc get nodes -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\n"}}{{end}}'
...
===> node:> dell-r730-068.dsal.lab.eng.rdu2.redhat.com
machineconfiguration.openshift.io/currentConfig : rendered-worker-9425bc18523ebb54a929a7db9813a343
machineconfiguration.openshift.io/desiredConfig : rendered-worker-02875877c08cddc36cabb21b7788801c
machineconfiguration.openshift.io/reason : failed to log pending config: exit status 1
machineconfiguration.openshift.io/ssh : accessed
machineconfiguration.openshift.io/state : Degraded
volumes.kubernetes.io/controller-managed-attach-detach : true


# oc -n openshift-machine-config-operator logs -f machine-config-daemon-vbs9h
I0510 06:52:22.794488   23426 update.go:170] Checking reconcilable for config rendered-worker-9425bc18523ebb54a929a7db9813a343 to rendered-worker-02875877c08cddc36cabb21b7788801c
I0510 06:52:22.795668   23426 update.go:743] Starting update from rendered-worker-9425bc18523ebb54a929a7db9813a343 to rendered-worker-02875877c08cddc36cabb21b7788801c
I0510 06:52:22.796636   23426 update.go:372] Updating files
I0510 06:52:22.796648   23426 update.go:574] Writing file "/etc/tmpfiles.d/cleanup-cni.conf"
I0510 06:52:23.334889   23426 update.go:574] Writing file "/etc/systemd/system.conf.d/kubelet-cgroups.conf"
I0510 06:52:23.935607   23426 update.go:574] Writing file "/var/lib/kubelet/config.json"
I0510 06:52:24.569906   23426 update.go:574] Writing file "/etc/kubernetes/ca.crt"
I0510 06:52:24.978786   23426 update.go:574] Writing file "/etc/sysctl.d/forward.conf"
I0510 06:52:25.377527   23426 update.go:574] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy"
I0510 06:52:25.785718   23426 update.go:574] Writing file "/etc/containers/registries.conf"
I0510 06:52:26.421934   23426 update.go:574] Writing file "/etc/containers/storage.conf"
I0510 06:52:26.827820   23426 update.go:574] Writing file "/etc/crio/crio.conf"
I0510 06:52:27.397023   23426 update.go:574] Writing file "/etc/kubernetes/cloud.conf"
I0510 06:52:27.866888   23426 update.go:574] Writing file "/etc/kubernetes/kubelet.conf"
I0510 06:52:28.446934   23426 update.go:574] Writing file "/etc/kubernetes/kubelet.conf"
I0510 06:52:28.832434   23426 update.go:529] Writing systemd unit "kubelet.service"
I0510 06:52:29.278484   23426 update.go:546] Enabling systemd unit "kubelet.service"
I0510 06:52:29.278529   23426 update.go:466] /etc/systemd/system/multi-user.target.wants/kubelet.service already exists. Not making a new symlink
I0510 06:52:29.278541   23426 update.go:391] Deleting stale data
I0510 06:52:29.278558   23426 update.go:636] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0510 06:52:29.689316   23426 update.go:636] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0510 06:52:30.295779   23426 update.go:372] Updating files
I0510 06:52:30.295804   23426 update.go:574] Writing file "/etc/tmpfiles.d/cleanup-cni.conf"
I0510 06:52:30.702501   23426 update.go:574] Writing file "/etc/systemd/system.conf.d/kubelet-cgroups.conf"
I0510 06:52:31.253447   23426 update.go:574] Writing file "/var/lib/kubelet/config.json"
I0510 06:52:31.847678   23426 update.go:574] Writing file "/etc/kubernetes/ca.crt"
I0510 06:52:32.275698   23426 update.go:574] Writing file "/etc/sysctl.d/forward.conf"
I0510 06:52:32.885524   23426 update.go:574] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy"
I0510 06:52:33.289385   23426 update.go:574] Writing file "/etc/containers/registries.conf"
I0510 06:52:33.930205   23426 update.go:574] Writing file "/etc/containers/storage.conf"
I0510 06:52:34.496840   23426 update.go:574] Writing file "/etc/crio/crio.conf"
I0510 06:52:34.913196   23426 update.go:574] Writing file "/etc/kubernetes/cloud.conf"
I0510 06:52:35.118717   23426 update.go:574] Writing file "/etc/kubernetes/kubelet.conf"
I0510 06:52:35.611156   23426 update.go:529] Writing systemd unit "kubelet.service"
I0510 06:52:36.004440   23426 update.go:546] Enabling systemd unit "kubelet.service"
I0510 06:52:36.004469   23426 update.go:466] /etc/systemd/system/multi-user.target.wants/kubelet.service already exists. Not making a new symlink
I0510 06:52:36.004475   23426 update.go:391] Deleting stale data
E0510 06:52:36.004497   23426 writer.go:132] Marking Degraded due to: failed to log pending config: exit status 1


Expected results:
Should work well

Additional info:

Comment 1 Antonio Murdaca 2019-05-10 10:38:11 UTC
Can you provide full steps to reproduce?

Comment 2 Antonio Murdaca 2019-05-10 10:52:39 UTC
I've opened https://github.com/openshift/machine-config-operator/pull/733 to provide further debug when merged.

Comment 3 Antonio Murdaca 2019-05-10 11:03:34 UTC
[root@dell-r730-068 ~]# logger --journald <<EOF
> MESSAGE_ID=machine-config-daemon-pending-state
> MESSAGE=rendered-worker-02875877c08cddc36cabb21b7788801c
> BOOT_ID=983da81ad27042e29688465f2201b1f2
> PENDING=1
> EOF
logger: unrecognized option '--journald'



The logger version shipped with rhel 7.6 hasn't that --journald options, and is thus failing.

Comment 5 Colin Walters 2019-05-13 12:54:08 UTC
Does this one need to be marked as a blocker?

Comment 6 Antonio Murdaca 2019-05-13 12:58:33 UTC
(In reply to Colin Walters from comment #5)
> Does this one need to be marked as a blocker?

uhm, not sure, same goes for https://bugzilla.redhat.com/show_bug.cgi?id=1707162 which doesn't have blocker but both have the 4.1.0 target

Comment 8 weiwei jiang 2019-05-14 10:06:17 UTC
Need wait new build to have a try.

Comment 9 weiwei jiang 2019-05-15 08:27:07 UTC
Checked with 4.1.0-0.nightly-2019-05-14-202907, and rhel 7.6 worker still can not work well now.


# oc get nodes -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\n"}}{{end}}'
...
===> node:> dell-r730-068.dsal.lab.eng.rdu2.redhat.com
machineconfiguration.openshift.io/currentConfig : rendered-worker-dd652a6e0f249230ef0b9fe1f5f241e9
machineconfiguration.openshift.io/desiredConfig : rendered-worker-41eed7b1596539a2ea0028a651704126
machineconfiguration.openshift.io/reason : failed to log pending config: logger: unrecognized option '--journald'

Usage:
 logger [options] [message]

Options:
 -T, --tcp             use TCP only
 -d, --udp             use UDP only
 -i, --id              log the process ID too
 -f, --file <file>     log the contents of this file
 -h, --help            display this help text and exit
 -S, --size <num>      maximum size for a single message (default 1024)
 -n, --server <name>   write to this remote syslog server
 -P, --port <port>     use this port for UDP or TCP connection
 -p, --priority <prio> mark given message with this priority
 -s, --stderr          output message to standard error as well
 -t, --tag <tag>       mark every line with this tag
 -u, --socket <socket> write to this Unix socket
 -V, --version         output version information and exit

: exit status 1
machineconfiguration.openshift.io/ssh : accessed
machineconfiguration.openshift.io/state : Degraded
volumes.kubernetes.io/controller-managed-attach-detach : true

Comment 10 Antonio Murdaca 2019-05-15 08:45:48 UTC
The branching to 4.1 happened and it didn't pick https://github.com/openshift/machine-config-operator/pull/734
Backported. https://github.com/openshift/machine-config-operator/pull/754

Comment 11 Antonio Murdaca 2019-05-15 11:55:26 UTC
Backport merged. Next nightly should have that.

Comment 12 Antonio Murdaca 2019-05-15 17:17:36 UTC
This nightly payload now contains the backport to be tested by QE https://openshift-release.svc.ci.openshift.org/releasestream/4.1.0-0.nightly/release/4.1.0-0.nightly-2019-05-15-151517

Comment 13 weiwei jiang 2019-05-16 07:12:45 UTC
Checked with 4.1.0-0.nightly-2019-05-15-151517 and rhel worker work well for machineconfig updating, so move to verified.

Comment 15 errata-xmlrpc 2019-06-04 10:48:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758