Hide Forgot
Description of problem: When we create a MachineConfig resource to deploy a file with permissions numerically bigger than decimal 511 (octal 0777) the MachineConfigPool becomes degraded, the config daemon shows an error regarding a config drift and the pool cannot be recovered by deleting the MC and editing the desirdeConfig value. Version-Release number of MCO (Machine Config Operator) (if applicable): $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.10.0-0.nightly-2022-01-07-004348 True False False 5h37m Platform (AWS, VSphere, Metal, etc.): AWS Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Y How reproducible: Always Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: 2. Profile: Steps to Reproduce: 1. Create a MachineConfiguration resource deploying a file using a mode bigger than decimal 511 (octal 0777) cat << EOF | oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: mco-test-file-permissions spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:,MCO%20test%20file%20permissions%0A path: /etc/mco-test-file-permissions mode: 512 EOF Actual results: The worker pool is degraded, and we can see this message in the config daemon logs I0107 11:50:22.717823 1612 daemon.go:1198] Validating against pending config rendered-worker-d262b13e390b5082d2cf843819138dba E0107 11:50:22.723320 1612 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-d262b13e390b5082d2cf843819138dba: mode mismatch for file: "/etc/mco-test-file-permissions"; expected: ----------/512/01000; received: ----------/0/0 Expected results: If permissions numerically bigger than 511(0777) are allowed, no error should happen and the permisions should be set properly. If permissions numerically bigger than 511(0777) are not allowed, a validation should be done such that we don't get into an error that cannot be recovered. This validation should report the right cause of the problem (the MachineConfig resource defining forbidden permissions in a file). Additional info: Notice that if permissions numerically bigger than 511(0777) are not allowed, then users cannot configure things like the sticky bit, decimal 1023(octal 1777), or the setuid and setgid bits.
Overall, this behavior isn't new but rather has existed and has been (mostly) dormant. What's changed is that we're executing this code more often and writing better test cases. Under the hood, here's what's happening: 1. When the file is created by the MCD [1], the os.Chmod [2] function ignores the sticky, setuid, and setgid bits because only the 9 most significant bits (read / write / execute for user, group, and other) are considered standard UNIX permissions [3]. See [6] for a Go Playground link which illustrates this. 2. Because of this, the permissions are set, but the mode bits are not. 3. When the config drift detection code runs, it identifies a mismatch between what's on disk vs. what the MachineConfig specifies. In Sergio's provided example which sets 01000, we can see that stat'ing the file shows that its permission and mode bits are set to 0 because of the truncation. In addition to #1, one can see that Golang has its own internal representation of file mode bits [3]. What this means is that a valid (to Golang's internal file mode bits, anyway) octal representation of 01777 (that os.Chmod would set correctly) would be 04000777 (1049087, decimal). While passing this into os.Chmod works, it fails Ignition validation [5], which expects a file mode value less than 07777. To my understanding, Golang keeps its own internal representation of this for portability across OSes [7], so this particular behavior is considered a feature. This appears to be consistent with the current behavior in Ignition [4], meaning that if one were to use Ignition in an out-of-MCO context, the file mode would not be set correctly. ========= Refs: 1. https://github.com/openshift/machine-config-operator/blob/release-4.10/pkg/daemon/update.go#L1567-L1600 2. https://cs.opensource.google/go/go/+/refs/tags/go1.17.6:src/os/file.go;l=521-539 3. https://cs.opensource.google/go/go/+/refs/tags/go1.17.6:src/io/fs/fs.go;l=166-167 4. https://github.com/coreos/ignition/blob/v2.7.0/internal/exec/util/file.go#L163-L180 5. https://github.com/coreos/ignition/blob/v2.7.0/config/v3_2/types/mode.go#L21-L26 6. https://go.dev/play/p/iLVNsA3Kf_y 7. https://github.com/golang/go/issues/25539
After explaining this to a non-engineer, this is a more succinct and easier to follow summary of my comment above. - File permissions (read / write / execute for user, group, and other) is conferred by the first three digits, e.g., 0755. - Special file modes (sticky bit, setuid, getgid) are conferred by a fourth digit to the left of the first three, e.g., 01755. - What's happening is that the internal Golangs internal representation of file modes places the special file mode bits in a different location (e.g., 04000775). - When os.Chmod() attempts to apply the file mode bits, it discards anything to the left of the file permission bits (e.g., the 1 in 01755, which becomes 0755) since it ultimately calls the .Perm() method on the FileMode object to do this. - While the os.Chmod() function (or rather, the ones it delegates to) can set the special file mode bits, it will only do so if they're set in the different location referenced above. So some additional logic is needed to determine if any special file mode bits are set and to adjust the internal Golang representation accordingly.
I was able to reproduce this issue within Ignition as well, so I've opened a bug against Ignition as well: https://github.com/coreos/ignition/issues/1301.
Verified using $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.10.0-0.ci-2022-01-26-000911 True False False 3h57m - When the MC with the wrong permissions was created, the worker pool became Degraded - When the offending MC was deleted, the worker pool stopped being Degraded - When a MC using valid permissions (0777) was created, the nodes were configured properly and without errors
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056