Description of problem:
I observed that the cluster-image-registry-operator-watch log was completely full with the same message repeating forever...
W0716 14:19:07.792149 1 cmd.go:197] Unable to determine PID for "cluster-image-registry-operator" (retry: 49425, err: <nil>)
Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.5.1 True False 16h Cluster version is 4.5.1
This is the first time that I have observed this particular issue. I have verified that this is happening on 3 out of 4 of my production clusters...
It's happening on 1 of my 4.5.1 clusters and 2 of my 4.4.13 clusters.
Steps to Reproduce:
1. $ oc logs -n openshift-image-registry cluster-image-registry-operator-<UID> -c cluster-image-registry-operator-watch
2.Observe that the log is full of the same message.
The "cluster-image-registry-operator-watch" container is unable to determine the PID of the operator it is supposed to be watching.
The container should be able to determine the PID of the operator and watch over it.
Created attachment 1701408 [details]
operator log file
Created attachment 1701409 [details]
operator-watch log file
I could not replicate this, could you clarify what is the platform you are running the cluster?
If you can, could you run the following commands and report back here their output? That might help out in debugging.
$ oc rsh -c cluster-image-registry-operator-watch cluster-image-registry-operator-<id>
$ ps -ef
<copy pid for /usr/bin/cluster-image-registry-operator process>
$ ls -la /proc/<pid>/exe
Unfortunately, it does not look like I can access that particular container:
$ oc rsh cluster-image-registry-operator-5d8b676fdd-z7c7d -c cluster-image-registry-operator-watch
Defaulting container name to cluster-image-registry-operator.
Use 'oc describe pod/cluster-image-registry-operator-5d8b676fdd-z7c7d -n openshift-image-registry' to see all of the containers in this pod.
ERRO exec failed: container_linux.go:348: starting container process caused "exec: \"-c\": executable file not found in $PATH"
exec failed: container_linux.go:348: starting container process caused "exec: \"-c\": executable file not found in $PATH"
command terminated with exit code 1
I was able to access the container on a different cluster (experiencing the same issue):
$ oc rsh -c cluster-image-registry-operator-watch cluster-image-registry-operator-57494b6774-hvcl6
sh-4.2$ ps -ef
UID PID PPID C STIME TTY TIME CMD
1000010+ 1 0 0 Jul29 ? 00:06:18 cluster-image-registry-operator-watch file-watcher-watchdog --namespace=openshift-image-registry --process-name=cluster-image-registry-operator --termination-grace-period=30s --files=/va
1000010+ 22 0 0 13:32 pts/0 00:00:00 /bin/sh
1000010+ 28 22 0 13:33 pts/0 00:00:00 ps -ef
sh-4.2$ ls -la /proc/
1/ cgroups devices fb ioports keys loadavg modules partitions self/ sys/ tty/ xen/
22/ cmdline diskstats filesystems irq/ kmsg locks mounts pressure/ slabinfo sysrq-trigger uptime zoneinfo
acpi/ consoles dma fs/ kallsyms kpagecgroup mdstat mtrr sched_debug softirqs sysvipc/ version
buddyinfo cpuinfo driver/ interrupts kcore kpagecount meminfo net/ schedstat stat thread-self/ vmallocinfo
bus/ crypto execdomains iomem key-users kpageflags misc pagetypeinfo scsi/ swaps timer_list vmstat
sh-4.2$ ls -la /proc/1/
attr/ clear_refs cpuset fd/ limits mem net/ oom_score personality schedstat smaps_rollup status timerslack_ns
autogroup cmdline cwd/ fdinfo/ loginuid mountinfo ns/ oom_score_adj projid_map sessionid stack syscall uid_map
auxv comm environ gid_map map_files/ mounts numa_maps pagemap root/ setgroups stat task/ wchan
cgroup coredump_filter exe io maps mountstats oom_adj patch_state sched smaps statm timers
sh-4.2$ ls -la /proc/1/exe
lrwxrwxrwx. 1 1000010000 root 0 Jul 29 20:45 /proc/1/exe -> /usr/bin/cluster-image-registry-operator-watch
We have noticed that in the faulty clusters the operator has been deployed without the shared process namespace directive present on the manifest. I am forwarding this BZ to CVO team for further investigation.
> We have noticed that in the faulty clusters the operator has been deployed without the shared process namespace directive present on the manifest.
I can confirm that the CVO is indeed ignoring that pod property, which would mean that whatever the value was when the manifest was created would be preserved regardless of the value in future manifests (until we fix the CVO-side bug). I am not clear on how that would lead to 4.4 and 4.5 clusters exhibiting "Unable to determine PID" because of the CVO shortcoming, because it seems like the registry operator only began using the property in 4.2:
$ git --no-pager log --oneline -G shareProcessNamespace origin/release-4.2..origin/master -- manifests
$ git --no-pager log --oneline -G shareProcessNamespace origin/release-4.1..origin/master -- manifests
cc9e9fe05 (origin/pr/364) Integrating watchdog as a sidecar to registry operator.
3803d25ff Revert "Integrating watchdog as a sidecar to registry operator."
ffbb403ef (origin/pr/342) Integrating watchdog as a sidecar to registry operator.
Were all of the affected clusters born in 4.1?
These clusters are (very) long-lived clusters...
Currently exhibiting this issue:
Not impacted by this issue:
So do we have a story around why the born-in-4.2.2 cluster is showing 'Unable to determine PID'? If we don't, I'll spin off the CVO-should-manage-shareProcessNamespace bit into a new bug, and reset the title here.
None that I am aware at this time.
Ok, spun the CVO's handling of shareProcessNamespace out into bug 1866554, and reset this one back to the registry folks so they can figure out what's happening with the 4.2 cluster (which should not have a divergent shareProcessNamespace setting, but may diverge from the intended manifest on other properties).
Still awaiting for further investigation, postponed due to higher priority bugs.
Can we get yaml for the deployment/pods from the born-in-4.2.2 cluster?
I'll go ahead and gather the pieces that you requested. If you want to DM me your SSH key, I can give you admin access to the cluster itself, if you're interested?
I double checked and this property was set to true on 4.2.2 clusters. IIUC this should have been created as such by CVO on a fresh 4.2.2 install
$ oc adm release info 4.2.2 --commits |grep image-registry-operator
cluster-image-registry-operator https://github.com/openshift/cluster-image-registry-operator 017a9fcc24253e356dd58d32fb9fe18a071901d3
$ git checkout 017a9fcc24253e356dd58d32fb9fe18a071901d3
HEAD is now at 017a9fcc2 Merge pull request #399 from ricardomaraschini/storage-cleanup
$ grep share manifests/*
manifests/07-operator.yaml: shareProcessNamespace: true
Passing this back to CVO folks. Let me know if there is something else I can do.
That looks like a dupe of 1866554
We spun bug 1866554 out of this ticket. If that is indeed the only issue with these clusters, yeah, lets just reverse comment 16 and go back to MODIFIED.
Assigning back to Ricardo, who will understand the impact of this issue well enough to fill in the doc text ;).
Trevor, the rest of the bug is also about CVO.
The born-in-4.2.2 cluster doesn't have shareProcessNamespace while the manifest has this property. It means CVO has created this resource incorrectly (or maybe somebody decided to change it later, I guess we don't have this information).
Feel free to mark this bug as a DUPLICATE if BZ1866554 also solves the problem with new clusters.
Bug 1866554 has nothing to do with new clusters. This issue here is that:
* Born in 4.1 clusters had no shareProcessNamespace.
* Registry folks added shareProcessNamespace=true in 4.2, but because of bug 1866554, the CVO did not set it when updating born-in-4.1 clusters.
*  dropped shareProcessNamespace in 4.6, but because of bug 1866554, the CVO did not set it when updating born-before-4.6 clusters.
* Bug 1866554 (backported to 4.5 with bug 1868478, which is not yet released) fixed the CVO handling, but CVO treats manifests where *bool properties are unset as "manifest author does not care". This fixes born-in-4.1 clusters who update to fixed 4.5 releases, because they'll finally get shareProcessNamespace=true. I have no idea what those clusters receiving that change means about how the function of the registry operator.
* This bug landed  with an explicit shareProcessNamespace=false, so when born-before- clusters update to a release that contains this bug's fix, the CVO will set shareProcessNamespace=false, as the registry folks intended in . I have no idea what those clusters receiving that change means about how the function of the registry operator.
My take one release notes:
Cause: The cluster-version operator was not updating shareProcessNamespace on the image registry deployment.
Consequence: In 4.6, we began setting shareProcessNamespace=false; from 4.2 through 4.5 it had been shareProcessNamespace=true. That meant that clusters which were installed as 4.2, 4.3, 4.4, or 4.5 clusters would, when updating to 4.6 still have shareProcessNamespace=true and $WHATEVER_GOES_WRONG_WHEN_THAT_HAPPENS.
Fix: The cluster-version operator was fixed to manage shareProcessNamespace and the image registry deployment was fixed to explicitly set shareProcessNamespace=false.
Result: $WHATEVER_GOES_WRONG_WHEN_THAT_HAPPENS no longer happens.
Do you have recommended wording for $WHATEVER_GOES_WRONG_WHEN_THAT_HAPPENS?
Nothing will happen, I don't care about shareProcessNamespace in 4.6 that much. It's a leftover that we want to remove.
In 4.2-4.5 we had a separate container to watch CA bundle and TLS certificates changes. As part of IR-16 it was integrated into the main container, so this option is not needed anymore. As in 4.6 we have only one container, it doesn't affect anything.
Probably we should've created a no-doc BZ for  instead of using this one. This one is kind of related because it's about the leftovers after this container that this BZ is about. But it doesn't fix this BZ, we used it because we had to use something.
What this BZ is about and we care about is that we should have shareProcessNamespace set in 4.2-4.5. Per comment 10 born-in-4.2.2 cluster doesn't have shareProcessNamespace set, but 4.2.2 manifests have `true` for this option. That's the only thing that is left in this BZ.
Trevor, do you want to investigate it? Or I'm going to close this BZ as DUPLICATE of BZ1866554.
> What this BZ is about and we care about is that we should have shareProcessNamespace set in 4.2-4.5.
The title of this bug mentions "Unable to determine PID". That isn't directly:
registry-operator has shareProcessNamespace=false for born-in-4.1 clusters which have updated to 4.2 through 4.5
If we retitle this bug to be about that, I agree that it can be closed as a dup of bug 1866554. But if you _really_ don't care about shareProcessNamespace in 4.6, you should revert  and unlink from this bug, because leaving the property unset in your manifest is the way to tell the CVO you don't care. I don't think we can close this bug as a dup of 1866554 as long as we have a PR still attached here.
If there is actually some connection to "Unable to determine PID" behavior, I think this bug is the place to talk about what that connection is.
> The title of this bug mentions "Unable to determine PID". That isn't directly:
That's arguable. "Unable to determine PID" is a direct consequence of shareProcessNamespace=false (one container doesn't see another one).
> you should revert  and unlink from this bug, because leaving the property unset in your manifest is the way to tell the CVO you don't care.
I unlinked this PR from this BZ, but I won't revert it or I'll need to merge it again, and that's pointless bureaucracy. I think we already have enough information about this PR in this BZ. It doesn't need doc updates as the "problem"  and the "fix"  landed in 4.6 before 4.6.0 is released.
Ideally both  and  should've been squashed into , but we didn't notice this field and didn't know about CVO details.
The reason why I want to keep  is because I don't want to have different operator deployments in born-in-4.5 and brand-new-4.6 clusters. It means we should have this field set to false because CVO doesn't work the way `kubectl apply` works. That's not because it fixes some problem, but because I want to have things consistent.
As the only open question was about , I'm closing this BZ.
*** This bug has been marked as a duplicate of bug 1866554 ***
> I unlinked this PR from this BZ, but I won't revert it or I'll need to merge it again, and that's pointless bureaucracy.
I don't see much benefit to "we will close out this bug as a dup and leave registry-operator/591 undocumented" over "we will close out this bug linking registry-operator#591 with an errata link and No Doc Update", but if you are comfortable with the dangling PR, and also comfortable with us not having discussed registry-specific impacts in our bug 1866554 doc text, I'm fine with this conclusion.