Bug 1857782

Summary: cluster-image-registry-operator-watch container continuously logging "Unable to determine PID"
Product: OpenShift Container Platform Reporter: brad.williams
Component: Image RegistryAssignee: Ricardo Maraschini <rmarasch>
Status: CLOSED DUPLICATE QA Contact: Wenjing Zheng <wzheng>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.5CC: aos-bugs, jokerman, obulatov, scuppett, vrutkovs, wking
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-03 23:04:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
operator log file
none
operator-watch log file none

Description brad.williams 2020-07-16 14:53:06 UTC
Description of problem:
I observed that the cluster-image-registry-operator-watch log was completely full with the same message repeating forever...

W0716 14:19:07.792149       1 cmd.go:197] Unable to determine PID for "cluster-image-registry-operator" (retry: 49425, err: <nil>)


Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.1     True        False         16h     Cluster version is 4.5.1


How reproducible:
This is the first time that I have observed this particular issue.  I have verified that this is happening on 3 out of 4 of my production clusters...

It's happening on 1 of my 4.5.1 clusters and 2 of my 4.4.13 clusters.


Steps to Reproduce:
1. $ oc logs -n openshift-image-registry cluster-image-registry-operator-<UID> -c cluster-image-registry-operator-watch
2.Observe that the log is full of the same message.

Actual results:
The "cluster-image-registry-operator-watch" container is unable to determine the PID of the operator it is supposed to be watching.

Expected results:
The container should be able to determine the PID of the operator and watch over it.

Comment 1 brad.williams 2020-07-16 14:54:40 UTC
Created attachment 1701408 [details]
operator log file

Comment 2 brad.williams 2020-07-16 14:55:40 UTC
Created attachment 1701409 [details]
operator-watch log file

Comment 4 Ricardo Maraschini 2020-08-05 13:02:50 UTC
I could not replicate this, could you clarify what is the platform you are running the cluster?
If you can, could you run the following commands and report back here their output? That might help out in debugging.

$ oc rsh -c cluster-image-registry-operator-watch  cluster-image-registry-operator-<id>
$ ps -ef
<copy pid for /usr/bin/cluster-image-registry-operator process>
$ ls -la /proc/<pid>/exe

Comment 5 brad.williams 2020-08-05 13:28:05 UTC
Hi Ricardo,
Unfortunately, it does not look like I can access that particular container:

$ oc rsh cluster-image-registry-operator-5d8b676fdd-z7c7d -c cluster-image-registry-operator-watch
Defaulting container name to cluster-image-registry-operator.
Use 'oc describe pod/cluster-image-registry-operator-5d8b676fdd-z7c7d -n openshift-image-registry' to see all of the containers in this pod.
ERRO[0000] exec failed: container_linux.go:348: starting container process caused "exec: \"-c\": executable file not found in $PATH" 
exec failed: container_linux.go:348: starting container process caused "exec: \"-c\": executable file not found in $PATH"
command terminated with exit code 1

Comment 6 brad.williams 2020-08-05 13:34:57 UTC
I was able to access the container on a different cluster (experiencing the same issue):

$ oc rsh -c cluster-image-registry-operator-watch cluster-image-registry-operator-57494b6774-hvcl6
sh-4.2$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
1000010+       1       0  0 Jul29 ?        00:06:18 cluster-image-registry-operator-watch file-watcher-watchdog --namespace=openshift-image-registry --process-name=cluster-image-registry-operator --termination-grace-period=30s --files=/va
1000010+      22       0  0 13:32 pts/0    00:00:00 /bin/sh
1000010+      28      22  0 13:33 pts/0    00:00:00 ps -ef
sh-4.2$ ls -la /proc/
1/             cgroups        devices        fb             ioports        keys           loadavg        modules        partitions     self/          sys/           tty/           xen/           
22/            cmdline        diskstats      filesystems    irq/           kmsg           locks          mounts         pressure/      slabinfo       sysrq-trigger  uptime         zoneinfo       
acpi/          consoles       dma            fs/            kallsyms       kpagecgroup    mdstat         mtrr           sched_debug    softirqs       sysvipc/       version        
buddyinfo      cpuinfo        driver/        interrupts     kcore          kpagecount     meminfo        net/           schedstat      stat           thread-self/   vmallocinfo    
bus/           crypto         execdomains    iomem          key-users      kpageflags     misc           pagetypeinfo   scsi/          swaps          timer_list     vmstat         
sh-4.2$ ls -la /proc/1/
attr/            clear_refs       cpuset           fd/              limits           mem              net/             oom_score        personality      schedstat        smaps_rollup     status           timerslack_ns
autogroup        cmdline          cwd/             fdinfo/          loginuid         mountinfo        ns/              oom_score_adj    projid_map       sessionid        stack            syscall          uid_map
auxv             comm             environ          gid_map          map_files/       mounts           numa_maps        pagemap          root/            setgroups        stat             task/            wchan
cgroup           coredump_filter  exe              io               maps             mountstats       oom_adj          patch_state      sched            smaps            statm            timers           
sh-4.2$ ls -la /proc/1/exe 
lrwxrwxrwx. 1 1000010000 root 0 Jul 29 20:45 /proc/1/exe -> /usr/bin/cluster-image-registry-operator-watch
sh-4.2$

Comment 8 Ricardo Maraschini 2020-08-05 15:58:06 UTC
We have noticed that in the faulty clusters the operator has been deployed without the shared process namespace[1] directive present on the manifest. I am forwarding this BZ to CVO team for further investigation.



[1] https://github.com/openshift/cluster-image-registry-operator/blob/release-4.5/manifests/07-operator.yaml#L20

Comment 9 W. Trevor King 2020-08-05 20:31:12 UTC
> We have noticed that in the faulty clusters the operator has been deployed without the shared process namespace[1] directive present on the manifest.

I can confirm that the CVO is indeed ignoring that pod property, which would mean that whatever the value was when the manifest was created would be preserved regardless of the value in future manifests (until we fix the CVO-side bug).  I am not clear on how that would lead to 4.4 and 4.5 clusters exhibiting "Unable to determine PID" because of the CVO shortcoming, because it seems like the registry operator only began using the property in 4.2:

$ git --no-pager log --oneline -G shareProcessNamespace origin/release-4.2..origin/master -- manifests
...no hits...
$ git --no-pager log --oneline -G shareProcessNamespace origin/release-4.1..origin/master -- manifests
cc9e9fe05 (origin/pr/364) Integrating watchdog as a sidecar to registry operator.
3803d25ff Revert "Integrating watchdog as a sidecar to registry operator."
ffbb403ef (origin/pr/342) Integrating watchdog as a sidecar to registry operator.

Were all of the affected clusters born in 4.1?

Comment 10 brad.williams 2020-08-05 20:45:30 UTC
These clusters are (very) long-lived clusters...

Currently exhibiting this issue:
     us-east-2: 4.1.21
  ca-central-1: 4.1.15
     us-west-1: 4.2.2

Not impacted by this issue:
     us-east-1: 4.2.7

Comment 11 W. Trevor King 2020-08-05 20:47:34 UTC
So do we have a story around why the born-in-4.2.2 cluster is showing 'Unable to determine PID'?  If we don't, I'll spin off the CVO-should-manage-shareProcessNamespace bit into a new bug, and reset the title here.

Comment 12 brad.williams 2020-08-05 21:01:02 UTC
None that I am aware at this time.

Comment 13 W. Trevor King 2020-08-05 21:23:50 UTC
Ok, spun the CVO's handling of shareProcessNamespace out into bug 1866554, and reset this one back to the registry folks so they can figure out what's happening with the 4.2 cluster (which should not have a divergent shareProcessNamespace setting, but may diverge from the intended manifest on other properties).

Comment 16 Ricardo Maraschini 2020-08-21 12:39:08 UTC
Still awaiting for further investigation, postponed due to higher priority bugs.

Comment 17 Oleg Bulatov 2020-08-24 19:57:07 UTC
Can we get yaml for the deployment/pods from the born-in-4.2.2 cluster?

Comment 18 brad.williams 2020-08-24 20:15:21 UTC
Hi Oleg,
I'll go ahead and gather the pieces that you requested.  If you want to DM me your SSH key, I can give you admin access to the cluster itself, if you're interested?

Thanks,
    Brad

Comment 21 Ricardo Maraschini 2020-09-01 14:22:47 UTC
I double checked and this property was set to true on 4.2.2 clusters. IIUC this should have been created as such by CVO on a fresh 4.2.2 install

$ oc adm release info 4.2.2 --commits |grep image-registry-operator
  cluster-image-registry-operator               https://github.com/openshift/cluster-image-registry-operator               017a9fcc24253e356dd58d32fb9fe18a071901d3
$ git checkout 017a9fcc24253e356dd58d32fb9fe18a071901d3
HEAD is now at 017a9fcc2 Merge pull request #399 from ricardomaraschini/storage-cleanup
$ grep share manifests/*
manifests/07-operator.yaml:      shareProcessNamespace: true
$

Passing this back to CVO folks. Let me know if there is something else I can do.

Comment 22 Vadim Rutkovsky 2020-09-01 14:50:08 UTC
That looks like a dupe of 1866554

Comment 23 W. Trevor King 2020-09-01 20:43:09 UTC
We spun bug 1866554 out of this ticket.  If that is indeed the only issue with these clusters, yeah, lets just reverse comment 16 and go back to MODIFIED.

Comment 25 W. Trevor King 2020-09-02 15:43:02 UTC
Assigning back to Ricardo, who will understand the impact of this issue well enough to fill in the doc text ;).

Comment 26 Oleg Bulatov 2020-09-02 16:00:31 UTC
Trevor, the rest of the bug is also about CVO.

The born-in-4.2.2 cluster doesn't have shareProcessNamespace while the manifest has this property. It means CVO has created this resource incorrectly (or maybe somebody decided to change it later, I guess we don't have this information).

Feel free to mark this bug as a DUPLICATE if BZ1866554 also solves the problem with new clusters.

Comment 27 W. Trevor King 2020-09-02 23:30:11 UTC
Bug 1866554 has nothing to do with new clusters.  This issue here is that:

* Born in 4.1 clusters had no shareProcessNamespace.
* Registry folks added shareProcessNamespace=true in 4.2, but because of bug 1866554, the CVO did not set it when updating born-in-4.1 clusters.
* [1] dropped shareProcessNamespace in 4.6, but because of bug 1866554, the CVO did not set it when updating born-before-4.6 clusters.
* Bug 1866554 (backported to 4.5 with bug 1868478, which is not yet released) fixed the CVO handling, but CVO treats manifests where *bool properties are unset as "manifest author does not care".  This fixes born-in-4.1 clusters who update to fixed 4.5 releases, because they'll finally get shareProcessNamespace=true.  I have no idea what those clusters receiving that change means about how the function of the registry operator.
* This bug landed [2] with an explicit shareProcessNamespace=false, so when born-before-[1] clusters update to a release that contains this bug's fix, the CVO will set shareProcessNamespace=false, as the registry folks intended in [1].  I have no idea what those clusters receiving that change means about how the function of the registry operator.

My take one release notes:

Cause: The cluster-version operator was not updating shareProcessNamespace on the image registry deployment.

Consequence: In 4.6, we began setting shareProcessNamespace=false; from 4.2 through 4.5 it had been shareProcessNamespace=true.  That meant that clusters which were installed as 4.2, 4.3, 4.4, or 4.5 clusters would, when updating to 4.6 still have shareProcessNamespace=true and $WHATEVER_GOES_WRONG_WHEN_THAT_HAPPENS.

Fix: The cluster-version operator was fixed to manage shareProcessNamespace and the image registry deployment was fixed to explicitly set shareProcessNamespace=false.

Result: $WHATEVER_GOES_WRONG_WHEN_THAT_HAPPENS no longer happens.

Do you have recommended wording for $WHATEVER_GOES_WRONG_WHEN_THAT_HAPPENS?

[1]: https://github.com/openshift/cluster-image-registry-operator/pull/587
[2]: https://github.com/openshift/cluster-image-registry-operator/pull/591

Comment 28 Oleg Bulatov 2020-09-03 09:25:43 UTC
Nothing will happen, I don't care about shareProcessNamespace in 4.6 that much. It's a leftover that we want to remove.

In 4.2-4.5 we had a separate container to watch CA bundle and TLS certificates changes. As part of IR-16 it was integrated into the main container, so this option is not needed anymore. As in 4.6 we have only one container, it doesn't affect anything.

Probably we should've created a no-doc BZ for [1] instead of using this one. This one is kind of related because it's about the leftovers after this container that this BZ is about. But it doesn't fix this BZ, we used it because we had to use something.

What this BZ is about and we care about is that we should have shareProcessNamespace set in 4.2-4.5. Per comment 10 born-in-4.2.2 cluster doesn't have shareProcessNamespace set, but 4.2.2 manifests have `true` for this option. That's the only thing that is left in this BZ.

Trevor, do you want to investigate it? Or I'm going to close this BZ as DUPLICATE of BZ1866554.

[1]: https://github.com/openshift/cluster-image-registry-operator/pull/591

Comment 29 W. Trevor King 2020-09-03 20:27:59 UTC
> What this BZ is about and we care about is that we should have shareProcessNamespace set in 4.2-4.5.

The title of this bug mentions "Unable to determine PID".  That isn't directly:

  registry-operator has shareProcessNamespace=false for born-in-4.1 clusters which have updated to 4.2 through 4.5

If we retitle this bug to be about that, I agree that it can be closed as a dup of bug 1866554.  But if you _really_ don't care about shareProcessNamespace in 4.6, you should revert [1] and unlink from this bug, because leaving the property unset in your manifest is the way to tell the CVO you don't care.  I don't think we can close this bug as a dup of 1866554 as long as we have a PR still attached here.

If there is actually some connection to "Unable to determine PID" behavior, I think this bug is the place to talk about what that connection is.

[1]: https://github.com/openshift/cluster-image-registry-operator/pull/591

Comment 30 Oleg Bulatov 2020-09-03 23:04:36 UTC
> The title of this bug mentions "Unable to determine PID".  That isn't directly:

That's arguable. "Unable to determine PID" is a direct consequence of shareProcessNamespace=false (one container doesn't see another one).

> you should revert [1] and unlink from this bug, because leaving the property unset in your manifest is the way to tell the CVO you don't care.

I unlinked this PR from this BZ, but I won't revert it or I'll need to merge it again, and that's pointless bureaucracy. I think we already have enough information about this PR in this BZ. It doesn't need doc updates as the "problem" [2][3] and the "fix" [1] landed in 4.6 before 4.6.0 is released.

Ideally both [1] and [2] should've been squashed into [3], but we didn't notice this field and didn't know about CVO details.

The reason why I want to keep [1] is because I don't want to have different operator deployments in born-in-4.5 and brand-new-4.6 clusters. It means we should have this field set to false because CVO doesn't work the way `kubectl apply` works. That's not because it fixes some problem, but because I want to have things consistent.

As the only open question was about [1], I'm closing this BZ.

[1]: https://github.com/openshift/cluster-image-registry-operator/pull/591
[2]: https://github.com/openshift/cluster-image-registry-operator/pull/587
[3]: https://github.com/openshift/cluster-image-registry-operator/pull/575

*** This bug has been marked as a duplicate of bug 1866554 ***

Comment 31 W. Trevor King 2020-09-03 23:19:10 UTC
> I unlinked this PR from this BZ, but I won't revert it or I'll need to merge it again, and that's pointless bureaucracy.

I don't see much benefit to "we will close out this bug as a dup and leave registry-operator/591 undocumented" over "we will close out this bug linking registry-operator#591 with an errata link and No Doc Update", but if you are comfortable with the dangling PR, and also comfortable with us not having discussed registry-specific impacts in our bug 1866554 doc text, I'm fine with this conclusion.