Bug 1518599

Summary: engine-setup upgrade to postgresql 9.5 sometimes fails due to missing selinux policy
Product: Red Hat Enterprise Linux 7 Reporter: Petr Lautrbach <plautrba>
Component: selinux-policyAssignee: Lukas Vrabec <lvrabec>
Status: CLOSED WONTFIX QA Contact: Milos Malik <mmalik>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.7CC: bugs, didi, edwardh, lvrabec, mgrepl, mmalik, plautrba, pstehlik, ssekidde, zpytela
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1518253 Environment:
Last Closed: 2019-03-14 10:24:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Lautrbach 2017-11-29 09:40:23 UTC
+++ This bug was initially created as a clone of Bug #1518253 +++

Description of problem:

engine-setup fails with:

[ERROR] Failed to execute stage 'Misc configuration': Failed to start service 'rh-postgresql95-postgresql'

audit log has:

type=AVC msg=audit(1511854191.913:1057): avc:  denied  { getattr } for  pid=14752 comm="postgres" path="/var/opt/rh/rh-postgresql95/lib/pgsql/data/postgresql.conf" dev="dm-0" ino=396983 scontext=system_u:system_r:postgresql_t:s0 tcontext=unconfined_u:object_r:var_t:s0 tclass=file

We run this upgrade on CI, and it works there, so there is obviously something specific to the machine this happened on. Not sure what exactly.

Comparing to a machine where it does work, I saw this difference, when running 'semanage fcontext --list': On the good machine, I have:

==========================================
SELinux Local fcontext Equivalence 

/opt/rh/rh-postgresql95/root = /
/var/opt/rh/rh-postgresql95 = /var
/etc/opt/rh/rh-postgresql95 = /etc
==========================================

This is missing on the bad machine.

This seems to be handled by the contents of the file:

selinux/targeted/contexts/files/file_contexts.subs

which is empty on the bad machine, and on the good one has:
==========================================
/opt/rh/rh-postgresql95/root /
/var/opt/rh/rh-postgresql95 /var
/etc/opt/rh/rh-postgresql95 /etc
==========================================

It seems to be created by the postinstall scriptlet of the rpm package rh-postgresql95-runtime. Based on this, I verified the following workaround:

yum reinstall rh-postgresql95-runtime

Based on logs of the bad machine, and the timestamp of the above file, it seems to have been broken by the systemd one-time service selinux-policy-migrate-local-changes. But it did not output anything to the log except for Starting/Started, so can't be sure.

Version-Release number of selected component (if applicable):

Current master

How reproducible:
Not sure

Steps to Reproduce:
1. Have a pre-postgresql9.5 setup. Current was old master (4.2), not sure it affects also 4.1
2. Try to upgrade as usual - yum update setup packages, engine-setup
3.

Actual results:
Fails as above

Expected results:
Succeeds

Additional info:

Workaround:

yum reinstall rh-postgresql95-runtime

and try again.

Opening this bug for reference, for now. Once we get more reports and/or find the root cause, we might close, move to selinux, apply some workaround during engine-setup, or something else.

--- Additional comment from Edward Haas on 2017-11-28 15:11 CET ---

Split by dd into two parts.

--- Additional comment from Edward Haas on 2017-11-28 15:14 CET ---

Split by dd into two parts.

--- Additional comment from Sandro Bonazzola on 2017-11-28 15:16:01 CET ---

Didi please get rh-postgresql95 maintainer involved, maybe some race condition in the order packages are installed due to some missing dep within rh-postgresql95 deptree.

--- Additional comment from Yedidyah Bar David on 2017-11-28 15:27:28 CET ---

Adding selinux-policy and rh-postgresql95 maintainers.

Pavel and Lukas, any of this makes sense?

--- Additional comment from Pavel Raiskup on 2017-11-28 15:39:55 CET ---

(In reply to Yedidyah Bar David from comment #0)
> type=AVC msg=audit(1511854191.913:1057): avc:  denied  { getattr } for 
> pid=14752 comm="postgres"
> path="/var/opt/rh/rh-postgresql95/lib/pgsql/data/postgresql.conf" dev="dm-0"
> ino=396983 scontext=system_u:system_r:postgresql_t:s0
> tcontext=unconfined_u:object_r:var_t:s0 tclass=file

The postgresql.conf file should be of type postgresql_db_t.  If it is not
we need to find the reason why it is not.

> We run this upgrade on CI, and it works there, so there is obviously
> something specific to the machine this happened on. Not sure what exactly.

Do you have SELinux enabled in the CI?

> [...]
> It seems to be created by the postinstall scriptlet of the rpm package
> rh-postgresql95-runtime. Based on this, I verified the following workaround:
> 
> yum reinstall rh-postgresql95-runtime

I would guess the context is OK after postgresql installation,
and db initialization;  and that engine-setup changes the context later.
That script should be analyzed.

--- Additional comment from Yedidyah Bar David on 2017-11-28 16:22:51 CET ---

(In reply to Pavel Raiskup from comment #5)
> (In reply to Yedidyah Bar David from comment #0)
> > type=AVC msg=audit(1511854191.913:1057): avc:  denied  { getattr } for 
> > pid=14752 comm="postgres"
> > path="/var/opt/rh/rh-postgresql95/lib/pgsql/data/postgresql.conf" dev="dm-0"
> > ino=396983 scontext=system_u:system_r:postgresql_t:s0
> > tcontext=unconfined_u:object_r:var_t:s0 tclass=file
> 
> The postgresql.conf file should be of type postgresql_db_t.  If it is not
> we need to find the reason why it is not.
> 
> > We run this upgrade on CI, and it works there, so there is obviously
> > something specific to the machine this happened on. Not sure what exactly.
> 
> Do you have SELinux enabled in the CI?

Yes

> 
> > [...]
> > It seems to be created by the postinstall scriptlet of the rpm package
> > rh-postgresql95-runtime. Based on this, I verified the following workaround:
> > 
> > yum reinstall rh-postgresql95-runtime
> 
> I would guess the context is OK after postgresql installation,
> and db initialization;  and that engine-setup changes the context later.
> That script should be analyzed.

As I wrote above, and discussed in private, engine-setup does not change the label on that file. It copies it from /var/lib/pgsql/data to the /var/opt location using python's shutil.copy2.

Adding also needinfo on Petr, who appears as the author of the script that is ran by selinux-policy-migrate-local-changes, selinux-policy-migrate-local-changes.sh. Petr - any idea?

--- Additional comment from Petr Lautrbach on 2017-11-28 21:27:39 CET ---

selinux-policy-migrate-local-changes.sh script should be run only once after a system is upgraded to rhel-7.3 or from SELinux userspace release < 2.4 to 2.5.

The only place where it touches file_contexts.subs is when it runs:

/usr/sbin/semanage export | /usr/sbin/semanage import

This generally should not fail.

At the same time there is the following line in selinux-policy.spec file which should prevent overwriting this file during selinux-policy update:

%config(noreplace) %{_sysconfdir}/selinux/%1/contexts/files/file_contexts.subs

So it looks like something else removed/erased file_contexts.subs. 

Are you able to reproduce it in a way that you have working setup and do something which results in empty file_contexts.subs ?

--- Additional comment from Yedidyah Bar David on 2017-11-29 07:45:02 CET ---

(In reply to Petr Lautrbach from comment #7)
> selinux-policy-migrate-local-changes.sh script should be run only once after
> a system is upgraded to rhel-7.3 or from SELinux userspace release < 2.4 to
> 2.5.
> 
> The only place where it touches file_contexts.subs is when it runs:
> 
> /usr/sbin/semanage export | /usr/sbin/semanage import
> 
> This generally should not fail.
> 
> At the same time there is the following line in selinux-policy.spec file
> which should prevent overwriting this file during selinux-policy update:
> 
> %config(noreplace)
> %{_sysconfdir}/selinux/%1/contexts/files/file_contexts.subs
> 
> So it looks like something else removed/erased file_contexts.subs. 

Perhaps, but I really have no idea what. Also, the timestamp of that file is ~ 1 second before selinux-policy-migrate-local-changes was "Started" (finished).

> 
> Are you able to reproduce it in a way that you have working setup and do
> something which results in empty file_contexts.subs ?

Didn't try yet.
I have a VM snapshotted and cloned from a machine in a state that has this file zeroed, I can give you access to it for further analysis. The attached sosreport is from the original machine.

Comment 2 Petr Lautrbach 2017-11-29 09:44:16 UTC
It seems to be some corner case when selinux-policy-3.13.1-102.el7.noarch is updated to 3.13.1-166.el7_4.7.noarch in a particular condition

Comment 3 Sandro Bonazzola 2017-11-29 10:25:44 UTC
Please consider backporting to 7.4.z

Comment 4 Lukas Vrabec 2017-12-07 09:57:27 UTC
I need to reproduce it on RHEL-7.2. But as Petr mentioned, it's some corner case of that script.

Comment 5 Lukas Vrabec 2018-04-24 09:18:38 UTC
Hi all, 

We're *not* able to reproduce it. Also in original bugzilla there is a workaround how to deal with this potential bugzilla [1]. Cloasing as WONTFIX. 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1518253

Comment 6 Yedidyah Bar David 2019-02-19 07:25:59 UTC
Hi again, this happens again now on our CI system. I managed to reproduce it on my own machine, and succeeded, when using the same tools we use in CI. I see this in /var/log/messages:

Feb 18 06:56:55 lago-upgrade-from-release-suite-master-engine systemd: Starting Migrate local SELinux policy changes from the old store structure to the new structure...
Feb 18 06:56:55 lago-upgrade-from-release-suite-master-engine systemd: Starting Tell Plymouth To Write Out Runtime Data...
Feb 18 06:56:55 lago-upgrade-from-release-suite-master-engine systemd: Failed at step STDIN spawning /usr/libexec/selinux/selinux-policy-migrate-local-changes.sh: Inappropriate ioctl for device
Feb 18 06:56:55 lago-upgrade-from-release-suite-master-engine systemd: selinux-policy-migrate-local-changes: main process exited, code=exited, status=208/STDIN
Feb 18 06:56:55 lago-upgrade-from-release-suite-master-engine systemd: Failed to start Migrate local SELinux policy changes from the old store structure to the new structure.
Feb 18 06:56:55 lago-upgrade-from-release-suite-master-engine systemd: Unit selinux-policy-migrate-local-changes entered failed state.
Feb 18 06:56:55 lago-upgrade-from-release-suite-master-engine systemd: selinux-policy-migrate-local-changes failed.

Searching for 'Failed at step STDIN spawning' I found (also) bug 1049656. Looking there, I guess the root cause is something like:

1. systemd/kernel/whatever fails to assign a tty to selinux-policy-migrate-local-changes@.service when starting it.

2. This service has 'StandardInput=tty', so it fails running as above.

Our CI system uses lago, a small framework around libvirt to create several VMs and test oVirt using them. I am not that familiar with lago. Checking the running system, I see (with virsh dumpxml):

    <console type='pty' tty='/dev/pts/6'>
      <source path='/dev/pts/6'/>
      <target type='virtio' port='0'/>
      <alias name='console0'/>
    </console>

So perhaps something caused /dev/pts/6 to not be available and therefore the console to not be available, not sure.

Any idea why we need 'StandardInput=tty' for this service? I gave a brief look at selinux-policy-migrate-local-changes.sh and can't see something obvious that might require interaction. If we know about something specific there that might require a tty, perhaps allow passing an option to the script so that it runs this specific thing with '--quiet' or whatever option it has that makes it not require interaction.

Anyway, I think it's worth fixing also for systems that do not have a console. Makes sense?

Comment 7 Sandro Bonazzola 2019-02-19 08:01:31 UTC
Reopening for evaluation after comment #6 giving more details on the issue.
Proposing this bug for 7.6.z consideration.

Comment 9 Yedidyah Bar David 2019-02-24 09:49:37 UTC
Another update:

See also [1] re how we generate images in oVirt CI. TL; DR: We use virt-builder, and the specific image seems to be based on centos7.2. I guess it's due to the specific way we upgrade that we run into current bug. We created new images based on centos7.6 and updated CI to use them, and this seems to work.

Seems like virt-builder hard-codes 'console=tty0' in the kernel command line, but in CI we do not create the machines with a serial port, so /dev/console does not work.

So current bug is quite hard to reproduce, but still not sure it's worth to fix, especially if there is no concrete reason for 'StandardInput=tty'.

[1] https://docs.google.com/document/d/1fxXBcgHnQr3r2qrSzXiSlpNMMLWnbq899s8rUP38KHw/edit?usp=sharing

Comment 10 Yedidyah Bar David 2019-02-24 09:51:30 UTC
not sure it's _not_ worth to fix :-)

Comment 11 Zdenek Pytela 2019-03-14 10:24:29 UTC
This issue was not selected to be included in Red Hat Enterprise Linux 7.7 because it is seen either as low or moderate impact to a small number of use-cases. The next release will be in Maintenance Support 1 Phase, which means that qualified Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released as they become available.

We will now close this issue, but if you believe that it qualifies for the Maintenance Support 1 Phase, please re-open; otherwise, we recommend moving the request to Red Hat Enterprise Linux 8 if applicable.