1761941 – insights-client kills rpm/yum processes leaving stale lock files in the rpm database [rhel-7.9.z]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1761941 - insights-client kills rpm/yum processes leaving stale lock files in the rpm database [rhel-7.9.z]

Summary: insights-client kills rpm/yum processes leaving stale lock files in the rpm d...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	insights-client
Sub Component:
Version:	7.7
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	jcrafts
QA Contact:	Pavol Kotvan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-15 16:25 UTC by Ryan Mullett
Modified:	2024-10-01 16:21 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-15 11:23:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5398681	0	None	None	None	2020-09-17 17:28:23 UTC
Red Hat Product Errata	RHBA-2020:5462	0	None	None	None	2020-12-15 11:23:42 UTC

Internal Links: 1935846

Description Ryan Mullett 2019-10-15 16:25:19 UTC

Description of problem:
insights-client kills rpm/yum processes leaving stale lock files in the rpm database at /var/lib/rpm/__db*

Version-Release number of selected component (if applicable):
insights-client-3.0.6-2.el7_6.noarch

How reproducible:
Always, on specific systems. Never on other systems.

Steps to Reproduce:
Have a yum process running while insights-client does automatically scheduled run

Actual results:
insights-client kills yum process leaving lock files in the rpm database

Expected results:
insights-client will not kill yum processes

Additional info:
Here's the data that was utilized to come to this conclusion:

From audit logs here's the yum pid (22351) that created the lock files. We have an audit set up to log every rpm database modification:

type=PROCTITLE msg=audit(1571031348.818:17961): proctitle=72706D002D71002D2D716600257B56455253494F4E7D2D257B52454C454153457D00696E7369676874732D636C69656E74
type=SYSCALL msg=audit(1571031356.824:17962): arch=c000003e syscall=2 success=yes exit=3 a0=109f070 a1=42 a2=1a4 a3=3 items=2 ppid=22350 pid=22351 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="yum" exe="/usr/bin/python2.7" key="rpm"
type=CWD msg=audit(1571031356.824:17962):  cwd="/"
type=PATH msg=audit(1571031356.824:17962): item=0 name="/var/lib/rpm/" inode=67155796 dev=fd:00 mode=040755 ouid=0 ogid=0 rdev=00:00 objtype=PARENT cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1571031356.824:17962): item=1 name="/var/lib/rpm/.dbenv.lock" inode=67155797 dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00 objtype=NORMAL cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1571031356.824:17962): proctitle=2F7573722F62696E2F707974686F6E002F7573722F62696E2F79756D002D43007265706F6C697374
type=SYSCALL msg=audit(1571031356.824:17963): arch=c000003e syscall=2 success=no exit=-17 a0=1099f60 a1=c2 a2=1a4 a3=2 items=2 ppid=22350 pid=22351 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="yum" exe="/usr/bin/python2.7" key="rpm"
type=CWD msg=audit(1571031356.824:17963):  cwd="/"
type=PATH msg=audit(1571031356.824:17963): item=0 name="/var/lib/rpm/" inode=67155796 dev=fd:00 mode=040755 ouid=0 ogid=0 rdev=00:00 objtype=PARENT cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1571031356.824:17963): item=1 name="/var/lib/rpm/__db.001" inode=67167052 dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00 objtype=NORMAL cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1571031356.824:17963): proctitle=2F7573722F62696E2F707974686F6E002F7573722F62696E2F79756D002D43007265706F6C697374
type=SYSCALL msg=audit(1571031356.824:17964): arch=c000003e syscall=2 success=yes exit=4 a0=1099f60 a1=2 a2=0 a3=2 items=1 ppid=22350 pid=22351 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="yum" exe="/usr/bin/python2.7" key="rpm"
type=CWD msg=audit(1571031356.824:17964):  cwd="/"
type=PATH msg=audit(1571031356.824:17964): item=0 name="/var/lib/rpm/__db.001" inode=67167052 dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00 objtype=NORMAL cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1571031356.824:17964): proctitle=2F7573722F62696E2F707974686F6E002F7573722F62696E2F79756D002D43007265706F6C697374
type=SYSCALL msg=audit(1571031356.824:17965): arch=c000003e syscall=2 success=yes exit=4 a0=1099f60 a1=2 a2=1a4 a3=2 items=1 ppid=22350 pid=22351 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="yum" exe="/usr/bin/python2.7" key="rpm"
type=CWD msg=audit(1571031356.824:17965):  cwd="/"
type=PATH msg=audit(1571031356.824:17965): item=0 name="/var/lib/rpm/__db.001" inode=67167052 dev=fd:00 mode=0100644 ouid=0 ogid=0 rdev=00:00 objtype=NORMAL cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1571031356.824:17965): proctitle=2F7573722F62696E2F707974686F6E002F7573722F62696E2F79756D002D43007265706F6C697374
type=SYSCALL msg=audit(1571031356.824:17966): arch=c000003e syscall=2 success=yes exit=6 a0=107b4c0 a1=42 a2=1a4 a3=2 items=2 ppid=22350 pid=22351 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="yum" exe="/usr/bin/python2.7" key="rpm"
type=CWD msg=audit(1571031356.824:17966):  cwd="/"

=============================================================================================================================== 

We have a debugging application installed in the client that tracks every kill signal that is generated on the system. This was used in combination with the above audit logs to see what killed the pid that created the locks listed in audit logs.

This allows us to see that the insights-client was the application that killed the yum process:

Mon Oct 14 01:36:25 2019 EDT: sys.kill: swapper/0[0] -> systemd[1] -> insights-client[22230] -> insights-client[22317] -> timeout[22342] -> python[22343] -> timeout(22350) called kill 22351, SIGKILL killed yum
Mon Oct 14 01:36:25 2019 EDT: sig.send: SIGKILL was sent to yum (pid:22351) by swapper/0[0] -> systemd[1] -> insights-client[22230] -> insights-client[22317] -> timeout[22342] -> python[22343] -> timeout (pid:22350) uid:0 using signal_generate
Mon Oct 14 01:36:25 2019 EDT: sys.kill: swapper/0[0] -> systemd[1] -> insights-client[22230] -> insights-client[22317] -> timeout[22342] -> python[22343] -> timeout(22350) called kill 0, SIGKILL killed 
Mon Oct 14 01:36:25 2019 EDT: sig.send: SIGKILL was sent to yum (pid:22351) by swapper/0[0] -> systemd[1] -> insights-client[22230] -> insights-client[22317] -> timeout[22342] -> python[22343] -> timeout (pid:22350) uid:0 using signal_generate

=============================================================================================================================== 

Just to be thorough, and to make sure that the pid lines up, I confirmed that the locks in the rpm database we analyzed were created by the same pid:

# db_stat -CA -h .   
Default locking region information:
1222	Last allocated locker ID
0x7fffffff	Current maximum unused locker ID
5	Number of lock modes
100	Initial number of locks allocated
0	Initial number of lockers allocated
100	Initial number of lock objects allocated
0	Maximum number of locks possible
0	Maximum number of lockers possible
0	Maximum number of lock objects possible
100	Current number of locks allocated
18	Current number of lockers allocated
100	Current number of lock objects allocated
20	Number of lock object partitions
1031	Size of object hash table
2	Number of current locks
29	Maximum number of locks at any one time
3	Maximum number of locks in any one bucket
0	Maximum number of locks stolen by for an empty partition
0	Maximum number of locks stolen for any one partition
2	Number of current lockers
18	Maximum number of lockers at any one time
2	Number of current lock objects
25	Maximum number of lock objects at any one time
2	Maximum number of lock objects in any one bucket
0	Maximum number of objects stolen by for an empty partition
0	Maximum number of objects stolen for any one partition
1807327	Total number of locks requested
1807324	Total number of locks released
0	Total number of locks upgraded
908	Total number of locks downgraded
0	Lock requests not available due to conflicts, for which we waited
0	Lock requests not available due to conflicts, for which we did not wait
0	Number of deadlocks
0	Lock timeout value
0	Number of locks that have timed out
0	Transaction timeout value
0	Number of transactions that have timed out
280KB	Region size
0	The number of partition locks that required waiting (0%)
0	The maximum number of times any partition lock was waited for (0%)
0	The number of object queue operations that required waiting (0%)
0	The number of locker allocations that required waiting (0%)
0	The number of region locks that required waiting (0%)
2	Maximum hash bucket length
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock REGINFO information:
Environment	Region type
1	Region ID
./__db.001	Region name
0x7f59a0d14000	Region address
0x7f59a0d140a0	Region allocation head
0x7f59a0d146a8	Region primary address
0	Region maximum allocation
0	Region allocated
Region allocations: 101 allocations, 0 failures, 0 frees, 1 longest
Allocations by power-of-two sizes:
  1KB	97
  2KB	1
  4KB	1
  8KB	0
 16KB	0
 32KB	1
 64KB	0
128KB	1
256KB	0
512KB	0
1024KB	0
REGION_SHARED	Region flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock region parameters:
2	Lock region region mutex [0/1415 0% 23/140022927968064]
131	locker table size
1031	object table size
2320	obj_off
111448	locker_off
0	need_dd
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock conflict matrix:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Locks grouped by lockers:
Locker   Mode      Count Status  ----------------- Object ---------------
     4c5 dd= 0 locks held 1    write locks 0    pid/thread 22351/140325946337088 flags 0    priority 100       
     4c5 READ          1 HELD    (201342d fd00 6f89e70e 3000 0) handle        0
     4c6 dd= 0 locks held 1    write locks 0    pid/thread 22351/140325946337088 flags 0    priority 100       
     4c6 READ          2 HELD    0x1e5e0 len:  20 data: 2d34010200fd00000ee7896f0030000000000000
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Locks grouped by object:
Locker   Mode      Count Status  ----------------- Object ---------------
     4c6 READ          2 HELD    0x1e5e0 len:  20 data: 2d34010200fd00000ee7896f0030000000000000

     4c5 READ          1 HELD    (201342d fd00 6f89e70e 3000 0) handle        0

=============================================================================================================================== 

Note in the above, under "Lock conflict matrix:" we can see that the pid that created those locks is 22351, which lines up with the pid of the yum process, as well as the pid that was killed by insights-client. 

The insights-client config, as well as the redhat-access-insights.conf configs are both in default state.

The insights-client on the hosts having the issues is set to autorun via the --enable-schedule option. We have disabled automatic scheduling on the hosts in question in order to see if we can reproduce this without it enabled.

Comment 14 Pavol Kotvan 2020-07-15 14:50:25 UTC

I was able to reproduce RPM DB corruption by running this command:

COUNT=0; while true; do time /bin/timeout -s 9 0.2 /bin/rpm -qa --qf [name:%{NAME},epoch:%{EPOCH},version:%{VERSION},release:%{RELEASE},arch:%{ARCH},installtime:%{INSTALLTIME:date},buildtime:%{BUILDTIME},vendor:%{VENDOR},buildhost:%{BUILDHOST},sigpgp:%{SIGPGP:pgpsig}]; rpm -qa | grep insights; let COUNT+=1; echo "======COUNT: ${COUNT}======"; sleep 3; done

RPM DB became corrupted after 67 iterations.

By changing timeout signal to SIGTERM, above loop could iterate more than 200 times without corrupting the DB (I stopped while loop after so many iterations, it does not mean the DB got corrupted).

insights-client developers are preparing fix which will be released in next EGG release which should be this Thursday.

Comment 15 Stephen Adams 2020-07-16 19:48:52 UTC

Fix for this issue was released in EGG/core version v 3.0.174.

Comment 16 Kyle Walker 2020-09-08 13:23:04 UTC

@Stephen,

Can you clarify how your previous update is a fix for this particular problem? Your comment indicates that version 3.0.174 includes the necessary change, but as of this morning there was another instance of this particular problem. I went ahead and registered a system with insights and checked the version included:

    ~]# yum install insights-client
    ~]# insights-client --register
    ~]# cd $(mktemp -d)
    tmp.hCztyxd0qJ]# unzip /etc/insights-client/rpm.egg >/dev/null 2>&1
    tmp.hCztyxd0qJ]# grep ^Version EGG-INFO/PKG-INFO
    Version: 3.0.150


I assume that you mean that this particular change fixed the problem:

    $ git show 5122e0d51f8d392bbeebff27f5a8fbab4bed8ba4
    commit 5122e0d51f8d392bbeebff27f5a8fbab4bed8ba4
    Author: Jeremy Crafts <gravitypriest.github.com>
    Date:   Thu Jul 16 10:24:15 2020 -0400
    
        use SIGTERM for rpm and yum commands (#2630)
        
        Signed-off-by: Jeremy Crafts <jcrafts>
    
    diff --git a/insights/client/insights_spec.py b/insights/client/insights_spec.py
    index 1e2de068..d0b9037b 100644
    --- a/insights/client/insights_spec.py
    +++ b/insights/client/insights_spec.py
    @@ -47,8 +47,17 @@ class InsightsCommand(InsightsSpec):
     
             # all commands should timeout after a long interval so the client does not hang
             # prepend native nix 'timeout' implementation
    -        timeout_command = 'timeout -s KILL %s %s' % (
    -            self.config.cmd_timeout, self.command)
    +
    +        # use TERM for rpm/yum commands, KILL for everything else
    +        if (self.command.startswith('/bin/rpm') or
    +           self.command.startswith('yum') or
    +           self.command.startswith('/usr/bin/yum')):
    +            signal = 'TERM'
    +        else:
    +            signal = 'KILL'
    +
    +        timeout_command = 'timeout -s %s %s %s' % (
    +            signal, self.config.cmd_timeout, self.command)
     
             # ensure consistent locale for collected command output
             cmd_env = {'LC_ALL': 'C',


But that is _not_ in the version pulled by my RHEL 7 client:

    # awk '/InsightsCommand/,/timeout -s/' insights/client/insights_spec.py
    class InsightsCommand(InsightsSpec):
        '''
        A command spec
        '''
        def __init__(self, config, spec, exclude, mountpoint, parent_pid=None):
            InsightsSpec.__init__(self, config, spec, exclude, parent_pid)
            self.command = spec['command'].replace(
                '{CONTAINER_MOUNT_POINT}', mountpoint)
            self.archive_path = mangle.mangle_command(self.command)
            self.is_hostname = spec.get('symbolic_name') == 'hostname'
            if not six.PY3:
                self.command = self.command.encode('utf-8', 'ignore')

        def get_output(self):
            '''
            Execute a command through system shell. First checks to see if
            the requested command is executable. Returns (returncode, stdout, 0)
            '''
            # let systemd know we're still going
            systemd_notify(self.parent_pid)

            if self.is_hostname:
                # short circuit for hostame with internal method
                return determine_hostname()

            # all commands should timeout after a long interval so the client does not hang
            # prepend native nix 'timeout' implementation
            timeout_command = 'timeout -s KILL %s %s' % (

I'm reopening this bug at this time.

Comment 18 jcrafts 2020-09-17 13:46:27 UTC

Hello,

Could you provide the output of insights-client --version? Also, are you receiving automatic egg updates? (i.e. in insights-client.conf, auto_updates=True)

Thanks!

Comment 22 Kyle Walker 2020-09-22 12:21:08 UTC

Just for completeness,

Discussion and internal investigation of the testing outcome mentioned in my previous update identified a separate issue which external end users would not be subject to. However, this does not preclude the failures being seen. The update dated "2020-07-16 19:48:52 UTC" indicating that the fix is present is applicable only to the auto-update behaviour. The core implementation that is shipped within the RPM is not updated at this time.

Comment 28 Chris Williams 2020-11-11 21:38:09 UTC

Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7

Comment 29 Kyle Walker 2020-11-18 13:09:26 UTC

Reopening and setting back to ASSIGNED due to the severity of the issue and that there is an ongoing effort towards fixing this problem

Comment 37 errata-xmlrpc 2020-12-15 11:23:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (insights-client bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5462

Note You need to log in before you can comment on or make changes to this bug.