Bug 1207721 - Runtime Error could not execute statement at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,102
Summary: Runtime Error could not execute statement at org.postgresql.core.v3.QueryExec...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Candlepin
Classification: Community
Component: candlepin
Version: 0.5
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: candlepin-bugs
QA Contact: Katello QA List
URL:
Whiteboard:
Depends On:
Blocks: rhsm-rhel67
TreeView+ depends on / blocked
 
Reported: 2015-03-31 14:46 UTC by John Sefler
Modified: 2019-11-14 06:40 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-01 19:05:49 UTC
Embargoed:


Attachments (Terms of Use)
foreman-debug (474.39 KB, application/x-xz)
2016-07-19 16:05 UTC, Roman Plevka
no flags Details

Description John Sefler 2015-03-31 14:46:32 UTC
Description of problem:
Test automation runs have started hitting this error recently.

Version-Release number of selected component (if applicable):
[root@jsefler-os6candlepin candlepin]# git log | head -1
commit c04ffb6b2716644c44355f88f03087c33da21d87
[root@jsefler-os6server ~]# subscription-manager version
server type: Red Hat Subscription Management
subscription management server: 0.9.47-1
subscription management rules: 5.14
subscription-manager: 1.14.2-1.git.1.6470ef0.el6
python-rhsm: 1.14.2-1.git.0.fafb02d.el6


How reproducible:
Not often - I do not know a repeatable use case.

Steps to Reproduce:
Pasted from a test automation log...

201503310319:14.101 - FINE: ssh root.redhat.com subscription-manager unregister
201503310319:17.043 - FINE: Stdout: 
201503310319:17.043 - FINE: Stderr: Runtime Error could not execute statement at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,102
201503310319:17.043 - FINE: ExitCode: 70
201503310319:17.044 - FINE: ssh root.redhat.com LINE_NUMBER=$(grep --line-number 'Making request:' /var/log/rhsm/rhsm.log | tail --lines=1 | cut --delimiter=':' --field=1); if [ -n "$LINE_NUMBER" ]; then tail -n +$LINE_NUMBER /var/log/rhsm/rhsm.log; fi;
201503310319:17.127 - WARNING: Last request from /var/log/rhsm/rhsm.log:
2015-03-31 03:19:17,873 [DEBUG] rhsmd:22939 @connection.py:494 - Making request: GET /candlepin/consumers/589555ff-42bf-45e4-9799-1419bc945006/compliance
2015-03-31 03:19:18,270 [DEBUG] rhsmd:22939 @connection.py:521 - Response: status=200, requestUuid=eabfdf97-edf7-44ac-91ee-815e82c350be
2015-03-31 03:19:18,339 [DEBUG] rhsmd:22939 @cache.py:272 - Started thread to write cache: /var/lib/rhsm/cache/entitlement_status.json
2015-03-31 03:19:19,130 [DEBUG] subscription-manager:22933 @connection.py:521 - Response: status=500
2015-03-31 03:19:19,130 [ERROR] subscription-manager:22933 @managercli.py:161 - Unregister failed
2015-03-31 03:19:19,130 [ERROR] subscription-manager:22933 @managercli.py:162 - Runtime Error could not execute statement at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,102
Traceback (most recent call last):
  File "/usr/share/rhsm/subscription_manager/managercli.py", line 1240, in _do_command
    managerlib.unregister(self.cp, self.identity.uuid)
  File "/usr/share/rhsm/subscription_manager/managerlib.py", line 788, in unregister
    uep.unregisterConsumer(consumer_uuid)
  File "/usr/lib64/python2.6/site-packages/rhsm/connection.py", line 990, in unregisterConsumer
    return self.conn.request_delete(method)
  File "/usr/lib64/python2.6/site-packages/rhsm/connection.py", line 614, in request_delete
    return self._request("DELETE", method, params)
  File "/usr/lib64/python2.6/site-packages/rhsm/connection.py", line 530, in _request
    self.validateResponse(result, request_type, handler)
  File "/usr/lib64/python2.6/site-packages/rhsm/connection.py", line 572, in validateResponse
    raise RestlibException(response['status'], error_msg)
RestlibException: Runtime Error could not execute statement at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,102

Comment 1 Devan Goodwin 2015-03-31 15:38:13 UTC
John are there candlepin logs available for this?

Comment 2 Devan Goodwin 2015-03-31 18:15:14 UTC
From the candlepin logs this appears to always be something like:

2015-03-31 07:23:14,999 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=] INFO  org.candlepin.common.filter.LoggingFilter - Request: verb=DELETE, uri=/candlepin/consumers/512a05fc-6d1c-4e10-a626-0f8e
2a91cd1f
2015-03-31 07:23:15,052 [req=d144fe22-045b-4bee-90f6-ee65827c6d8b, org=] INFO  org.candlepin.common.filter.LoggingFilter - Request: verb=GET, uri=/candlepin/consumers/512a05fc-6d1c-4e10-a626-0f8e2a9
1cd1f/compliance
2015-03-31 07:23:15,131 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=admin] INFO  org.candlepin.controller.CandlepinPoolManager - Revoked entitlement: 2c90af964c6dffde014c6eb644ff459d
2015-03-31 07:23:15,141 [req=d144fe22-045b-4bee-90f6-ee65827c6d8b, org=admin] INFO  org.candlepin.common.filter.LoggingFilter - Response: status=200, content-type="application/json", time=88
2015-03-31 07:23:15,149 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=admin] INFO  org.candlepin.controller.CandlepinPoolManager - Revoked entitlement: 2c90af964c6dffde014c6eb643b64598
2015-03-31 07:23:15,168 [req=1e234b82-cfbc-4f7f-8cfc-c3a54d04a2c9, org=] INFO  org.candlepin.common.filter.LoggingFilter - Request: verb=GET, uri=/candlepin/consumers/512a05fc-6d1c-4e10-a626-0f8e2a9
1cd1f/compliance
2015-03-31 07:23:15,219 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=admin] WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper - SQL Error: 0, SQLState: 23503
2015-03-31 07:23:15,219 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=admin] ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper - ERROR: update or delete on table "cp_consumer" violates foreign
 key constraint "fk_consumer_checkin_consumer" on table "cp_consumer_checkin"
  Detail: Key (id)=(2c90af964c6dffde014c6eb569a544e3) is still referenced from table "cp_consumer_checkin".
2015-03-31 07:23:15,224 [req=7d53040c-29a9-4250-b22e-bc76968affef, org=admin] ERROR org.candlepin.common.exceptions.mappers.CandlepinExceptionMapper - Runtime Error could not execute statement at or
g.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse:2,102
org.hibernate.exception.ConstraintViolationException: could not execute statement


Looking closely at this, the failing request UUID is a DELETE consumer, but before the DELETE request can complete there's a simultaneous GET request which is going to insert another consumer checkin timestamp into the db, but the DELETE request has already loaded the consumer for deletion.

As such the delete fails as the db realizes there's a foreign key still in place.

John this feels like tests that should be adjusted, can you think of any way you'd be deleting a test consumer but it would still be calculating it's compliance? Can you ensure all API calls for the consumer complete before trying to delete it somehow?

This would be exceedingly difficult to hit in production, and a re-try on the delete would succeed. Arguably everything working as expected but waiting to hear more from jsefler.

Comment 3 John Sefler 2015-03-31 18:33:44 UTC
(In reply to Devan Goodwin from comment #2)
> John this feels like tests that should be adjusted, can you think of any way
> you'd be deleting a test consumer but it would still be calculating it's
> compliance?

This question makes me think that think that bug 1207403 (which is easy to reproduce) may be the instigator of this bug.

Comment 4 Devan Goodwin 2015-04-01 11:39:56 UTC
That looks very likely, good find.

Comment 6 Adrian Likins 2015-04-07 19:56:11 UTC
(In reply to John Sefler from comment #3)
> (In reply to Devan Goodwin from comment #2)
> > John this feels like tests that should be adjusted, can you think of any way
> > you'd be deleting a test consumer but it would still be calculating it's
> > compliance?
> 
> This question makes me think that think that bug 1207403 (which is easy to
> reproduce) may be the instigator of this bug.

The logs here look like the post DELETE request for consumers compliance still knows the consumers UUID though. So the ordering is different. 

Is there anymore of the rhsm.log available? Curious when the DELETE request started. 

 My guess would be the test is registering, attach (auto or otherwise), then immediately unregistering. The subman register and/or attach triggers rhsmd (pid 22939), then subman exits. Then an immed 'unregister' starts another subman process, calling the DELETE. In the meantime the rhsmd process has made the GET compliance request.

subman register --autoattach
                                            rhsmd starts
subman unregister
<DELETE, req=7d53040c-29a9-4250-b22e-bc76968affef>
                                            <get COMPLIANCE with then correct id>
                                             req=d144fe22-045b-4bee-90f6-ee65827c6d8b

                     <cp does delete request>
                                            rhsmd triggered because of change
                                            <makes  req=1e234b82-cfbc-4f7f-8cfc-c3a54d04a2c9
                                             with id cert, still valid atm,
                                             exists on clients>
                     <cp attempts to that update now delete consumer about req=1e234b82-cfbc-4f7f-8cfc-c3a54d04a2c9>
                     <The DB gets grumpy>
<DELETE fails> 
                     <req=1e234b82-cfbc-4f7f-8cfc-c3a54d04a2c9  fails?>
                                            rhsmd

Comment 7 Devan Goodwin 2015-04-09 18:44:39 UTC
We believe this will be fixed by reverting the dbus code that triggers events on compliance status changes.

subscription-manager.git: cf23450fc1d519a7d0e90611499047ce2e7c3334
Will appear in subscription-manager-1.15.3-1

Comment 8 Devan Goodwin 2015-04-10 17:34:37 UTC
Some more analysis, I believe the issue is the rhsm-icon being triggered. We send out a dbus request to update status and potentially pop up the icon on many CLI commands, unregister (not likely our candidate here though), attach, remove, repos, etc. This is likely caused by automated tests where we run one of these commands, the dbus request goes out but is handled outside the process. The test then runs unregister, very shortly after which dbus kicks in and the icon tries to update it's status. This has to happen after the unregister request is sent, but before it completes. 

This is only showing up now because the insertion of a "last updated" row in the db is a new change to fix the lock wait timeouts and contention of rows in the database. Unfortunately it means that if you request to delete a consumer, which makes requests during the time the delete is running, it will fail because the delete transaction couldn't see the last checkin row in the db when it was started.

The server is behaving good here I think, and I believe the issue is limited to automated situations where we're doing subscription events and then immediately unregistering. It's quite difficult to reproduce even then, my best bet was putting a sleep into unregister, at which point it becomes quite easy. 

At present I can't really see any solution here. The client is correct to trigger the icon when it's doing so, and everything is perfectly valid when that happens, the consumer still exists as the unregister hasn't completed yet. 

The only option I can think of is attempting to catch the exception server side and re-try. (will be difficult across three databases) Re-tries are generally something we try to avoid. 

Or, we could perhaps try to signal on the client that an unregister is underway, and have the icon no-op when it sees this...

Comment 9 Devan Goodwin 2015-04-10 17:44:05 UTC
For the effort, and given this is probably not possible in normal usage (only automated testing), we're going to propose at the very least deferring, but it's probably worth a close. The effort to fix will be quite a mess, and the benefit will be very small.

Requesting needinfo from John for this thoughts.

Comment 10 Devan Goodwin 2015-04-10 17:58:51 UTC
We did think of a possible solution to this server side. We could investigate making the last checkin rows not have a foreign key to consumer and instead replace it with an index to make sure lookups remain fast. We'd need provisions to cleanup orphaned rows, but this way a get request during a delete would not cause the delete to fail.

Steps to reproduce:

- Put a Thread.sleep somewhere in ConsumerResource.deleteConsumer and deploy.
- subscription-manager unregister
- switch to another terminal and subscription-manager status

Moving this to Candlepin and off 6.7.

Comment 11 Roman Plevka 2016-07-19 16:05:12 UTC
Created attachment 1181700 [details]
foreman-debug

Hit this bug with
sat6.2 snap20.2

on creating repositories simultaneously.
Attaching a foreman debug (happend at 2016-07-19 11:14:40,556)

Comment 12 Filip Nguyen 2016-07-20 15:59:04 UTC
Roman I think your bug is unrelated to this. This bug concerns logic around cp_consumer_checkin. That table is not present in Candlepin that is integrated in Sat 6.2. Even looking at your logs, the error [1] is in different code path, on different table. 

Your bug is caused by concurrent demotions/promotions of content, which is programming error on client's side (in your case Katello is the client). Not Candlepin server bug.

[1] 
Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "cp_env_cnt_env_id_cntid_ky"
585     at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102) ~[postgresql-9.0-801.jdbc4.jar:n

Comment 13 Roman Plevka 2016-07-20 21:40:39 UTC
(In reply to Filip Nguyen from comment #12)
> Roman I think your bug is unrelated to this. This bug concerns logic around
> cp_consumer_checkin. That table is not present in Candlepin that is
> integrated in Sat 6.2. Even looking at your logs, the error [1] is in
> different code path, on different table. 
> 
> Your bug is caused by concurrent demotions/promotions of content, which is
> programming error on client's side (in your case Katello is the client). Not
> Candlepin server bug.
> 
> [1] 
> Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value
> violates unique constraint "cp_env_cnt_env_id_cntid_ky"
> 585     at
> org.postgresql.core.v3.QueryExecutorImpl.
> receiveErrorResponse(QueryExecutorImpl.java:2102)
> ~[postgresql-9.0-801.jdbc4.jar:n

thanks Filip. I'll report this as a katello issue then.

Comment 14 Michael Stead 2017-08-01 19:05:49 UTC
As per comment 13, closing as NOT_A_BUG.


Note You need to log in before you can comment on or make changes to this bug.