Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1635213

Summary: civetweb get shows http status 500
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Jos Collin <jcollin>
Component: RGWAssignee: Matt Benjamin (redhat) <mbenjamin>
Status: CLOSED NOTABUG QA Contact: Tejas <tchandra>
Severity: low Docs Contact:
Priority: unspecified    
Version: 2.5CC: cbodley, ceph-eng-bugs, jcollin, kbader, mbenjamin, mkogan, sweil
Target Milestone: rc   
Target Release: 3.*   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-23 07:06:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
radosgw-log none

Description Jos Collin 2018-10-02 11:36:46 UTC
Created attachment 1489404 [details]
radosgw-log

Description of problem:
There is a failed_req field on the ceph radosgw "perf dump": "failed_req - Aborted requests". The customer is getting "errno -125" and "errno -5/500" in the radosgw.log. The log file is attached with this bug. 

Version-Release number of selected component (if applicable):
RHCS 2.5

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
There is a failed_req field on the ceph radosgw "perf dump": "failed_req - Aborted requests". The customer is getting "errno -125" and "errno -5/500". 

Expected results:


Additional info:

Additional Info from the customer:

There is a failed_req field on the ceph radosgw "perf dump" and documentation is bit terse about it: 
"failed_req - Aborted requests"

We're constantly seeing  these so it could be useful to know what it actually is counting as "Aborted request".
Is it counting  404 (not found) as failed_req or 500 (internal errors) or is it something in underlying system (network, etc?)

# ceph daemon /var/run/ceph/ceph-client.radosgw.es-si-s3-z4.asok perf dump|grep failed_req
        "failed_req": 6228224,
-------------------------------------------

# tail -f /var/log/ceph/client.radosgw.es-ka-s3-dhn-14.log> /tmp/lokki&date; ceph daemon client.radosgw.es-ka-s3-dhn-14 perf dump|grep failed_req; sleep 10; date; ceph daemon client.radosgw.es-ka-s3-dhn-14 perf dump|grep failed_req; kill %1
[1] 58910
Mon Sep 10 12:34:00 EEST 2018
        "failed_req": 290254755,
Mon Sep 10 12:34:10 EEST 2018
        "failed_req": 290255005,
(generated log-file attached)
So for this 10sec time we had 290255005-290254755 = 250 failed_requests.

If we count the numbers from the logfile it almost matches number internal errors + "not found":
[root@esclosto014008 ~]# grep http_status= /tmp/lokki |egrep -v '=200|=206'|wc -l
    256 

For the 10sec period we have 3 internal errors (which feels not so good for just 10 sec period):
[root@esclosto014008 ~]# grep http_status= /tmp/lokki |grep  =500|wc -l
      3 
Those seem to be these:
2018-09-10 12:34:06.309861 7fc6bffcf700  0 ERROR: flush_read_list(): d->client_c->handle_data() returned -5
2018-09-10 12:34:06.309902 7fc6bffcf700  0 WARNING: set_req_state_err err_no=5 resorting to 500
2018-09-10 12:34:06.309930 7fc6bffcf700  0 ERROR: s->cio->send_content_length() returned err=-5
2018-09-10 12:34:06.309933 7fc6bffcf700  0 ERROR: s->cio->print() returned err=-5
2018-09-10 12:34:06.309935 7fc6bffcf700  0 ERROR: STREAM_IO(s)->print() returned err=-5
2018-09-10 12:34:06.309941 7fc6bffcf700  0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
2018-09-10 12:34:06.309957 7fc6bffcf700  1 ====== req done req=0x7fc6bffc9710 op status=-5 http_status=500 ======

And 253 times "not found":
[root@esclosto014008 ~]# grep http_status= /tmp/lokki |grep  =404|wc -l
    253 


Interestingly we have a same object that is first working (status=200) and then it goes to status=500 and next query is 200 again... :)
...
2018-09-10 13:02:21.848517 7f6c5ffbf700  1 civetweb: 0x7f6f48002e90: 127.0.0.1 - - [10/Sep/2018:13:02:17 +0300] "GET /cd262a96-a981-4027-853d-cb9fe63d915e/workspace.tar.lz4 HTTP/1.1" 200 0 - Wget/1.16 (linux-gnu)
2018-09-10 13:02:21.906691 7f6c677ce700  1 civetweb: 0x7f6d94001540: 127.0.0.1 - - [10/Sep/2018:13:02:21 +0300] "GET /cd262a96-a981-4027-853d-cb9fe63d915e/workspace.tar.lz4 HTTP/1.1" 500 0 - Wget/1.16 (linux-gnu)
2018-09-10 13:02:23.785745 7f6c637c6700  1 civetweb: 0x7f6f94002e90: 127.0.0.1 - - [10/Sep/2018:13:02:22 +0300] "GET /cd262a96-a981-4027-853d-cb9fe63d915e/workspace.tar.lz4 HTTP/1.1" 200 0 - Wget/1.16 (linux-gnu)
...

Comment 21 Giridhar Ramaraju 2019-08-05 13:12:09 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 22 Giridhar Ramaraju 2019-08-05 13:12:57 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 24 Red Hat Bugzilla 2023-09-15 00:12:42 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days