Bug 997939

Summary: raw output format not reliably parseable -- newlines not encoded in "description" field
Product: [Fedora] Fedora Reporter: Steve Tyler <stephent98>
Component: python-bugzillaAssignee: Will Woods <wwoods>
Status: CLOSED DEFERRED QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 19CC: bugs.michael, crobinso, dzickus, jskarvad, stephent98, wwoods
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-28 17:51:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
raw-not-reliably-parseable-1.txt
none
bz997939-raw-not-reliably-parseable-2.txt
none
[prototype] bugzilla-raw-output-json-prototype-1.patch
none
bugzilla-raw-output-json-prototype-1.txt example raw output with two bugs none

Description Steve Tyler 2013-08-16 15:04:43 UTC
Created attachment 787302 [details]
raw-not-reliably-parseable-1.txt

Description of problem:
The raw output format has bug fields tagged with a string beginning "ATTRIBUTE":

ATTRIBUTE[id]: 741734
ATTRIBUTE[foo]: bar

Such a string could appear at the beginning of a bug description line, which would then be incorrectly interpreted as a bug field tag.

Also, the first line has a bug id string: "Bugzilla 741734:" that is:
1. Inconsistent with the format of the other lines.
2. Redundant with:
ATTRIBUTE[id]: 741734

Disclaimer: I did not actually submit a bug that would exhibit this problem.

Version-Release number of selected component (if applicable):
python-bugzilla-0.9.0-1.fc19.noarch

How reproducible:
Always.

Steps to Reproduce:
1. $ bugzilla query -b 741734 --raw > raw-not-reliably-parseable-1.txt

Actual results:
Incorrect parsing if a bug description line begins with a string:
ATTRIBUTE[fff]: ...

Expected results:
Raw output format can be reliably parsed regardless of the content of the bug description.

Additional info:
The attached example is from:

$ bugzilla query -b 741734
#741734 NEW        - Red Hat Kernel Manager - FEAT: /sys should provide more accurate current cpu frequency

Or, with the output manually modified to enable BZ autolinkification:
Bug 741734 NEW        - Red Hat Kernel Manager - FEAT: /sys should provide more accurate current cpu frequency

Comment 1 Steve Tyler 2013-08-16 15:18:31 UTC
Created attachment 787325 [details]
bz997939-raw-not-reliably-parseable-2.txt

The problem appears to be that newlines in the "description" field are not encoded as "\n". That is in contrast with the "comments" field. The attached raw output illustrates the problem.

Attachment generated with:
$ bugzilla query -b 997939 --raw > bz997939-raw-not-reliably-parseable-2.txt

Comment 2 Steve Tyler 2013-08-16 15:37:15 UTC
This example explicitly illustrates the problem:

$ cat bz997939-raw-not-reliably-parseable-2.txt | egrep '^ATTRIBUTE\[(id|foo)\]'
ATTRIBUTE[id]: 741734
ATTRIBUTE[foo]: bar
ATTRIBUTE[id]: 741734
ATTRIBUTE[id]: 997939

Comment 3 Steve Tyler 2013-08-16 15:52:23 UTC
(In reply to Steve Tyler from comment #0)
...
> Also, the first line has a bug id string: "Bugzilla 741734:" that is:
> 1. Inconsistent with the format of the other lines.
> 2. Redundant with:
> ATTRIBUTE[id]: 741734
...

I have opened a separate bug for this problem. Changing bug summary accordingly.

Bug 997963 - raw output header "Bugzilla NNNNNN:" inconsistent and redundant

Comment 4 Steve Tyler 2013-08-16 21:20:02 UTC
After looking at this in pdb, it appears that the description has newlines encoded as "\n" before line 698, and that the newline encoding has been removed after line 698:

$ less -N python-bugzilla-0.9.0/bin/bugzilla
...
    690 def _format_output(bz, opt, buglist):
    691     if opt.output == 'raw':
    692         buglist = bz.getbugs([b.bug_id for b in buglist])
    693         for b in buglist:
    694             print "Bugzilla %s: " % b.bug_id
    695             for a in dir(b):
    696                 if a.startswith("__") and a.endswith("__"):
    697                     continue
    698                 print to_encoding(u"ATTRIBUTE[%s]: %s" % (a, getattr(b, a)))
    699             print "\n\n"
    700         return
...

https://git.fedorahosted.org/cgit/python-bugzilla.git/tree/bin/bugzilla?id=0.9.0#n698

This commit added that line:
https://git.fedorahosted.org/cgit/python-bugzilla.git/commit/?id=e40b423fd4785b8a6df25959b9f97b6c5c06642a

Comment 5 Steve Tyler 2013-08-16 22:13:30 UTC
This doesn't happen with comments, because they are lists, not strings.

The newlines are removed from comment text too, if the comments list is indexed as a list:
(Pdb) print b.comments[0]['text']

Further, this produces the same output:
(Pdb) print b.comments[0]['text'].encode(locale.getpreferredencoding(), 'replace')

Comment 6 Steve Tyler 2013-08-18 20:56:43 UTC
Created attachment 787846 [details]
[prototype] bugzilla-raw-output-json-prototype-1.patch

The attached prototype patch:
1. Fixes this bug.
2. Fixes Bug 998256.
3. Implements the ideas in Bug 997963, Comment 1 by adding a BUGZILLA keyword.

The output format retains the ATTRIBUTE keyword, but uses the Python json module[1] to encode the data for each attribute. Thus, this format is a hybrid of the raw output format and the json format. The result is generally readable, except for the description, which has newlines encoded as "\n". This is unavoidable, if the format is to be reliably machine readable.

Two known issues are that:
1. Bug comments are missing, because the json encoder does not know how to encode comments.
2. Some spurious fields are generated:
ATTRIBUTE[__doc__]: ...
ATTRIBUTE[__module__]: "bugzilla.bug"
ATTRIBUTE[__weakref__]: null
These occur, because they can be json encoded (no TypeError is raised).

[1] http://docs.python.org/2.7/library/json.html

Comment 7 Steve Tyler 2013-08-18 21:15:00 UTC
Created attachment 787849 [details]
bugzilla-raw-output-json-prototype-1.txt example raw output with two bugs

This is example output from the patched raw output generator with two bugs.[1]

Notable features are:
1. There is a tagged header for each bug:
BUGZILLA[bug_id]: 986069
BUGZILLA[timestamp]: "2013-08-18 20:22:31 UTC"
BUGZILLA[version]: "0.9.0"

2. The bug description is machine-readable, because newlines are encoded as "\n". In particular, lines in the description beginning "ATTRIBUTE" cannot confuse a parser. The second bug illustrates this, because it contains the string "ATTRIBUTE" several times in the description.

An enhancement that would improve readability would be to convert lines of text into elements of a list, so that each line of text could be listed separately.

[1] $ PYTHONPATH=. ./bin/bugzilla query -b 986069,997939 --raw > bugzilla-raw-output-json-prototype-1.txt

Comment 8 Cole Robinson 2013-10-28 17:51:54 UTC
I think it's reasonable to add a json output option or similar, but I'm not personally interested in implementing it. If you want to follow up, please take this to the upstream mailing list where more interested parties are watching.