Bug 823679

Summary: Review Request: python-pdfminer - PDF data parsing library and tool
Product: [Fedora] Fedora Reporter: Ben Rosser <rosser.bjr>
Component: Package ReviewAssignee: Lubomir Rintel <lkundrak>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rawhideCC: lkundrak, package-review, tibbs
Target Milestone: ---Flags: lkundrak: fedora‑review+
limburgher: fedora‑cvs+
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: python-pdfminer-20140328-2.fc19 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-09 18:06:49 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On: 859246    
Bug Blocks:    

Description Ben Rosser 2012-05-21 17:00:20 EDT
Spec URL: http://venus.arosser.com/fedora/pdfminer/python-pdfminer.spec
SRPM URL: http://venus.arosser.com/fedora/pdfminer/python-pdfminer-20110515-0.fc16.src.rpm
Description: PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows to obtain the exact location of texts in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

This is my first package, so I am in need of sponsorship to proceed further- I'll be glad to start doing reviews of other packages as well as part of this.

Fedora Account System Username: tc01
Comment 1 Michael Schwendt 2012-05-22 06:28:58 EDT
Just a brief look. There are a few eyebrow raisers in the spec file:

> %prep
> %setup -q -n %{srcname}-%{version}
> export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/lib/pkgconfig

What's the purpose of this export? Nonarch pkgconfig files are stored in /usr/share/pkgconfig/, so there should not be any reason to alter the search path like that.


> %build
> CFLAGS="$RPM_OPT_FLAGS"

This isn't needed at all, is it? If any C code/data files were compiled in this Python based package, that would deserve a comment in the spec file and would ask for a closer look during review.


> %{__python} setup.py build
> make cmap
> chmod +x build/lib/pdfminer/*
> mv build/scripts-2.7/dumppdf.py build/scripts-2.7/dumppdf
> mv build/scripts-2.7/pdf2txt.py build/scripts-2.7/pdf2txt
> mv build/scripts-2.7/latin2ascii.py build/scripts-2.7/latin2ascii

I would move the "chmod" line and the three "mv" lines into the %install section. In order to distinguish between the steps, which are needed to build the upstream software, and the additional steps, which may only be necessary to bring into shape the files for installation/packaging.


> %install
> [...]
> mkdir -p %{buildroot}%{_mandir}/man1/

No manual pages are available, however.


> %files
> [...]
> %{python_sitelib}/%{srcname}-20110515-py2.7.egg-info
> %{python_sitelib}/%{srcname}/*

https://fedoraproject.org/wiki/Packaging:UnownedDirectories


> %defattr(-,root,root,-)
> %doc docs/*

https://fedoraproject.org/wiki/Packaging:Guidelines#File_Permissions

If there's a special/secret reason for putting the %defattr there, please document it.
Comment 2 Ben Rosser 2012-05-22 08:11:40 EDT
Fixed the unowned directories problem, and moved the chmod and mv commands to %install.

Got rid of the %defattr (I guess I pulled that from an older RPM reference somewhere), the export, the CFLAGS, and the man mkdir- none of these were needed.

Reposted spec file and SRPM; they're at the same URLs:
Spec URL: http://venus.arosser.com/fedora/pdfminer/python-pdfminer.spec
SRPM URL: http://venus.arosser.com/fedora/pdfminer/python-pdfminer-20110515-0.fc16.src.rpm
Comment 3 Ben Rosser 2012-05-29 14:16:37 EDT
As I realized I completely forgot to increment the release tag with these changes (thus changing the version number to 20110515-1).. I've now reuploaded the spec and a new SRPM:

Spec: http://venus.arosser.com/fedora/pdfminer/python-pdfminer.spec
SRPM: http://venus.arosser.com/fedora/pdfminer/python-pdfminer-20110515-1.fc16.src.rpm
Comment 4 Jason Tibbitts 2012-07-05 13:27:31 EDT
Just a note that this code appears to bundle some data from Adobe that is not under the MIT license.  According to cmapsrc/README.txt it appears to be under the 3-clause BSD license.

So there are two issues:

The License: tag is wrong; it needs to be something like "MIT and BSD" and then indicate which parts of the final package are under which license.  (And looking at the built package that's not immediately clear to me.)

The bundling may or may not run afoul of the general bundling prohibition: https://fedoraproject.org/wiki/Packaging:No_Bundled_Libraries.  Since, again, it's not immediately clear to me what happens to that cmap data, I can't really say what should happen here.

Also, if you do not intend to package this for RHEL5, you can remove BuildRoot:, the entire %clean section, and the first line of %install.
Comment 5 Ben Rosser 2012-07-05 20:39:52 EDT
Update, with RHEL 5 legacy stuff removed and license modified to reflect the cmap content.

Spec: http://venus.arosser.com/fedora/pdfminer/python-pdfminer.spec
SRPM: http://venus.arosser.com/fedora/pdfminer/python-pdfminer-20110515-2.fc17.src.rpm

(In reply to comment #4)
> Just a note that this code appears to bundle some data from Adobe that is
> not under the MIT license.  According to cmapsrc/README.txt it appears to be
> under the 3-clause BSD license.
> 
> So there are two issues:
> 
> The License: tag is wrong; it needs to be something like "MIT and BSD" and
> then indicate which parts of the final package are under which license. 
> (And looking at the built package that's not immediately clear to me.)
> 
> The bundling may or may not run afoul of the general bundling prohibition:
> https://fedoraproject.org/wiki/Packaging:No_Bundled_Libraries.  Since,
> again, it's not immediately clear to me what happens to that cmap data, I
> can't really say what should happen here.

So... it looks like, because I was calling "make cmap" *after* "python setup.py build", the cmap data is not being installed into the package. I've fixed this, and now the cmap data is being stored in: "/usr/lib/python2.7/site-packages/pdfminer/cmap". So this folder would be under the BSD license, then- I've updated the License tag and put a comment there explaining this.

I'm not sure about the bundling though. One option would be for me to simply remove it, as it's not a vital component of the package. The maintainer marks it as an optional step for CJK language support here- http://www.unixuser.org/~euske/python/pdfminer/index.html#install.

So, if the bundling of the cmap data does violate the bundling prohibition, it can just be removed.

> Also, if you do not intend to package this for RHEL5, you can remove
> BuildRoot:, the entire %clean section, and the first line of %install.

Ah, okay. Yeah, I have no intention of packaging for RHEL5, so I've removed all that.
Comment 6 Ben Rosser 2012-09-20 17:36:43 EDT
Update: effectively removed cmap bundling (by creating a separate cmap package here: https://bugzilla.redhat.com/show_bug.cgi?id=859246).

Spec: http://venus.arosser.com/fedora/pdfminer/python-pdfminer.spec
SRPM: http://venus.arosser.com/fedora/pdfminer/python-pdfminer-20110515-3.fc17.src.rpm
Comment 7 Ben Rosser 2014-07-28 20:38:39 EDT
Additional update: these are now the valid URLs.

Spec: http://mars.arosser.com/fedora/pdfminer/python-pdfminer.spec
SRPM: http://mars.arosser.com/fedora/pdfminer/python-pdfminer-20110515-3.fc17.src.rpm

I'll need to update this spec to the latest version of pdfminer, and make it depend on the right cmap packages too.
Comment 8 Lubomir Rintel 2014-08-11 08:12:30 EDT
I guess you'd want to update the SPEC file for newly packaged cmap files; something along the lines of

BuildRequires: cmap-japan1-6
BuildRequires: cmap-korean1-2
BuildRequires: cmap-gb1-5
BuildRequires: cmap-cns1-6
...
cp %{_datadir}/cmap/cmap-japan*/cid2code.txt cmaprsrc/cid2code_Adobe_Japan1.txt
cp %{_datadir}/cmap/cmap-korea*/cid2code.txt cmaprsrc/cid2code_Adobe_Korea1.txt
cp %{_datadir}/cmap/cmap-gb*/cid2code.txt cmaprsrc/cid2code_Adobe_GB1.txt
cp %{_datadir}/cmap/cmap-cns*/cid2code.txt cmaprsrc/cid2code_Adobe_CNS1.txt

(Maybe cmap-* packaging could be changed to avoid use of wildcards above?)
Comment 9 Ben Rosser 2014-08-16 17:41:06 EDT
Hmm, yes, the version should probably no longer be in the /usr/share/cmap/cmap-* path. That's an artifact of when "1.6" was the version, rather than the file modification date. I think it should probably become /usr/share/cmap/cmap-japan1-6/ (for example).

I'll push updates to the other packages, but for now, here's an updated spec.

Spec: http://mars.arosser.com/fedora/pdfminer/python-pdfminer.spec
SRPM: http://mars.arosser.com/fedora/pdfminer/python-pdfminer-20140328-0.fc20.src.rpm
Comment 10 Lubomir Rintel 2014-08-18 05:27:14 EDT
0.) rpmlint unhappy:

python-pdfminer.src:17: W: mixed-use-of-spaces-and-tabs (spaces: line 3, tab: line 17)
The specfile mixes use of spaces and tabs for indentation, which is a cosmetic
annoyance.  Use either spaces or tabs for indentation, not both.

1.) You don't need to quote your name with single quotes in changelog:


* Sat Aug 16 2014 'Ben Rosser' <rosser.bjr@gmail.com> 20140328-0

2.) Please move the files only after actual install in %{buildroot}.

This makes %install non-idempotent and thus breaks short-circuit of install phase (rpmbuild -bi --short-circuit).

%install
...
# Rename the python scripts to get rid of the *.py suffix
mv build/scripts-2.7/dumppdf.py build/scripts-2.7/dumppdf
mv build/scripts-2.7/pdf2txt.py build/scripts-2.7/pdf2txt
mv build/scripts-2.7/latin2ascii.py build/scripts-2.7/latin2ascii

3.) The initial release number should be >= 1.

Release numbers < 1 are reserved to pre-release snapshots.

Release:        0%{?dist}
Comment 11 Ben Rosser 2014-08-23 16:10:51 EDT
Fixed the release numbers, the quotes in changelog, and the tabs. rpmlint is now happy.

I changed the script renaming to this (and it happens after "python setup.py install") now:

%install
chmod +x build/lib/pdfminer/*
%{__python} setup.py install --skip-build --root %{buildroot}

# Rename the python scripts to get rid of the *.py suffix
mv %{buildroot}/usr/bin/dumppdf.py %{buildroot}/usr/bin/dumppdf
mv %{buildroot}/usr/bin/pdf2txt.py %{buildroot}/usr/bin/pdf2txt
mv %{buildroot}/usr/bin/latin2ascii.py %{buildroot}/usr/bin/latin2ascii

Is this better?

Spec: http://mars.arosser.com/fedora/pdfminer/python-pdfminer.spec
SRPM: http://mars.arosser.com/fedora/pdfminer/python-pdfminer-20140328-1.fc20.src.rpm
Comment 12 Ben Rosser 2014-08-23 16:33:55 EDT
Er, updated again to use %{_bindir} in those lines instead of /usr/bin.

Spec: http://mars.arosser.com/fedora/pdfminer/python-pdfminer.spec
SRPM: http://mars.arosser.com/fedora/pdfminer/python-pdfminer-20140328-2.fc20.src.rpm
Comment 13 Lubomir Rintel 2014-08-25 08:25:10 EDT
* Package named correctly
* Package version is correct
* Packaging the latest version
* Source file checksum matches
* SPEC file clean and legible
* License tag correct
* License fine for fedora
* Filelist sane
* Provides & requirements okay
* Builds fine in mock
* rpmlint happy

APPROVED
Comment 14 Ben Rosser 2014-08-25 22:38:15 EDT
New Package SCM Request
=======================
Package Name: python-pdfminer
Short Description: PDF data parsing library and tool
Upstream URL: http://www.unixuser.org/~euske/python/pdfminer/index.html
Owners: tc01
Branches: f19 f20 f21
InitialCC:
Comment 15 Jon Ciesla 2014-08-26 07:58:47 EDT
Git done (by process-git-requests).
Comment 16 Fedora Update System 2014-08-26 15:21:20 EDT
python-pdfminer-20140328-2.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/python-pdfminer-20140328-2.fc19
Comment 17 Fedora Update System 2014-08-26 16:08:55 EDT
python-pdfminer-20140328-2.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/python-pdfminer-20140328-2.fc20
Comment 18 Fedora Update System 2014-08-28 11:32:30 EDT
python-pdfminer-20140328-2.fc19 has been pushed to the Fedora 19 testing repository.
Comment 19 Fedora Update System 2014-09-09 18:06:49 EDT
python-pdfminer-20140328-2.fc20 has been pushed to the Fedora 20 stable repository.
Comment 20 Fedora Update System 2014-09-09 18:17:53 EDT
python-pdfminer-20140328-2.fc19 has been pushed to the Fedora 19 stable repository.