Bug 2319926 - Review Request: python-html-text - Extract text from HTML
Summary: Review Request: python-html-text - Extract text from HTML
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: Package Review
Version: rawhide
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Gwyn Ciesla
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-10-19 20:10 UTC by Benson Muite
Modified: 2024-11-04 04:23 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2024-11-03 02:38:35 UTC
Type: ---
Embargoed:
gwync: fedora-review+


Attachments (Terms of Use)

Description Benson Muite 2024-10-19 20:10:27 UTC
spec: https://download.copr.fedorainfracloud.org/results/fed500/gourmand/fedora-rawhide-x86_64/08156160-python-html-text/python-html-text.spec
srpm: https://download.copr.fedorainfracloud.org/results/fed500/gourmand/fedora-rawhide-x86_64/08156160-python-html-text/python-html-text-0.6.2-1.fc42.src.rpm

description:
How is html_text different from .xpath('//text()') from LXML
or .get_text() from Beautiful Soup?

- Text extracted with html_text does not contain inline styles,
javascript, comments and other text that is not normally visible
to users;

- html_text normalizes whitespace, but in a way smarter than
.xpath('normalize-space()), adding spaces around inline elements
(which are often used as block elements in html markup), and trying
to avoid adding extra spaces for punctuation;

- html-text can add newlines (e.g. after headers or paragraphs), so
that the output text looks more like how it is rendered in browsers.

fas: fed500

Comments:
Pytest7 warning seems spurious as pytest7 is not installed.

Reproducible: Always

Comment 1 Fedora Review Service 2024-10-19 20:10:45 UTC
The ticket summary is not in the correct format.
Expected:

    Review Request: <main package name here> - <short summary here>

Found:

    Review-request: python-html-text - Extract text from HTML

As a consequence, the package name cannot be parsed and submitted to
be automatically build. Please modify the ticket summary and trigger a
build by typing [fedora-review-service-build].


---
This comment was created by the fedora-review-service
https://github.com/FrostyX/fedora-review-service

If you want to trigger a new Copr build, add a comment containing new
Spec and SRPM URLs or [fedora-review-service-build] string.

Comment 2 Benson Muite 2024-10-19 20:12:40 UTC
[fedora-review-service-build]

Comment 3 Fedora Review Service 2024-10-19 20:16:53 UTC
Copr build:
https://copr.fedorainfracloud.org/coprs/build/8158558
(succeeded)

Review template:
https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2319926-python-html-text/fedora-rawhide-x86_64/08158558-python-html-text/fedora-review/review.txt

Found issues:

- python3-pytest7 is deprecated, you must not depend on it.
  Read more: https://docs.fedoraproject.org/en-US/packaging-guidelines/deprecating-packages/

Please know that there can be false-positives.

---
This comment was created by the fedora-review-service
https://github.com/FrostyX/fedora-review-service

If you want to trigger a new Copr build, add a comment containing new
Spec and SRPM URLs or [fedora-review-service-build] string.

Comment 4 Gwyn Ciesla 2024-10-23 15:22:15 UTC
Package Review
==============

Legend:
[x] = Pass, [!] = Fail, [-] = Not applicable, [?] = Not evaluated
[ ] = Manual review needed


Issues:
=======
- Package must not depend on deprecated() packages.
  Note: python3-pytest7 is deprecated, you must not depend on it.
  See: https://docs.fedoraproject.org/en-US/packaging-
  guidelines/deprecating-packages/


===== MUST items =====

Generic:
[x]: Package is licensed with an open-source compatible license and meets
     other legal requirements as defined in the legal section of Packaging
     Guidelines.
[x]: License field in the package spec file matches the actual license.
     Note: Checking patched sources after %prep for licenses. Licenses
     found: "Unknown or generated", "MIT License", "*No copyright* MIT
     License". 23 files have unknown license. Detailed output of
     licensecheck in /home/gwyn/2319926-python-html-text/licensecheck.txt
[x]: Package must own all directories that it creates.
     Note: Directories without known owners: /usr/lib/python3.13/site-
     packages, /usr/lib/python3.13
[x]: Package contains no bundled libraries without FPC exception.
[x]: Changelog in prescribed format.
[x]: Sources contain only permissible code or content.
[-]: Package contains desktop file if it is a GUI application.
[-]: Development files must be in a -devel package
[x]: Package uses nothing in %doc for runtime.
[x]: Package consistently uses macros (instead of hard-coded directory
     names).
[x]: Package is named according to the Package Naming Guidelines.
[x]: Package does not generate any conflict.
[x]: Package obeys FHS, except libexecdir and /usr/target.
[x]: If the package is a rename of another package, proper Obsoletes and
     Provides are present.
[x]: Requires correct, justified where necessary.
[x]: Spec file is legible and written in American English.
[-]: Package contains systemd file(s) if in need.
[x]: Package is not known to require an ExcludeArch tag.
[x]: Package complies to the Packaging Guidelines
[x]: Package successfully compiles and builds into binary rpms on at least
     one supported primary architecture.
[x]: Package installs properly.
[x]: Rpmlint is run on all rpms the build produces.
     Note: There are rpmlint messages (see attachment).
[x]: If (and only if) the source package includes the text of the
     license(s) in its own file, then that file, containing the text of the
     license(s) for the package is included in %license.
[x]: The License field must be a valid SPDX expression.
[x]: Package requires other packages for directories it uses.
[x]: Package does not own files or directories owned by other packages.
[x]: Package uses either %{buildroot} or $RPM_BUILD_ROOT
[x]: Package does not run rm -rf %{buildroot} (or $RPM_BUILD_ROOT) at the
     beginning of %install.
[x]: Macros in Summary, %description expandable at SRPM build time.
[x]: Dist tag is present.
[x]: Package does not contain duplicates in %files.
[x]: Permissions on files are set properly.
[x]: Package use %makeinstall only when make install DESTDIR=... doesn't
     work.
[x]: Package is named using only allowed ASCII characters.
[x]: Package does not use a name that already exists.
[x]: Package is not relocatable.
[x]: Sources used to build the package match the upstream source, as
     provided in the spec URL.
[x]: Spec file name must match the spec package %{name}, in the format
     %{name}.spec.
[x]: File names are valid UTF-8.
[x]: Large documentation must go in a -doc subpackage. Large could be size
     (~1MB) or number of files.
     Note: Documentation size is 4760 bytes in 1 files.
[x]: Packages must not store files under /srv, /opt or /usr/local

Python:
[x]: Python eggs must not download any dependencies during the build
     process.
[-]: A package which is used by another package via an egg interface should
     provide egg info.
[x]: Package meets the Packaging Guidelines::Python
[x]: Package contains BR: python2-devel or python3-devel
[x]: Packages MUST NOT have dependencies (either build-time or runtime) on
     packages named with the unversioned python- prefix unless no properly
     versioned package exists. Dependencies on Python packages instead MUST
     use names beginning with python2- or python3- as appropriate.
[x]: Python packages must not contain %{pythonX_site(lib|arch)}/* in %files
[x]: Binary eggs must be removed in %prep

===== SHOULD items =====

Generic:
[x]: If the source package does not include license text(s) as a separate
     file from upstream, the packager SHOULD query upstream to include it.
[x]: Final provides and requires are sane (see attachments).
[x]: Package functions as described.
[x]: Latest version is packaged.
[x]: Package does not include license text files separate from upstream.
[-]: Sources are verified with gpgverify first in %prep if upstream
     publishes signatures.
     Note: gpgverify is not used.
[x]: Package should compile and build into binary rpms on all supported
     architectures.
[x]: %check is present and all tests pass.
[x]: Packages should try to preserve timestamps of original installed
     files.
[x]: Reviewer should test that the package builds in mock.
[x]: Buildroot is not present
[x]: Package has no %clean section with rm -rf %{buildroot} (or
     $RPM_BUILD_ROOT)
[x]: No file requires outside of /etc, /bin, /sbin, /usr/bin, /usr/sbin.
[x]: Packager, Vendor, PreReq, Copyright tags should not be in spec file
[x]: Sources can be downloaded from URI in Source: tag
[x]: SourceX is a working URL.
[x]: Spec use %global instead of %define unless justified.

===== EXTRA items =====

Generic:
[x]: Rpmlint is run on all installed packages.
     Note: There are rpmlint messages (see attachment).
[x]: Spec file according to URL is the same as in SRPM.


Rpmlint
-------
Checking: python3-html-text-0.6.2-1.fc42.noarch.rpm
          python-html-text-0.6.2-1.fc42.src.rpm
===================================== rpmlint session starts ====================================
rpmlint: 2.5.0
configuration:
    /usr/lib/python3.12/site-packages/rpmlint/configdefaults.toml
    /etc/xdg/rpmlint/fedora-legacy-licenses.toml
    /etc/xdg/rpmlint/fedora-spdx-licenses.toml
    /etc/xdg/rpmlint/fedora.toml
    /etc/xdg/rpmlint/scoring.toml
    /etc/xdg/rpmlint/users-groups.toml
    /etc/xdg/rpmlint/warn-on-functions.toml
rpmlintrc: [PosixPath('/tmp/tmp73bs6qsk')]
checks: 32, packages: 2

python-html-text.src: E: spelling-error ('xpath', '%description -l en_US xpath -> path, x path, expat')
python-html-text.src: E: spelling-error ('javascript', '%description -l en_US javascript -> java script, java-script, JavaScript')
python-html-text.src: E: spelling-error ('whitespace', '%description -l en_US whitespace -> white space, white-space, whites pace')
python3-html-text.noarch: E: spelling-error ('xpath', '%description -l en_US xpath -> path, x path, expat')
python3-html-text.noarch: E: spelling-error ('javascript', '%description -l en_US javascript -> java script, java-script, JavaScript')
python3-html-text.noarch: E: spelling-error ('whitespace', '%description -l en_US whitespace -> white space, white-space, whites pace')
 2 packages and 0 specfiles checked; 6 errors, 0 warnings, 9 filtered, 6 badness; has taken 3.4 s 




Rpmlint (installed packages)
----------------------------
============================ rpmlint session starts ============================
rpmlint: 2.5.0
configuration:
    /usr/lib/python3.13/site-packages/rpmlint/configdefaults.toml
    /etc/xdg/rpmlint/fedora-spdx-licenses.toml
    /etc/xdg/rpmlint/fedora.toml
    /etc/xdg/rpmlint/scoring.toml
    /etc/xdg/rpmlint/users-groups.toml
    /etc/xdg/rpmlint/warn-on-functions.toml
checks: 32, packages: 1

python3-html-text.noarch: E: spelling-error ('xpath', '%description -l en_US xpath -> path, x path, expat')
python3-html-text.noarch: E: spelling-error ('javascript', '%description -l en_US javascript -> java script, java-script, JavaScript')
python3-html-text.noarch: E: spelling-error ('whitespace', '%description -l en_US whitespace -> white space, white-space, whites pace')
 1 packages and 0 specfiles checked; 3 errors, 0 warnings, 5 filtered, 3 badness; has taken 0.4 s 



Source checksums
----------------
https://github.com/zytedata/html-text/archive/0.6.2/html-text-0.6.2.tar.gz :
  CHECKSUM(SHA256) this package     : 2bda73192e3009bacb626c8feacc9ab5f0685947eb5847e181fb1d330410bcc3
  CHECKSUM(SHA256) upstream package : 2bda73192e3009bacb626c8feacc9ab5f0685947eb5847e181fb1d330410bcc3


Requires
--------
python3-html-text (rpmlib, GLIBC filtered):
    python(abi)
    python3.13dist(lxml)
    python3.13dist(lxml-html-clean)



Provides
--------
python3-html-text:
    python-html-text
    python3-html-text
    python3.13-html-text
    python3.13dist(html-text)
    python3dist(html-text)



Generated by fedora-review 0.10.0 (e79b66b) last change: 2023-07-24
Command line :/usr/bin/fedora-review -b 2319926
Buildroot used: fedora-rawhide-x86_64
Active plugins: Python, Generic, Shell-api
Disabled plugins: fonts, Java, Haskell, SugarActivity, C/C++, Ocaml, PHP, Perl, R
Disabled flags: EXARCH, EPEL6, EPEL7, DISTTAG, BATCH

Comment 5 Benson Muite 2024-10-24 15:18:00 UTC
Build log shows that pytest not pytest7 is used:
https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2319926-python-html-text/fedora-rawhide-x86_64/08158558-python-html-text/builder-live.log.gz
This seems like an error when running Fedora-Review

Spelling seems ok, but can change to what is suggested if required.

Comment 6 Gwyn Ciesla 2024-10-24 15:35:06 UTC
Agreed. Approved.

Comment 7 Fedora Admin user for bugzilla script actions 2024-10-25 02:33:53 UTC
The Pagure repository was created at https://src.fedoraproject.org/rpms/python-html-text

Comment 8 Fedora Update System 2024-10-25 03:51:15 UTC
FEDORA-2024-2cc8b58f3c (python-html-text-0.6.2-1.fc40) has been submitted as an update to Fedora 40.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-2cc8b58f3c

Comment 9 Fedora Update System 2024-10-25 03:52:20 UTC
FEDORA-2024-8b68e67ed9 (python-html-text-0.6.2-1.fc41) has been submitted as an update to Fedora 41.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-8b68e67ed9

Comment 10 Fedora Update System 2024-10-26 04:23:09 UTC
FEDORA-2024-2cc8b58f3c has been pushed to the Fedora 40 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf install --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-2cc8b58f3c \*`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-2cc8b58f3c

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 11 Fedora Update System 2024-10-27 18:19:49 UTC
FEDORA-2024-8b68e67ed9 has been pushed to the Fedora 41 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf install --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-8b68e67ed9 \*`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-8b68e67ed9

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 12 Fedora Update System 2024-11-03 02:38:35 UTC
FEDORA-2024-2cc8b58f3c (python-html-text-0.6.2-1.fc40) has been pushed to the Fedora 40 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 13 Fedora Update System 2024-11-04 04:23:08 UTC
FEDORA-2024-8b68e67ed9 (python-html-text-0.6.2-1.fc41) has been pushed to the Fedora 41 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.