Bug 2388154 - Review Request: python-tokenizers - Implementation of today's most used tokenizers
Summary: Review Request: python-tokenizers - Implementation of today's most used token...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: Package Review
Version: rawhide
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Miroslav Suchý
QA Contact: Fedora Extras Quality Assurance
URL: https://github.com/huggingface/tokeni...
Whiteboard:
Depends On: 2358553 2385892 2391936 2397580 2397581
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-08-13 04:40 UTC by Alexander Lent
Modified: 2026-02-12 03:27 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2026-02-12 03:27:45 UTC
Type: ---
Embargoed:
msuchy: fedora-review+


Attachments (Terms of Use)
The .spec file difference from Copr build 9782382 to 10104590 (8.55 KB, patch)
2026-02-08 18:43 UTC, Fedora Review Service
no flags Details | Diff
The .spec file difference from Copr build 10104590 to 10104981 (2.60 KB, patch)
2026-02-09 02:03 UTC, Fedora Review Service
no flags Details | Diff

Description Alexander Lent 2025-08-13 04:40:47 UTC
Spec URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/655e9c5b2346332198e0ea00f822c35a4ed7455a/python-tokenizers.spec
SRPM URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/655e9c5b2346332198e0ea00f822c35a4ed7455a/python-tokenizers-0.21.4-1.fc42.src.rpm
Description: Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the rust-tokenizers implementation.
Fedora Account System Username: xanderlent

Build failures are expected, this depends on bug 2358553 and bug 2385892.

Comment 1 Fedora Review Service 2025-08-13 22:10:43 UTC
Copr build:
https://copr.fedorainfracloud.org/coprs/build/9410995
(failed)

Build log:
https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2388154-python-tokenizers/fedora-rawhide-x86_64/09410995-python-tokenizers/builder-live.log.gz

Please make sure the package builds successfully at least for Fedora Rawhide.

- If the build failed for unrelated reasons (e.g. temporary network
  unavailability), please ignore it.
- If the build failed because of missing BuildRequires, please make sure they
  are listed in the "Depends On" field


---
This comment was created by the fedora-review-service
https://github.com/FrostyX/fedora-review-service

If you want to trigger a new Copr build, add a comment containing new
Spec and SRPM URLs or [fedora-review-service-build] string.

Comment 2 Gordon Messmer 2025-09-23 00:51:51 UTC
This package will also require rust module pyo3-async-runtimes, which is not yet packaged for Fedora.

Comment 3 Gordon Messmer 2025-10-02 18:34:14 UTC
pyo3-async-runtimes is available now, but python-tokenizers will actually depend on pyo3-async-runtimes0.25, so I've requested new repos for that version as well. They should be available shortly.

With 0.25 modules available locally, I see two problems with this spec.

First, there are three instances of the command "cd ../../" and all of them need to be removed.

Second, there are some failures and errors in the tests, and at least some of them look like they're trying to use network resources:

FAILED tests/bindings/test_tokenizer.py::TestTokenizer::test_decode_stream_fallback
FAILED tests/bindings/test_tokenizer.py::TestTokenizer::test_decode_skip_special_tokens
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_basic_encoding
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_encode - hug...
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_with_special_tokens
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_with_truncation_padding
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_various_input_formats
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_error_handling
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_concurrency
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_decode - hug...
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_large_batch
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_numpy_inputs
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_async_methods_existence
ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_performance_comparison

I'd like to ask the project to mark tests that require network access so that this spec doesn't need a comprehensive list of tests to disable.

Comment 4 Gordon Messmer 2025-10-02 20:30:52 UTC
The PR to mark network tests upstream is here: https://github.com/huggingface/tokenizers/pull/1872

Comment 5 Gordon Messmer 2025-10-02 21:31:00 UTC
I've pushed the changes required to package the current version, here: https://github.com/gordonmessmer/python-tokenizers

Comment 6 Gordon Messmer 2025-10-02 23:43:00 UTC
> there are three instances of the command "cd ../../" and all of them need to be removed

N/M... mistake on my part.

Package seems mostly good, other than:

Needs to update to the current version.

Current version has more tests that require network access and must be disabled. The patch in my git repo addresses this, and has been offered upstream.

It would be nice to add a bcond to optionally enable network tests.

Dependency on pyo3-async-runtimes0.25, and I'm waiting on creation of repos for that package.

There is one use of %define that should instead be %global (or should include a justification)

Comment 7 Alexander Lent 2025-11-09 21:14:15 UTC
Spec URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/db2f2c06fc5299db64875dfaa4ef02a52de58724/python-tokenizers.spec
SRPM URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/db2f2c06fc5299db64875dfaa4ef02a52de58724/python-tokenizers-0.22.1-1.fc44.src.rpm

Hi Gordon, sorry for the delay, things have been very busy at $DAYJOB. Looking forward to continuing the work on the Huggingface suite.

I've got this package building with only two minor issues:
- For some reason the generated BuildRequires don't capture one cargo package.
- The test section needs a look-over because the cargo tests fail to link against Python's C libraries and there are some skipped or deselected pytests.

Comment 8 Fedora Review Service 2025-11-09 21:19:34 UTC
Copr build:
https://copr.fedorainfracloud.org/coprs/build/9782382
(failed)

Build log:
https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2388154-python-tokenizers/fedora-rawhide-x86_64/09782382-python-tokenizers/builder-live.log.gz

Please make sure the package builds successfully at least for Fedora Rawhide.

- If the build failed for unrelated reasons (e.g. temporary network
  unavailability), please ignore it.
- If the build failed because of missing BuildRequires, please make sure they
  are listed in the "Depends On" field


---
This comment was created by the fedora-review-service
https://github.com/FrostyX/fedora-review-service

If you want to trigger a new Copr build, add a comment containing new
Spec and SRPM URLs or [fedora-review-service-build] string.

Comment 9 Gordon Messmer 2025-12-05 05:58:26 UTC
(In reply to Alexander Lent from comment #7)
> Spec URL:
> https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/
> db2f2c06fc5299db64875dfaa4ef02a52de58724/python-tokenizers.spec

I've resolved a few issues, with an alternative dist-git, here: https://codeberg.org/gordonmessmer/python-tokenizers

First change: https://codeberg.org/gordonmessmer/python-tokenizers/commit/f7d1e0a44ea364cd6100f9596b9ef6f597db002c

The package will not build in mock unless we disable a fairly large number of tests that require network access (including the two you had disabled in the spec, above). I've offered this change to the project (https://github.com/huggingface/tokenizers/pull/1872) though they are currently not inclined to merge the change. If you know them, maybe you'll have more luck advocating for the PR.

Second change: https://codeberg.org/gordonmessmer/python-tokenizers/commit/75e082e59dc3b7ef12b327ac0d5ca3c5df9d100a

This is a nit-pick: the guidelines request that we use "global" rather than "define" unless we really need to do otherwise.

Third change: https://codeberg.org/gordonmessmer/python-tokenizers/commit/ed3c1a9fe16ecbc3f88111a49975969ca368b10a

dev-dependencies will not be selected for dependency generation unless there is a "check" build condition, so line 1 "bcond check" is required, and that allows us to remove the manual specification of the buildreq on tempfile.

cargo test requires the --no-default-features argument, as indicated in the Makefile for the "test" rule. With the addition of that argument, the tests link to libpython correctly and the tests run successfully. (Though I couldn't really explain the purpose of that flag.)

Comment 10 Miroslav Suchý 2026-02-04 14:27:05 UTC
%files should include
%license LICENSE.dependencies

Comment 11 Miroslav Suchý 2026-02-04 14:37:00 UTC
@lx the build still fails for me with:
Problem: nothing provides requested (crate(numpy/default) >= 0.25.0 with crate(numpy/default) < 0.26.0~)

That is  because rawhide has already newer numpy.


> # Fill in the actual package description to submit package to Fedora
Can you remove this comment as you already done what the instructions say?

Comment 13 Fedora Review Service 2026-02-08 18:43:36 UTC
Created attachment 2128693 [details]
The .spec file difference from Copr build 9782382 to 10104590

Comment 14 Fedora Review Service 2026-02-08 18:43:47 UTC
Copr build:
https://copr.fedorainfracloud.org/coprs/build/10104590
(succeeded)

Review template:
https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2388154-python-tokenizers/fedora-rawhide-x86_64/10104590-python-tokenizers/fedora-review/review.txt

Please take a look if any issues were found.


---
This comment was created by the fedora-review-service
https://github.com/FrostyX/fedora-review-service

If you want to trigger a new Copr build, add a comment containing new
Spec and SRPM URLs or [fedora-review-service-build] string.

Comment 15 Miroslav Suchý 2026-02-08 22:42:45 UTC
> # License expression simplified by eliminatation, as permitted

License tag is missing:
  (BSD-2-Clause OR Apache-2.0 OR MIT) AND (MIT OR Apache-2.0) AND (Apache-2.0 OR MIT)

You should not evaluate and simplify the boolean formula. See https://docs.fedoraproject.org/en-US/legal/license-field/#_no_effective_license_analysis 

> # A patch I wrote, updating the sources for PyO3 0.27 support
> Patch:          https://github.com/huggingface/tokenizers/pull/1941.patch

This is better written as:

# A patch I wrote, updating the sources for PyO3 0.27 support
# https://github.com/huggingface/tokenizers/pull/1941
Patch:          1941.patch

This way you can easily check the status of the PR. If it was already merged or not. But I will treat it as style preference and will not block on this.

> # Fill in the actual package description to submit package to Fedora

This is from the template and can be removed.

> # Update some deps to the latest version

Personally, I would change it to >= of these versions instead of one specific version, otherwise the maintenance will be painful for you. I will not block on this.

> %files -n python3-tokenizers -f %{pyproject_files}

You have to add here:
%license LICENSE
%license LICENSE.dependencies

Comment 17 Fedora Review Service 2026-02-09 02:03:41 UTC
Created attachment 2128723 [details]
The .spec file difference from Copr build 10104590 to 10104981

Comment 18 Fedora Review Service 2026-02-09 02:03:43 UTC
Copr build:
https://copr.fedorainfracloud.org/coprs/build/10104981
(succeeded)

Review template:
https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2388154-python-tokenizers/fedora-rawhide-x86_64/10104981-python-tokenizers/fedora-review/review.txt

Please take a look if any issues were found.


---
This comment was created by the fedora-review-service
https://github.com/FrostyX/fedora-review-service

If you want to trigger a new Copr build, add a comment containing new
Spec and SRPM URLs or [fedora-review-service-build] string.

Comment 19 Miroslav Suchý 2026-02-09 07:53:11 UTC
Package Review
==============

Legend:
[x] = Pass, [!] = Fail, [-] = Not applicable, [?] = Not evaluated
[ ] = Manual review needed



===== MUST items =====

C/C++:
[-]: Development (unversioned) .so files in -devel subpackage, if present.
     Note: Unversioned so-files in private %_libdir subdirectory (see
     attachment). Verified they are not in ld path. 

Generic:
[x]: Package successfully compiles and builds into binary rpms on at least
     one supported primary architecture.
[x]: Package is licensed with an open-source compatible license and meets
     other legal requirements as defined in the legal section of Packaging
     Guidelines.
[x]: License field in the package spec file matches the actual license.
[x]: Package must own all directories that it creates.
[x]: %build honors applicable compiler flags or justifies otherwise.
[x]: Package contains no bundled libraries or specifies bundled libraries
     with Provides: bundled(<libname>) if unbundling is not possible.
[x]: Changelog in prescribed format.
[x]: Sources contain only permissible code or content.
[-]: Package contains desktop file if it is a GUI application.
[x]: Development files must be in a -devel package
[x]: Package uses nothing in %doc for runtime.
[x]: Package consistently uses macros (instead of hard-coded directory
     names).
[x]: Package is named according to the Package Naming Guidelines.
[x]: Package does not generate any conflict.
[x]: Package obeys FHS, except libexecdir and /usr/target.
[-]: If the package is a rename of another package, proper Obsoletes and
     Provides are present.
[x]: Requires correct, justified where necessary.
[x]: Spec file is legible and written in American English.
[-]: Package contains systemd file(s) if in need.
[x]: Useful -debuginfo package or justification otherwise.
[x]: Package is not known to require an ExcludeArch tag.
[-]: Large documentation must go in a -doc subpackage. Large could be size
     (~1MB) or number of files.
     Note: Documentation size is 28466 bytes in 2 files.
[x]: Package complies to the Packaging Guidelines
[x]: Package installs properly.
[x]: Rpmlint is run on all rpms the build produces.
     Note: No rpmlint messages.
[x]: If (and only if) the source package includes the text of the
     license(s) in its own file, then that file, containing the text of the
     license(s) for the package is included in %license.
[x]: The License field must be a valid SPDX expression.
[x]: Package requires other packages for directories it uses.
[x]: Package does not own files or directories owned by other packages.
[x]: Package uses either %{buildroot} or $RPM_BUILD_ROOT
[x]: Package does not run rm -rf %{buildroot} (or $RPM_BUILD_ROOT) at the
     beginning of %install.
[x]: Macros in Summary, %description expandable at SRPM build time.
[x]: Dist tag is present.
[x]: Package does not contain duplicates in %files.
[x]: Permissions on files are set properly.
[x]: Package must not depend on deprecated() packages.
[x]: Package use %makeinstall only when make install DESTDIR=... doesn't
     work.
[x]: Package is named using only allowed ASCII characters.
[x]: Package does not use a name that already exists.
[x]: Package is not relocatable.
[x]: Sources used to build the package match the upstream source, as
     provided in the spec URL.
[x]: Spec file name must match the spec package %{name}, in the format
     %{name}.spec.
[x]: File names are valid UTF-8.
[x]: Packages must not store files under /srv, /opt or /usr/local

Python:
[x]: Python eggs must not download any dependencies during the build
     process.
[-]: A package which is used by another package via an egg interface should
     provide egg info.
[x]: Package meets the Packaging Guidelines::Python
[x]: Package contains BR: python2-devel or python3-devel
[x]: Packages MUST NOT have dependencies (either build-time or runtime) on
     packages named with the unversioned python- prefix unless no properly
     versioned package exists. Dependencies on Python packages instead MUST
     use names beginning with python2- or python3- as appropriate.
[x]: Python packages must not contain %{pythonX_site(lib|arch)}/* in %files
[x]: Binary eggs must be removed in %prep

===== SHOULD items =====

Generic:
[x]: Reviewer should test that the package builds in mock.
[x]: If the source package does not include license text(s) as a separate
     file from upstream, the packager SHOULD query upstream to include it.
[x]: Final provides and requires are sane (see attachments).
[?]: Package functions as described.
[x]: Latest version is packaged.
[x]: Package does not include license text files separate from upstream.
[x]: Patches link to upstream bugs/comments/lists or are otherwise
     justified.
[-]: Sources are verified with gpgverify first in %prep if upstream
     publishes signatures.
     Note: gpgverify is not used.
[?]: Package should compile and build into binary rpms on all supported
     architectures.
[x]: %check is present and all tests pass.
[x]: Packages should try to preserve timestamps of original installed
     files.
[x]: Buildroot is not present
[x]: Package has no %clean section with rm -rf %{buildroot} (or
     $RPM_BUILD_ROOT)
[x]: No file requires outside of /etc, /bin, /sbin, /usr/bin, /usr/sbin.
[x]: Packager, Vendor, PreReq, Copyright tags should not be in spec file
[x]: Sources can be downloaded from URI in Source: tag
[x]: SourceX is a working URL.
[x]: Spec use %global instead of %define unless justified.

===== EXTRA items =====

Generic:
[x]: Rpmlint is run on all installed packages.
     Note: No rpmlint messages.
[x]: Large data in /usr/share should live in a noarch subpackage if package
     is arched.


APPROVED

Comment 20 Alexander Lent 2026-02-12 02:31:35 UTC
Thank you!

Comment 21 Fedora Admin user for bugzilla script actions 2026-02-12 02:32:55 UTC
The Pagure repository was created at https://src.fedoraproject.org/rpms/python-tokenizers

Comment 22 Fedora Update System 2026-02-12 03:23:29 UTC
FEDORA-2026-99231e372d (python-tokenizers-0.22.2-1.fc45) has been submitted as an update to Fedora 45.
https://bodhi.fedoraproject.org/updates/FEDORA-2026-99231e372d

Comment 23 Fedora Update System 2026-02-12 03:23:35 UTC
FEDORA-2026-e05183e1a7 (python-tokenizers-0.22.2-1.fc44) has been submitted as an update to Fedora 44.
https://bodhi.fedoraproject.org/updates/FEDORA-2026-e05183e1a7

Comment 24 Fedora Update System 2026-02-12 03:27:45 UTC
FEDORA-2026-99231e372d (python-tokenizers-0.22.2-1.fc45) has been pushed to the Fedora 45 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 25 Fedora Update System 2026-02-12 03:27:49 UTC
FEDORA-2026-e05183e1a7 (python-tokenizers-0.22.2-1.fc44) has been pushed to the Fedora 44 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.