Spec URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/655e9c5b2346332198e0ea00f822c35a4ed7455a/python-tokenizers.spec SRPM URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/655e9c5b2346332198e0ea00f822c35a4ed7455a/python-tokenizers-0.21.4-1.fc42.src.rpm Description: Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the rust-tokenizers implementation. Fedora Account System Username: xanderlent Build failures are expected, this depends on bug 2358553 and bug 2385892.
Copr build: https://copr.fedorainfracloud.org/coprs/build/9410995 (failed) Build log: https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2388154-python-tokenizers/fedora-rawhide-x86_64/09410995-python-tokenizers/builder-live.log.gz Please make sure the package builds successfully at least for Fedora Rawhide. - If the build failed for unrelated reasons (e.g. temporary network unavailability), please ignore it. - If the build failed because of missing BuildRequires, please make sure they are listed in the "Depends On" field --- This comment was created by the fedora-review-service https://github.com/FrostyX/fedora-review-service If you want to trigger a new Copr build, add a comment containing new Spec and SRPM URLs or [fedora-review-service-build] string.
This package will also require rust module pyo3-async-runtimes, which is not yet packaged for Fedora.
pyo3-async-runtimes is available now, but python-tokenizers will actually depend on pyo3-async-runtimes0.25, so I've requested new repos for that version as well. They should be available shortly. With 0.25 modules available locally, I see two problems with this spec. First, there are three instances of the command "cd ../../" and all of them need to be removed. Second, there are some failures and errors in the tests, and at least some of them look like they're trying to use network resources: FAILED tests/bindings/test_tokenizer.py::TestTokenizer::test_decode_stream_fallback FAILED tests/bindings/test_tokenizer.py::TestTokenizer::test_decode_skip_special_tokens ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_basic_encoding ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_encode - hug... ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_with_special_tokens ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_with_truncation_padding ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_various_input_formats ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_error_handling ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_concurrency ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_decode - hug... ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_large_batch ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_numpy_inputs ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_async_methods_existence ERROR tests/bindings/test_tokenizer.py::TestAsyncTokenizer::test_performance_comparison I'd like to ask the project to mark tests that require network access so that this spec doesn't need a comprehensive list of tests to disable.
The PR to mark network tests upstream is here: https://github.com/huggingface/tokenizers/pull/1872
I've pushed the changes required to package the current version, here: https://github.com/gordonmessmer/python-tokenizers
> there are three instances of the command "cd ../../" and all of them need to be removed N/M... mistake on my part. Package seems mostly good, other than: Needs to update to the current version. Current version has more tests that require network access and must be disabled. The patch in my git repo addresses this, and has been offered upstream. It would be nice to add a bcond to optionally enable network tests. Dependency on pyo3-async-runtimes0.25, and I'm waiting on creation of repos for that package. There is one use of %define that should instead be %global (or should include a justification)
Spec URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/db2f2c06fc5299db64875dfaa4ef02a52de58724/python-tokenizers.spec SRPM URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/db2f2c06fc5299db64875dfaa4ef02a52de58724/python-tokenizers-0.22.1-1.fc44.src.rpm Hi Gordon, sorry for the delay, things have been very busy at $DAYJOB. Looking forward to continuing the work on the Huggingface suite. I've got this package building with only two minor issues: - For some reason the generated BuildRequires don't capture one cargo package. - The test section needs a look-over because the cargo tests fail to link against Python's C libraries and there are some skipped or deselected pytests.
Copr build: https://copr.fedorainfracloud.org/coprs/build/9782382 (failed) Build log: https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2388154-python-tokenizers/fedora-rawhide-x86_64/09782382-python-tokenizers/builder-live.log.gz Please make sure the package builds successfully at least for Fedora Rawhide. - If the build failed for unrelated reasons (e.g. temporary network unavailability), please ignore it. - If the build failed because of missing BuildRequires, please make sure they are listed in the "Depends On" field --- This comment was created by the fedora-review-service https://github.com/FrostyX/fedora-review-service If you want to trigger a new Copr build, add a comment containing new Spec and SRPM URLs or [fedora-review-service-build] string.
(In reply to Alexander Lent from comment #7) > Spec URL: > https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/ > db2f2c06fc5299db64875dfaa4ef02a52de58724/python-tokenizers.spec I've resolved a few issues, with an alternative dist-git, here: https://codeberg.org/gordonmessmer/python-tokenizers First change: https://codeberg.org/gordonmessmer/python-tokenizers/commit/f7d1e0a44ea364cd6100f9596b9ef6f597db002c The package will not build in mock unless we disable a fairly large number of tests that require network access (including the two you had disabled in the spec, above). I've offered this change to the project (https://github.com/huggingface/tokenizers/pull/1872) though they are currently not inclined to merge the change. If you know them, maybe you'll have more luck advocating for the PR. Second change: https://codeberg.org/gordonmessmer/python-tokenizers/commit/75e082e59dc3b7ef12b327ac0d5ca3c5df9d100a This is a nit-pick: the guidelines request that we use "global" rather than "define" unless we really need to do otherwise. Third change: https://codeberg.org/gordonmessmer/python-tokenizers/commit/ed3c1a9fe16ecbc3f88111a49975969ca368b10a dev-dependencies will not be selected for dependency generation unless there is a "check" build condition, so line 1 "bcond check" is required, and that allows us to remove the manual specification of the buildreq on tempfile. cargo test requires the --no-default-features argument, as indicated in the Makefile for the "test" rule. With the addition of that argument, the tests link to libpython correctly and the tests run successfully. (Though I couldn't really explain the purpose of that flag.)
%files should include %license LICENSE.dependencies
@lx the build still fails for me with: Problem: nothing provides requested (crate(numpy/default) >= 0.25.0 with crate(numpy/default) < 0.26.0~) That is because rawhide has already newer numpy. > # Fill in the actual package description to submit package to Fedora Can you remove this comment as you already done what the instructions say?
Spec URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/06fa50d88c25a1efbf59bc29db18c60a948182f5/python-tokenizers.spec SRPM URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/06fa50d88c25a1efbf59bc29db18c60a948182f5/python-tokenizers-0.22.2-1.fc45.src.rpm Apologies for the delay, I've got a new - and patched to work with 0.27 - version published.
Created attachment 2128693 [details] The .spec file difference from Copr build 9782382 to 10104590
Copr build: https://copr.fedorainfracloud.org/coprs/build/10104590 (succeeded) Review template: https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2388154-python-tokenizers/fedora-rawhide-x86_64/10104590-python-tokenizers/fedora-review/review.txt Please take a look if any issues were found. --- This comment was created by the fedora-review-service https://github.com/FrostyX/fedora-review-service If you want to trigger a new Copr build, add a comment containing new Spec and SRPM URLs or [fedora-review-service-build] string.
> # License expression simplified by eliminatation, as permitted License tag is missing: (BSD-2-Clause OR Apache-2.0 OR MIT) AND (MIT OR Apache-2.0) AND (Apache-2.0 OR MIT) You should not evaluate and simplify the boolean formula. See https://docs.fedoraproject.org/en-US/legal/license-field/#_no_effective_license_analysis > # A patch I wrote, updating the sources for PyO3 0.27 support > Patch: https://github.com/huggingface/tokenizers/pull/1941.patch This is better written as: # A patch I wrote, updating the sources for PyO3 0.27 support # https://github.com/huggingface/tokenizers/pull/1941 Patch: 1941.patch This way you can easily check the status of the PR. If it was already merged or not. But I will treat it as style preference and will not block on this. > # Fill in the actual package description to submit package to Fedora This is from the template and can be removed. > # Update some deps to the latest version Personally, I would change it to >= of these versions instead of one specific version, otherwise the maintenance will be painful for you. I will not block on this. > %files -n python3-tokenizers -f %{pyproject_files} You have to add here: %license LICENSE %license LICENSE.dependencies
Spec URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/b49b710026fd581cd1e21b0661b3ec506840ca10/python-tokenizers.spec SRPM URL: https://gist.github.com/xanderlent/425254e3fc437b2558a3c8063b0b573a/raw/b49b710026fd581cd1e21b0661b3ec506840ca10/python-tokenizers-0.22.2-1.fc45.src.rpm Please take a look; I think this addresses all of the issues.
Created attachment 2128723 [details] The .spec file difference from Copr build 10104590 to 10104981
Copr build: https://copr.fedorainfracloud.org/coprs/build/10104981 (succeeded) Review template: https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2388154-python-tokenizers/fedora-rawhide-x86_64/10104981-python-tokenizers/fedora-review/review.txt Please take a look if any issues were found. --- This comment was created by the fedora-review-service https://github.com/FrostyX/fedora-review-service If you want to trigger a new Copr build, add a comment containing new Spec and SRPM URLs or [fedora-review-service-build] string.
Package Review ============== Legend: [x] = Pass, [!] = Fail, [-] = Not applicable, [?] = Not evaluated [ ] = Manual review needed ===== MUST items ===== C/C++: [-]: Development (unversioned) .so files in -devel subpackage, if present. Note: Unversioned so-files in private %_libdir subdirectory (see attachment). Verified they are not in ld path. Generic: [x]: Package successfully compiles and builds into binary rpms on at least one supported primary architecture. [x]: Package is licensed with an open-source compatible license and meets other legal requirements as defined in the legal section of Packaging Guidelines. [x]: License field in the package spec file matches the actual license. [x]: Package must own all directories that it creates. [x]: %build honors applicable compiler flags or justifies otherwise. [x]: Package contains no bundled libraries or specifies bundled libraries with Provides: bundled(<libname>) if unbundling is not possible. [x]: Changelog in prescribed format. [x]: Sources contain only permissible code or content. [-]: Package contains desktop file if it is a GUI application. [x]: Development files must be in a -devel package [x]: Package uses nothing in %doc for runtime. [x]: Package consistently uses macros (instead of hard-coded directory names). [x]: Package is named according to the Package Naming Guidelines. [x]: Package does not generate any conflict. [x]: Package obeys FHS, except libexecdir and /usr/target. [-]: If the package is a rename of another package, proper Obsoletes and Provides are present. [x]: Requires correct, justified where necessary. [x]: Spec file is legible and written in American English. [-]: Package contains systemd file(s) if in need. [x]: Useful -debuginfo package or justification otherwise. [x]: Package is not known to require an ExcludeArch tag. [-]: Large documentation must go in a -doc subpackage. Large could be size (~1MB) or number of files. Note: Documentation size is 28466 bytes in 2 files. [x]: Package complies to the Packaging Guidelines [x]: Package installs properly. [x]: Rpmlint is run on all rpms the build produces. Note: No rpmlint messages. [x]: If (and only if) the source package includes the text of the license(s) in its own file, then that file, containing the text of the license(s) for the package is included in %license. [x]: The License field must be a valid SPDX expression. [x]: Package requires other packages for directories it uses. [x]: Package does not own files or directories owned by other packages. [x]: Package uses either %{buildroot} or $RPM_BUILD_ROOT [x]: Package does not run rm -rf %{buildroot} (or $RPM_BUILD_ROOT) at the beginning of %install. [x]: Macros in Summary, %description expandable at SRPM build time. [x]: Dist tag is present. [x]: Package does not contain duplicates in %files. [x]: Permissions on files are set properly. [x]: Package must not depend on deprecated() packages. [x]: Package use %makeinstall only when make install DESTDIR=... doesn't work. [x]: Package is named using only allowed ASCII characters. [x]: Package does not use a name that already exists. [x]: Package is not relocatable. [x]: Sources used to build the package match the upstream source, as provided in the spec URL. [x]: Spec file name must match the spec package %{name}, in the format %{name}.spec. [x]: File names are valid UTF-8. [x]: Packages must not store files under /srv, /opt or /usr/local Python: [x]: Python eggs must not download any dependencies during the build process. [-]: A package which is used by another package via an egg interface should provide egg info. [x]: Package meets the Packaging Guidelines::Python [x]: Package contains BR: python2-devel or python3-devel [x]: Packages MUST NOT have dependencies (either build-time or runtime) on packages named with the unversioned python- prefix unless no properly versioned package exists. Dependencies on Python packages instead MUST use names beginning with python2- or python3- as appropriate. [x]: Python packages must not contain %{pythonX_site(lib|arch)}/* in %files [x]: Binary eggs must be removed in %prep ===== SHOULD items ===== Generic: [x]: Reviewer should test that the package builds in mock. [x]: If the source package does not include license text(s) as a separate file from upstream, the packager SHOULD query upstream to include it. [x]: Final provides and requires are sane (see attachments). [?]: Package functions as described. [x]: Latest version is packaged. [x]: Package does not include license text files separate from upstream. [x]: Patches link to upstream bugs/comments/lists or are otherwise justified. [-]: Sources are verified with gpgverify first in %prep if upstream publishes signatures. Note: gpgverify is not used. [?]: Package should compile and build into binary rpms on all supported architectures. [x]: %check is present and all tests pass. [x]: Packages should try to preserve timestamps of original installed files. [x]: Buildroot is not present [x]: Package has no %clean section with rm -rf %{buildroot} (or $RPM_BUILD_ROOT) [x]: No file requires outside of /etc, /bin, /sbin, /usr/bin, /usr/sbin. [x]: Packager, Vendor, PreReq, Copyright tags should not be in spec file [x]: Sources can be downloaded from URI in Source: tag [x]: SourceX is a working URL. [x]: Spec use %global instead of %define unless justified. ===== EXTRA items ===== Generic: [x]: Rpmlint is run on all installed packages. Note: No rpmlint messages. [x]: Large data in /usr/share should live in a noarch subpackage if package is arched. APPROVED
Thank you!
The Pagure repository was created at https://src.fedoraproject.org/rpms/python-tokenizers
FEDORA-2026-99231e372d (python-tokenizers-0.22.2-1.fc45) has been submitted as an update to Fedora 45. https://bodhi.fedoraproject.org/updates/FEDORA-2026-99231e372d
FEDORA-2026-e05183e1a7 (python-tokenizers-0.22.2-1.fc44) has been submitted as an update to Fedora 44. https://bodhi.fedoraproject.org/updates/FEDORA-2026-e05183e1a7
FEDORA-2026-99231e372d (python-tokenizers-0.22.2-1.fc45) has been pushed to the Fedora 45 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2026-e05183e1a7 (python-tokenizers-0.22.2-1.fc44) has been pushed to the Fedora 44 stable repository. If problem still persists, please make note of it in this bug report.