Bug 1686078 - bytecompilation is not reproducible, leading to incompatibilities between arched packages
Summary: bytecompilation is not reproducible, leading to incompatibilities between arc...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: python-rpm-macros
Version: 33
Hardware: Unspecified
OS: Unspecified
low
unspecified
Target Milestone: ---
Assignee: Lumír Balhar
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-06 16:46 UTC by Zbigniew Jędrzejewski-Szmek
Modified: 2020-11-03 15:37 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-02 12:03:40 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Python 34033 0 None None None 2020-01-21 14:07:45 UTC

Description Zbigniew Jędrzejewski-Szmek 2019-03-06 16:46:22 UTC
I know that python maintainers are aware of the issue, so I'm writing this up mostly for other people who might encounter the same issue.

Description of problem:
During upgrade F29→F30 I got a transaction error from dnf that glib2-devel.i686 conflicts with glibc2-devel.x84_64 in file /usr/share/glib-2.0/codegen/__pycache__/codegen_docbook.cpython-37.pyc. Checking the difference between files reveals multiple single-bytes differences throughout the file.

This is essentially the problem described and solved by
https://www.python.org/dev/peps/pep-0552/
and 
https://bugs.python.org/issue34033. Based on help from the Miro, Petr, I came up with the following solution:
https://src.fedoraproject.org/rpms/glib2/c/de2e4aad989fff9bdd232b7f68b5e619198f5ef6?branch=master.

tl;dr: py_bytecompile output depends on hash seed and some other implementation details, so it's not reproducible between architectures. This means that .py files that have the exact same contents and same timestamps and other metadata result in differing .pyc files. This might lead to two kinds of issues:
- multi-arch packages are not co-installable,
- noarch packages built as part of an arch-full build might be rejected by koji because they have different contents.
Finally, there's the general consideration that build reproducubility is lost. It is not an explicit goal in Fedora (yet), but it's certainly worth thinking about.



Version-Release number of selected component (if applicable):
python3-3.7.2-6.fc30.x86_64

Comment 1 Zbigniew Jędrzejewski-Szmek 2019-03-13 08:34:26 UTC
It turns out that SOURCE_DATE_EPOCH=0 (which forces invalidation-mode == checked-hash) does not work.
glib2-devel-2.60.0-2.fc30.x86_64.rpm and glib2-devel-2.60.0-2.fc30.i686.rpm still conflict.
Inspecting the headers of the pyc files reveals that:
1. codegen.cpython-37.pyc starts with "0d42 0a0d 0003 0000 6c90 0170 a1f5 94cc", which
   means that the invalidation-mode==checked-hash (00000003 encoded as little-endian as
   the second 32-bit word)
2. the same header appears in files in both rpms
3. the contents of the two files differ later on.


│ ├── ./usr/share/glib-2.0/codegen/__pycache__/codegen_docbook.cpython-37.pyc
│ │ @@ -79,15 +79,15 @@
│ │  000004e0: 6f64 73da 036d 6178 da03 6c65 6eda 046e  ods..max..len..n
│ │  000004f0: 616d 65da 0769 6e5f 6172 6773 da09 7369  ame..in_args..si
│ │  00000500: 676e 6174 7572 65da 086f 7574 5f61 7267  gnature..out_arg
│ │  00000510: 73da 036f 7574 da05 7772 6974 6572 0500  s..out..writer..
│ │  00000520: 0000 da0f 646f 7473 5f74 6f5f 6879 7068  ....dots_to_hyph
│ │  00000530: 656e 7329 0972 0b00 0000 da01 69da 016d  ens).r......i..m
│ │  00000540: da0b 696e 5f73 796e 6f70 7369 735a 0e6d  ..in_synopsisZ.m
│ │ -00000550: 6178 5f6d 6574 686f 645f 6c65 6eda 025f  ax_method_len.._
│ │ +00000550: 6178 5f6d 6574 686f 645f 6c65 6e5a 025f  ax_method_lenZ._
│ │  00000560: 6dda 116d 6178 5f73 6967 6e61 7475 7265  m..max_signature
│ │  00000570: 5f6c 656e da01 61da 0563 6f75 6e74 720c  _len..a..countr.
│ │  00000580: 0000 0072 0c00 0000 720d 0000 00da 1670  ...r....r......p
│ │  00000590: 7269 6e74 5f6d 6574 686f 645f 7072 6f74  rint_method_prot
│ │  000005a0: 6f74 7970 6526 0000 0073 4200 0000 0001  otype&...sB.....
│ │  000005b0: 0401 0401 0c01 1602 1002 0401 0401 0c01  ................
│ │  000005c0: 0c01 1401 0c01 1a02 0c01 1401 0c01 1402  ................
│ │ @@ -128,15 +128,15 @@
│ │  000007f0: 252a 7320 2573 7203 0000 007a 0329 3b0a  %*s %sr....z.);.
│ │  00000800: 290a da07 7369 676e 616c 7372 1200 0000  )...signalsr....
│ │  00000810: 7213 0000 0072 1400 0000 da04 6172 6773  r....r......args
│ │  00000820: 7216 0000 0072 1800 0000 7219 0000 0072  r....r....r....r
│ │  00000830: 0500 0000 721a 0000 0029 0972 0b00 0000  ....r....).r....
│ │  00000840: 721b 0000 00da 0173 721d 0000 005a 0e6d  r......sr....Z.m
│ │  00000850: 6178 5f73 6967 6e61 6c5f 6c65 6e5a 025f  ax_signal_lenZ._
│ │ -00000860: 7372 1f00 0000 7220 0000 0072 2100 0000  sr....r ...r!...
│ │ +00000860: 7372 1e00 0000 721f 0000 0072 2000 0000  sr....r....r ...
│ │  00000870: 720c 0000 0072 0c00 0000 720d 0000 00da  r....r....r.....
│ │  00000880: 1670 7269 6e74 5f73 6967 6e61 6c5f 7072  .print_signal_pr
│ │  00000890: 6f74 6f74 7970 654e 0000 0073 3000 0000  ototypeN...s0...
│ │  000008a0: 0001 0401 0401 0c01 1602 1002 0401 0401  ................
│ │  000008b0: 0c01 0c01 1a02 0c01 1402 0401 0801 2a02  ..............*.
│ │  000008c0: 0801 1a01 0401 0c01 0a01 1801 2601 0c01  ............&...
│ │  000008d0: 7a2b 446f 6362 6f6f 6b43 6f64 6547 656e  z+DocbookCodeGen
...


This essentially means that the hashed invalidation mode is busted.

Comment 2 Petr Viktorin (pviktori) 2019-03-13 10:19:33 UTC
It should be noted that the file is not *incompatible* between arches. It is functionally identical, despite not being bit-by-bit identical.

Let me give a bit of background to enable others to think about this problem.

In this case, the culprit is the "references flag".
.pyc files use the `marshal` serializer, which saves some objects to look up later so they can be loaded by reference instead of by the data. This sometimes saves disk space, but more crucially, it ensures references to a single mutable object are stored properly. For example:

>>> import marshal
>>> import binascii
>>> empty = []
>>> # Serialize the list [a, a] and output in hexadecimal format:
>>> binascii.hexlify(marshal.dumps([a, a]))
b'db02000000db000000007201000000'

That's:
0x5b ('[', TYPE_LIST) -- read a list
0x02000000 -- length 2 (2 items follow):
  - 0xdb (0x5b | 0x80):
        0x5b ('[', TYPE_LIST) -- read a list
        0x80 (FLAG_REF) -- save a reference to it (in the next free slot which is the first one -- number zero)
    0x00000000 -- length 0 (0 items follow)
  - 0x72 ('r', TYPE_REF) -- object from the list of references
    0x00000000 -- number zero

Note that two references to the same object. It's different from `marshal.dumps([[], []])`.



Back to the .pyc example from the last comment. Here, the differences are:
- 0xda vs. 0x5a -- type code 'Z' (short ASCII interned string), with or without reference flag (0x80). In this case, the string "_m" (probably an attribute name) is either saved in the reference table (at the current position, which happens to be 0x20), or not. This presumably doesn't matter, since it's not looked up later. This being one-pass encoding, there's no way to know that the rest of the file won't need the reference. In one case we were able to determine that it *cannot* be referenced elsewhere; in the other, we were not.
Later, another object is added to the list of references. If "_m" was added at 0x20, it gets the index 0x21; otherwise it's 0x20.

- 0x72 'r' followed by 0x21000000 vs. 0x20000000 -- this references that other object.


So, how does Python determine something *cannot* be referenced elsewhere? It looks at the reference count of the object it's serializing. If the refcount is 1, it knows it won't see any more references to that object. But if it's not (here: if the string "_m" is used *anywhere else* in the currently running script) it deduces that it's *possible* for more references to follow in the rest of the data being serialized.


How to solve this? It's tough, because marshal needs to be *fast* (for other use cases than we have in Fedora).
For this particular case, I could imagine making interned strings shorter than N characters not use references at all (the ['Z' 2 '_' 'm'] is shorter than ['r' 2 0 0 0]!). But that won't solve the larger problem.
We could also add a second pass -- an optional postprocessor that removes unused references (and maybe fixes up other discrepancies, like set order).
Or something else.

That discussion should happen on the upstream tracker. On the Fedora side, I'm happy to mentor anyone interested in the problem -- i.e. get you up to speed to contribute upstream.

Comment 3 Zbigniew Jędrzejewski-Szmek 2019-03-13 10:50:49 UTC
Thanks for the explanation. It sounds like making marshaling slower would be an acceptable tradeoff for .pyc serialization.

Comment 4 Petr Viktorin (pviktori) 2019-03-13 11:44:36 UTC
Not in general: the serialization is also used for caching scripts that change often (when you're developing them).

It would make sense as an option or additional optional step, which we would turn on in rpmbuild.

Comment 5 Victor Stinner 2019-03-19 14:56:30 UTC
Open issues upstream to make Python more reproducible:

* https://bugs.python.org/issue29708
* https://bugs.python.org/issue34033
* https://bugs.python.org/issue34093

Comment 6 Miro Hrončok 2019-04-16 13:36:03 UTC
We realize this problem, however it is not in the top part of our TODO list.

I see you've managed to build glib2 somehow, let me know if this is still blocking you.

Comment 7 Zbigniew Jędrzejewski-Szmek 2019-04-17 07:15:15 UTC
Yes, the issue has been worked-around in glib2, so priority=low is OK for me.

Comment 10 Ben Cotton 2019-08-13 17:10:53 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

Comment 11 Ben Cotton 2019-08-13 19:26:06 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 12 Miro Hrončok 2020-01-21 14:09:30 UTC
Random idea: We can make this a student project: bytecode canonizer

Comment 13 Petr Viktorin (pviktori) 2020-04-01 12:45:41 UTC
Lumír is looking into this as a low-priority side project.

Comment 14 Lumír Balhar 2020-05-01 05:42:42 UTC
My side-project is growing and its name is MarshalParser. The goal of this project is to be able to parse any data serialized by marshal (`.pyc` file) and fix all problems related to byte-compilation reproducibility. Problem number one to tackle is "reference flags" deeply described by Petr in comment#2. This part is almost done - I am testing it on Python standard library (all pycs from python3-libs RPM) and fixing bugs this testing uncovers.

https://github.com/frenzymadness/MarshalParser

Comment 15 Lumír Balhar 2020-07-10 12:32:17 UTC
MarshalParser is now able to fix reference flags for Python 3.6+. Now, I'll try to implement it into RPM as a new macro: https://src.fedoraproject.org/rpms/python-rpm-macros/pull-request/68

Comment 16 Lumír Balhar 2020-07-16 12:24:01 UTC
A new way how this can be implemented with a new PR: https://src.fedoraproject.org/rpms/redhat-rpm-config/pull-request/96

See the description there and feel free to ask question if anything is not clear.

Comment 17 Miro Hrončok 2020-07-21 18:28:19 UTC
Lumír, apart from using %py_reproducible_pyc_path, it seems that PYTHONHASHSEED=0 still needs to be used to have reproducible frozenset bytecode.

Should we advise packagers to do:

export PYTHONHASHSEED=0
%py_byte_compile %{python3} %{buildroot}%{_datadir}
%global py_reproducible_pyc_path %{buildroot}%{_datadir}

Or should we set PYTHONHASHSEED=0 in %py_byte_compile (and possibly in brp-python-bytecompile) ourselves?

Comment 18 Miro Hrončok 2020-07-21 18:49:08 UTC
Alternatively, should marshalparser sort frozensets?

Comment 19 Lumír Balhar 2020-07-22 05:16:37 UTC
(In reply to Miro Hrončok from comment #18)
> Alternatively, should marshalparser sort frozensets?

Yes, the plan is to implement this in marshalparser but it's not easy to do. In the meantime, I'd say that setting PYTHONHASHSEED=0 explicitly in specfiles is better.

The closest next step now is to document the new feature, right?

Comment 20 Miro Hrončok 2020-07-22 08:53:50 UTC
> I'd say that setting PYTHONHASHSEED=0 explicitly in specfiles is better.

FTR for %py_byte_compile, you can do that, but for brp-python-bytecompile, you cannot.


> The closest next step now is to document the new feature, right?

Yes, ideally near https://docs.fedoraproject.org/en-US/packaging-guidelines/Python_Appendix/#manual-bytecompilation

Comment 21 Miro Hrončok 2020-07-22 10:06:44 UTC
Let's add PYTHONHASHSEED=0 %py_byte_compile and brp-python-bytecompile, there should be no risk and it makes using %py_reproducible_pyc_path much easier.

Comment 22 Miro Hrončok 2020-07-22 10:07:47 UTC
I've meant: Let's add PYTHONHASHSEED=0 to %py_byte_compile ...

Comment 24 Petr Viktorin (pviktori) 2020-08-05 12:05:07 UTC
The next steps are:
- document this in the Python appendix
- try to robustly fix frozenset order in marshalparser

Comment 25 Ben Cotton 2020-08-11 15:34:36 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle.
Changing version to 33.


Note You need to log in before you can comment on or make changes to this bug.