Bug 2217097 - Review Request: rocm-smi - ROCm System Management Interface Library
Summary: Review Request: rocm-smi - ROCm System Management Interface Library
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: Package Review
Version: rawhide
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Tom Rix
QA Contact: Fedora Extras Quality Assurance
URL: https://github.com/RadeonOpenCompute/...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-23 21:20 UTC by Jeremy Newton
Modified: 2023-07-14 21:06 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-14 21:06:10 UTC
Type: ---
Embargoed:
trix: fedora-review+


Attachments (Terms of Use)

Description Jeremy Newton 2023-06-23 21:20:09 UTC
Spec URL: https://mystro256.fedorapeople.org/rocm_smi_lib.spec
SRPM URL: https://mystro256.fedorapeople.org/rocm_smi_lib-5.5.1-1.fc39.src.rpm
COPR BUILD: https://copr.fedorainfracloud.org/coprs/mystro256/rocm-hip/build/6109029/
Description:
The ROCm System Management Interface Library, or ROCm SMI library, is part of
the Radeon Open Compute ROCm software stack . It is a C library for Linux that
provides a user space interface for applications to monitor and control GPU
applications.
Fedora Account System Username: mystro256

Implementation notes:
- the doxygen input doesn't appear to work on 1.9.6 or older, but I didn't look further as my focus is more on rawhide right now
- This replaces the retired "rocm-smi" fedora package, so I added an obsoletes/provides, for reference, upstream rewrote rocm-smi as rocm_smi_lib and dropped rocm-smi in ROCm 4.1
- I copied upstream's name, but I'm not sure if rocm-smi-lib or rocm_smi is better style for a fedora package
- A part of me is considering reviving the retired rocm-smi package, where I would put the python frontend in "rocm-smi" and add a "rocm-smi-lib" subpackage for the library to go; any feedback/opinions welcome

Comment 1 Jeremy Newton 2023-06-23 22:15:27 UTC
Side note:
I'm aware there's smi tests, but they require rocrtst and HW, which isn't being built right now in rocm-runtime. I can look into it later.

Comment 2 Tom Rix 2023-06-24 12:24:27 UTC
I had looked at this because it is needed by tensile which in needed by rocblas.
For the naming, I was thinking using this new src and reviving the old name.
My reasoning is that all the other rocm packages, though they have libraries, do not have a *-lib suffix, the lib suffix is redundant with -devel. so we would have rocm-smi and rocm-smi-devel.
Whatcha think ?

Comment 3 Jeremy Newton 2023-06-24 13:00:35 UTC
Well I guess it depends if there's a strong need for link against the library without the use of the python frontend, but I can always split the package into rocm-smi/rocm-smi-lib/rocm-smi-devel at a later date if I see the need.

I'll get the unretiring process going later today, but feel free to informally review now if you see issues.

Comment 4 Tom Rix 2023-06-24 13:59:18 UTC
My opinion on the naming is not a strong opinion, we can keep moving this review forward.

This package does seem newer upstream so it has some quirks.
The one that jumps out at me is the *64 suffix, is that really needed ? 
I know it would be difficult to chase all the lib uses now, but it would be nice if the library name was something like librocmsmi.so. instead of librocm_smi64.so.

how stable is the versioning ? will the so.1.0 hold when rocm 6.0 comes out or would it be better to have a so.5.5.0, so.6.0.0 ?

Here is a warning cleanup that is a nice to have
https://github.com/RadeonOpenCompute/rocm_smi_lib/pull/123
There were a couple of other warnings on the build that also looked fine

Comment 5 Tom Rix 2023-06-24 15:05:39 UTC
fedora-review is failing for me after this morning update of rawhide with.
Unknown argument "-C" for command "repoquery".
I'll assume this is resolved for the *.2, or can you paste the review with *.2 ?

I played a bit with the rocm-smi binary, seems ok.
The cmake/rocm_smi resolved the 'where is smi?' issue i was having running some of the scripts Tensile installs.

ExclusiveArch, for the other rocm packages we have be setting the to just x86_64, are we going to confuse folks when just this rpm shows up on ppc ?

looks in good shape

Comment 6 Jeremy Newton 2023-06-24 22:31:39 UTC
> My opinion on the naming is not a strong opinion, we can keep moving this review forward.

Honestly the idea has kind of grown on me, I'll change to use the old fedora name in my next update. It shouldn't slow down this review as un-retiring a package is pretty much the same as a new package (review + pagure request).

> The one that jumps out at me is the *64 suffix, is that really needed ? 
> I know it would be difficult to chase all the lib uses now, but it would be nice if the library name was something like librocmsmi.so. instead of librocm_smi64.so.

A lot of the low level ROCm packages have this suffix. I believe it's a requirement for Windows, as they share a lot of source with Windows.
I can propose a cmake patch for only using this suffix on Windows, but my gut feels like they will reject pulling it in and I'd rather not maintain any long term patches in Fedora if possible.

> how stable is the versioning ? will the so.1.0 hold when rocm 6.0 comes out or would it be better to have a so.5.5.0, so.6.0.0 ?

I'll contact the developers on Monday to confirm, but I think it's actually wrong because the logic tries to look for a git tag and fails, falling back to 1.0.0 as a default. I believe the soversion should be "5" and when they bump to "6" it will change to that.

> Here is a warning cleanup that is a nice to have

ACK

> fedora-review is failing for me after this morning update of rawhide with.

Yeah it's why I always make COPR builds, it has an option to generate the review form automatically, so you can just use that instead of running locally.
I noticed it didn't actually generate for rawhide, which is probably because of the breakage, but you can use the f38 copy for now:
https://download.copr.fedorainfracloud.org/results/mystro256/rocm-hip/fedora-38-x86_64/06109029-rocm_smi_lib/fedora-review/review.txt


> ExclusiveArch, for the other rocm packages we have be setting the to just x86_64, are we going to confuse folks when just this rpm shows up on ppc ?

There's initial unofficial support for ppc64le and aarch64. I'd like to try to build on all three platforms when possible to encourage adoption.
To be clear, SMI just requires amdgpu kernel support, which has been in Fedora for a while now.
On the other hand, the HIP math libraries are pretty broken on non-x86, so I'm ok with just x86_64 there, as the other parts of the stack is still useful without the mathlibs.

Comment 7 Tom Rix 2023-06-25 00:57:40 UTC
Thanks for the review.txt, nothing to do.
Everything else is fine.
Only issue is the .so, and that will be a bit of back-n-forth with upstream.

Comment 8 Jeremy Newton 2023-06-25 21:20:12 UTC
So I spent a bit of time reverse engineering how the SOVERSION works. They use the git tag to store the soversion and it's a bit buggy, so I need to talk to upstream about how to fix it.

For the time being, I made a patch to allow setting the string via CMake variable, so that should be better for Fedora in the long run, as I assume we don't want to deal with git tags. I just need to confirm what value it actually should be. Upstream packages have "5.0.0" in the ROCm 5.5.1 packages for some reason, even though the tag on github is 5.5.1.

If it's supposed to be "5.5.1", I'll send my patches upstream and notify them of the bug. If it's supposed to be "5.0.0", then I'm going to ask upstream if they can maintain and update the cmake default value to the correct value instead of my patch.

Anyway, here's the latest:
Spec URL: https://mystro256.fedorapeople.org/rocm-smi.spec
SRPM URL: https://mystro256.fedorapeople.org/rocm-smi-5.5.1-2.fc39.src.rpm
COPR: https://copr.fedorainfracloud.org/coprs/mystro256/rocm-hip/build/6112127/

We can resume once I figure out the version string situation.

Comment 9 Tom Rix 2023-06-26 00:20:46 UTC
Some problems with the old version
> rocm-smi -h
Unable to find /usr/libexec/rocm_smi/../../lib64/librocm_smi64.so.1 . Trying /opt/rocm*

We do not really want to deal with git tags, what is released should be buildable from the tarball on an isolated system.

Comment 10 Jeremy Newton 2023-06-26 01:36:56 UTC
Ah sorry I didn't update my used logic in %prep to account for this. None the less, I get your point and I'll bug upstream.

Comment 11 Richard W.M. Jones 2023-06-26 12:58:09 UTC
(In reply to Tom Rix from comment #5)
> fedora-review is failing for me after this morning update of rawhide with.
> Unknown argument "-C" for command "repoquery".
> I'll assume this is resolved for the *.2, or can you paste the review with
> *.2 ?

FYI: https://bugzilla.redhat.com/show_bug.cgi?id=2217496

Comment 12 Jeremy Newton 2023-06-30 18:27:41 UTC
Ok so I think I'm happy with it now.

I had an extensive email conversation with upstream and we've agreed on taking in some proposed patches from me (plus the one from Tom).

Regarding the SOVERSION, they don't plan on bumping this any time soon, or at all for that matter, ast heir goal is actually to merge a bunch of tools together into a project called amdsmi:
https://github.com/RadeonOpenCompute/amdsmi

With that said, they agreed it's better to maintain it in cmake rather than git tags.

As far as I know, they're still a ways out before deprecating rocm-smi, but I'll need to keep that in mind for later. There's been some shift of ownership, as smi wasn't really maintained by a dedicated group for a while, so they're still getting things working, such as mirroring over their development branch to github, pulling github requests, etc.

They've accepted more or less all the patches attached to this package as far as I know, so we can move forward now:

Spec URL: https://mystro256.fedorapeople.org/rocm-smi.spec
SRPM URL: https://mystro256.fedorapeople.org/rocm-smi-5.6.0-1.fc39.src.rpm
COPR BUILD: https://copr.fedorainfracloud.org/coprs/mystro256/rocm-hip/build/6130687/

let me know if there's anything else outstanding.

FYI fedora-review complains about this:
- Package does not use a name that already exists.
  Note: A package with this name already exists. Please check
  https://src.fedoraproject.org/rpms/rocm-smi
  See: https://docs.fedoraproject.org/en-US/packaging-
  guidelines/Naming/#_conflicting_package_names

But obviously this is intended.

Comment 13 Fedora Review Service 2023-06-30 18:37:41 UTC
Copr build:
https://copr.fedorainfracloud.org/coprs/build/6130875
(succeeded)

Review template:
https://download.copr.fedorainfracloud.org/results/@fedora-review/fedora-review-2217097-rocm-smi/fedora-rawhide-x86_64/06130875-rocm-smi/fedora-review/review.txt

Please take a look if any issues were found.

---
This comment was created by the fedora-review-service
https://github.com/FrostyX/fedora-review-service

If you want to trigger a new Copr build, add a comment containing new
Spec and SRPM URLs or [fedora-review-service-build] string.

Comment 14 Tom Rix 2023-06-30 19:39:38 UTC
wrt amdsmi project, big repo name changes is not great, this would be the third i know of.  if this is needed maybe it could the last change ?  we do not need to solve that now.

Looks good.
Approved.

Comment 15 Jeremy Newton 2023-07-01 01:48:31 UTC
Agreed, but amdsmi isn't just a rewrite or rename. It's a merge of a bunch of different projects and forks of smi.

The original rocm-smi -> rocm_smi_lib appears to just be a rewrite to make it more into a library, but it was forked and reimplemented a few times, and other sub projects popped up, which I suspect was due to a lack of strong maintainer-ship or direction/design.

As far as I understand, they're taking their time with it to make sure they do it right this time.

Comment 16 Jeremy Newton 2023-07-05 21:18:56 UTC
FYI
https://pagure.io/releng/issue/11509

I'll get rocblas working locally while we wait.
Once I confirm it works locally, I'll push a review for tensile.


Note You need to log in before you can comment on or make changes to this bug.