Releases: StanfordLegion/legion
Version 24.09.0 (September 27, 2024)
- Legion
- Bug fixes for control replication and multi-node configurations
- Regent
- Fixes for ROCm 6.0 code generation
- Tools
- Legion Prof now uses subcommands (e.g.,
legion_prof view
) to clarify which options apply to which actions - Legion Prof now tracks backtraces at the points where blocking wait calls are performed by the application
- Legion Prof reports more detailed timing information for tasks
- Legion Prof calculates clock skew between nodes and reports it when relevant
- Commonly used features of Legion Prof are now enabled by default
- The old Python Legion Prof implementation is no longer supported
- Legion Prof now uses subcommands (e.g.,
- Realm
Point
fieldsx
,y
,z
andw
have been replaced by methods- Support for launching CUDA tasks onto a CUDA stream asynchronously via
cuCtxRecordEvent
without the need of CUDA hijack - Support for CUDA fabric sharing
- Support for host-to-host copies via CUDA DMA
- Support for querying number of NUMA nodes from the
NumaModuleConfig
- Added reference counting for preimage operations
- Make
std::atomic
as the default atomic implementation - Remove
REALM_CXX_STANDARD
, and bump the minimal requirement to C++17 - Implemented an ABI stable wrapper for GASNetEX
- Additional unit tests including
CircularQueue
,ReplicatedHeap
,find_fastest_path
,DynaamicTableAllocator
,generate_gather_paths
,TransferIteratorIndexSpace
- Dead code cleanups and bug fixes
Version 24.06.0 (June 28, 2024) – Nonidempotent Traces
- Build
- Minimum required C++ standard is now 17
- Embedded GASNet build in CMake now automatically enables GPU memory kinds
- Legion
- Support for nonidempotent traces (where the postconditions do not imply the preconditions of the trace)
- Deletions are now committed in program order, making it easier for users to reason about when their effects take place
- All tasks (and other operations) are now committed in order (a prerequisite for anticipated, but not yet implemented, precise exception support)
- Improvements to Legion's internal algorithm for virtual instances, fixing various correctness bugs in the implementation
- Improvements to the
DefaultMapper
handling of task layout constraints
- Regent
- Improvements to make compiler more deterministic
- Improvements to auto-detect CUDA
- Support for complex numbers in
std/format
- Static control replication (SCR) and RDIR have been completely removed. All SCR and RDIR related flags (
-fflow-*
) have been removed, except for-fflow 0
which is permitted (but no longer does anything, and now issues a warning)
- Tools
- Restore profiler's ability to render dependent partitioning channels
- Render mapper information on mapper calls in the profiler
- Render user-provided profiling information in the profiler
- Realm
- UVM support for the HIP module
- Error code support for command line parser
- Support for querying MIG devices from NVML
- Add indirection channel query
- Additional unit tests and bug fixes
Version 24.03.0 (March 27, 2024) – Control Replication
Legion is an implicitly parallel, distributed runtime system for heterogeneous supercomputers.
The most notable feature in this release is control replication, a feature that we have been working on for many years that makes Legion dramatically more scalable in typical usage scenarios. In fact, the vast majority of users have already been using control replication, meaning that this is the first stable release of Legion which is usable (in a practical manner) for the vast majority of our users.
If you are not familiar with control replication, there is a wiki page that describes it, and of course the original paper.
As of this release, that means that the old control_replication branch is no longer being updated, and will be deleted at some point in the future. All updates from now on will go into the master branch, and it is our intention to avoid any long-standing feature branches in the future.
This release also finally removes some old Legion features that have been deprecated for nearly 10 years at this point. If you were somehow using those features, you will need to update to their replacements.
In addition, with this release, we are now packaging Legion Prof via crates.io. That means you can now install Legion Prof with:
cargo install --all-features --locked legion_prof@0.2403.0
(Note the version format is 0.YYMM.0
. This is required because Rust uses semver while Legion uses calver.)
Full release notes:
- Build
- ROCm 6.0 is now supported, and support for ROCm 4.x has been removed
- Legion
- Support for control replication has been merged
- Support for discarding region contents on task completion
- Long-deprecated APIs, such as the old
HighLevel
namespace, have been removed
- Mappers
- Default mapper support for control replication
- Default and null mapper now use C++
override
keyword
- Regent
- Support for pure projection functors that capture arguments
- Static control replication (SCR) has been deprecated and will be removed in a future release
- Tools
- The profiler now correctly recognizes the logger format version and throws an error if it does not match
- The profiler now reports when a profile was generated with debug mode (or another expensive setting) was enabled
- Many profiler fixes for correctly rendering runtime and mapper calls
- Profiler now renders GPU device and host execution separately
- Optimizations to improve profiler memory usage and running time
- Rust profiler now requires at least Rust 1.74
- Realm
- Support for registration of dynamically allocated buffers
- Support for handling poisoned events for reservation
- Refactor CUDA allocation and IPC paths
- Support for querying CUDA device information (GPU UUID and ID),process information (process ID, hostname, host ID) and timer calibration error from the profiler
- Remove address alignment from serializer and deserializer
- Support for creating network shared peers using IPC mailbox
- Support OMP thread binding and allow for multiple OMP parallel sections when enabling system OMP runtime
- Add Realm unit tests
- Fixes for Realm tests, sparsity map, MemoryQuery, dynamic framebuffer memory and memcpy channel
Version 23.12.0 (December 14, 2023)
- Regent
- Support for HIP multi-GPU per runtime
- Realm
- Improve scalability of startup by replacing point-to-point communication with allgatherv for machine model announcements
- Support shared memory communication for system memory
- Provide sanity check for GPU tasks to detect any leak of CUDA streams
- Support for GPU transposes in CUDA-DMA
- Bug fixes for CUDA-DMA
Version 23.09.0 (September 28, 2023)
- Regent
- Elide future maps in index launches
- Improvements to Pygion interop
- Realm
- Add a machine configuration API that allows applications to configure the machine model without using the command line
- Expose Realm managed CUDA/HIP stream to applications to launch GPU tasks without device-wise synchronization when hijack is disabled
- Change timers to use rdtsc
- Improve performance for getting highest priority task available in any task queue
- Implement framebuffer memory with
cuMemMap
- Initial work for moving STL dependencies to header only
Version 23.06.0 (June 28, 2023)
- Build
- Fixes for CMake build on macOS
- Fixes for HIP build when arch is specified
- Realm
- Support for better backtraces via libdw and libunwind
- Improve scalability and performance in task spawning by caching the triggering operation of an event if one is provided
- Fix a minor issue with affinity queries to properly clear the user-provided vector before populating it
- Add more accurate GPU memory bandwidth affinity calculations if NVML is available
- Refactor CPU core topology enumeration to serve systems without NUMA capabilities (like Jetson ARM systems)
- Improve scalability and performance of task spawning by moving event reuse freelists to be per-processor, reducing lock contention
- Add a microbenchmark for measuring task throughput more accurately
- Add a series of Realm API tutorials
- Replace
CU_EVENT_DEFAULT
withCU_EVENT_DISABLE_TIMING
for better performance of CUDA events - Support Kokkos interop for the HIP module
- Fixes for Realm tests on macOS
- Tools
- Legion Prof now supports search in the new profiler UI
- Legion Prof now supports an HTTP client/server interface. Launch the server with
--serve
(on port 8080 by default) and attach a client to it with--attach http://127.0.0.1:8080
- Legion Prof now supports a new achival mode via the
--archive
flag. Generate an offline profile and view it either via--attach
or by uploading it to a server and navigating tohttps://legion.stanford.edu/prof-viewer/?url=...
- Legion Prof modes (client/server/viewer) are now parallel by default, and perform heavy computations off the UI thread for better responsiveness
- Add support for rendering indirect copies (i.e., gather/scatter)
- Fix rendering of profiles over HTTP with old profiler UI
- Fix profiling of copies with different numbers of hops between instances
Version 23.03.0 (March 27, 2023)
- Build
- Minimum supported CMake version is now 3.16. (Some optional features may continue to require even newer versions.)
- Minimum supported GCC version is now 8.
- Minimum supported CUDA version is now 10.
- Legion
- Added support for padded layout constraints to provide scratch space in instances for tasks to use (see examples/padded_instances).
- Added support for tiled layout constraints to provide an ability to layout instances by breaking down dimensions (see examples/tiling).
- Realm
- An experimental UCX network backend has been added.
- Updated the Kokkos interop to support Kokkos 4.0.
- Python
- Support loading Legion as a library from a stock Python interpreter.
- Regent
- Fixes to avoid leaking futures.
- Improvements to Regent's predicate optimization.
- Tools
- Legion Prof now supports a native viewer UI. Enable it with the
viewer
feature (e.g.,cargo run --features=viewer
) and use the flag--view
. - Legion Prof now has better support for rendering a subset of available nodes. Pass all log files (from all nodes) into Legion Prof and add the
--subnodes
flag to specify which ones to render. This ensures all copies in/out of those nodes will be shown correctly.
- Legion Prof now supports a native viewer UI. Enable it with the
Version 22.12.0 (December 30, 2022)
- Regent
- Support for nested predication of
if
andwhile
statements
- Support for nested predication of
- Realm
- Support priorities for Copy operations
- Support building with multiple network backends enabled, and use
-ll:networks
(gasnetex
/gasnet1
/mpi
/none
) to pick which one to use during runtime - Separate CUDA runtime from Realm by removing all references to CUDA runtime and relying only on driver API, which fixes an issue when mixing static and dynamic cudart across an application and improves Realm’s compatibility across driver versions
- Tools
- Legion Prof support visualization of Channel of indirect copy, and Instances being used by different operations including Task, Copy and Fill
Version 22.09.0 (September 30, 2022)
- Python
- Support for running packages via
legion_python -m
- Support for Jupyter Notebook on single node execution.
- Support for running packages via
- Regent
- Deprecated support for LLVM versions less than 11 in
setup_env.py
. These versions will be removed in the next release. LLVM 13 is recommended, except on ARM where LLVM 11 is currently required - Added support for provenance for all launcher operations
- Debug info is no longer generated by default in order to optimize compile times. To re-enable it, run with
-fdebuginfo 1
- Deprecated support for LLVM versions less than 11 in
- Legion
- Most Legion APIs now support passing a provenance string. This provenance information is passed through to tools like Legion Spy and Legion Prof so users can map what they are seeing back to their source code. In the future, provenance strings will also be used by all Legion error messages as well.
- Realm
- Support for fills of arbitrary instances (via multi-hop paths where needed)
- Fixed crashes when using external instances and network-registered memory at the same time
- Removed all direct references to CUDA runtime library in CUDA module
- Caching of minimum-cost data transfer path for repeated copies
- Dependent partitioning support for image and preimage using structured (~affine) transforms in addition to existing unstructured (field-based) images/preimages
Version 22.06.0 (June 29, 2022)
- Regent
- Support for cross-products in index launches, as well as multi-level projection functors.
- Support for HIP on AMD GPUs has been added. All tasks marked with
__demand(__cuda)
are automatically eligible. Note that the name of the annotation may change in the future to something more general, but for now no change is being made. Some CUDA flags have migrated to more general names. See below. - The flag
-fcuda 1
is deprecated. Use-fgpu cuda
instead. - The flag
-fcuda-offline
is deprecated. Use-fgpu-offline
instead. - The flag
-fcuda-arch
is deprecated. Use-fgpu-arch
instead. - Enable HIP support with
-fgpu hip
and use the-fgpu-offline
and-fgpu-arch
flags as necessary/appropriate. - Support for new flag
-ffast-math 1
which enables fast-math optimizations on CPU and GPU. By default, CPU code has this disabled, and GPU code uses only thecontract
flag in LLVM to generate FMA instructions. For compute-intensive applications, additional performance can sometimes be unlocked by enabling the full suite of optimizations with-ffast-math 1
, at the cost of numerical accuracy. - Performance improvements for CUDA allow recent LLVM versions (e.g., 13) to match or exceed the performance of LLVM 3.8. Previously, performance regressions made LLVM 3.8 the most performant version for use with CUDA. The recommended LLVM version moving forward is 13, and
setup_env.py
has been updated to set this on all platforms. - The versions of GASNet and Terra are now pinned by default in
setup_env.py
. You can choose versions explicitly withGASNET_VERSION
(as before, though the previous default was unpinned) and--terra-branch
, respectively.
- Realm
- Allow use of system OpenMP runtime (instead of Realm-provided one) with
-DLegion_OpenMP_SYSTEM_RUNTIME=ON
. This allows inter-operation with libraries that have already been linked to the system runtime, but limits each process to a single OMP processor.
- Allow use of system OpenMP runtime (instead of Realm-provided one) with