Releases: kyegomez/LongNet
0.4.8
Changelog for DilatedAttention with ParallelWrapper:
1. Added ParallelWrapper
Class
- Introduced a
ParallelWrapper
class to simplify the usage of data parallelism. - The
ParallelWrapper
class:- Takes a neural network model as input.
- Allows the user to specify a device ("cuda" or "cpu").
- Contains a flag
use_data_parallel
to enable or disable data parallelism. - Checks if multiple GPUs are available and applies
nn.DataParallel
to the model accordingly. - Redirects attribute accesses to the internal model for seamless usage.
2. Modified Usage of DilatedAttention
Model
- Wrapped the
DilatedAttention
model using theParallelWrapper
class. - Enabled the model to be run on multiple GPUs if available.
3. Device Assignment
- Explicitly defined a device and used it to specify where the
DilatedAttention
model should be loaded. - The device defaults to GPU (
cuda:0
) if CUDA is available; otherwise, it defaults to CPU.
4. Example Usage
- Provided an example of how to initialize and use the
ParallelWrapper
with theDilatedAttention
model.
Summary:
The key addition was the ParallelWrapper
class to facilitate easy and configurable usage of data parallelism with the provided DilatedAttention
model. This ensures scalability across multiple GPUs without any significant change in the existing workflow. The user can now enable or disable data parallelism using a single flag.
0.4.3
Changelog:
-
Tensor Shape Adjustments:
-
Ensured the consistent shape of tensors across all operations.
-
Squeezed
a_indices
to 2D to match dimensions ofatt_denom_sums
.a_indices = a_indices[:, :, 0].squeeze(-1).squeeze(-1)
-
Sliced
a_indices
to the unpadded sequence length before scattering.a_indices = a_indices[:, :unpadded_seq_len]
-
-
Scatter and Gather Operations:
-
Scatter with squeezed 2D
a_indices
and gather sparse sums with these indices.att_denom_sums.scatter_add_(1, a_indices, a_denoms) sparse_att_denom_sum = torch.gather(att_denom_sums, 1, a_indices)
-
-
DataType Handling:
- Converted the 'sparse indices' tensors to
torch.int64
(ortorch.long
) to ensure compatibility with PyTorch's indexing operations. - Retained the
torch.float16
dtype for the 'X' tensor to make it memory-efficient.
- Converted the 'sparse indices' tensors to
-
Code Cleaning:
- Removed repeated lines that print the shape and datatype of "sparse indices" to declutter the code.
- Standardized debug print statements to have a consistent format.
- Print shapes of tensors before scattering to verify dimensions match.
- Added comments explaining dimension squeezing, slicing, and other adjustments for clarity.
-
Validation Checks:
- Added checks to ensure tensors are on the same device (either all on CPU or all on CUDA).
- Checked whether the size of the tensor 'X' matches the expected shape before operations.
-
Enhanced Error Messages:
- Improved the debug error messages to be more descriptive.
-
Optimizations:
- Removed unnecessary tensor operations that don't contribute to the final result.
- Optimized tensor slicing and indexing operations to be more memory efficient.
-
Edge Case Handling:
- Handled the edge case of negative
head_idx
.
- Handled the edge case of negative
-
Other Minor Fixes:
- Ensured that the code uses math or memory-efficient attention only if the input tensor is on CUDA and a non-A100 GPU is detected.
- Made sure tensor operations are consistent with PyTorch best practices.
-
Documentation:
- Added comments to highlight important changes and to explain certain decisions in the code.
0.4.2
0.4.1
Changelog
Bug Fixes
-
Bug: The size mismatch in tensor operations in the forward method of the
DilatedAttentionLLAMA
class.- Root Cause: The tensors that are being operated upon did not have matching dimensions due to incorrect striding operations.
- Resolution: We modified the dilation process by introducing an inner loop over split tensors to handle each part separately, which resolved the dimension mismatch issues.
-
Bug: Index out of range error while transposing tensors.
- Root Cause: The index provided to the transpose operation was larger than the total number of dimensions in the tensor.
- Resolution: Corrected the index passed to the transpose operation to fit within the number of dimensions in the tensor.
Improvements
-
Optimized Tensor Operations: The tensor operations in the forward method were optimized to ensure they all operate on tensors with matching dimensions, improving the efficiency of the model.
-
Added Error Handling: We added checks for dimension mismatches in tensor operations to throw useful error messages when the input data does not match the expected shape.
Features
-
DilatedAttentionLLAMA Class: Introduced a new DilatedAttentionLLAMA class that uses dilated attention mechanism for the forward method. This new implementation is designed to be more efficient for larger sequence lengths.
-
Performance Testing: Added a simple performance test to benchmark the speed of the forward method in the DilatedAttentionLLAMA class.
0.4.0
Changelog
Bug Fixes
Issue: ValueError: too many values to unpack (expected 3)
Root Cause: The attention function was returning more than three values, but the code was trying to unpack its return values into only three variables.
Resolution: Modified the line where the attention function is called to collect all additional return values into a list using the * operator.
Issue: RuntimeError: The size of tensor a (64) must match the size of tensor b (2) at non-singleton dimension 1
Root Cause: The code was trying to add two tensors of different sizes in the forward method of the DynamicDilatedAttention class.
Resolution: Modified the line where the tensors are added to ensure that attn_output has the same size as the corresponding slice of outputs before trying to add them.
Issue: ValueError: not enough values to unpack (expected 7, got 6)
Root Cause: The flash_attn function in the FlashAttention class was trying to unpack the shape of the q tensor into seven variables, but the q tensor only had six dimensions.
Resolution: Modified the forward method of the DilatedAttention class to reshape the x tensor correctly before passing it to the attention function.
Improvements
Improvement: Added assertions to check the types and values of the parameters in the init method of the DilatedAttention class to prevent incorrect usage.
Improvement: Added a check for the Distributed parameter in the init method of the DilatedAttention class to decide whether to use the DataParallel wrapper for the FlashAttention modules.
Improvement: Modified the forward method of the DilatedAttention class to process each segment of the input separately for each attention head, allowing the attention heads to share information between different segments.
Improvement: Modified the forward method of the DilatedAttention class to use a buffer to store the attn_output_resized tensor instead of creating a new tensor of zeros in every forward pass, improving efficiency.