Skip to content

Commit

Permalink
Merge branch 'main' into brosko/no_cluster_desc_path
Browse files Browse the repository at this point in the history
  • Loading branch information
broskoTT authored Nov 29, 2024
2 parents 5492b12 + 427a7c9 commit 25302c3
Show file tree
Hide file tree
Showing 17 changed files with 587 additions and 357 deletions.
14 changes: 7 additions & 7 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,10 @@ repos:
rev: v1.35.1
hooks:
- id: yamllint
- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v19.1.4
hooks:
- id: clang-format
entry: git-clang-format
types_or: [c++, c]
args: ["--style=file"]
# - repo: https://github.com/pre-commit/mirrors-clang-format
# rev: v19.1.4
# hooks:
# - id: clang-format
# entry: git-clang-format
# types_or: [c++, c]
# args: ["--style=file"]
38 changes: 38 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -429,6 +429,44 @@ cat generated/watcher/watcher.log # See k_ids field for each core in the last d
- In the future, this tool will be expanded to show more debug information available from the host side.

## Contribution standards
This project has adopted C++ formatting and style as defined in `.clang-format`.
There are additional requirements such as license headers.

## Pre-commit Hook Integration for Formatting and Linting

As part of maintaining consistent code formatting across the project, we have integrated the [pre-commit](https://pre-commit.com/) framework into our workflow. The pre-commit hooks will help automatically check and format code before commits are made, ensuring that we adhere to the project's coding standards.

### What is Pre-commit?

Pre-commit is a framework for managing and maintaining multi-language pre-commit hooks. It helps catch common issues early by running a set of hooks before code is committed, automating tasks like:

- Formatting code (e.g., fixing trailing whitespace, enforcing end-of-file newlines)
- Running linters (e.g., `clang-format`, `black`, `flake8`)
- Checking for merge conflicts or other common issues.

For more details on pre-commit, you can visit the [official documentation](https://pre-commit.com/).

### How to Set Up Pre-commit Locally

To set up pre-commit on your local machine, follow these steps:

1. **Install Pre-commit**:
Ensure you have Python installed, then run:
```bash
pip install pre-commit
```
*Note:* pre-commit is already installed if you are using the python virtual environment.
2. **Install the Git Hook Scripts**:
In your local repository, run the following command to install the pre-commit hooks:
```bash
pre-commit install
```
This command will configure your local Git to run the defined hooks automatically before each commit.
3. **Run Pre-commit Hooks Manually**:
You can also run the hooks manually against all files at any time with:
```bash
pre-commit run --all-files
```

### File structure and formats

Expand Down
2 changes: 2 additions & 0 deletions tests/scripts/test_moreh_microbenchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -1005,6 +1005,8 @@ def test_dram_read_remote_cb_sync(
elif test == "Matmul":
if arch == "wormhole_b0":
bw_bound = 18.0
if use_sub_devices:
pytest.xfail("Tests using sub-devices is not correctly set up for BW measurements")
assert bw_bound <= throughput


Expand Down

Large diffs are not rendered by default.

Large diffs are not rendered by default.

3 changes: 1 addition & 2 deletions tests/ttnn/unit_tests/test_to_dtype.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,5 @@ def test_to_dtype(height, width, from_dtype, to_dtype):
assert output_tensor.layout == ttnn.ROW_MAJOR_LAYOUT
assert tuple(output_tensor.shape) == (height, width)

output_tensor = ttnn.to_torch(output_tensor).to(torch_input_tensor.dtype)

output_tensor = ttnn.to_torch(output_tensor, dtype=torch_input_tensor.dtype)
assert_with_pcc(torch_input_tensor, output_tensor)
16 changes: 16 additions & 0 deletions tt_metal/host_api.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,7 @@ const CircularBufferConfig& GetCircularBufferConfig(Program& program, CBHandle c
// clang-format off
/**
* Update the total size of the circular buffer at the given circular buffer handle. Updating a program-local circular buffer requires all circular buffers in the program to be reallocated.
* If it is required to update the address and total size of a dynamic circular buffer, use `UpdateDynamicCircularBufferAddressAndTotalSize`.
*
* Return value: void
*
Expand Down Expand Up @@ -244,6 +245,7 @@ void UpdateCircularBufferPageSize(Program& program, CBHandle cb_handle, uint8_t
// clang-format off
/**
* Update the address of a dynamic circular buffer. Dynamic circular buffers share the same address space as L1 buffers.
* If it is required to update the address and total size of a dynamic circular buffer, use `UpdateDynamicCircularBufferAddressAndTotalSize`.
*
* Return value: void
*
Expand All @@ -257,6 +259,20 @@ void UpdateCircularBufferPageSize(Program& program, CBHandle cb_handle, uint8_t
void UpdateDynamicCircularBufferAddress(Program& program, CBHandle cb_handle, const Buffer& buffer);

// clang-format off
/**
* Update the address and total size of a dynamic circular buffer. Dynamic circular buffers share the same address space as L1 buffers.
*
* Return value: void
*
* | Argument | Description | Type | Valid Range | Required |
* |------------|------------------------------------------------------------------------------------------|------------------------------|-------------|----------|
* | program | The program containing the circular buffer | Program & | | Yes |
* | cb_handle | ID of the circular buffer, returned by `CreateCircularBuffers` | CBHandle (uintptr_t) | | Yes | |
* | buffer | Dynamically allocated L1 buffer that shares address space of circular buffer `cb_handle` | const Buffer & | L1 buffer | Yes |
* | total_size | New size of the circular buffer in bytes | uint32_t | | Yes |
*/
void UpdateDynamicCircularBufferAddressAndTotalSize(Program& program, CBHandle cb_handle, const Buffer& buffer, uint32_t total_size);

/**
* Initializes semaphore on all cores within core range (inclusive). Each core can have up to eight 4B semaphores aligned to L1_ALIGNMENT.
*
Expand Down
19 changes: 14 additions & 5 deletions tt_metal/impl/buffers/circular_buffer_types.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,11 @@ CircularBufferConfig& CircularBufferConfig::set_total_size(uint32_t total_size)
}

CircularBufferConfig& CircularBufferConfig::set_globally_allocated_address(const Buffer& buffer) {
return this->set_globally_allocated_address_and_total_size(buffer, this->total_size_);
}

CircularBufferConfig& CircularBufferConfig::set_globally_allocated_address_and_total_size(
const Buffer& buffer, uint32_t total_size) {
if (not buffer.is_l1()) {
TT_THROW("Only L1 buffers can have an associated circular buffer!");
}
Expand All @@ -94,28 +99,32 @@ CircularBufferConfig& CircularBufferConfig::set_globally_allocated_address(const
this->max_size_ = buffer.aligned_size_per_bank();
this->buffer_size_ = buffer.aligned_size();
this->shadow_global_buffer = &buffer;
if (this->total_size_ > this->max_size_) {
if (total_size > this->max_size_) {
TT_ASSERT(
false,
"Cannot set to globally allocated buffer. Circular buffer size {} B exceeds allocated L1 buffer bank "
"size of {} B",
this->total_size_,
total_size,
this->max_size_);
#ifndef DEBUG
log_warning(
"Circular buffer size {} B exceeds allocated L1 buffer bank size of {} B. This may allow this circular "
"buffer to write outside the allocated buffer space.",
this->total_size_,
total_size,
this->max_size_);
if (this->total_size_ > this->buffer_size_) {
if (total_size > this->buffer_size_) {
TT_THROW(
"Cannot set to globally allocated buffer. Circular buffer size {} B exceeds allocated L1 buffer "
"size of {} B",
this->total_size_,
total_size,
this->buffer_size_);
}
#endif
}
if (total_size == 0) {
TT_THROW("Total size for circular buffer must be non-zero!");
}
this->total_size_ = total_size;
return *this;
}

Expand Down
39 changes: 20 additions & 19 deletions tt_metal/impl/buffers/circular_buffer_types.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,52 +23,53 @@ inline namespace v0 {

using CBHandle = uintptr_t;


class CircularBufferConfig {
public:
public:
// Static circular buffer spec
CircularBufferConfig(uint32_t total_size, const std::map<uint8_t, tt::DataFormat> &data_format_spec);
CircularBufferConfig(uint32_t total_size, const std::map<uint8_t, tt::DataFormat>& data_format_spec);

// User is expected to use the builder here.
CircularBufferConfig(uint32_t total_size);

// Dynamic circular buffer spec
CircularBufferConfig(
uint32_t total_size, const std::map<uint8_t, tt::DataFormat> &data_format_spec, const Buffer &buffer);
uint32_t total_size, const std::map<uint8_t, tt::DataFormat>& data_format_spec, const Buffer& buffer);

CircularBufferConfig& set_page_size(uint8_t buffer_index, uint32_t page_size);

CircularBufferConfig& set_total_size(uint32_t total_size);

CircularBufferConfig& set_globally_allocated_address(const Buffer &buffer);
CircularBufferConfig& set_globally_allocated_address(const Buffer& buffer);

CircularBufferConfig& set_globally_allocated_address_and_total_size(const Buffer& buffer, uint32_t total_size);

CircularBufferConfig& set_tile_dims(uint8_t buffer_index, const Tile& tile);

const std::array<std::optional<Tile>, NUM_CIRCULAR_BUFFERS> &tiles() const;
const std::array<std::optional<Tile>, NUM_CIRCULAR_BUFFERS>& tiles() const;

uint32_t total_size() const;

std::optional<uint32_t> globally_allocated_address() const;

const std::array<std::optional<tt::DataFormat>, NUM_CIRCULAR_BUFFERS> &data_formats() const;
const std::array<std::optional<tt::DataFormat>, NUM_CIRCULAR_BUFFERS>& data_formats() const;

const std::array<std::optional<uint32_t>, NUM_CIRCULAR_BUFFERS> &page_sizes() const;
const std::array<std::optional<uint32_t>, NUM_CIRCULAR_BUFFERS>& page_sizes() const;
const Buffer* shadow_global_buffer{nullptr};

class Builder {
public:
Builder(CircularBufferConfig &parent, uint8_t buffer_index);
public:
Builder(CircularBufferConfig& parent, uint8_t buffer_index);

const Builder &set_data_format(tt::DataFormat data_format) const;
const Builder& set_data_format(tt::DataFormat data_format) const;

const Builder &add_size(uint32_t size) const;
const Builder& add_size(uint32_t size) const;

const Builder &set_page_size(uint32_t page_size) const;
const Builder& set_page_size(uint32_t page_size) const;

const Builder &set_tile_dims(const Tile &tile) const;
const Builder& set_tile_dims(const Tile& tile) const;

private:
CircularBufferConfig &parent_;
private:
CircularBufferConfig& parent_;
uint8_t buffer_index_;
};

Expand All @@ -77,9 +78,9 @@ class CircularBufferConfig {
friend bool operator==(const CircularBufferConfig& lhs, const CircularBufferConfig& rhs);
friend bool operator!=(const CircularBufferConfig& lhs, const CircularBufferConfig& rhs);


private:
void set_config(const std::map<uint8_t, tt::DataFormat> &data_format_spec);
private:
void set_config(const std::map<uint8_t, tt::DataFormat>& data_format_spec);
void validate_total_size(uint32_t total_size);

uint32_t total_size_ = 0;
std::optional<uint32_t> globally_allocated_address_ = std::nullopt;
Expand Down
6 changes: 6 additions & 0 deletions tt_metal/tt_metal.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1084,6 +1084,12 @@ void UpdateDynamicCircularBufferAddress(Program &program, CBHandle cb_handle, co
circular_buffer->assign_global_address();
}

void UpdateDynamicCircularBufferAddressAndTotalSize(Program& program, CBHandle cb_handle, const Buffer& buffer, uint32_t total_size) {
auto circular_buffer = detail::GetCircularBuffer(program, cb_handle);
circular_buffer->config().set_globally_allocated_address_and_total_size(buffer, total_size);
circular_buffer->assign_global_address();
}

uint32_t CreateSemaphore(
Program &program,
const std::variant<CoreRange, CoreRangeSet> &core_spec,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -314,13 +314,13 @@ operation::ProgramWithCallbacks bcast_multi_core_hw(
}

if (src0_sharded) {
UpdateDynamicCircularBufferAddress(program, cb_src0, *src_buffer_a);
UpdateCircularBufferTotalSize(program, cb_src0, num_tiles_per_core_group_1 * src0_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_src0, *src_buffer_a, num_tiles_per_core_group_1 * src0_single_tile_size);
}

if (out_sharded) {
UpdateDynamicCircularBufferAddress(program, cb_output, *dst_buffer);
UpdateCircularBufferTotalSize(program, cb_output, num_tiles_per_core_group_1 * dst_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_output, *dst_buffer, num_tiles_per_core_group_1 * dst_single_tile_size);
}
};

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1899,13 +1899,13 @@ operation::ProgramWithCallbacks transpose_wh_multi_core_sharded(const Tensor& a,
uint32_t num_tiles_per_shard = shard_spec.numel() / TILE_HW;

if (src0_sharded) {
UpdateDynamicCircularBufferAddress(program, cb_src0, *src_buffer);
UpdateCircularBufferTotalSize(program, cb_src0, num_tiles_per_shard * src0_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_src0, *src_buffer, num_tiles_per_shard * src0_single_tile_size);
}

if (out_sharded) {
UpdateDynamicCircularBufferAddress(program, cb_output, *dst_buffer);
UpdateCircularBufferTotalSize(program, cb_output, num_tiles_per_shard * dst_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_output, *dst_buffer, num_tiles_per_shard * dst_single_tile_size);
}

uint32_t Wt = shard_spec.shape[1] / TILE_WIDTH;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -384,13 +384,13 @@ void BinaryDeviceOperation::BroadcastHeightAndWidthMultiCore::override_runtime_a
}

if (src0_sharded) {
UpdateDynamicCircularBufferAddress(program, cb_src0, *src_buffer_a);
UpdateCircularBufferTotalSize(program, cb_src0, num_tiles_per_core_group_1 * src0_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_src0, *src_buffer_a, num_tiles_per_core_group_1 * src0_single_tile_size);
}

if (out_sharded) {
UpdateDynamicCircularBufferAddress(program, cb_output, *dst_buffer);
UpdateCircularBufferTotalSize(program, cb_output, num_tiles_per_core_group_1 * dst_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_output, *dst_buffer, num_tiles_per_core_group_1 * dst_single_tile_size);
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ BinaryDeviceOperation::BroadcastHeightMultiCoreShardedOptimized::create(
TT_FATAL(input_tile_size == output_tile_size, "Input and output tile size should be same");
uint32_t shard_size_in_bytes = shard_spec.numel() * a.element_size();

uint32_t num_tile_per_core = (shard_size_in_bytes + input_tile_size - 1) / TILE_HW; // ceil value
uint32_t num_tile_per_core = (shard_size_in_bytes + input_tile_size - 1) / input_tile_size; // ceil value
TT_FATAL(input_tile_size <= shard_size_in_bytes, "Input tile size should be less than shard size");

uint32_t Wt, Ht;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -273,16 +273,16 @@ inline __attribute__((always_inline)) void set_eltwise_binary_runtime_args(
}

if (src0_sharded) {
UpdateDynamicCircularBufferAddress(program, cb_src0, *src_buffer_a);
UpdateCircularBufferTotalSize(program, cb_src0, num_tiles_per_core_group_1 * src0_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_src0, *src_buffer_a, num_tiles_per_core_group_1 * src0_single_tile_size);
}
if (src1_sharded) {
UpdateDynamicCircularBufferAddress(program, cb_src1, *src_buffer_b);
UpdateCircularBufferTotalSize(program, cb_src1, num_tiles_per_core_group_1 * src1_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_src1, *src_buffer_b, num_tiles_per_core_group_1 * src1_single_tile_size);
}
if (out_sharded) {
UpdateDynamicCircularBufferAddress(program, cb_output, *dst_buffer);
UpdateCircularBufferTotalSize(program, cb_output, num_tiles_per_core_group_1 * dst_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_output, *dst_buffer, num_tiles_per_core_group_1 * dst_single_tile_size);
}
}
BinaryDeviceOperation::ElementWiseMultiCore::cached_program_t BinaryDeviceOperation::ElementWiseMultiCore::create(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -572,8 +572,8 @@ operation::ProgramWithCallbacks multi_core_group_attn_matmul(
if (in0_is_sharded) {
uint32_t cb0_num_input_tiles =
a.shard_spec().value().numel() / TILE_HW; // Should be full MtKt and C should be 1
UpdateDynamicCircularBufferAddress(program, cb_src0, *src0_buffer);
UpdateCircularBufferTotalSize(program, cb_src0, cb0_num_input_tiles * in0_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_src0, *src0_buffer, cb0_num_input_tiles * in0_single_tile_size);
} else {
uint32_t cb0_num_input_tiles =
in0_block_w; // TODO: Generalize; double buffer and add blocking along ineer dim if we have Mt > 1
Expand All @@ -586,17 +586,17 @@ operation::ProgramWithCallbacks multi_core_group_attn_matmul(
if (in1_is_sharded) {
uint32_t cb2_num_input_tiles =
b.shard_spec().value().numel() / TILE_HW; // Should be full CKtNt and batch must be 32
UpdateDynamicCircularBufferAddress(program, cb_src2, *src1_buffer);
UpdateCircularBufferTotalSize(program, cb_src2, cb2_num_input_tiles * in1_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_src2, *src1_buffer, cb2_num_input_tiles * in1_single_tile_size);
}

UpdateCircularBufferTotalSize(program, cb_interm1, MtNt * interm_single_tile_size);

if (output_is_sharded) {
uint32_t num_output_tiles =
output.shard_spec().value().numel() / TILE_HW; // Should be full MtNt and C should be 1
UpdateDynamicCircularBufferAddress(program, cb_output, *dst_buffer);
UpdateCircularBufferTotalSize(program, cb_output, num_output_tiles * output_single_tile_size);
UpdateDynamicCircularBufferAddressAndTotalSize(
program, cb_output, *dst_buffer, num_output_tiles * output_single_tile_size);
} else {
uint32_t num_output_tiles =
MtNt; // TODO: Should be MtNt if Mt > 1? Or, produce one Nt at a time and double buffer?
Expand Down
Loading

0 comments on commit 25302c3

Please sign in to comment.