Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
…3133) this commit adds the initial support for reduce scatter. However, only a few cases are functional. Future work will improve correctness across more cases. ======= Line Reduce Scatter Algorithm ====== The algorithm for line reduce scatter will send the minimal ammount of data over each line and out of each chip. All diagrams are for an example 4-chip line reduce scatter. First the operation fractures each input tensor Input Tensors --------------- | | | | | | | | v v v v |-| |-| |-| |-| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |-| |-| |-| |-| Chip 0 1 2 3 | | Fracture | Tensors v Input Tensors --------------- | | | | | | | | v v v v |-| |-| |-| |-| | | | | | | | | |-| |-| |-| |-| | | | | | | | | |-| |-| |-| |-| | | | | | | | | |-| |-| |-| |-| | | | | | | | | |-| |-| |-| |-| Chip 0 1 2 3 With fracture tensors are reduced and collapsed to the diagonal across the chips where the diagonal shows how the fractures spatially map to the final outputs. For example, the first output is generated by reducing the top chunk of each input tensor. The reduction is performed by having each chip forward its input to its neighbour. For chips that are not at the end of the line, they reduce with their input and forward. |-| |-| |-| |-| |#|<---| |<---| |<---| | |-| |-| |-| |-| | |--->|#|<---| |<---| | |-| |-| |-| |-| | |--->| |--->|#|<---| | |-| |-| |-| |-| | |--->| |--->| |--->|#| |-| |-| |-| |-| Chip 0 1 2 3 However, note that each arrow from the diagram heading out of of a chip in a given direction shares ethernet resources for all other arrows heading in the same direction from that chip. This means there is inherently serialization here. For that reason, we schedule the chunks in some way. The general scheduling strategy is to send the chunks that are furthest from the final reduce output first and step through chunks that are incrementally closer to the final output. Each direction from a chip can be processed independently. The diagram below is annotated with the "timesteps" when each chunk is sent. Each timestep is marked relative to the chunk source. |-| t=0|-| t=0|-| t=0|-| |#|<---| |<---| |<---| | |-|t=2 |-| t=1|-| t=1|-| | |--->|#|<---| |<---| | |-|t=1 |-|t=1 |-| t=2|-| | |--->| |--->|#|<---| | |-|t=0 |-|t=0 |-|t=0 |-| | |--->| |--->| |--->|#| |-| |-| |-| |-| Chip 0 1 2 3 Finally, not that the final output requires a reduction from both directions. Given that the two directions of the line are executing completely indepdently, we require some sort of merge operation. At the time of this commit, the merge strategy is to designate a master and slave reducer direction. We arbitrarily choose the 'right' or 'clockwise' direction as the master. The master direction will write its output to the output tensor but note that this will only be a partial output. The slave direction will read from the output tensor to merge with the data from producer chip. It will read from the output tensor based on a credit passing from master (implemented via semaphores) -------Input Tensor | | | | | | |---------|---------- |----> | | | | Reader | Sender |--- From EDM | (master)| (master)| | ---------> | | | | |---------|---------| | | | |------ Output Tensor <---| | ^ | |--------- | | | |---------|---------- | --> | | | | | Reader | Sender |--| From EDM | (master)| (master)| ---------> | | | |---------|---------| As a part of the line reduce scatter implementation, new CCL componenst were added: ccl send and ccl command generators/readers. The ccl_send (kernel) was used to implement the starting ends of the lines (i.e. the first senders). Although the ccl_send provides more generic send capabilities than line reduce scatter currently requires, it was chosen because it is a basic building block also for future CCL send/recv "operations" and higher level CCL programming models. ======= CCL Send (Kernel) ======= The ccl_send kernel acts like an interpreter of CCL commands. CCL commands are, so far, limited to be a send from tensor to EDM of a tensor slice. The command specifies some information about the tensor (shape, slice/view shape, view offset, etc.). CCL send is capable of executing multiple commands back to back. In the context of line reduce scatter, the ccl_send implements the separate sends of the fractured chunks on the left and right ends of the line. To do this for a line reduce scatter, we invoke n commands where n=#chips in the line. Future commands will let an invoker specify this basic pattern as a single command. Looking at the third diagram that outlines the timesteps for each chunk, for the left/right tensors, each timestep directly maps to a separate ccl command. ======= CCL Command Generators/Readers ======= To facilitate command generation, initial components have been added to let the host serialize commands for the the ccl_send kernel. Correspondingly, command unpacking logic is also specified for each command. This is used to help simplify command generation for the host. Note that ccl_send as a standalone kernel and operation is experimental and has several limitations: - Slice reads currently constrained to page aligned slices - Host command generation doesn't support proper 4D shape support (although the kernel side will internally represent shapes as 4D) - Only one command is currently supported (send tensor slice to EDM)
- Loading branch information