Suggestion: profiling #13

philipturner · 2022-01-12T18:25:21Z

philipturner
Jan 12, 2022

Apple has highly optimized DL primitives in Metal performance shaders. By partially de-compiling Metal Performance Shaders, I saw an insane amount of permutations. They optimized for all sorts of edge cases. I saw the term "winograd" once or twice in function names. I tried comparing custom Metal shaders to Apple's MPS and mine were terrible, but I imagine you have more time to thoroughly investigate performance deltas.

Given that Metal works on most AMD and Intel GPUs, it would be wise to run your OpenCL code on macOS and compare your performance to Apple's. That would ensure your kernels utilize the GPU as much as physically possible. Another suggestion is to try comparing DirectML, although I suspect that Apple is more optimized due to the sheer number of permutations they created. You can examine the DirectML source code to see if Microsoft takes the permutation approach too.

artyom-beilis · 2022-01-12T21:25:06Z

artyom-beilis
Jan 12, 2022
Maintainer

In general, I use OpenCL to make things platform independent so Metal and DirectML aren't relevant.

I hadn't tested dlprimitives on Mac OS X since I don't have one - but you are welcome to compare. Regarding DirectML I just don't have capacity to check against it.

I have good baseline reference of cuda+cudnn for nVidia and I have AMD's MIOpen on GCN and RNDA (once they release it)

There is also oneDNN for Intel but it has poor performance if channel first layout: oneapi-src/oneDNN#1194

Bottom line I know there are lots of improvements in performance can be done. I still don't reach cuDNN level of performance but at this point it is mostly good enough to make it highly useful.

I'd be glad to receive help in writing better kernels, adopting code from other projects and improving performance.

At this point I try to cover much more operators, make pytorch/dlprimitives much more usable and give decent inference support.

0 replies

philipturner · 2022-01-12T21:27:13Z

philipturner
Jan 12, 2022
Author

At this point I try to cover much more operators, make pytorch/dlprimitives much more usable and give decent **inference** support.

Training is the most meaningful use case for this stuff. Are you planning to restrict implementations to inference?

0 replies

artyom-beilis · 2022-01-12T21:32:26Z

artyom-beilis
Jan 12, 2022
Maintainer

Training is the most meaningful use case for this stuff

It is one of important use cases and I work on it (see pytorch backend I work on)

Are you planning to restrict implementations to inference?

By no means!

But I want to have lightweight inference tool: see https://github.com/artyom-beilis/dlprimitives/tree/onednn_integration
I work on loading ONNX model directly by dlprimitives that would make it much more easy to deploy inference in cross platform way with minimal dependencies.

0 replies

artyom-beilis · 2022-01-12T21:43:18Z

artyom-beilis
Jan 12, 2022
Maintainer

I meant https://github.com/artyom-beilis/dlprimitives/tree/onnx_api Onnx support.

(oneDNN is useless for channel first layout)

0 replies

philipturner · 2022-01-13T01:29:11Z

philipturner
Jan 13, 2022
Author

Do you have a short benchmark suite? If so, I could translate it to Metal Performance Shaders and run it on a few GPU models. Can you test your code on an Intel UHD 630 or Iris Plus? Or what AMD GPUs do you have? The UHD 630 is the one I have instant access to; the others are ones my friends have.

…

Sent from my iPhone

On Jan 12, 2022, at 4:43 PM, Artyom Beilis ***@***.***> wrote: I meant https://github.com/artyom-beilis/dlprimitives/tree/onnx_api Onnx support. oneDNN is useless for channel first layout — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

0 replies

artyom-beilis · 2022-01-13T07:19:57Z

artyom-beilis
Jan 13, 2022
Maintainer

Do you have a short benchmark suite?
If so, I could translate it to Metal Performance Shaders and run it on a few GPU models.

Yes: https://github.com/artyom-beilis/dlprimitives/blob/master/docs/build.md#benchmarking
Also there is dlprim_flops for specific kernel performance measurements

Can you test your code on an Intel UHD 630 or Iris Plus?
Or what AMD GPUs do you have? The UHD 630 is the one I have instant access to; the others are ones my friends have.

see: https://github.com/artyom-beilis/dlprimitives#tested-gpus

On Intel side I test it on UHD 530 that is basically same as 630 up to slight clock changes.

To be hones Intel GPUs with 400GFlops quite useless for trainin. they have ~ same performance as CPU training and actually CPU training with mklDNN may be way faster - this what frameworks like pytorch/tf use. Also with modern multi-core CPU the built-in GPU would likely be slower. Maybe when I implement float16 support with ~700GFlops it would be little bit more useful for inference.

0 replies

philipturner · 2022-01-13T10:50:04Z

philipturner
Jan 13, 2022
Author

I agree the UHD 630 is kind of useless. I could also run your benchmark suite on it on Windows to provide the most accurate comparison. I think there’s a difference between HD 530 and UHD 530, so be careful with naming if that’s not the case.

…

Sent from my iPhone

On Jan 13, 2022, at 2:20 AM, Artyom Beilis ***@***.***> wrote: Do you have a short benchmark suite? If so, I could translate it to Metal Performance Shaders and run it on a few GPU models. Yes: https://github.com/artyom-beilis/dlprimitives/blob/master/docs/build.md#benchmarking Also there is dlprim_flops for specific kernel performance measurements Can you test your code on an Intel UHD 630 or Iris Plus? Or what AMD GPUs do you have? The UHD 630 is the one I have instant access to; the others are ones my friends have. see: https://github.com/artyom-beilis/dlprimitives#tested-gpus On Intel side I test it on UHD 530 that is basically same as 630 up to slight clock changes. To be hones Intel GPUs with 400GFlops quite useless for trainin. they have ~ same performance as CPU training and actually CPU training with mklDNN may be way faster - this what frameworks like pytorch/tf use. Also with modern multi-core CPU the built-in GPU would likely be slower. Maybe when I implement float16 support with ~700GFlops it would be little bit more useful for inference. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

0 replies

philipturner · 2022-01-13T11:23:02Z

philipturner
Jan 13, 2022
Author

I was expecting to have pre-recorded benchmarks of how many milliseconds each test takes on a particular GPU. Then, I would make a simple command-like utility to compare performance on macOS. I was not expecting to have to set up all the OpenCL stuff by myself and modify the code to collect benchmarks. It’s not worth my time now if I have to do all of that.

0 replies

artyom-beilis · 2022-01-13T11:29:06Z

artyom-beilis
Jan 13, 2022
Maintainer

I suggest reading docs and some blog posts

There are plenty of benchmark results.

However since every gpu and system little bit different it is better to run comparison on same machine

0 replies

philipturner · 2022-01-13T11:54:13Z

philipturner
Jan 13, 2022
Author

I don't understand why there's an order of magnitude performance difference between Caffe and PlaidML. The difference is the same across weaker and stronger GPUs, so it seems like Caffe is using the GPU. I'm confused because I thought PyTorch trained exclusively on the CPU except on Nvidia. Also, what do the percentages mean? Do larger percents mean more efficiency or more execution time, and is that convention consistent across all of your benchmarks?

0 replies

artyom-beilis · 2022-01-13T12:32:10Z

artyom-beilis
Jan 13, 2022
Maintainer

I don't understand why there's an order of magnitude performance difference between Caffe and PlaidML

Because PlaidML has very poor performance

, so it seems like Caffe is using the GPU

Caffe has OpenCL, cuda and hip/rocm versions. But caffe isn't developed any more and its memory use is not optimal at all

I'm confused because I thought PyTorch trained exclusively on the CPU except on Nvidia.

There is:

Hip/ROCm support of AMD GPUs of older generation (GCN), no RDNA support yet and only Linux supoorted
pytorch-dlprim :-)

0 replies

artyom-beilis · 2022-01-13T12:33:07Z

artyom-beilis
Jan 13, 2022
Maintainer

Moved to discussions

0 replies

philipturner · 2022-01-13T23:53:48Z

philipturner
Jan 13, 2022
Author

I don't think I'll be able to provide benchmarks for this.

2 replies

artyom-beilis Jan 14, 2022
Maintainer

What you can do is to build dlprimitives on Mac and test it...

philipturner Jan 14, 2022
Author

I know. I just realized that this is all going to take more time than I anticipated and I’m strapped for time right now. This isn’t entirely off the plate, as it could give me experience with Metal Performance Shaders Graph for making a Metal backend for Swift for TensorFlow.

philipturner · 2022-02-10T04:37:14Z

philipturner
Feb 10, 2022
Author

Now that I think about it, I'm going to make universal GPU acceleration a primary focus of the new S4TF. My initial goals are Metal and DirectX because 99% of people have access to macOS or Windows. Linux already has CUDA acceleration through TensorFlow, but some people with an Intel or AMD GPU and would be forced to use native Windows for hardware acceleration via DirectML.

I see this repository as being able to enhance some Linux workflows, including Docker VMs on my M1 Mac where Metal isn't available. The DL primitives would accelerate what's possible, falling back to regular TensorFlow otherwise. DLPrimitives has a very finite instruction set, but I have two questions:

Could you add depthwise 3D convolutions and 3D pooling to this repository? Apple accomplished this by composing 3D convolutions out of 2D convolution shaders.
What are the advantages of this repo to TensorFlow's SYCL, with respect to driver overhead?
Would you be willing to co-author a future backend for S4TF that utilizes this repository? For example, helping me out with trivial 1D operations when I'm not experienced in OpenCL.

0 replies

artyom-beilis · 2022-02-10T10:39:17Z

artyom-beilis
Feb 10, 2022
Maintainer

Now that I think about it, I'm going to make universal GPU acceleration a primary focus of the new S4TF.

Isn't S4TF is dead?

Also can you elaborate on what are you trying to do?

Since if you want a new backend for TF there is what is called pluggable device support in TF >= 2.5. Also documentation can't be called as so but there is still an option to do it.

I see this repository as being able to enhance some Linux workflows, including Docker VMs on my M1 Mac where Metal isn't available

It is more than that. OpenCL is universal compute platform that works even on smartphone gpu. It is designed for computing unlike Direct3D. I so no reason whatsoever to use Direct3D or Metal unless you are looking to lock somebody on your own platform.

Finally OpenCL is very similar in code and concepts to CUDA that is most common compute API today.

falling back to regular TensorFlow otherwise.

What is regular TF exactly - CPU?

DLPrimitives has a very finite instruction set

Every project has finite instruction set ;-)

Could you add depthwise 3D convolutions and 3D pooling to this repository?

Of course it is planned to add 3d convs and others and I work on it step by step adding different operators. And there are a lot of them!

SYCL

There are two problems with SYCL:

There is barely any compiler support and each compiler locks you to specific vendor
Since compilation is static it is much more limited in possibilities, consider this https://github.com/artyom-beilis/pytorch_dlprim/blob/master/src/pointwise_ops.cpp#L155 it is dynamically compiled for any platform. Many kernels I use include compilation time constants for specific cases and embedding of activations - something you can't do with static compilation CUDA or SYCL style. Cuda provides for such things nvrt that allows building kernels on the fly

So also SYCL looks interesting on paper it isn't good enough for me

Would you be willing to co-author a future backend for S4TF that utilizes this repository?

You want to add universal backend for TensorFlow? I suggest take a look on how to build pluggable device for TF. It is official API to do it.

At this point I mostly work on pytorch backend since:

Pytorch has way better C++ API, way better documented, much better modern C++ code base (unlike TF that forbids using exceptions)
At this point I hadn't implemented channel last operations since channel first are most common ones (caffe/mxnet/pytorch/onnx), it is planned so in future channels last will be useable with dlprimitives but for TF it would require from users to use channel first setup

So if you want to add universal backend for TF start from pluggable device and use dlprimitives as backbone for operations. Once you have basic things working - lets say you can train resnet18 using dlprimitives on channel 1st setup - the rest becomes much easier and it will be matter of adding operators one by one.

For example, helping me out with trivial 1D operations when I'm not experienced in OpenCL.

Trivial operators are trivial to implement using dlprimitives. There is pointwise set of functions that basically need from you to write simple equation and it wraps it with kernel including broadcasting, reduction and compilation.

See the example I linked before

0 replies

philipturner · 2022-02-10T13:40:31Z

philipturner
Feb 10, 2022
Author

Isn’t S4TF is dead?

You’re correct. There’s a whole ton of context I haven’t given. Hang in there because this comment is going to be long.

(wrapped in a drop down to not clutter this discussion)

Back in February 2021, I was using Metal to accelerate real-time world reconstruction in the thermal envelope of a mobile device. This was incredibly resource-constrained, but thanks to the optimized GPGPU capabilities of Metal, I pulled it off (see scene color reconstruction). At the same time, Google killed Swift for TensorFlow - a project that promised to change the face of deep learning.

Now that I’m done with my augmented reality projects, I’m realizing my dream of resurrecting S4TF. I’m doing this mostly by myself, although a few people in the OSS community have pitched in to help. S4TF was promising for many reasons, but one I saw - compatibility with anything - was unprecedented. Python can’t run on iPhones, but Swift can. Swift also runs faster on embedded devices because it’s pre-compiled and avoids garbage collection.

I got a bit off track, but Swift’s compatibility means it can be used to author a GPU ML backend for any platform - unlike frameworks with a Python API. However, PluggableDevice is not compatible with my efforts. First, it requires the file structure of Python TensorFlow - one incompatible with the Swift Package Manager. It seems like TF assumes you’re using it in tandem with their Python API. Second, it requires mastering some TF-specific C APIs and passing kernel execution calls through a layer of TF. Thus, calling straight through to a GPGPU API would be faster CPU-side than mediating with TF’s PluggableDevice.

I’m planning to make a second accelerator backend for S4TF, which is an alternative to CTensorFlow. It’s highly customizable, and you can even design a backend calls back into the original CTensorFlow. It’s also easy to use - with a simple idiomatic Swift API instead of TF specific C APIs. You mentioned that PyTorch’s C++ API is easier. The Swift API would support language errors and much more. In fact, your experience could help me design a better API. I currently plan to name this backend PluggableDevice, but it’s not the same as CTensorFlow’s PluggableDevice. To resolve ambiguity, I’ll label it S4TF-Device here.

OpenCL is universal compute platform that works even on smartphone gpu.

Yes, but it’s not well optimized. The PyTorch core team chose to not use OpenCL because of massive driver latencies. To quote them, “every vendor has a walled garden and nobody’s willing to fix driver latencies”. It sacrifices performance for being cross-platform, although developers like you still make amazing things with it.

Finally OpenCL is very similar in code and concepts to CUDA that is most common compute API today.

From my experience, Metal is the second-closest thing to CUDA. It has compute kernels build into the API, and two libraries of highly optimized kernels - Metal Performance Shaders (MPS) and MPS Graph. This mirrors cuDNN and NVIDIA’s array of optimized libraries, but is less feature-complete. I did a lot of studying, and learned (the hard way) that there’s no way to beat Apple’s level of intense optimization. They cover every edge case of matrix multiplication, and utilize a form of hardware acceleration on M1 only accessible through Metal (analogous to tensor cores).

Driver latencies are also a massive factor, besides GPU-side performance. Metal has a 10-microsecond round trip latency for a command buffer, while encoding a command within that buffer takes less than 1 microsecond. In machine learning, CPU-side driver latencies can drastically lower performance. That’s one of the problems graph execution helps solve - pre-compiling to avoid round trips between APIs. I had assumed that DLPrimitives has minimized the driver latency incurred while using OpenCL.

It is designed for computing unlike Direct3D.

I’m not using Direct3D. More specifically, I’m going to use DirectML. Microsoft’s DirectX is a wide variety of APIs, ranging from audio drivers to file storage APIs. It is primarily for making games, but they recently realized its potential for machine learning. So, they made a library using D3D 12 compute shaders that’s incredibly optimized for low driver overhead. It runs on Intel, AMD, NVIDIA, and Qualcomm GPUs. Microsoft is already working on open-source backends for TF and Torch.

I plan to explore Windows programming after implementing the MPSGraph and BNNS backends for S4TF. Since everyone with a computer can usually run macOS, iOS, or Windows on it, that should make it cross-platform. Yes, I’m making two separate GPGPU backends instead of one OpenCL. But the end product will have a much smoother user experience because of lower driver overhead. In addition, the frameworks I chose are very feature complete - I made a spreadsheet outlining the operations S4TF needs to use from them.

in future channels last will be useable with dlprimitives but for TF it would require from users to use channel first setup

I’m not sure whether S4TF requires the ability to use channel-first APIs. I’m planning to make obscure features like that (and Lanczos image resampling) crash initially when a backend isn’t feature-complete. I may even remove some obscure ops like FractionalMaxPool, if existing S4TF users don’t utilize them. And on platforms besides iOS, a custom S4TF-Device backend could delegate back to TF for ops it doesn’t support. For example, the GPU API would throw an error that you catch in a Swift do-catch statement, then attempt to call into TF. But, transferring data across frameworks has high overhead. I want to minimize that by having many trivial ops run directly on the OpenCL backend for Linux.

Since compilation is static it is much more limited in possibilities, consider this

One more thing - Metal has this problem covered very well. Metal Shading Language has function constants, where you can mostly pre-compile a shader with runtime specialization. MPS uses function constants internally to dynamically generate shader permutations fine-tuned for a specific device. And MPS Graph fuses consecutive 1D ops into a runtime-generated Metal shader. I don’t know if XLA or Aten delegate to the NVIDIA shader compiler to do this.

0 replies

artyom-beilis · 2022-02-10T14:44:58Z

artyom-beilis
Feb 10, 2022
Maintainer

The PyTorch core team chose to not use OpenCL because of massive driver latencies

This is so untrue that I can't even explain.

Pytorch does not use OpenCL since they use what everybody else in deep learning world are using: Cuda and nVidia's cudnn libraries. They are THE reason to use cuda. cuDNN performance is surior because nVidia spent lots of time optimizing it. And nVidia did great job creating great proprietary tools to control entire DL market.

There is nobody who can or competes with nVidia. AMD and Intel are true toddlers in this field. To be honest they don't even really try (see RDNA and MIOpen/HIP support). Maybe Google's TPUs some competitor but they aren't relevant for most end users.

Now OpenCL is fine and primary API for Intel and AMD (oneDNN, ROCm). MIOpen has OpenCL and HIP API that have (I tested) same performance. But they created hip to auto-convert cuda code for AMD - because otherwise AMD will be even less relevant.

ARM's GPU use OpenCL as well for compute (there is a library for this)

Now with regard to nVidia. I did some writing and same kernel written in Cuda and in OpenCL has same performance. They use same PTX and runtime. I tested my GEMM kernels on cuda and got exactly the same results as in OpenCL performance wise - and these are the most important ones.

So you are missing the entire process and point of using open platform/

So... Your statement can't be more inaccurate. Regarding latency. One of the big things in DL that most of operations are computationally bound. I couldn't find a difference for example running over PCI-E x4 and x16 because contributions are negligible.

As long as you can feed GPU faster than GPU can compute you are OK. I suggest take a look on pytorch async design.

Python can’t run on iPhones, but Swift can

You don't use python on end devices. This is what C++ for. That is why for both pytorch and tensorflow the python is merely driver that calls ops in C++...

Finally. If you are interested in inference and not training I'd suggest you look into ONNX and its implementations.

0 replies

artyom-beilis · 2022-02-10T15:30:36Z

artyom-beilis
Feb 10, 2022
Maintainer

FYI: Why do we need OpenCL based solution for deep learning? http://blog.dlprimitives.org/post/2

1 reply

philipturner Feb 10, 2022
Author

Scanning DirectML's code base, it seems their kernels are closed-source just like Metal Performance Shaders! The need for open-source stuff is serious! I saw some references to "winograd" in some MPS function names, so it seems Apple uses that algorithm. They also said in a WWDC video that their simdgroup_matrix instructions accelerate the MMX component of convolutions. But you would be the first to have a highly performant open-source implementation of this kind of stuff.

philipturner · 2022-02-10T16:34:27Z

philipturner
Feb 10, 2022
Author

This is so untrue that I can't even explain.
AMD and Intel are true toddlers
to be honest they don't even really try
So you are missing the entire process and point of using open platform
So... Your statement can't be more inaccurate.

Maybe I'm reading it the wrong way, but I got a non-constructive tone. I know that NVIDIA is #1 in this field, and I'm trying to bring ML to other platforms.

Pytorch does not use OpenCL since they use what everybody else in deep learning world are using: Cuda and nVidia's cudnn libraries. They are THE reason to use cuda. cuDNN performance is surior

On an issue on the PyTorch repository, the leader of PyTorch outlined their reason for not going with OpenCL. They experimented with OpenCL recently because their users wanted it.

You don't use python on end devices. This is what C++ for. That is why for both pytorch and tensorflow the python is merely driver that calls ops in C++...

I intend to use iOS/iPadOS for ML training. For a year, I had an iPhone that was much more powerful than my Mac. MLCompute ran on it, but was not as ergonomic as TF or S4TF. Swift is a language that replaces both Python and C++, and I saw unrealized potential for hardware acceleration.

I'm not sure where this conversation is going, but I would hope to collaborate with you in software development. Are you open to helping me author a DLPrimitives backend for a machine learning framework in the future?

1 reply

philipturner Feb 10, 2022
Author

One more note: depending on the course of our future conversations, I might use DLPrimitives for Windows instead of DirectML. Or, I could postpone DirectML as a future optimization for Windows. It takes a lot of time to learn the Windows APIs, but until now that was the only solution for GPU acceleration on PCs. I would need heavy statistics comparing DirectML and DLPrimitives on driver latency and overall performance, before making this decision.

artyom-beilis · 2022-02-10T21:25:53Z

artyom-beilis
Feb 10, 2022
Maintainer

Maybe I'm reading it the wrong way, but I got a non-constructive tone.

The point was to show the claim that OpenCL isn't suitable is inaccurate. It wasn't meant to be non-constructuve

I intend to use iOS/iPadOS for ML training.

I see. It surprises me a little. Also not sure iPad's GPU is stronger than CPU.

Are you open to helping me author a DLPrimitives backend for a machine learning framework in the future?

Yes, I create dlprimitives as a common core library similar to cudnn/tensorrt. So I'm happy were ever it is going to be used.

If you'll have some critical operators missing just drop the word and I'll prioritise them.

On an issue on the PyTorch repository, the leader of PyTorch pytorch/pytorch#47702 (comment) their reason for not going with OpenCL.

Lets say, they wouldn't spend time to develop OpenCL backend. But there is no problem to make out-of-tree backend. Note I frequently discuss various topics on dev-forums by pytorch.

12 replies

philipturner Feb 11, 2022
Author

Does your "depthwise separable" 2D convolution fuse the depthwise and pointwise components, or separate them? I need to have them separated into individual shaders.

Do you have an efficient pointwise convolution implementation? If there's a 1x1 filter, I would select to use that efficient implementation.

artyom-beilis Feb 11, 2022
Maintainer

Does your "depthwise separable" 2D convolution fuse the depthwise and pointwise components, or separate them

In general depthwise separable convolution is convolution with groups=channels and after that another convolution with kernel 1x1.

Now both have specific optimizations, for example 1x1 becomes virtually GEMM and depthwise is entirely separate kernel since it isn't efficient to do with im2col+gemm but rather direct operation.

https://github.com/artyom-beilis/dlprimitives/blob/master/docs/implementation_details.md

I don't think there is any implementation that does this as a single kernel/operation.

philipturner Feb 11, 2022
Author

One thing - I support channel first layout only at this point.

How long would it take you to implement channel-last (NHWC)? It seems that S4TF requires channel-last.

MPSGraph has options to specify the data layout. Given that I'm implementing the Metal backend first, you have plenty of time to add both channel-last 2D convolutions and 3D convolutions.

artyom-beilis Feb 11, 2022
Maintainer

How much of an overhead would it be to support channel-last (NHWC)

Quite a lot - both changes in interfaces and rewriting lots of operators like Pooling, GEMM-Convolution etc that work much more naturally in channel first order.

TF supports both NHWC and NCHW - you can do everything in the NCHW as-is. So not having NHWC can limit code that requires it but it still allow working environment.

NHWC It is somewhat lower priority at this point since:

for inference using ONNX as industry standard the NCHW is the format the graph is defined in
all frameworks but TF use NCHW by default - Caffe, PyTorch, MXNet

Anyway I put most of my effort into pytorch backend for training (that already works very werll) and onnx inference as they give most "bang for the buck" and btw when TF exports model to ONNX it converts it into NCHW with exception of inputs/outputs.

So I assume once there will be real demand for it, I'll put the effort or maybe somebody else will contribute for this project.

philipturner Feb 12, 2022
Author

Quite a lot - both changes in interfaces and rewriting lots of operators

I realized that I can just change S4TF to use NCHW/NCDHW instead. I'm removing some operators so I already have source-breaking changes. This will decrease performance on CPU, NVIDIA GPU tensor cores, and TPU. But, it will increase performance on non-NVIDIA GPUs 😃.

The reason TensorFlow chose channel-last is because proprietary matrix multiplication accelerators (and sometimes CPU) run faster that way. TensorFlow is mostly focused on proprietary TPUs and NVIDIA GPUs with ML acceleration. This should mean that the AMX accelerator on Apple Silicon CPUs might run slower, but for some reason BNNS not only prefers CHW, it requires it.

artyom-beilis · 2022-02-10T21:28:32Z

artyom-beilis
Feb 10, 2022
Maintainer

I would need heavy statistics comparing DirectML and DLPrimitives on driver latency and overall performance, before making this decision.

I'm looking forward to do it.

3 replies

philipturner Feb 10, 2022
Author

When I'm ready for a backend, could you help me gather these statistics? I'm not that familiar with Windows programming, so I may struggle with setting up the DirectML benchmarks. But you could at least set up the DLPrimitives benchmarks once I make a hypothesis to test.

artyom-beilis Feb 10, 2022
Maintainer

When I'm ready for a backend, could you help me gather these statistics? I'm not that familiar with Windows programming, so I may struggle with setting up the DirectML benchmarks. But you could at least set up the DLPrimitives benchmarks once I make a hypothesis to test.

The problem like with the any experiment that should be worth anything - it must be tested on same HW... So you'll need to build dlprimitives and directml on your own.

If you have python api for directml that you may be lucking since building dlprimitives even on Windows is fairly simple.

philipturner Feb 10, 2022
Author

If other factors outweigh the costs of driver overhead, I might not need to benchmark DirectML. I think it's good to compare OpenCL driver overhead to Metal. I know Metal's latencies very well (10 us command buffer, c. 1 us per encoded command). I expect DirectX/DirectML to be on the same order of magnitude. But, I have no idea what those numbers are for OpenCL. I'm anticipating something huge like 100 us. If OpenCL is on the same order of magnitude as Metal, my driver latency worries are resolved.

Also, the driver latencies for Metal vary by device. I think it got as high as 70 us on an old iPad Pro (per command buffer), so I'll have to benchmark both frameworks on the same device. If you could set up some C++ code that measures the appropriate benchmarks right now, I could run it on my personal computer.

artyom-beilis · 2022-02-11T06:19:24Z

artyom-beilis
Feb 11, 2022
Maintainer

I'm basing my conclusion of processing power mostly based on TFLOPS. Apple silicon has a ML accelerator called the AMX, accessible from single-core CPU. It has around 1.0 TFLOPS of processing power. The A14's GPU is also 1.0 TFLOPS, while M1 (iPad Pro) is 2.6 TFLOPS. For reference, an Intel Mac mini has 10's of GFLOPS single-core CPU and 384 GFLOPS GPU

Intel GPUs are useless in this sense. Latest mutli-core Intel CPUs have very good performance and good Terra-flops. In any case note, not TFLops only matter, memory bandwidth is critical as well.

Now: 1.0 TF isn't really useful for anything but Mnist level training. 2.6 TFlops is kind of very basic start (for example gtx960 and rx560 I have are 2.5TFLops grade) but they are mostly toy cases or learning cases or very basic networks for small data-sets/small images.

Finally only starting from 5-10FLops some work can be done.

think it's good to compare OpenCL driver overhead to Metal.

What GPU iPad is using? It isn't only about driver, it is much more writing good kernels. I did optimization for AMD, nVidia and to some extent to Intel. Core ops like GEMM currently aren't optimized for mobile gpus since I donht have easy way to test it.

1 reply

philipturner Feb 11, 2022
Author

I’m just using the iPad as an example for measuring Metal driver latency. That iPad is an A10X. But since OpenCL is cross-platform, I could compare Metal and OpenCL’s overhead on the iPad as well. My main computer is a 10 TFLOPS M1 Max Mac.

Apple has seriously optimized their MPS kernels for their mobile GPUs, so you don’t have to worry about bringing OpenCL to it. In terms of bandwidth, the A14 is circa 40 GB/s and M1 Max is 400 GB/s.

It isn't only about driver, it is much more writing good kernels.

Yes, but for RNNs and reinforcement learning, driver latency can sometimes become the bottleneck. Also, eager execution in machine learning libraries means every shader dispatch is a new command buffer. That jumps the CPU overhead per op from 1 to 10 us using Metal, and with zero opportunity to fuse ops. In my AR work, I've seen it take more time to encode GPU commands than to execute them. Driver latency is a primary concern and I need hard statistics showing how it compares to DirectX or Metal.

artyom-beilis · 2022-02-11T12:07:59Z

artyom-beilis
Feb 11, 2022
Maintainer

but for RNNs and reinforcement learning, driver latency can sometimes become the bottleneck.

Interesting, since I don't have experience with sequences and RNNs.

5 replies

philipturner Feb 11, 2022
Author

That's a reason why the 1-2 TFLOPS AMX accelerator for the CPU is so useful, even if it's less powerful than the GPU. There's a 0.01 us overhead to accessing it.

philipturner Feb 12, 2022
Author

How difficult would it be to benchmark OpenCL to get the statistics I need? If it’s trivial, I’d rather have you do it - instead of me learning the OpenCL API now just to run one tiny benchmark.

artyom-beilis Feb 12, 2022
Maintainer

How difficult would it be to benchmark OpenCL to get the statistics I need? If it’s trivial, I’d rather have you do it - instead of me learning the OpenCL API now just to run one tiny benchmark

No

You'll have to do it on your own

philipturner Feb 12, 2022
Author

Since these benchmarks are a requirement for choosing DLPrimitives over DirectML, that means I’m still seriously considering using DirectML until I get the appropriate benchmarks. Maybe I should run them after making some trivial 1D OpenCL kernels.

philipturner Feb 12, 2022
Author

Until then, I suggest making 3D convolutions only if it’s for the sake of your PyTorch backend.

philipturner · 2022-02-11T15:42:08Z

philipturner
Feb 11, 2022
Author

I've went ahead and forked this repository. I'm going to experiment with a scheme where I refactor your files into a Swift package. Because Git remembers the files' original location in the folders, I can push your future commits into my repository with a pull request. I'll also want a second layer of control over a backend I officially support and integrate with my ML framework.

0 replies

philipturner · 2022-02-19T23:47:52Z

philipturner
Feb 19, 2022
Author

Let's say that in the future, I make OpenCL kernels for ML that build on your work in DLPrimitives. But, since I mostly have access to my Apple silicon device, they're optimized for the Apple architecture. From what I've read, the Apple and AMD GPU architectures have a lot of similarities. What would the performance implications be for when they're run on Intel and AMD devices?

6 replies

artyom-beilis Feb 20, 2022
Maintainer

What I can suggest

Build dlprimitives in Apple PC - should be very simple it has very minimalist dependencies - cmake, c++11
Run dlprim_flops to see if real flops/memory bandwidth comparable to theoretical
Start testing it, run tests benchmarks

That would be a good starter. Otherwise I can't say anything smart.

philipturner Feb 20, 2022
Author

I plan to start compiling DLPrimitives as soon as I prototype a Metal backend for S4TF. Things are moving incredibly fast on the S4TF front right now. I’m also feeling more and more like scrapping the DirectX idea because my Intel Mac mini for Windows is not as easily accessible - despite DirectML being super feature-complete. DLPrimitives hasn’t been tested once on macOS, so I expect some unexpected hurdles there.

Run dlprim_flops to see if real flops/memory bandwidth comparable to theoretical

It’s nice that you have such a utility. I anticipate that it will be very helpful in the future. I know the FLOPS and memory bandwidth of my machines by heart. 10.4 TFLOPS theoretical (7 actual for FMA) and 400 GB/s for my MacBook Pro.

Apple has 32-wide warp size, and 256 32-bit registers (or 512 16-bit w/ superscalar). 16-bit integers could be useful for conserving register usage. AMD seems to be the closest architecture to Apple except for warp size. For reference, I’ve looked at this document plenty of times. At the time, I was working with my A14 iPhone and needed insane optimization to mitigate overheating for AR. A14 and M1 have the same architecture, just a different scale. The Mac I’ll develop this on is also the Apple7 architecture (M1 Max).

Since Apple keeps the specs of their architecture very secretive, this source has been incredibly insightful. It’s probably too long for you to want to read though.

Apple has 1024 threads per threadgroup with 32 KB of threadgroup memory per threadgroup. Also, they have SIMD permute and reduction instructions. Apple GPU has 128-bit vectorized device memory accesses. It’s not clear if that’s the case for threadgroup memory as well.

The way I phrased the stats above uses Metal-specific terminology, so it may not sound 100% clear. I have barely any experience utilizing threadgroup memory, so I don’t have stats on performance there. Also, I’m oblivious to cache lines and cache sizes - again an unfortunate consequence of Apple’s secrecy.

Since I’m ultimately using Metal instead for S4TF on Apple platforms, this is really just my machine for prototyping kernels that will end up on AMD and Nvidia GPUs. But, having a wildly different performance pattern will corrupt my benchmarks. If possible, would you be able to run some highly refined benchmarks of a final product to validate some crucial hypotheses.

philipturner Feb 20, 2022
Author

Also note that if the register utilization is pushed to the max, threadgroup size is throttled from 1024 down to 384. The size starts decreasing when roughly 104 of the 32-bit registers are used. The exact numbers are in the documents I linked above.

artyom-beilis Feb 20, 2022
Maintainer

If so arch is really looks similar to RDNA/nVidia so don't see issues there.

Also I don't use work groups above 256.

So please in order to make the conversation more reasonable... build dlprimitives, run some tests, benchmarks etc.

If there will be failures I can give some points for adjustments to look into and how to enable Winograd (since it is enabled currently on nVidia and AMD only)

philipturner Feb 21, 2022
Author

I got basic OpenCL to run on macOS, although I am still working on compiling DLPrimitives at the time of commenting. Since I'm familiar with Swift and using it as a scripting language, I'm going to write CPU-side stuff in it whenever possible. It's easier for me to write and debug a language I know well. The Swift standard library is very familiar, but the C++ stdlib is not. Furthermore, the command for running the script is as simple as:

swift opencl_import.swift

When modifying your code, I'll keep to C++ because I can infer what code to write given the context. I also know that your code uses the OpenCL C++ API, while Swift must use its C API.

On the other hand, I will write GPU code in the same native language as DLPrimitives. Xcode has proper syntax highlighting for OpenCL's kernel language, making it easy to read. I used MSL (based on C++ 14) a lot for Metal, and that knowledge transfers to OpenCL. I'm familiar with C++ syntax, just not with using it for CPU stuff.

https://gist.github.com/philipturner/665457757356b03fb0c8490e84046c0a

Could you read the script above and tell me whether it makes enough sense that you can follow along? The C code it was translated from is located here. My code doesn't use most Swift language features, as it's a literal translation from C.

philipturner · 2022-02-21T01:13:44Z

philipturner
Feb 21, 2022
Author

Instant compile failure. Use the C script at https://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/ as a reference. On Apple, the OpenCL headers are imported with #include <OpenCL/opencl.h>. On Windows and Linux, it's #include <CL/cl.h> and your repository uses CL/cl(2).hpp

31 replies

philipturner Feb 21, 2022
Author

bwd-filt in the large convolutional neural networks is becoming a big bottleneck in gathering data. It's somewhere near 1% GFLOPS and 1% bandwidth.

artyom-beilis Feb 21, 2022
Maintainer

Easiest way to find optimal parameters for GEMM I maybe miss is to run clblast auto-tuner. My GEMM is quite similar to what author of CLBlast did. So if you manage to build and tune clblast it can be very helpful to see its DB.

Do the shader names I gave above have any numbers that seem familiar?

Are these shaders open source? If not it isn't relevant for me whatsoever.

Regarding Winograd and parameters. It isn't easy to squeeze it. In fact I cound't find any open source GPU implementation of Winograd convolution. The problem with adopting it is that it is very accurate play with data loading/storing/transposing and it is really hard to make it right.

artyom-beilis Feb 21, 2022
Maintainer

bwd-filt in the large convolutional neural networks is becoming a big bottleneck in gathering data. It's somewhere near 1% GFLOPS and 1% bandwidth.

bwd-filter frequently requires large reductions - try using reduce_k_ to see if it works/helps

philipturner Feb 21, 2022
Author

Here's the full results for dlprim_flops on each of the 3 configurations. Same file layout as before. I did not mess with reduce_k_ and cores yet.
full_results.zip

artyom-beilis Feb 22, 2022
Maintainer

Do not attach any disassembled or reverse engineered code!!!

EVER!

You don't have copyrights on it... I don't want to jeopardize anything related to copyrights for my project.
It is closed source... let it be this way.

philipturner · 2022-02-21T22:39:33Z

philipturner
Feb 21, 2022
Author

Just to organize our conversation, I'm separating part of it into a different reply thread. Here's the results of running CLBlast autotune for:

clblast_tuner_xgemm
clblast_tuner_xgemm_direct
clblast_tuner_routine_xgemm
clblast_tuner_xconvgemm

cblast_autotune_results.zip

29 replies

philipturner Apr 25, 2022
Author

I re-ran part of the benchmark with my Mac in high power mode, so that there was no performance throttling. Apple lets me do this even when I'm on battery, but I switch back to lower power mode after the benchmark is done.

Output of dlprim_flops (alexnet, resnet only) with Winograd enabled

Testing on Apple M1 Max on Apple
Testing memory speed
- Vector size 1
-- Warming 
-- Running   311.802 GB/s
- Vector size 2
-- Warming 
-- Running   338.448 GB/s
- Vector size 4
-- Warming 
-- Running   340.182 GB/s
- Vector size 8
-- Warming 
-- Running   335.413 GB/s
- Vector size 16
-- Warming 
-- Running   334.018 GB/s
Testing flops float
- Vector size 1
-- Warming 
-- Running   9970.46 GFlops
- Vector size 2
-- Warming 
-- Running   9830.33 GFlops
- Vector size 4
-- Warming 
-- Running   9873.04 GFlops
- Vector size 8
-- Warming 
-- Running   9296.59 GFlops
- Vector size 16
-- Warming 
-- Running   9053.17 GFlops
Summray for Apple M1 Max on Apple
Peak GFlops for float 9970.46
Peak memory 340.182 GB/s
GEMM
  NN  0:  512,  512,  512     1880.0 GFlops (18.86%)     22.1 GB/s ( 6.60%) limited by gflops 18.86%
  NN  1: 1024, 1024, 1024     2511.5 GFlops (25.19%)     14.7 GB/s ( 4.41%) limited by gflops 25.19%
  NN  2: 1025, 1025, 1025     2274.8 GFlops (22.82%)     13.3 GB/s ( 3.99%) limited by gflops 22.82%
  NN  3: 2048, 2048, 2048     2693.5 GFlops (27.01%)      7.9 GB/s ( 2.36%) limited by gflops 27.01%
  NN  4: 2049, 2049, 2049     2563.1 GFlops (25.71%)      7.5 GB/s ( 2.25%) limited by gflops 25.71%
  NN  5:   64, 2048,   64      643.7 GFlops ( 6.46%)     41.2 GB/s (12.33%) limited by memory 12.33%
  NN  6: 2048,   64, 2048     1613.3 GFlops (16.18%)     53.6 GB/s (16.04%) limited by gflops 16.18%
  NN  7: 2048, 2048,   64     1982.2 GFlops (19.88%)     66.3 GB/s (19.86%) limited by gflops 19.88%
  NN  8: 2048,   64,   64      641.0 GFlops ( 6.43%)     41.0 GB/s (12.28%) limited by memory 12.28%
  NN  9:   64, 2048, 2048     1599.0 GFlops (16.04%)     53.1 GB/s (15.90%) limited by gflops 16.04%
  NN 10:   64,   64, 2048      253.2 GFlops ( 2.54%)     16.1 GB/s ( 4.81%) limited by memory  4.81%
  NT  0:  512,  512,  512     1922.7 GFlops (19.28%)     22.6 GB/s ( 6.75%) limited by gflops 19.28%
  NT  1: 1024, 1024, 1024     2530.4 GFlops (25.38%)     14.8 GB/s ( 4.44%) limited by gflops 25.38%
  NT  2: 1025, 1025, 1025     2294.8 GFlops (23.02%)     13.4 GB/s ( 4.02%) limited by gflops 23.02%
  NT  3: 2048, 2048, 2048     2701.9 GFlops (27.10%)      7.9 GB/s ( 2.37%) limited by gflops 27.10%
  NT  4: 2049, 2049, 2049     2578.1 GFlops (25.86%)      7.6 GB/s ( 2.26%) limited by gflops 25.86%
  NT  5:   64, 2048,   64      667.0 GFlops ( 6.69%)     42.7 GB/s (12.78%) limited by memory 12.78%
  NT  6: 2048,   64, 2048     1616.0 GFlops (16.21%)     53.7 GB/s (16.07%) limited by gflops 16.21%
  NT  7: 2048, 2048,   64     1961.6 GFlops (19.67%)     65.6 GB/s (19.65%) limited by gflops 19.67%
  NT  8: 2048,   64,   64      675.8 GFlops ( 6.78%)     43.2 GB/s (12.94%) limited by memory 12.94%
  NT  9:   64, 2048, 2048     1612.7 GFlops (16.17%)     53.6 GB/s (16.03%) limited by gflops 16.17%
  NT 10:   64,   64, 2048      271.8 GFlops ( 2.73%)     17.3 GB/s ( 5.17%) limited by memory  5.17%
  TN  0:  512,  512,  512     1871.4 GFlops (18.77%)     22.0 GB/s ( 6.57%) limited by gflops 18.77%
  TN  1: 1024, 1024, 1024     2499.6 GFlops (25.07%)     14.7 GB/s ( 4.39%) limited by gflops 25.07%
  TN  2: 1025, 1025, 1025     2256.0 GFlops (22.63%)     13.2 GB/s ( 3.96%) limited by gflops 22.63%
  TN  3: 2048, 2048, 2048     2673.4 GFlops (26.81%)      7.8 GB/s ( 2.35%) limited by gflops 26.81%
  TN  4: 2049, 2049, 2049     2545.9 GFlops (25.53%)      7.5 GB/s ( 2.23%) limited by gflops 25.53%
  TN  5:   64, 2048,   64      583.4 GFlops ( 5.85%)     37.3 GB/s (11.17%) limited by memory 11.17%
  TN  6: 2048,   64, 2048     1633.8 GFlops (16.39%)     54.3 GB/s (16.25%) limited by gflops 16.39%
  TN  7: 2048, 2048,   64     1907.0 GFlops (19.13%)     63.8 GB/s (19.11%) limited by gflops 19.13%
  TN  8: 2048,   64,   64      604.8 GFlops ( 6.07%)     38.7 GB/s (11.58%) limited by memory 11.58%
  TN  9:   64, 2048, 2048     1613.4 GFlops (16.18%)     53.6 GB/s (16.04%) limited by gflops 16.18%
  TN 10:   64,   64, 2048      263.9 GFlops ( 2.65%)     16.8 GB/s ( 5.02%) limited by memory  5.02%
  TT  0:  512,  512,  512     1911.0 GFlops (19.17%)     22.4 GB/s ( 6.71%) limited by gflops 19.17%
  TT  1: 1024, 1024, 1024     2512.9 GFlops (25.20%)     14.7 GB/s ( 4.41%) limited by gflops 25.20%
  TT  2: 1025, 1025, 1025     2281.2 GFlops (22.88%)     13.4 GB/s ( 4.00%) limited by gflops 22.88%
  TT  3: 2048, 2048, 2048     2675.2 GFlops (26.83%)      7.8 GB/s ( 2.35%) limited by gflops 26.83%
  TT  4: 2049, 2049, 2049     2565.6 GFlops (25.73%)      7.5 GB/s ( 2.25%) limited by gflops 25.73%
  TT  5:   64, 2048,   64      631.3 GFlops ( 6.33%)     40.4 GB/s (12.09%) limited by memory 12.09%
  TT  6: 2048,   64, 2048     1627.0 GFlops (16.32%)     54.0 GB/s (16.18%) limited by gflops 16.32%
  TT  7: 2048, 2048,   64     1985.2 GFlops (19.91%)     66.4 GB/s (19.89%) limited by gflops 19.91%
  TT  8: 2048,   64,   64      647.5 GFlops ( 6.49%)     41.4 GB/s (12.40%) limited by memory 12.40%
  TT  9:   64, 2048, 2048     1628.8 GFlops (16.34%)     54.1 GB/s (16.19%) limited by gflops 16.34%
  TT 10:   64,   64, 2048      264.0 GFlops ( 2.65%)     16.8 GB/s ( 5.02%) limited by memory  5.02%
Convolution
   0     effnet  forward b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14      212.7 GFlops ( 2.13%)     94.5 GB/s (28.31%) limited by memory 28.31% algo=depthwise_separable
   0     effnet bwd-data b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14       44.4 GFlops ( 0.45%)     19.7 GB/s ( 5.91%) limited by memory  5.91% algo=depthwise_separable
   0     effnet bwd-filt b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14       21.7 GFlops ( 0.22%)      9.7 GB/s ( 2.89%) limited by memory  2.89% algo=depthwise_separable
   1    alexnet  forward b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224    1845.7 GFlops (18.51%)     18.1 GB/s ( 5.42%) limited by gflops 18.51% algo=gemm
   1    alexnet bwd-data b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224     772.2 GFlops ( 7.75%)      7.6 GB/s ( 2.27%) limited by gflops  7.75% algo=gemm
   1    alexnet bwd-filt b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224    1195.3 GFlops (11.99%)     11.7 GB/s ( 3.51%) limited by gflops 11.99% algo=gemm
   2    alexnet  forward b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     2007.2 GFlops (20.13%)      5.1 GB/s ( 1.53%) limited by gflops 20.13% algo=gemm
   2    alexnet bwd-data b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     1209.1 GFlops (12.13%)      3.1 GB/s ( 0.92%) limited by gflops 12.13% algo=gemm
   2    alexnet bwd-filt b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     1643.8 GFlops (16.49%)      4.3 GB/s ( 1.27%) limited by gflops 16.49% algo=gemm
   3    alexnet  forward b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     2008.4 GFlops (20.14%)      3.4 GB/s ( 1.03%) limited by gflops 20.14% algo=gemm
   3    alexnet bwd-data b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     1842.9 GFlops (18.48%)      3.2 GB/s ( 0.94%) limited by gflops 18.48% algo=gemm
   3    alexnet bwd-filt b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     1771.4 GFlops (17.77%)      3.1 GB/s ( 0.93%) limited by gflops 17.77% algo=gemm
   4    alexnet  forward b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13      646.0 GFlops ( 6.48%)      1.1 GB/s ( 0.32%) limited by gflops  6.48% algo=winograd
   4    alexnet bwd-data b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13      658.8 GFlops ( 6.61%)      1.1 GB/s ( 0.32%) limited by gflops  6.61% algo=winograd
   4    alexnet bwd-filt b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13      620.7 GFlops ( 6.23%)      1.1 GB/s ( 0.34%) limited by gflops  6.23% algo=winograd
   5     resnet  forward b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224    1678.0 GFlops (16.83%)     27.1 GB/s ( 8.12%) limited by gflops 16.83% algo=gemm
   5     resnet bwd-data b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224     749.0 GFlops ( 7.51%)     12.1 GB/s ( 3.62%) limited by gflops  7.51% algo=gemm
   5     resnet bwd-filt b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224     662.6 GFlops ( 6.65%)     10.7 GB/s ( 3.21%) limited by gflops  6.65% algo=gemm
   6     resnet  forward b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56     1905.8 GFlops (19.11%)     74.5 GB/s (22.29%) limited by memory 22.29% algo=gemm
   6     resnet bwd-data b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56     2323.5 GFlops (23.30%)     90.8 GB/s (27.18%) limited by memory 27.18% algo=gemm
   6     resnet bwd-filt b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56     1434.7 GFlops (14.39%)     56.1 GB/s (16.79%) limited by memory 16.79% algo=gemm
   7     resnet  forward b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56     1900.5 GFlops (19.06%)    118.8 GB/s (35.57%) limited by memory 35.57% algo=gemm
   7     resnet bwd-data b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56     1500.9 GFlops (15.05%)     93.8 GB/s (28.09%) limited by memory 28.09% algo=gemm
   7     resnet bwd-filt b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56      445.1 GFlops ( 4.46%)     27.8 GB/s ( 8.33%) limited by memory  8.33% algo=gemm
   8     resnet  forward b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56      706.7 GFlops ( 7.09%)      4.9 GB/s ( 1.47%) limited by gflops  7.09% algo=winograd
   8     resnet bwd-data b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56      560.7 GFlops ( 5.62%)      3.9 GB/s ( 1.17%) limited by gflops  5.62% algo=winograd
   8     resnet bwd-filt b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56      792.4 GFlops ( 7.95%)      5.5 GB/s ( 1.65%) limited by gflops  7.95% algo=winograd
   9     resnet  forward b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14     2421.2 GFlops (24.28%)     15.7 GB/s ( 4.71%) limited by gflops 24.28% algo=gemm
   9     resnet bwd-data b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14     2575.3 GFlops (25.83%)     16.7 GB/s ( 5.01%) limited by gflops 25.83% algo=gemm
   9     resnet bwd-filt b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14     1887.1 GFlops (18.93%)     13.5 GB/s ( 4.03%) limited by gflops 18.93% algo=gemm
  10     resnet  forward b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14     2614.7 GFlops (26.22%)     26.0 GB/s ( 7.77%) limited by gflops 26.22% algo=gemm
  10     resnet bwd-data b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14     2411.9 GFlops (24.19%)     23.9 GB/s ( 7.17%) limited by gflops 24.19% algo=gemm
  10     resnet bwd-filt b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14     1869.1 GFlops (18.75%)     18.8 GB/s ( 5.64%) limited by gflops 18.75% algo=gemm
  11     resnet  forward b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14      724.4 GFlops ( 7.27%)      1.4 GB/s ( 0.41%) limited by gflops  7.27% algo=winograd
  11     resnet bwd-data b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14      730.3 GFlops ( 7.32%)      1.4 GB/s ( 0.41%) limited by gflops  7.32% algo=winograd
  11     resnet bwd-filt b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14      790.3 GFlops ( 7.93%)      1.6 GB/s ( 0.49%) limited by gflops  7.93% algo=winograd
  12        vgg  forward b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224^C

Going to high power mode changes the maximum processing power from circa 9.0 TFLOPS to circa 10.0 TFLOPS.

philipturner Apr 25, 2022
Author

It seems that Winograd decreases performance on the Apple architecture in its current incarnation. Here's the output. Is it possible for you to use only 32 KB on your computer, then see if there's a performance regression?

Output of dlprim_flops (alexnet, resnet only) with only GEMM

Testing on Apple M1 Max on Apple
Testing memory speed
- Vector size 1
-- Warming 
-- Running   326.688 GB/s
- Vector size 2
-- Warming 
-- Running   344.594 GB/s
- Vector size 4
-- Warming 
-- Running   342.119 GB/s
- Vector size 8
-- Warming 
-- Running   340.038 GB/s
- Vector size 16
-- Warming 
-- Running   335.431 GB/s
Testing flops float
- Vector size 1
-- Warming 
-- Running   10069.4 GFlops
- Vector size 2
-- Warming 
-- Running   9817.96 GFlops
- Vector size 4
-- Warming 
-- Running   9825.53 GFlops
- Vector size 8
-- Warming 
-- Running   9275.39 GFlops
- Vector size 16
-- Warming 
-- Running   9100.1 GFlops
Summray for Apple M1 Max on Apple
Peak GFlops for float 10069.4
Peak memory 344.594 GB/s
GEMM
  NN  0:  512,  512,  512     1888.5 GFlops (18.75%)     22.2 GB/s ( 6.60%) limited by gflops 18.75%
  NN  1: 1024, 1024, 1024     2517.3 GFlops (25.00%)     14.8 GB/s ( 4.40%) limited by gflops 25.00%
  NN  2: 1025, 1025, 1025     2278.5 GFlops (22.63%)     13.3 GB/s ( 3.98%) limited by gflops 22.63%
  NN  3: 2048, 2048, 2048     2693.6 GFlops (26.75%)      7.9 GB/s ( 2.35%) limited by gflops 26.75%
  NN  4: 2049, 2049, 2049     2565.3 GFlops (25.48%)      7.5 GB/s ( 2.24%) limited by gflops 25.48%
  NN  5:   64, 2048,   64      653.2 GFlops ( 6.49%)     41.8 GB/s (12.46%) limited by memory 12.46%
  NN  6: 2048,   64, 2048     1625.9 GFlops (16.15%)     54.0 GB/s (16.10%) limited by gflops 16.15%
  NN  7: 2048, 2048,   64     1984.9 GFlops (19.71%)     66.4 GB/s (19.80%) limited by memory 19.80%
  NN  8: 2048,   64,   64      642.5 GFlops ( 6.38%)     41.1 GB/s (12.25%) limited by memory 12.25%
  NN  9:   64, 2048, 2048     1597.8 GFlops (15.87%)     53.1 GB/s (15.82%) limited by gflops 15.87%
  NN 10:   64,   64, 2048      256.1 GFlops ( 2.54%)     16.3 GB/s ( 4.85%) limited by memory  4.85%
  NT  0:  512,  512,  512     1914.8 GFlops (19.02%)     22.5 GB/s ( 6.70%) limited by gflops 19.02%
  NT  1: 1024, 1024, 1024     2536.8 GFlops (25.19%)     14.9 GB/s ( 4.43%) limited by gflops 25.19%
  NT  2: 1025, 1025, 1025     2297.5 GFlops (22.82%)     13.5 GB/s ( 4.01%) limited by gflops 22.82%
  NT  3: 2048, 2048, 2048     2692.7 GFlops (26.74%)      7.9 GB/s ( 2.35%) limited by gflops 26.74%
  NT  4: 2049, 2049, 2049     2585.6 GFlops (25.68%)      7.6 GB/s ( 2.26%) limited by gflops 25.68%
  NT  5:   64, 2048,   64      659.1 GFlops ( 6.55%)     42.2 GB/s (12.57%) limited by memory 12.57%
  NT  6: 2048,   64, 2048     1622.9 GFlops (16.12%)     53.9 GB/s (16.07%) limited by gflops 16.12%
  NT  7: 2048, 2048,   64     1955.4 GFlops (19.42%)     65.4 GB/s (19.51%) limited by memory 19.51%
  NT  8: 2048,   64,   64      673.6 GFlops ( 6.69%)     43.1 GB/s (12.85%) limited by memory 12.85%
  NT  9:   64, 2048, 2048     1603.0 GFlops (15.92%)     53.2 GB/s (15.87%) limited by gflops 15.92%
  NT 10:   64,   64, 2048      272.0 GFlops ( 2.70%)     17.3 GB/s ( 5.15%) limited by memory  5.15%
  TN  0:  512,  512,  512     1862.4 GFlops (18.50%)     21.8 GB/s ( 6.51%) limited by gflops 18.50%
  TN  1: 1024, 1024, 1024     2502.8 GFlops (24.86%)     14.7 GB/s ( 4.37%) limited by gflops 24.86%
  TN  2: 1025, 1025, 1025     2257.2 GFlops (22.42%)     13.2 GB/s ( 3.94%) limited by gflops 22.42%
  TN  3: 2048, 2048, 2048     2675.5 GFlops (26.57%)      7.8 GB/s ( 2.34%) limited by gflops 26.57%
  TN  4: 2049, 2049, 2049     2543.7 GFlops (25.26%)      7.5 GB/s ( 2.22%) limited by gflops 25.26%
  TN  5:   64, 2048,   64      595.0 GFlops ( 5.91%)     38.1 GB/s (11.35%) limited by memory 11.35%
  TN  6: 2048,   64, 2048     1621.9 GFlops (16.11%)     53.9 GB/s (16.06%) limited by gflops 16.11%
  TN  7: 2048, 2048,   64     1905.0 GFlops (18.92%)     63.8 GB/s (19.01%) limited by memory 19.01%
  TN  8: 2048,   64,   64      626.4 GFlops ( 6.22%)     40.1 GB/s (11.95%) limited by memory 11.95%
  TN  9:   64, 2048, 2048     1621.4 GFlops (16.10%)     53.8 GB/s (16.05%) limited by gflops 16.10%
  TN 10:   64,   64, 2048      267.6 GFlops ( 2.66%)     17.0 GB/s ( 5.07%) limited by memory  5.07%
  TT  0:  512,  512,  512     1898.0 GFlops (18.85%)     22.3 GB/s ( 6.64%) limited by gflops 18.85%
  TT  1: 1024, 1024, 1024     2513.3 GFlops (24.96%)     14.7 GB/s ( 4.39%) limited by gflops 24.96%
  TT  2: 1025, 1025, 1025     2275.6 GFlops (22.60%)     13.3 GB/s ( 3.97%) limited by gflops 22.60%
  TT  3: 2048, 2048, 2048     2680.8 GFlops (26.62%)      7.9 GB/s ( 2.34%) limited by gflops 26.62%
  TT  4: 2049, 2049, 2049     2566.3 GFlops (25.49%)      7.5 GB/s ( 2.24%) limited by gflops 25.49%
  TT  5:   64, 2048,   64      646.9 GFlops ( 6.42%)     41.4 GB/s (12.34%) limited by memory 12.34%
  TT  6: 2048,   64, 2048     1646.7 GFlops (16.35%)     54.7 GB/s (16.30%) limited by gflops 16.35%
  TT  7: 2048, 2048,   64     1984.7 GFlops (19.71%)     66.4 GB/s (19.80%) limited by memory 19.80%
  TT  8: 2048,   64,   64      668.7 GFlops ( 6.64%)     42.8 GB/s (12.75%) limited by memory 12.75%
  TT  9:   64, 2048, 2048     1621.6 GFlops (16.10%)     53.9 GB/s (16.06%) limited by gflops 16.10%
  TT 10:   64,   64, 2048      263.3 GFlops ( 2.61%)     16.7 GB/s ( 4.98%) limited by memory  4.98%
Convolution
   0     effnet  forward b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14      212.3 GFlops ( 2.11%)     94.4 GB/s (28.14%) limited by memory 28.14% algo=depthwise_separable
   0     effnet bwd-data b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14       44.4 GFlops ( 0.44%)     19.7 GB/s ( 5.88%) limited by memory  5.88% algo=depthwise_separable
   0     effnet bwd-filt b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14       21.7 GFlops ( 0.22%)      9.6 GB/s ( 2.88%) limited by memory  2.88% algo=depthwise_separable
   1    alexnet  forward b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224    1847.3 GFlops (18.35%)     18.1 GB/s ( 5.40%) limited by gflops 18.35% algo=gemm
   1    alexnet bwd-data b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224     773.6 GFlops ( 7.68%)      7.6 GB/s ( 2.26%) limited by gflops  7.68% algo=gemm
   1    alexnet bwd-filt b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224    1196.1 GFlops (11.88%)     11.7 GB/s ( 3.50%) limited by gflops 11.88% algo=gemm
   2    alexnet  forward b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     2007.8 GFlops (19.94%)      5.1 GB/s ( 1.52%) limited by gflops 19.94% algo=gemm
   2    alexnet bwd-data b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     1215.6 GFlops (12.07%)      3.1 GB/s ( 0.92%) limited by gflops 12.07% algo=gemm
   2    alexnet bwd-filt b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27     1650.9 GFlops (16.39%)      4.3 GB/s ( 1.27%) limited by gflops 16.39% algo=gemm
   3    alexnet  forward b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     2011.7 GFlops (19.98%)      3.4 GB/s ( 1.03%) limited by gflops 19.98% algo=gemm
   3    alexnet bwd-data b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     1851.9 GFlops (18.39%)      3.2 GB/s ( 0.94%) limited by gflops 18.39% algo=gemm
   3    alexnet bwd-filt b=64 k=5  p=2 s=1 in=64   out=192  g=1   D=27     1776.0 GFlops (17.64%)      3.1 GB/s ( 0.93%) limited by gflops 17.64% algo=gemm
   4    alexnet  forward b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13     1974.7 GFlops (19.61%)      3.2 GB/s ( 0.96%) limited by gflops 19.61% algo=gemm
   4    alexnet bwd-data b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13     2251.6 GFlops (22.36%)      3.7 GB/s ( 1.10%) limited by gflops 22.36% algo=gemm
   4    alexnet bwd-filt b=64 k=3  p=1 s=1 in=384  out=256  g=1   D=13     1787.1 GFlops (17.75%)      3.2 GB/s ( 0.97%) limited by gflops 17.75% algo=gemm
   5     resnet  forward b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224    1671.5 GFlops (16.60%)     27.0 GB/s ( 8.05%) limited by gflops 16.60% algo=gemm
   5     resnet bwd-data b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224     737.4 GFlops ( 7.32%)     11.9 GB/s ( 3.55%) limited by gflops  7.32% algo=gemm
   5     resnet bwd-filt b=64 k=7  p=3 s=2 in=3    out=64   g=1   D=224     660.5 GFlops ( 6.56%)     10.7 GB/s ( 3.18%) limited by gflops  6.56% algo=gemm
   6     resnet  forward b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56     1898.0 GFlops (18.85%)     74.2 GB/s (22.11%) limited by memory 22.11% algo=gemm
   6     resnet bwd-data b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56     2326.1 GFlops (23.10%)     90.9 GB/s (27.10%) limited by memory 27.10% algo=gemm
   6     resnet bwd-filt b=64 k=1  p=0 s=1 in=64   out=256  g=1   D=56     1432.8 GFlops (14.23%)     56.0 GB/s (16.69%) limited by memory 16.69% algo=gemm
   7     resnet  forward b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56     1897.7 GFlops (18.85%)    118.6 GB/s (35.37%) limited by memory 35.37% algo=gemm
   7     resnet bwd-data b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56     1506.1 GFlops (14.96%)     94.1 GB/s (28.07%) limited by memory 28.07% algo=gemm
   7     resnet bwd-filt b=64 k=1  p=0 s=1 in=64   out=64   g=1   D=56      445.3 GFlops ( 4.42%)     27.8 GB/s ( 8.30%) limited by memory  8.30% algo=gemm
   8     resnet  forward b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56     1953.3 GFlops (19.40%)     13.6 GB/s ( 4.05%) limited by gflops 19.40% algo=gemm
   8     resnet bwd-data b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56      880.4 GFlops ( 8.74%)      6.1 GB/s ( 1.83%) limited by gflops  8.74% algo=gemm
   8     resnet bwd-filt b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=56     1179.6 GFlops (11.71%)      8.2 GB/s ( 2.45%) limited by gflops 11.71% algo=gemm
   9     resnet  forward b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14     2438.4 GFlops (24.22%)     15.8 GB/s ( 4.72%) limited by gflops 24.22% algo=gemm
   9     resnet bwd-data b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14     2571.0 GFlops (25.53%)     16.7 GB/s ( 4.98%) limited by gflops 25.53% algo=gemm
   9     resnet bwd-filt b=64 k=1  p=0 s=2 in=1024 out=2048 g=1   D=14     1887.0 GFlops (18.74%)     13.5 GB/s ( 4.01%) limited by gflops 18.74% algo=gemm
  10     resnet  forward b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14     2615.1 GFlops (25.97%)     26.0 GB/s ( 7.74%) limited by gflops 25.97% algo=gemm
  10     resnet bwd-data b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14     2412.0 GFlops (23.95%)     23.9 GB/s ( 7.14%) limited by gflops 23.95% algo=gemm
  10     resnet bwd-filt b=64 k=1  p=0 s=1 in=1024 out=256  g=1   D=14     1869.1 GFlops (18.56%)     18.8 GB/s ( 5.62%) limited by gflops 18.56% algo=gemm
  11     resnet  forward b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14     1985.9 GFlops (19.72%)      3.8 GB/s ( 1.12%) limited by gflops 19.72% algo=gemm
  11     resnet bwd-data b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14     2292.8 GFlops (22.77%)      4.3 GB/s ( 1.30%) limited by gflops 22.77% algo=gemm
  11     resnet bwd-filt b=64 k=3  p=1 s=1 in=256  out=256  g=1   D=14     1755.8 GFlops (17.44%)      3.6 GB/s ( 1.08%) limited by gflops 17.44% algo=gemm
  12        vgg  forward b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224    1140.4 GFlops (11.33%)     88.4 GB/s (26.36%) limited by memory 26.36% algo=gemm
  12        vgg bwd-data b=64 k=3  p=1 s=1 in=3    out=64   g=1   D=224^C

artyom-beilis Apr 25, 2022
Maintainer

It seems that Winograd decreases performance on the Apple architecture in its current incarnation. Here's the output. Is it possible for you to use only 32 KB on your computer, then see if there's a performance regression?

I did and performance reduction was rather small.

I suspect it is related to internal GEMM loop in Winograd:

https://github.com/artyom-beilis/dlprimitives/blob/master/src/kernels/winograd_fwd.cl#L337

The PATCH_K and PATCH_T are both 8, this is equivalent to block_size_m_/block_size_n_ in GEMM. While both AMD and nVidia easily handle 8x8 patch for multiplication we have see that apple's GPU does not and 4x4 works much better.

I'm not sure I can rewrite this kernel with internal 4x4 GEMM loop since I need a large computational volume per-thread.

philipturner Apr 25, 2022
Author

I'm not sure I can rewrite this kernel with internal 4x4 GEMM loop since I need a large computational volume per-thread.

What about 4x8 or 6x6? Is this a problem of being limited to 32 KB of shared memory? Is there any way to fine-tune block_size_m and block_size_n to permit favorable performance?

philipturner Apr 25, 2022
Author

make test passed conv2d_win.

Log output

Running tests...
Test project /Users/philipturner/Documents/building-dlprimitives/dlprimitives/build
      Start  1: test_test_case_abs
 1/33 Test  #1: test_test_case_abs ...............   Passed    0.77 sec
      Start  2: test_test_case_activation
 2/33 Test  #2: test_test_case_activation ........   Passed    1.23 sec
      Start  3: test_test_case_batchnorm
 3/33 Test  #3: test_test_case_batchnorm .........   Passed    4.71 sec
      Start  4: test_test_case_concat
 4/33 Test  #4: test_test_case_concat ............   Passed    0.05 sec
      Start  5: test_test_case_conv2d
 5/33 Test  #5: test_test_case_conv2d ............   Passed  494.17 sec
      Start  6: test_test_case_conv2d_dsc
 6/33 Test  #6: test_test_case_conv2d_dsc ........   Passed  438.32 sec
      Start  7: test_test_case_conv2d_gemm
 7/33 Test  #7: test_test_case_conv2d_gemm .......   Passed  447.40 sec
      Start  8: test_test_case_conv2d_win
 8/33 Test  #8: test_test_case_conv2d_win ........   Passed  441.50 sec
      Start  9: test_test_case_elementwise
 9/33 Test  #9: test_test_case_elementwise .......   Passed    5.21 sec
      Start 10: test_test_case_global_pooling
10/33 Test #10: test_test_case_global_pooling ....   Passed    5.53 sec
      Start 11: test_test_case_hardtanh
11/33 Test #11: test_test_case_hardtanh ..........   Passed    0.59 sec
      Start 12: test_test_case_inner_product
12/33 Test #12: test_test_case_inner_product .....   Passed   97.12 sec
      Start 13: test_test_case_log_softmax
13/33 Test #13: test_test_case_log_softmax .......   Passed    0.80 sec
      Start 14: test_test_case_mse_loss
14/33 Test #14: test_test_case_mse_loss ..........   Passed    0.55 sec
      Start 15: test_test_case_nll_loss
15/33 Test #15: test_test_case_nll_loss ..........   Passed    0.68 sec
      Start 16: test_test_case_param
16/33 Test #16: test_test_case_param .............   Passed    0.09 sec
      Start 17: test_test_case_pooling2d
17/33 Test #17: test_test_case_pooling2d .........   Passed   72.97 sec
      Start 18: test_test_case_reduction
18/33 Test #18: test_test_case_reduction .........   Passed   22.97 sec
      Start 19: test_test_case_slice
19/33 Test #19: test_test_case_slice .............   Passed    0.05 sec
      Start 20: test_test_case_softmax
20/33 Test #20: test_test_case_softmax ...........   Passed    0.79 sec
      Start 21: test_test_case_softmax_loss
21/33 Test #21: test_test_case_softmax_loss ......   Passed    0.66 sec
      Start 22: test_test_case_threshold
22/33 Test #22: test_test_case_threshold .........   Passed    0.50 sec
      Start 23: test_test_case_tr_conv2d
23/33 Test #23: test_test_case_tr_conv2d .........   Passed   58.00 sec
      Start 24: test_test_case_tr_conv2d_dsc
24/33 Test #24: test_test_case_tr_conv2d_dsc .....   Passed    7.20 sec
      Start 25: test_test_case_tr_conv2d_gemm
25/33 Test #25: test_test_case_tr_conv2d_gemm ....   Passed    8.82 sec
      Start 26: test_test_case_tr_conv2d_win
26/33 Test #26: test_test_case_tr_conv2d_win .....   Passed    7.25 sec
      Start 27: test_net
27/33 Test #27: test_net .........................   Passed    2.49 sec
      Start 28: test_net_nonopt
28/33 Test #28: test_net_nonopt ..................   Passed    0.11 sec
      Start 29: test_json
29/33 Test #29: test_json ........................***Failed    0.18 sec
      Start 30: test_random
30/33 Test #30: test_random ......................   Passed    0.44 sec
      Start 31: test_context
31/33 Test #31: test_context .....................   Passed    0.18 sec
      Start 32: test_util
32/33 Test #32: test_util ........................   Passed    1.17 sec
      Start 33: test_broadcast_reduce
33/33 Test #33: test_broadcast_reduce ............   Passed    6.28 sec

97% tests passed, 1 tests failed out of 33

Total Test time (real) = 2128.78 sec

The following tests FAILED:
	 29 - test_json (Failed)
Errors while running CTest
Output from these tests are in: /Users/philipturner/Documents/building-dlprimitives/dlprimitives/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [test] Error 8

philipturner · 2022-02-22T01:53:33Z

philipturner
Feb 22, 2022
Author

of course need to fix the ctx.estimated_core_count() first to calculate number of threads from the cores

Also separating this out from other talk. The ratio of threads to "cores" on the Apple7 generation is 256 (8192 threads / 32 cores). However, that number doubled with the Apple8 generation. We would need to check for A15+ or M2+ in the device's name to bump the ratio to 512. Do you have a suggestion for how we can future-proof the algorithm to get the correct number of cores on the Apple8 architecture?

1 reply

philipturner Feb 22, 2022
Author

I changed the number of cores to show the proper amount, and I proved that reduce_k_ was activated in some of the Convolution benchmarks for bwd-filt (including VGG). Nothing changed much, but there were some places where the bwd-filt performance improved. For example, the second VGG backward filter went from 4% 0.8% to 9% 1.9%.

full_results.txt

I attempted changing the tile_size_k_ for the fastest option (32x32, 4x4) from 16 to 32, but performance for VGG suffered by ~1% (forward, bwd-data) and improved ~0.5% (bwd-filter). Both K=16 and K=32 have identical performance for GEMM. Which option would you expect to be faster for reduction operations?

philipturner · 2022-04-25T18:14:59Z

philipturner
Apr 25, 2022
Author

I’m going to reserve this comment for discussion about compiling any code I create.

——-

I got the idea to make dedicated Swift bindings for OpenCL, mirroring how the C++ API is just a wrapper around the C API. Then, I might validate it by rewriting DLPrimitives using Swift (or just package tests inspired by your code). Would you be willing to run my finished product on your machine to see if it runs correctly?

2 replies

artyom-beilis Apr 25, 2022
Maintainer

Open new a discussion :-)

Also I think the much easier idea is to make Swift bindings for C++/DLPrimitives and that integrating them to Swift. This way you will not have maintenance nightmare and manage huge number of modifications.

Note you probably need to bind things from dlprimitives_core library without all Net management part. It would be rather thin interface.

Anyway lets continue over different discussion

philipturner Apr 25, 2022
Author

Let's wait just a bit before making the new discussion. I'm not yet 100% sure that I will embark on this project or what exactly it will entail. It might also be more relevant to start the thread once I start rewriting DLPrimitives as part of Swift for TensorFlow (in the far future).

philipturner · 2022-10-20T22:18:38Z

philipturner
Oct 20, 2022
Author

@artyom-beilis I figured out why DLPrimitives runs so slowly on macOS. The Apple GPU has extremely slow threadgroup memory. Apple made it slow because threadgroup memory is very power hungry. This is one of several things I've realized they sacrificed for power efficiency.

In another code base, I saw a 20% : 50% ratio of performance between a 10 TFLOPS Apple GPU and a 10 TFLOPS Nvidia GPU. That code base also used threadgroup memory. Therefore, Apple made simdgroup_matrix to permit a specialized high-bandwidth access path to threadgroup memory.

I'm planning to make an OpenCL 3.0 driver for macOS, where simdgroup_matrix would be exposed to OpenCL. I don't plan to work on DLPrimitives or S4TF anymore, but will become technically possible to optimize DLPrimitives for the Apple GPU. You could also access half-precision math and subgroup permute/reductions, which may improve performance.

0 replies

Suggestion: profiling #13

philipturner Jan 12, 2022

Replies: 30 comments · 94 replies

artyom-beilis Jan 12, 2022 Maintainer

philipturner Jan 12, 2022 Author

artyom-beilis Jan 12, 2022 Maintainer

artyom-beilis Jan 12, 2022 Maintainer

philipturner Jan 13, 2022 Author

artyom-beilis Jan 13, 2022 Maintainer

philipturner Jan 13, 2022 Author

philipturner Jan 13, 2022 Author

artyom-beilis Jan 13, 2022 Maintainer

philipturner Jan 13, 2022 Author

artyom-beilis Jan 13, 2022 Maintainer

artyom-beilis Jan 13, 2022 Maintainer

philipturner Jan 13, 2022 Author

artyom-beilis Jan 14, 2022 Maintainer

philipturner Jan 14, 2022 Author

philipturner Feb 10, 2022 Author

artyom-beilis Feb 10, 2022 Maintainer

philipturner Feb 10, 2022 Author

artyom-beilis Feb 10, 2022 Maintainer

artyom-beilis Feb 10, 2022 Maintainer

philipturner Feb 10, 2022 Author

philipturner Feb 10, 2022 Author

philipturner Feb 10, 2022 Author

artyom-beilis Feb 10, 2022 Maintainer

philipturner Feb 11, 2022 Author

artyom-beilis Feb 11, 2022 Maintainer

philipturner Feb 11, 2022 Author

artyom-beilis Feb 11, 2022 Maintainer

philipturner Feb 12, 2022 Author

artyom-beilis Feb 10, 2022 Maintainer

philipturner Feb 10, 2022 Author

artyom-beilis Feb 10, 2022 Maintainer

philipturner Feb 10, 2022 Author

artyom-beilis Feb 11, 2022 Maintainer

philipturner Feb 11, 2022 Author

artyom-beilis Feb 11, 2022 Maintainer

philipturner Feb 11, 2022 Author

philipturner Feb 12, 2022 Author

artyom-beilis Feb 12, 2022 Maintainer

philipturner Feb 12, 2022 Author

philipturner Feb 12, 2022 Author

philipturner Feb 11, 2022 Author

philipturner Feb 19, 2022 Author

artyom-beilis Feb 20, 2022 Maintainer

philipturner Feb 20, 2022 Author

philipturner Feb 20, 2022 Author

artyom-beilis Feb 20, 2022 Maintainer

philipturner Feb 21, 2022 Author

philipturner Feb 21, 2022 Author

philipturner Feb 21, 2022 Author

artyom-beilis Feb 21, 2022 Maintainer

artyom-beilis Feb 21, 2022 Maintainer

philipturner Feb 21, 2022 Author

artyom-beilis Feb 22, 2022 Maintainer

philipturner Feb 21, 2022 Author

philipturner
Jan 12, 2022

Replies: 30 comments 94 replies

artyom-beilis
Jan 12, 2022
Maintainer

philipturner
Jan 12, 2022
Author

artyom-beilis
Jan 12, 2022
Maintainer

artyom-beilis
Jan 12, 2022
Maintainer

philipturner
Jan 13, 2022
Author

artyom-beilis
Jan 13, 2022
Maintainer

philipturner
Jan 13, 2022
Author

philipturner
Jan 13, 2022
Author

artyom-beilis
Jan 13, 2022
Maintainer

philipturner
Jan 13, 2022
Author

artyom-beilis
Jan 13, 2022
Maintainer

artyom-beilis
Jan 13, 2022
Maintainer

philipturner
Jan 13, 2022
Author

artyom-beilis Jan 14, 2022
Maintainer

philipturner Jan 14, 2022
Author

philipturner
Feb 10, 2022
Author

artyom-beilis
Feb 10, 2022
Maintainer

philipturner
Feb 10, 2022
Author

artyom-beilis
Feb 10, 2022
Maintainer

artyom-beilis
Feb 10, 2022
Maintainer

philipturner Feb 10, 2022
Author

philipturner
Feb 10, 2022
Author

philipturner Feb 10, 2022
Author

artyom-beilis
Feb 10, 2022
Maintainer

philipturner Feb 11, 2022
Author

artyom-beilis Feb 11, 2022
Maintainer

philipturner Feb 11, 2022
Author

artyom-beilis Feb 11, 2022
Maintainer

philipturner Feb 12, 2022
Author

artyom-beilis
Feb 10, 2022
Maintainer

philipturner Feb 10, 2022
Author

artyom-beilis Feb 10, 2022
Maintainer

philipturner Feb 10, 2022
Author

artyom-beilis
Feb 11, 2022
Maintainer

philipturner Feb 11, 2022
Author

artyom-beilis
Feb 11, 2022
Maintainer

philipturner Feb 11, 2022
Author

philipturner Feb 12, 2022
Author

artyom-beilis Feb 12, 2022
Maintainer

philipturner Feb 12, 2022
Author

philipturner Feb 12, 2022
Author

philipturner
Feb 11, 2022
Author

philipturner
Feb 19, 2022
Author

artyom-beilis Feb 20, 2022
Maintainer

philipturner Feb 20, 2022
Author

philipturner Feb 20, 2022
Author

artyom-beilis Feb 20, 2022
Maintainer

philipturner Feb 21, 2022
Author

philipturner
Feb 21, 2022
Author

philipturner Feb 21, 2022
Author

artyom-beilis Feb 21, 2022
Maintainer

artyom-beilis Feb 21, 2022
Maintainer

philipturner Feb 21, 2022
Author

artyom-beilis Feb 22, 2022
Maintainer

philipturner
Feb 21, 2022
Author