Suggestion: profiling #13
Replies: 30 comments 94 replies
-
In general, I use OpenCL to make things platform independent so Metal and DirectML aren't relevant. I hadn't tested dlprimitives on Mac OS X since I don't have one - but you are welcome to compare. Regarding DirectML I just don't have capacity to check against it. I have good baseline reference of cuda+cudnn for nVidia and I have AMD's MIOpen on GCN and RNDA (once they release it) There is also oneDNN for Intel but it has poor performance if channel first layout: oneapi-src/oneDNN#1194 Bottom line I know there are lots of improvements in performance can be done. I still don't reach cuDNN level of performance but at this point it is mostly good enough to make it highly useful. I'd be glad to receive help in writing better kernels, adopting code from other projects and improving performance. At this point I try to cover much more operators, make pytorch/dlprimitives much more usable and give decent inference support. |
Beta Was this translation helpful? Give feedback.
-
Training is the most meaningful use case for this stuff. Are you planning to restrict implementations to inference? |
Beta Was this translation helpful? Give feedback.
-
It is one of important use cases and I work on it (see pytorch backend I work on)
By no means! But I want to have lightweight inference tool: see https://github.com/artyom-beilis/dlprimitives/tree/onednn_integration |
Beta Was this translation helpful? Give feedback.
-
I meant https://github.com/artyom-beilis/dlprimitives/tree/onnx_api Onnx support. (oneDNN is useless for channel first layout) |
Beta Was this translation helpful? Give feedback.
-
Do you have a short benchmark suite? If so, I could translate it to Metal Performance Shaders and run it on a few GPU models. Can you test your code on an Intel UHD 630 or Iris Plus? Or what AMD GPUs do you have? The UHD 630 is the one I have instant access to; the others are ones my friends have.
…Sent from my iPhone
On Jan 12, 2022, at 4:43 PM, Artyom Beilis ***@***.***> wrote:
I meant https://github.com/artyom-beilis/dlprimitives/tree/onnx_api
Onnx support. oneDNN is useless for channel first layout
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
Beta Was this translation helpful? Give feedback.
-
Yes: https://github.com/artyom-beilis/dlprimitives/blob/master/docs/build.md#benchmarking
see: https://github.com/artyom-beilis/dlprimitives#tested-gpus On Intel side I test it on UHD 530 that is basically same as 630 up to slight clock changes. To be hones Intel GPUs with 400GFlops quite useless for trainin. they have ~ same performance as CPU training and actually CPU training with mklDNN may be way faster - this what frameworks like pytorch/tf use. Also with modern multi-core CPU the built-in GPU would likely be slower. Maybe when I implement float16 support with ~700GFlops it would be little bit more useful for inference. |
Beta Was this translation helpful? Give feedback.
-
I agree the UHD 630 is kind of useless. I could also run your benchmark suite on it on Windows to provide the most accurate comparison. I think there’s a difference between HD 530 and UHD 530, so be careful with naming if that’s not the case.
…Sent from my iPhone
On Jan 13, 2022, at 2:20 AM, Artyom Beilis ***@***.***> wrote:
Do you have a short benchmark suite?
If so, I could translate it to Metal Performance Shaders and run it on a few GPU models.
Yes: https://github.com/artyom-beilis/dlprimitives/blob/master/docs/build.md#benchmarking
Also there is dlprim_flops for specific kernel performance measurements
Can you test your code on an Intel UHD 630 or Iris Plus?
Or what AMD GPUs do you have? The UHD 630 is the one I have instant access to; the others are ones my friends have.
see: https://github.com/artyom-beilis/dlprimitives#tested-gpus
On Intel side I test it on UHD 530 that is basically same as 630 up to slight clock changes.
To be hones Intel GPUs with 400GFlops quite useless for trainin. they have ~ same performance as CPU training and actually CPU training with mklDNN may be way faster - this what frameworks like pytorch/tf use. Also with modern multi-core CPU the built-in GPU would likely be slower. Maybe when I implement float16 support with ~700GFlops it would be little bit more useful for inference.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
Beta Was this translation helpful? Give feedback.
-
I was expecting to have pre-recorded benchmarks of how many milliseconds each test takes on a particular GPU. Then, I would make a simple command-like utility to compare performance on macOS. I was not expecting to have to set up all the OpenCL stuff by myself and modify the code to collect benchmarks. It’s not worth my time now if I have to do all of that. |
Beta Was this translation helpful? Give feedback.
-
I suggest reading docs and some blog posts There are plenty of benchmark results. However since every gpu and system little bit different it is better to run comparison on same machine |
Beta Was this translation helpful? Give feedback.
-
I don't understand why there's an order of magnitude performance difference between Caffe and PlaidML. The difference is the same across weaker and stronger GPUs, so it seems like Caffe is using the GPU. I'm confused because I thought PyTorch trained exclusively on the CPU except on Nvidia. Also, what do the percentages mean? Do larger percents mean more efficiency or more execution time, and is that convention consistent across all of your benchmarks? |
Beta Was this translation helpful? Give feedback.
-
Because PlaidML has very poor performance
Caffe has OpenCL, cuda and hip/rocm versions. But caffe isn't developed any more and its memory use is not optimal at all
There is:
|
Beta Was this translation helpful? Give feedback.
-
Moved to discussions |
Beta Was this translation helpful? Give feedback.
-
I don't think I'll be able to provide benchmarks for this. |
Beta Was this translation helpful? Give feedback.
-
Now that I think about it, I'm going to make universal GPU acceleration a primary focus of the new S4TF. My initial goals are Metal and DirectX because 99% of people have access to macOS or Windows. Linux already has CUDA acceleration through TensorFlow, but some people with an Intel or AMD GPU and would be forced to use native Windows for hardware acceleration via DirectML. I see this repository as being able to enhance some Linux workflows, including Docker VMs on my M1 Mac where Metal isn't available. The DL primitives would accelerate what's possible, falling back to regular TensorFlow otherwise. DLPrimitives has a very finite instruction set, but I have two questions:
|
Beta Was this translation helpful? Give feedback.
-
Isn't S4TF is dead? Also can you elaborate on what are you trying to do? Since if you want a new backend for TF there is what is called pluggable device support in TF >= 2.5. Also documentation can't be called as so but there is still an option to do it.
It is more than that. OpenCL is universal compute platform that works even on smartphone gpu. It is designed for computing unlike Direct3D. I so no reason whatsoever to use Direct3D or Metal unless you are looking to lock somebody on your own platform. Finally OpenCL is very similar in code and concepts to CUDA that is most common compute API today.
What is regular TF exactly - CPU?
Every project has finite instruction set ;-)
Of course it is planned to add 3d convs and others and I work on it step by step adding different operators. And there are a lot of them!
There are two problems with SYCL:
So also SYCL looks interesting on paper it isn't good enough for me
You want to add universal backend for TensorFlow? I suggest take a look on how to build pluggable device for TF. It is official API to do it. At this point I mostly work on pytorch backend since:
So if you want to add universal backend for TF start from pluggable device and use dlprimitives as backbone for operations. Once you have basic things working - lets say you can train resnet18 using dlprimitives on channel 1st setup - the rest becomes much easier and it will be matter of adding operators one by one.
Trivial operators are trivial to implement using dlprimitives. There is See the example I linked before |
Beta Was this translation helpful? Give feedback.
-
You’re correct. There’s a whole ton of context I haven’t given. Hang in there because this comment is going to be long. (wrapped in a drop down to not clutter this discussion)Back in February 2021, I was using Metal to accelerate real-time world reconstruction in the thermal envelope of a mobile device. This was incredibly resource-constrained, but thanks to the optimized GPGPU capabilities of Metal, I pulled it off (see scene color reconstruction). At the same time, Google killed Swift for TensorFlow - a project that promised to change the face of deep learning. Now that I’m done with my augmented reality projects, I’m realizing my dream of resurrecting S4TF. I’m doing this mostly by myself, although a few people in the OSS community have pitched in to help. S4TF was promising for many reasons, but one I saw - compatibility with anything - was unprecedented. Python can’t run on iPhones, but Swift can. Swift also runs faster on embedded devices because it’s pre-compiled and avoids garbage collection. I got a bit off track, but Swift’s compatibility means it can be used to author a GPU ML backend for any platform - unlike frameworks with a Python API. However, PluggableDevice is not compatible with my efforts. First, it requires the file structure of Python TensorFlow - one incompatible with the Swift Package Manager. It seems like TF assumes you’re using it in tandem with their Python API. Second, it requires mastering some TF-specific C APIs and passing kernel execution calls through a layer of TF. Thus, calling straight through to a GPGPU API would be faster CPU-side than mediating with TF’s PluggableDevice. I’m planning to make a second accelerator backend for S4TF, which is an alternative to CTensorFlow. It’s highly customizable, and you can even design a backend calls back into the original CTensorFlow. It’s also easy to use - with a simple idiomatic Swift API instead of TF specific C APIs. You mentioned that PyTorch’s C++ API is easier. The Swift API would support language errors and much more. In fact, your experience could help me design a better API. I currently plan to name this backend PluggableDevice, but it’s not the same as CTensorFlow’s PluggableDevice. To resolve ambiguity, I’ll label it
Yes, but it’s not well optimized. The PyTorch core team chose to not use OpenCL because of massive driver latencies. To quote them, “every vendor has a walled garden and nobody’s willing to fix driver latencies”. It sacrifices performance for being cross-platform, although developers like you still make amazing things with it.
From my experience, Metal is the second-closest thing to CUDA. It has compute kernels build into the API, and two libraries of highly optimized kernels - Metal Performance Shaders (MPS) and MPS Graph. This mirrors cuDNN and NVIDIA’s array of optimized libraries, but is less feature-complete. I did a lot of studying, and learned (the hard way) that there’s no way to beat Apple’s level of intense optimization. They cover every edge case of matrix multiplication, and utilize a form of hardware acceleration on M1 only accessible through Metal (analogous to tensor cores). Driver latencies are also a massive factor, besides GPU-side performance. Metal has a 10-microsecond round trip latency for a command buffer, while encoding a command within that buffer takes less than 1 microsecond. In machine learning, CPU-side driver latencies can drastically lower performance. That’s one of the problems graph execution helps solve - pre-compiling to avoid round trips between APIs. I had assumed that DLPrimitives has minimized the driver latency incurred while using OpenCL.
I’m not using Direct3D. More specifically, I’m going to use DirectML. Microsoft’s DirectX is a wide variety of APIs, ranging from audio drivers to file storage APIs. It is primarily for making games, but they recently realized its potential for machine learning. So, they made a library using D3D 12 compute shaders that’s incredibly optimized for low driver overhead. It runs on Intel, AMD, NVIDIA, and Qualcomm GPUs. Microsoft is already working on open-source backends for TF and Torch. I plan to explore Windows programming after implementing the MPSGraph and BNNS backends for S4TF. Since everyone with a computer can usually run macOS, iOS, or Windows on it, that should make it cross-platform. Yes, I’m making two separate GPGPU backends instead of one OpenCL. But the end product will have a much smoother user experience because of lower driver overhead. In addition, the frameworks I chose are very feature complete - I made a spreadsheet outlining the operations S4TF needs to use from them.
I’m not sure whether S4TF requires the ability to use channel-first APIs. I’m planning to make obscure features like that (and Lanczos image resampling) crash initially when a backend isn’t feature-complete. I may even remove some obscure ops like FractionalMaxPool, if existing S4TF users don’t utilize them. And on platforms besides iOS, a custom
One more thing - Metal has this problem covered very well. Metal Shading Language has function constants, where you can mostly pre-compile a shader with runtime specialization. MPS uses function constants internally to dynamically generate shader permutations fine-tuned for a specific device. And MPS Graph fuses consecutive 1D ops into a runtime-generated Metal shader. I don’t know if XLA or Aten delegate to the NVIDIA shader compiler to do this. |
Beta Was this translation helpful? Give feedback.
-
This is so untrue that I can't even explain. Pytorch does not use OpenCL since they use what everybody else in deep learning world are using: Cuda and nVidia's cudnn libraries. They are THE reason to use cuda. cuDNN performance is surior because nVidia spent lots of time optimizing it. And nVidia did great job creating great proprietary tools to control entire DL market. There is nobody who can or competes with nVidia. AMD and Intel are true toddlers in this field. To be honest they don't even really try (see RDNA and MIOpen/HIP support). Maybe Google's TPUs some competitor but they aren't relevant for most end users. Now OpenCL is fine and primary API for Intel and AMD (oneDNN, ROCm). MIOpen has OpenCL and HIP API that have (I tested) same performance. But they created hip to auto-convert cuda code for AMD - because otherwise AMD will be even less relevant. ARM's GPU use OpenCL as well for compute (there is a library for this) Now with regard to nVidia. I did some writing and same kernel written in Cuda and in OpenCL has same performance. They use same PTX and runtime. I tested my GEMM kernels on cuda and got exactly the same results as in OpenCL performance wise - and these are the most important ones. So you are missing the entire process and point of using open platform/ So... Your statement can't be more inaccurate. Regarding latency. One of the big things in DL that most of operations are computationally bound. I couldn't find a difference for example running over PCI-E x4 and x16 because contributions are negligible. As long as you can feed GPU faster than GPU can compute you are OK. I suggest take a look on pytorch async design.
You don't use python on end devices. This is what C++ for. That is why for both pytorch and tensorflow the python is merely driver that calls ops in C++... Finally. If you are interested in inference and not training I'd suggest you look into ONNX and its implementations. |
Beta Was this translation helpful? Give feedback.
-
FYI: Why do we need OpenCL based solution for deep learning? http://blog.dlprimitives.org/post/2 |
Beta Was this translation helpful? Give feedback.
-
Maybe I'm reading it the wrong way, but I got a non-constructive tone. I know that NVIDIA is #1 in this field, and I'm trying to bring ML to other platforms.
On an issue on the PyTorch repository, the leader of PyTorch outlined their reason for not going with OpenCL. They experimented with OpenCL recently because their users wanted it.
I intend to use iOS/iPadOS for ML training. For a year, I had an iPhone that was much more powerful than my Mac. MLCompute ran on it, but was not as ergonomic as TF or S4TF. Swift is a language that replaces both Python and C++, and I saw unrealized potential for hardware acceleration. I'm not sure where this conversation is going, but I would hope to collaborate with you in software development. Are you open to helping me author a DLPrimitives backend for a machine learning framework in the future? |
Beta Was this translation helpful? Give feedback.
-
The point was to show the claim that OpenCL isn't suitable is inaccurate. It wasn't meant to be non-constructuve
I see. It surprises me a little. Also not sure iPad's GPU is stronger than CPU.
Yes, I create dlprimitives as a common core library similar to cudnn/tensorrt. So I'm happy were ever it is going to be used. If you'll have some critical operators missing just drop the word and I'll prioritise them.
Lets say, they wouldn't spend time to develop OpenCL backend. But there is no problem to make out-of-tree backend. Note I frequently discuss various topics on dev-forums by pytorch. |
Beta Was this translation helpful? Give feedback.
-
I'm looking forward to do it. |
Beta Was this translation helpful? Give feedback.
-
Intel GPUs are useless in this sense. Latest mutli-core Intel CPUs have very good performance and good Terra-flops. In any case note, not TFLops only matter, memory bandwidth is critical as well. Now: 1.0 TF isn't really useful for anything but Mnist level training. 2.6 TFlops is kind of very basic start (for example gtx960 and rx560 I have are 2.5TFLops grade) but they are mostly toy cases or learning cases or very basic networks for small data-sets/small images. Finally only starting from 5-10FLops some work can be done.
What GPU iPad is using? It isn't only about driver, it is much more writing good kernels. I did optimization for AMD, nVidia and to some extent to Intel. Core ops like GEMM currently aren't optimized for mobile gpus since I donht have easy way to test it. |
Beta Was this translation helpful? Give feedback.
-
Interesting, since I don't have experience with sequences and RNNs. |
Beta Was this translation helpful? Give feedback.
-
I've went ahead and forked this repository. I'm going to experiment with a scheme where I refactor your files into a Swift package. Because Git remembers the files' original location in the folders, I can push your future commits into my repository with a pull request. I'll also want a second layer of control over a backend I officially support and integrate with my ML framework. |
Beta Was this translation helpful? Give feedback.
-
Let's say that in the future, I make OpenCL kernels for ML that build on your work in DLPrimitives. But, since I mostly have access to my Apple silicon device, they're optimized for the Apple architecture. From what I've read, the Apple and AMD GPU architectures have a lot of similarities. What would the performance implications be for when they're run on Intel and AMD devices? |
Beta Was this translation helpful? Give feedback.
-
Instant compile failure. Use the C script at https://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/ as a reference. On Apple, the OpenCL headers are imported with |
Beta Was this translation helpful? Give feedback.
-
Just to organize our conversation, I'm separating part of it into a different reply thread. Here's the results of running CLBlast autotune for:
|
Beta Was this translation helpful? Give feedback.
-
Also separating this out from other talk. The ratio of threads to "cores" on the Apple7 generation is 256 (8192 threads / 32 cores). However, that number doubled with the Apple8 generation. We would need to check for A15+ or M2+ in the device's name to bump the ratio to 512. Do you have a suggestion for how we can future-proof the algorithm to get the correct number of cores on the Apple8 architecture? |
Beta Was this translation helpful? Give feedback.
-
I’m going to reserve this comment for discussion about compiling any code I create. ——- I got the idea to make dedicated Swift bindings for OpenCL, mirroring how the C++ API is just a wrapper around the C API. Then, I might validate it by rewriting DLPrimitives using Swift (or just package tests inspired by your code). Would you be willing to run my finished product on your machine to see if it runs correctly? |
Beta Was this translation helpful? Give feedback.
-
@artyom-beilis I figured out why DLPrimitives runs so slowly on macOS. The Apple GPU has extremely slow threadgroup memory. Apple made it slow because threadgroup memory is very power hungry. This is one of several things I've realized they sacrificed for power efficiency. In another code base, I saw a 20% : 50% ratio of performance between a 10 TFLOPS Apple GPU and a 10 TFLOPS Nvidia GPU. That code base also used threadgroup memory. Therefore, Apple made I'm planning to make an OpenCL 3.0 driver for macOS, where |
Beta Was this translation helpful? Give feedback.
-
Apple has highly optimized DL primitives in Metal performance shaders. By partially de-compiling Metal Performance Shaders, I saw an insane amount of permutations. They optimized for all sorts of edge cases. I saw the term "winograd" once or twice in function names. I tried comparing custom Metal shaders to Apple's MPS and mine were terrible, but I imagine you have more time to thoroughly investigate performance deltas.
Given that Metal works on most AMD and Intel GPUs, it would be wise to run your OpenCL code on macOS and compare your performance to Apple's. That would ensure your kernels utilize the GPU as much as physically possible. Another suggestion is to try comparing DirectML, although I suspect that Apple is more optimized due to the sheer number of permutations they created. You can examine the DirectML source code to see if Microsoft takes the permutation approach too.
Beta Was this translation helpful? Give feedback.
All reactions