Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GafferML #6150

Open
wants to merge 12 commits into
base: 1.5_maintenance
Choose a base branch
from
Open

GafferML #6150

wants to merge 12 commits into from

Conversation

johnhaddon
Copy link
Member

@johnhaddon johnhaddon commented Nov 19, 2024

This dips Gaffer's littlest toe into the swirling waters of machine learning, adding a handful of nodes to do processing using ONNX Runtime. These are :

  • DataToTensor : Converts Gaffer data to tensors.
  • Inference : Loads ONNX models and performance inference using an array of input tensors.
  • ImageToTensor : Converts images to tensors for use with the Inference node.
  • TensorToImage : Converts tensors back to images following inference.

At this point we are deliberately not shipping any actual ML models with Gaffer, nor any nodes to do specific tasks. This is all on a strictly "bring your own" basis, whereby the preparation of the .onnx models is entirely up to the user using external tools. But we believe that Gaffer provides a decent environment for packaging such models into useful Box-based end-user tools that interoperate with Gaffer's other modules, with internal nodes for wrangling the data into appropriate shape, and external plugs to provide user control.

Currently the "bring your own" philosophy extends as far as us not even shipping ONNX runtime with Gaffer - instead you must download it yourself and provide it via ONNX_ROOT, a little like is done with 3rd party renderers. In future we might ship the runtime, but right now it adds more to the package size than is justified by GafferML's slightly experimental nature. Development and testing to date has used v1.19.2 from https://github.com/microsoft/onnxruntime/releases/tag/v1.19.2.

@lucienfostier has guided me expertly through the development process on this, and has been busy testing it on various more advanced image-processing tasks in the background. Hopefully he'll be able to share details on that soon, but for the meantime, here are a couple of screenshots I've prepared of Gaffer doing basic image processing using the Segment Anything and Depth Anything models respectively :

image
image

Copy link
Collaborator

@lucienfostier lucienfostier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!
I had to spread the review across 3 days to carefully review everything but hopefully that will be useful.
Looking forward to integrate this in production!
Thanks!

@@ -0,0 +1,49 @@
##########################################################################
#
# Copyright (c) 2012, John Haddon. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copyright is different than most other files, is it on purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a sloppy copy-paste - fixed in 26dfaf8.

__import__( "GafferImage" )

if hasattr( os, "add_dll_directory" ) :
os.add_dll_directory( ( pathlib.Path( os.environ["ONNX_ROOT"] ) / "lib" ).resolve() )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, this is only necessary on Windows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct - Python won't find DLL dependencies of a module unless you tell it where they are with add_dll_directory(). The code only runs on Windows, because hasattr() returns false on other platforms.

GafferML.Tensor( IECore.IntVectorData( [ 1, 2, 3, 4 ] ), [ 4 ] )
]

self.assertEqual( len( { t.hash() for t in tensors } ), len( tensors ) )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this testing exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checks that each tensor has a unique hash with respect to the others. If any tensor wasn't unique, the size of the set of hashes would be smaller than the number of tensors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didnt notice the set, now that makes sense.

self.assertNotEqual( tensor1, tensor2 ) # Different shape

tensor2 = GafferML.Tensor( IECore.IntVectorData( [ 3, 2, 1 ] ), [ 3 ] )
self.assertNotEqual( tensor1, tensor2 ) # Different shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the comment be # Different data instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it should - fixed in 9967936.

allocator, shape.data(), shape.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_BOOL
);
std::copy( typedData->readable().begin(), typedData->readable().end(), value.GetTensorMutableData<bool>() );
m_state = new State{ std::move( value ), nullptr };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not passing the typedData as the DataPtr arg here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the Ort::Value owns its own copy of the data in this case. We only keep the DataPtr when the Ort::Value is referencing the data that we own within that DataPtr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Comment on lines +78 to +90
size_t i = shape.size() - 3;
for( size_t d = 0; d < i; ++d )
{
if( shape[d] != 1 )
{
throw IECore::Exception(
fmt::format(
"Expected {} dimensional tensor to have size 1 in dimension {}",
shape.size(), d
)
);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might need a revisit once we support models that process frame sequences.
For example the one i'm working with right now would have the following shape [1, 55, 3, 296, 720] that is respectively:

  • batch size
  • frame number
  • channels
  • height
  • width

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in that case, would a single tensor contain all the frames? And we'd need to add a TensorToImage::framePlug() to say which one to extract?

Happy to rejig this in future PRs, but I think it's worth getting this merged as-is and then working from that baseline. It's one of the reasons I've documented that we're not guaranteeing ABI stability at this point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should merge as-is but in the future we might need to do what you suggest and have the same for the ImageToTensor to gather a frame range into a single tensor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - those both sound like a good future addition.

src/GafferML/TensorToImage.cpp Show resolved Hide resolved
const Box2i validTileBound = BufferAlgo::intersection( dataWindow, tileBound );
out.resize( ImagePlug::tileSize() * ImagePlug::tileSize() );

const float *sourceData = tensorData->value().GetTensorData<float>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you already considered the case where an image tensor is stored as a different data type than float, for example as double or FLOAT16?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no, I have not! I think it would crash right now, at least for the FLOAT16 case. Should I throw if it isn't float? Or do some templating to support the conversion? And if the latter, do I need to worry about integer formats too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested a model where the current code wouldnt work but I was considering that for my TensorToMesh.
We can worry about that later.

Changes.md Show resolved Hide resolved
Changes.md Show resolved Hide resolved
@johnhaddon
Copy link
Member Author

Thanks for the thorough review @lucienfostier! I think I've been through and replied to all your comments now. There are a couple where I might still need to push code changes, pending your feedback. I can get onto those tomorrow as necessary...

Copy link
Collaborator

@lucienfostier lucienfostier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments, LGTM!

GafferML.Tensor( IECore.IntVectorData( [ 1, 2, 3, 4 ] ), [ 4 ] )
]

self.assertEqual( len( { t.hash() for t in tensors } ), len( tensors ) )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didnt notice the set, now that makes sense.


bool Tensor::isEqualTo( const IECore::Object *other ) const
{
if( !Object::isEqualTo( other ) )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

allocator, shape.data(), shape.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_BOOL
);
std::copy( typedData->readable().begin(), typedData->readable().end(), value.GetTensorMutableData<bool>() );
m_state = new State{ std::move( value ), nullptr };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

object tensorGetItemTyped( const Tensor &tensor, const std::vector<int64_t> &location )
{
return object(
const_cast<Ort::Value &>( tensor.value() ).At<T>( location )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see!

script2 = Gaffer.ScriptNode()
script2.execute( script.serialise() )
self.assertIsInstance( script2["node"]["user"]["p"], GafferML.TensorPlug )
self.assertEqual( script2["node"]["user"]["p"].getValue(), GafferML.Tensor() )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I guess we dont need that now

Comment on lines +251 to +262
for( int x = validTileBound.min.x; x < validTileBound.max.x; ++x )
{
dstData[dstIndex] = sourceData[srcIndex++];
dstIndex += dstStride;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, let's just merge it like that and we can always revisit if it becomes a problem.

vector<int64_t> shape;
if( interleaveChannels )
{
shape = { 1, dataWindow.size().y, dataWindow.size().x, (int64_t)channels.size() };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

static_cast<TensorPlug *>( output )->setValue( tensor );
}

ComputeNode::compute( output, context );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Comment on lines +78 to +90
size_t i = shape.size() - 3;
for( size_t d = 0; d < i; ++d )
{
if( shape[d] != 1 )
{
throw IECore::Exception(
fmt::format(
"Expected {} dimensional tensor to have size 1 in dimension {}",
shape.size(), d
)
);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should merge as-is but in the future we might need to do what you suggest and have the same for the ImageToTensor to gather a frame range into a single tensor.

const Box2i validTileBound = BufferAlgo::intersection( dataWindow, tileBound );
out.resize( ImagePlug::tileSize() * ImagePlug::tileSize() );

const float *sourceData = tensorData->value().GetTensorData<float>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested a model where the current code wouldnt work but I was considering that for my TensorToMesh.
We can worry about that later.

This is a wrapper class that will allow us to pass ONNX values through Gaffer's computation graph.
This will be used for passing Tensor values between nodes.
This allows data from elsewhere in Gaffer to be converted for use in GafferML.
This forms the meat of GafferML, loading ONNX models and performing inference using data from an array of input TensorPlugs.
This is a node which converts images from GafferImage into tensors for use by the Inference node.
This allows tensors to be converted back to GafferImage images, after they have been processed by the Inference node.
And advertise them in Changes.md.
@johnhaddon
Copy link
Member Author

Thanks for addressing the comments, LGTM!

I've squashed all the fixups in to the relevant commits, omitting the linear-search one. I've also added one final commit which checks the input tensor type in TensorToImage. Happy to merge?

@lucienfostier
Copy link
Collaborator

Yep, let's merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Pending Review
Development

Successfully merging this pull request may close these issues.

2 participants