Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse net leaf/predict #12

Open
yvdriess opened this issue Sep 27, 2017 · 3 comments
Open

Sparse net leaf/predict #12

yvdriess opened this issue Sep 27, 2017 · 3 comments

Comments

@yvdriess
Copy link

yvdriess commented Sep 27, 2017

I have a multi-target workload with very sparse labels. In other words, a single observable/sample is not a (bool label, SparseF[] features) tuple, but rather a (SparseF[] labels, SparseF[] features).
During the learn/optimize steps, the predict run currently appears to just produce a dense vector of predictions (e.g. a 65k vector), but only a few elements (e.g. 200) of that feature vector will have a label and contribute to the gradient.

a) Is there a way in vectorflow to deal with the multi-target learning scenario. I suppose I should make a loss function that scatters the non-zero gradients into a zero-vector of the predict-vector size?
b) Where I can see me doing a), it is kind of wasteful as it means producing a large vector of predictions where only a handful are used to produce the gradient.

I implemented b) In Tensorflow by gathering columns of the last-layer weight matrix, where the gathered columns match the label's non-zero coordinates (viz. target index). For Vectorflow, it probably makes more sense to have a SparseData() leaf/output layer. It appears that float[] dense output is currently hardcoded in the learning interface.. The loss function interface could probably be kept the same, as it should not care about the actual target index, but there needs to be some logic that produces a float[] from the SparseF[] output before passing it to the loss function, and then transforms the dense gradient vector back into a sparse gradient for back-prop. SparseLinear layer? (dense prev-vector + SparseData input, SparseData output)

@rostyboost
Copy link
Contributor

Thanks for your question @yvdriess. If I understand correctly, you still want a dense forward-prop for your loss, but your gradient that needs to be backpropagated is sparse, because it only involves a subset of the output neurons (presumably on the coordinates matching what's in your SparseF[] labels?). This is definitely possible in vectorflow. There is an overload of the callback mechanism resolved at compile-time where you can populate a sparse gradient for your loss. It's not so well documented, but you can find a dummy example here. So you can just append the non-zero SparseF coordinates of the gradient into this buffer. This would be the most efficient way for your problem, as this would avoid any unnecessary "dense to sparse" operation.
Typically vectorflow works by reference and tries to avoid any memcopy, so there is no "gather" as it is assumed that the memcopy cost outweights any potential SIMD-gain from "densifying" a sparse operation, which tends to be the case for most IO-bound / shallow problems.

@yvdriess
Copy link
Author

yvdriess commented Oct 9, 2017

Thanks for the response, I bit the bullet, dove in and indeed found the SparseF[] grad overload for the loss functions. That takes care of a), but I am still not sure how to do b) without implementing a SparseLinearLayer or similar myself.

The basic idea is that:

  • The network output (loss function input) can currently only be dense: float[]. (output of the dense linear layer)
  • A loss function that produces sparse gradients based on the SparseF[] labels coordinates only uses the same matching coordinates of the network output float[].
  • Similarly for the last linear layer weights, only the columns matching the label coordinates are involved in the loss/gradient calculation.

Optimizing for the above case by making a sparse linear layer that takes an extra coordinates argument and produces SparseF[] output would decrease the amount of memory touched. (The compute probably is not a factor here.)
In our workload, the labels vector is very sparse (~1/1500), so this optimization should have a big impact.

I guess the real question is: Is float[] network output baked in everywhere? Or should it be relatively straightforward for me to make a SparseLinearLayer that produces a SparseF[] network output?

@yvdriess
Copy link
Author

If I may resurrect this issue: I just noticed while adding a SparseLinear layer to Vectorflow that only the first layer can be sparse.

if(parents[i].type == LayerT.SPARSE) // TODO: fixme

Is this work in progress, or should I provide an implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants