-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TODO List for 0.1 release #17
Comments
@pavanky : It doesn't necessarily have to be a classifier though. |
I think the following would be helpful from an API standpoint: struct Model {
int add(Layer layer_type);
int compile(Optimizer opt, Loss loss, int max_iter = 200, bool early_stop = 1);
float fit(DataSet train_data, DataSet target_data, std::tuple<DataSet, DataSet> validation = std::nullptr, float validation_split = 0.0f); // for batch training up till either max_iter or early_stop if it is set. Has the option of either accepting cross validation data or splitting given data with ratio
float train(DataSet train_data, DataSet target_data); // single step for online methods (can be called from fit)
DataSet predict(DataSet test_data); // for evaluating new data
} All the int's are return codes in the above. This will give maximum flexibility in:
The Layer class as you mentioned should do the following: struct Layer{
int connect(Layer prev); // to connect to previous layer in a deep net
DataSet derivative(DataSet x);
DataSet forwardPass(DataSet x);
DataSet input(); //merely returns the input that was last used (or the output of the previous layer)
Weights weights(); //returns just weights
Bias bias(); // returns the bias (or stack of bias') if any (otherwise std::nullptr maybe? )
std::map<std::string, std::string> conf(); //getter to return config of the layer itself.
} |
According to many recent researches, the most powerful neural networks are no longer stacked layers but rather arbitrarily complex graphs, e.g. the bunches of advanced recursive neural networks, Facebook AI research's Memory Networks, Google DeepMind's Neural Turing Machine etc. So node is a more general name than layer.
The following API is inspired by Caffe's Blob, Layer and Net. typedef shared_ptr<array> ArrayPtr;
class Data {
public:
explicit Data(vector<int>& size);
int nDimension() const;
vector<int> size();
// Caffe exposes the raw CPU/GPU pointers to use in BLAS functions.
// array has high level API. So there's no need to to so.
ArrayPtr weights() const;
ArrayPtr gradients() const;
}
typedef shared_ptr<Data> DataPtr;
typedef vector<DataPtr> DataPtrVec;
class Node {
public:
explicit Node(NodeParam& nodeParam);
virtual ~Node();
// Calls initNode which subclass can override
void init();
// Input and output are more general than the top and bottom of Caffe
virtual void forward(const DataPtrVec& input,
const DataPtrVec& output);
// propagate_back is more general than propagate_down of Caffe
virtual void backward(const DataPtrVec& input,
const vector<bool>& propagate_back,
const DataPtrVec& output);
// The model is DAG(Directed Acyclic Graph)
// it's more intuitive for the predecessor to add the successor
void addSuccessor(Node& node);
void addSuccessors(vector<Node>& nodes);
protected:
virtual initNode();
};
// Dtype is float or double
template <typename Dtype>
class Graph {
public:
explicit Graph(GraphParam& graphParam);
virtual ~Graph();
virtual forward(const DataPtrVec& inputs, DataPtrVec* outputs,
Dtype* loss = NULL);
/**
* (Caffe) The network backward should take no input and output, since it solely
* computes the gradient w.r.t the parameters, and the data has already been
* provided during the forward pass.
*/
virtual backward();
Dtype forwardBackward(const DataPtrVec& inputs) {
Dtype loss;
DataPtrVec outputs;
forward(inputs, &outputs, &loss);
backward();
return loss;
}
}; |
Microsoft Research's "Computational Networks A Generalization of Deep Learning Models" presented its open source deep learning framework CNTK as |
I like the generalizations. Few notes: void addSuccessor(Node& node); seems redundant with: void addSuccessors(vector<Node>& nodes); since a node can have subnodes as well, right? One item that is very important that your schema is missing is some form of accuracy. |
If I am reading correctly, the following is for cases when a single node is connected to multiple successors (like the CN with shared params diagram).
I am not sure a linked list is the solution here. |
@futurely Thanks for the great feedback! The proposed API looks solid. |
@futurely Sorry for jumping the gun. I re-read the entire discussion again. It looks the proposed |
@futurely @jramapuram Would it be possible to continue the discussion over here: https://gitter.im/arrayfire/arrayfire_ml ? |
Suggestions from @alcinos:
|
Here is a proposition including the modifications: |
The Graph or Network class is not needed at all. Here's a simple illustration:
|
@futurely |
#22. |
@alcinos Can you explain how having an adjacency list helps in the situations you mentioned ? I still think it is the better option to have a centralized location for the representation, however I do not see it solving the problems of greedy layer by layer training. |
@pavanky Well let's say we have a 3 layers stacked auto-encoder. Eventually, depending on the applications, it is likely that the interesting part of the trained net is the output of E3 (high level features of the input). Once trained, we'll thus only use the first part of the net: I -> E1 -> E2 -> E3 The point is that in all those training steps, the architecture of the net is different, hence is makes more sense to store this architecture independently of the nodes. |
I understand what autogenerators are doing. My question was more about implementation. What you are suggesting requires updating the adjacency list after each step or creating a new network after each step. Am I correct in this observation ? |
Absolutely! |
@futurely The code you are showing doesn't feature any encapsulation. This is a problem:
|
Caffe creator's plan on "Improving Caffe: Some Refactoring". |
Minerva's DAG and operators based implementation. |
@futurely @alcinos @jramapuram I pushed some of the basic classes out. I do not have any convolution nodes yet. There is no network classes yet. I am going to push some sample network classes out later today to demonstrate how the structure can be stored. The network class will also extend I understand that this is not much right now, but feedback is welcome. |
Looks great so far @pavanky ! Few queries / comments:
I will take a deeper look a little later, but good job! |
@pavanky @jramapuram I believe forward and backward computations have to be refactored from the base interface too, because I can see no way to support forward phase gradient computations with the current Node class. I mean RTLR for example. |
@unbornchikken : Agreed. I think a simple solution is to create a virtual forward / backward in the Node class & implement it in each class. That way you can unwind recurrences if need be (or keep a truncated moving average of the weights as in RTRL) |
But for calculating RTLR you'll need a totally different kind of network traversal logic other than forward or backward. I meant that if traversal logic is refactored out of Node interfaces in separate classes (ForwardPhase, BackwardPhase, RTLRPhase, etc) its easy to define separate ones for every possible algorithm AFML want to support in the future and after the future. |
I think we are might be talking about two separate things. Are you talking about Real-Time-Recurrent-Learning: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.9724&rep=rep1&type=pdf ? |
@jramapuram Exactly. |
Exactly in that it is the same? Or different? I didn't find any mention online for RTLR. RTRL involves just computing the gradients of time n+1 using gradients and activations of time n in addiction to accumulating the gradients from 0 --> T_current. It is fully online. My solution of having virtual forward/backward functions should easily solve this problem.. |
@jramapuram :) Yeah, I've trouble with acronyms, I often write LSTM as LTSM also. Sorry. Ok, but for calculating gradients you'll need to feed desired outputs and to do a special kind of forward iteration of the whole network for each weights: http://www.willamette.edu/~gorr/classes/cs449/rtrl.html (18) Or maybe @pavanky doesn't want to support RTRL altogether in favor of LSTM, which is a separate structure other than this. |
@unbornchikken : no worries, just wanted to be on the same page :) . So RTRL is an optimization algorithm. It can be applied to LSTMs/GRU/...(insert your RNN flavor here). If you look at that link (specifically step # 7 & # 8) it isn't really anything fancy. You need to keep an accumultor for the error upto the current timestep as well as the gradient (as opposed to BPTT where things are unfolded all the way from T_current --> T0). You can ignore a lot of the delta function stuff that is mentioned there. It is as in the paper, a way to unify the input/output pairs |
@jramapuram What I tried to say is that to support RTRL with your rigid Node abstract class, you gotta define some other methods in it accepting and holding this forward propagated |
If you look at Linear.hpp it implements it's own forward / backward.
If you are implementing a new algorithm then you will have to do this. This is c++ , not python. Now, that being said it makes sense to have an Optimizer class. However, this is separate from the Nodes as optimizer just take a cost, a gradient and do some form of ascent/descent. |
@jramapuram You're talking about object oriented design and I'm talking about composite design there. In your case RTRLNode : RecurrentNode will have a backward and froward methods those throws unsupported exception, and a forward method that accepts a whole new vector for storing derivates along with the input, and an other forward method propagating p values - declared in RTRLNode class. In my case Node doesn't have forward and backward methods, only provide connections and node, weights and other weight related values. Forward, backward, etc cases are implemented separately, and an |
RTRLNode will have two private internal members that merely accumulate the error & derivatives locally. All RTRL is doing is estimating derivatives for t+1 using derivatives for t (this is done on a node by node basis). Each unit is a linear combination of the previous layer's activations coupled with a gemv call. The only thing that needs to be passed along is the current update (which is already being done).
In either case, how does your solution help you prototype faster? |
Maybe this is not the right issue to discuss about the internals of RTRL, but we can agree about that you will end up having an RTLNode class that have to throw not supported exception in its backward method, right? That's because your fundamental design hardcoded the ability to serve a single algorithm suite. Btw, I haven't talked about anyhing that helps pavan to prototype faster. I've talked about something that I believe makes the fundamental design more extensible. It was just only that: "I believe forward and backward computations have to be refactored from the base interface too" :) |
The LR was greater than 1 because I knew it was going to converge for that simple simple test.
Each "Node" here is the equivalent of a "Layer" from caffe and "Module" from torch-nn.
If you can point to some examples, I will look into this.
That is how things are at the moment..
Each "Node" can be a simple layer or a composite node like Multi Layer Perceptrons, Recurrent Neural networks, Autoencoders and Restricted Boltzman Machines etc. The composite nodes can be used as is or by plugging them into a larger network. The "forward" and "backward" methods help the composite nodes interface with other nodes in the larger network. For training these networks, the methods used will obviously be different. That said, none of the API is final and will change until we can address all the problems. This project is still in the prototyping phase afterall. P.S. If you want to have prolonged discussions, can you move to the gitter room instead :-) |
const double lr = 10; @pavanky : This is what I was referring to in perceptron.cpp
Yea, noticed that in the later comments. I need to learn to read :)
Do you mean examples of other normalization strategies? Example would be whitening using SVD @unbornchikken : we can discuss further if you like, but I think the only other area where removing the forward/backward paradigm would be when straying away from neural networks. In that case you can simply ignore 'backward' . Otherwise, extending them should prove sufficient. RTRL still have a forward and a backward btw, you don't need to throw an undefined exception. |
@jramapuram The LR was only high because the test (binary AND) is a linear operation and weights will always get updated in the same direction once the algorithm starts. I just wanted to speed things up a little while testing. I understand why learning rates usually are small, but for this case it does not really matter. |
I do not see how such strategies can be applied at each node. From what I understand this can be calculated near the output and the norm can be propagated back for scaling up / down. Perhaps I should rename the method to simply say "scale" |
Scale sounds more appropriate imo. But do we want to do internal node scaling? |
I agree. I will just merge normalize and update step for now. |
@pavanky "The "forward" and "backward" methods help the composite nodes interface with other nodes in the larger network." Oh, I get it! This was not obvious, I've seen many ML libraries out there that had this fw/bw only traversal hard coded in fundamentals. Now, I started to get excited to see where your design leads to. I cannot wait to put my hands on something that support this meta-network building approach. I started to imagine something like Peltarion Synapse but with performance level of x100. |
Also see: arrayfire/arrayfire#1441 |
Updated the issue to reflect the new architecture. |
Base Classes
Autograd
Neural Network
Solvers / Optimizers
Examples
The text was updated successfully, but these errors were encountered: