-
Notifications
You must be signed in to change notification settings - Fork 17
Dataset design #4
Comments
Do we agree this cannot be separated from the In the above I guess that Concretely, Then If so then we can consider Lastly we have So I think we need to include the above in scope - thoughts please? |
@SemanticBeeng If I'm understanding you correctly, I think the |
Having a unified API for both streaming and batch use cases means forbidding several useful operations on batch datasets (computing |
Yeah so ultimately, the low level description language probably makes sense to be streaming based since streams are a superset of batch operations: https://data-artisans.com/blog/batch-is-a-special-case-of-streaming Ultimately the user facing APIs will just be a way to construct this lower level description of a program. I think that exposing two APIs (
Thoughts? |
@camjo absolutely, that would buy us the ability to implement operations over |
I have a very rough API design proposal here: https://gist.github.com/camjo/10cb0f25b9da10f08f9b30cbd9419985 I've done a lot of thinking about the nature of batch and stream and what it means to do computation on each as well as how we should expose it as an API. I think it fundamentally boils down to data vs. computation. Data can roughly be conceptualised as either Bounded or Unbounded. Computation can be thought of as (online/streaming/unbounded) vs. (offline/batch/bounded). The Gist explains what I mean by this in more detail. Hopefully you can follow the types to understand how the various concepts fit together. There are still a fair few gaps listed as @jdegoes Would be keen to get your thoughts on this so far. Would it be easier as a PR that we can work on improving before merging or is a gist fine for now? Perhaps if this seems like a suitable direction, it can form the basis of another design session. I'm aware that its not implemented in terms of |
@camjo Another session works, too. Some quick thoughts:
|
The issue seems to be that you have 3 cases caused by the data/computation split.
And consequently
Note specifically that a bounded computation on a batch doesn't seem to have the same signature as a bounded computation on a stream. The alternative here would be to say "ok lets treat a batch as a finite stream and collapse them". Ok sure, but you're not removing any of those cases, you're simply renaming "batch" to "finite stream". The 3 cases of semantics still need to exist. Perhaps there is a more precise encoding that I'm missing though? In terms of prior work:
Thoughts? |
r.e. |
@idc101 It seems to me that the usage of a group by in frameworks that expose it with an intermediate structure like
The thing is, other libraries/frameworks can't express it like this:
Reason being is that they allow arbitrary functions in the group by clause so they can't optimise common aggregates. Since we capture functions and reify everything we can actually provide that API and still optimise it to run more like the first code snippit above by comparing equality of the functions. This would remove that state entirely. Can anyone see why this might not work? I wonder if we could even apply the same logic to |
I like the idea of data vs computation split, since the nature of computation on a stream is often very different from the one on a batch resulting in different optimisation strategies which might be applied and different expectations to a runtime.
you can - if you group by all columns, result will be an equivalent of a distinct (surprisingly producing more efficient query plan in some cases) |
I think possibly windowed and grouped are aspects of the same thing. The only reason I put emphasis on this is that many sql streaming libraries provide the same set of operators for both batch and streaming computations — albeit they have different interpretations. I do think it's a good point that "group by" only really exists for purposes of aggregation, so may not even need an intermediate representation. |
Hi, potential support dev here. Not sure where this discussion is at this point, but here is yet another stab at framing the API discussion, which is that we could divide DataSet into InfiniteDataSet and FiniteDataSet types. I don't know all of the problems we really need to solve here, but here are some ideas on how this would flesh out. InfiniteDataSet would be a high-level trait describing both the batch and streaming case. StreamDataSet would really just be an instance of BatchDataSet from a high-level perspective, treated as the case where input batch size is exactly 1. These would both expose a FiniteDataSet would be the case where a complete DataSet has been loaded into Scalaz-Analytics, such as from disk/Hadoop. This would expose a PS., I agree with the suggestion that Grouped may not need to be exposed to the user - I can't think of the last time I needed a GroupedRDD in Spark or didn't immediately follow with |
Thank you to everyone who has left feedback so far. I've created a PR with my new proposal over here: #14. I'm pretty sure it addresses all the comments left so far as well as my own gripes with the versions I had proposed earlier. Please take a look and make sure to comment if you have ideas/issues! :) |
+1 for unifying and introducing
@diminou this article informs about streaming vs batch with "incremental" in between: you may find it interesting. |
Creating this issue to discuss the detailed design of
Dataset[A]
(and potentiallyDataStream[A]
).As discussed in the meeting with John, I think its worth thinking through what this API would look like as both batch and stream. We can go severals ways with this:
The first would be a unified API like Spark's Dataset (and structured streaming).
https://spark.apache.org/docs/2.3.1/structured-streaming-programming-guide.html#programming-model
Another would be Flink's Dataset/DataStream API (which are build on top of their stateful streams abstraction).
https://ci.apache.org/projects/flink/flink-docs-release-1.5/concepts/programming-model.html
I'd like to flesh out what this API should look like and how it should function in more detail here.
The text was updated successfully, but these errors were encountered: