Skip to content

Latest commit

 

History

History
92 lines (68 loc) · 3.73 KB

README.adoc

File metadata and controls

92 lines (68 loc) · 3.73 KB

PTH_10: DPL to Apache Spark Translator

Translates Data Processing Language (DPL) commands to Apache Spark actions and transformations. Uses ANTLR visitors to generate a list of step objects, which contain the actual implementations of the commands using the Apache Spark API.

Features

  • Translates a string-based DPL command using the parse tree generated by the PTH_03 ANTLR-based parser to Apache Spark actions and transformations.

  • Fetch data from a datasource provider (by default, PTH_06 datasource provider) and filter the data with the filters specified in the DPL command.

  • Apply various transformations and actions to the data with simple easy-to-understand commands.

  • Supports parallel and sequential modes based on which kind of commands are used. If a command requires batch-based processing, sequential mode will be used. Otherwise, processing will remain on parallel mode, allowing stream processing.

  • Spark API implementations are enclosed in so-called Step objects, which take a Dataset as input and return the transformed dataset as the return value, allowing for easy reusability of these objects.

  • ANTLR-based visitor functions purely gather all the necessary parameters for these objects, not containing any implementation logic of the commands themselves.

Documentation

See the official documentation on docs.teragrep.com.

Limitations

Not all commands in the Data Processing Language are yet implemented.

How to

Use:

  • Create a new DPLParserCatalystContext. It requires a SparkSession object and a com.typesafe.config.Config. The config is usually provided from the Zeppelin component.

DPLParserCatalystContext catCtx = new DPLParserCatalystContext(sparkSession, config);
  • Create a new DPLParserCatalystVisitor, in which you set the DPLParserCatalystContext.

DPLParserCatalystVisitor catVisitor = new DPLParserCatalystVisitor(catCtx);
  • Visit the parse tree generated by PTH_03 using the visitor functions with the DPLParserCatalystVisitor.visit() function.

CatalystNode n = (CatalystNode) visitor.visit(tree);
  • The result of that function is a CatalystNode. It contains a DataStreamWriter, which can be started to start the execution.

n.getDataStreamWriter();
  • Set the visitor’s Consumer to a function of your liking to view or move the resulting Dataset to the desired component.

visitor.setConsumer((ds, id) -> {
    ds.show();
});

For a more concrete example, check out the PTH_07 Zeppelin DPL Interpreter project.

Compile:

mvn clean install -Pbuild

Contributing

You can involve yourself with our project by opening an issue or submitting a pull request.

Contribution requirements:

  1. All changes must be accompanied by a new or changed test. If you think testing is not required in your pull request, include a sufficient explanation as why you think so.

  2. Security checks must pass

  3. Pull requests must align with the principles and values of extreme programming.

  4. Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).

Read more in our Contributing Guideline.

Contributor License Agreement

Contributors must sign Teragrep Contributor License Agreement before a pull request is accepted to organization’s repositories.

You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep’s repositories.