Skip to content

Windowing functions

Dinesh Chandnani edited this page Apr 15, 2019 · 7 revisions

Advanced scenarios in streaming jobs require fine control over time windows. This tutorial walks you through various windowing functionality in Data Accelerator.

Batch Interval

User can control the interval at which the incoming data is processed. The image below shows the rate of incoming data and how data is processed based on batch interval setting. With batch interval of 10 seconds, the data in that interval is processed every 10 seconds.
Batch

To do so, you can alter the 'Batch interval in Seconds' setting. The default is 30 seconds. Furthermore, you can specify the maximum number of events to be processed in each batch by entering the value in 'Maximum Events per Batch Interval' field. The default value is already filled in. These settings will depend on incoming rate of the data, number of cores and memory in the cluster and of course scenario need.
Batch

Sliding Window

For every batch processed, you can control look back period. This is called Sliding Window. The image below shows batch interval of 5 seconds (i.e. data is processed every 5 seconds), with a sliding window of 10 seconds (i.e. look back is 10 seconds). So, past 10 seconds of data is processed every 5 seconds.
Sliding

You can alter the batch interval as shown above. To change the sliding window you can do it easily in-line in SQL using the TIMEWINDOW keyword. The query below will count the number of events every batch interval, looking back 10 seconds.

T1 = SELECT COUNT(*) AS Count
FROM DataXProcessedInput
TIMEWINDOW(‘10 seconds’)

Late Arrival Data Handling

There are times when you may want to wait for late arriving data before processing the data. This gives an opportunity to process the data even though it may be late, however if it arrives beyond the late arrival limit, the data would be dropped.

Check the example below. The color of the data represents the timestamp on the data and its position on the timeline shows when it arrived. For this example, we have set the Sliding Window to 10 seconds and Late Arrival Wait time is set to 5 seconds.

At T=20 seconds, all the data with timestamp between 5 and 15 seconds will be processed (yellow data). Notice that we will include data that arrived late. However, if yellow data arrives later than 5 seconds, then it will be dropped (i.e. data arriving after 20 seconds is dropped).

Next, the data will be processed at T=30 seconds because the batch interval is set to 10 seconds. At this time, data with timestamp between 15 and 25 seconds will be processed. The colors on the data show how they are included/excluded based on their timestamp and arrival time.

Using this setting often has trade-offs between latency (i.e. how long you want to wait for the late arrival data) and completeness (i.e. how to include as much data as possible in processing).
Late arrival

You can specify the wait time for late arriving data in 'Wait time for late arriving data' field in the input tab.
Late arrival

Time Column

When using late arriving data or sliding window, you need to specify the time column to use for the windowing functions. This is specified on the input tab.
Timestamp

Normalization

At times the incoming data may not have the time column in the expected format. For example, in the home automation sample, the eventTime column comes in as string instead of timestamp. To fix this you can specify a normalization statement in SQL that will be applied to the incoming raw data before being processed to DataXProcessedInput which you can use to set up rules or alerts in the Rules tab, or to write SQL queries against in the Query tab.

To specify normalization, toggle 'Show Normalization' to on, and write the SQL statement above the default Raw.* statement. For the home automation sample, the normalization is shown below. This will be applied to the incoming data and be used for windowing functions.
Normalization

Putting it all together

Data Accelerator gives you very powerful windowing functions to accomplish most complex of tasks that can be easily expressed in the UI or inline in SQL with keywords. The home automation sample included with Data Accelerator makes use of these dials that you can explore further.

Links

Data Accelerator

Install

Docs

Clone this wiki locally