-
Notifications
You must be signed in to change notification settings - Fork 92
Windowing functions
Advanced scenarios in streaming jobs require fine control over time windows. This tutorial walks you through various windowing functionality in Data Accelerator.
User can control the interval at which the incoming data is processed. The image below shows the rate of incoming data and how data is processed based on batch interval setting. With batch interval of 10 seconds, the data in that interval is processed every 10 seconds.
To do so, you can alter the 'Batch interval in Seconds' setting. The default is 30 seconds. Furthermore, you can specify the maximum number of events to be processed in each batch by entering the value in 'Maximum Events per Batch Interval' field. The default value is already filled in. These settings will depend on incoming rate of the data, number of cores and memory in the cluster and of course scenario need.
For every batch processed, you can control look back period. This is called Sliding Window. The image below shows batch interval of 5 seconds (i.e. data is processed every 5 seconds), with a sliding window of 10 seconds (i.e. look back is 10 seconds). So, past 10 seconds of data is processed every 5 seconds.
You can alter the batch interval as shown above. To change the sliding window you can do it easily in-line in SQL using the TIMEWINDOW keyword. The query below will count the number of events every batch interval, looking back 10 seconds.
T1 = SELECT COUNT(*) AS Count
FROM DataXProcessedInput
TIMEWINDOW(‘10 seconds’)
There are times when you may want to wait for late arriving data before processing the data. This gives an opportunity to process the data even though it may be late, however if it arrives beyond the late arrival limit, the data would be dropped.
Check the example below. The color of the data represents the timestamp on the data and its position on the timeline shows when it arrived. For this example, we have set the Sliding Window to 10 seconds and Late Arrival Wait time is set to 5 seconds.
At T=20 seconds, all the data with timestamp between 5 and 15 seconds will be processed (yellow data). Notice that we will include data that arrived late. However, if yellow data arrives later than 5 seconds, then it will be dropped (i.e. data arriving after 20 seconds is dropped).
Next, the data will be processed at T=30 seconds because the batch interval is set to 10 seconds. At this time, data with timestamp between 15 and 25 seconds will be processed. The colors on the data show how they are included/excluded based on their timestamp and arrival time.
Using this setting often has trade-offs between latency (i.e. how long you want to wait for the late arrival data) and completeness (i.e. how to include as much data as possible in processing).
You can specify the wait time for late arriving data in 'Wait time for late arriving data' field in the input tab.
When using late arriving data or sliding window, you need to specify the time column to use for the windowing functions. This is specified on the input tab.
At times the incoming data may not have the time column in the expected format. For example, in the home automation sample, the eventTime column comes in as string instead of timestamp. To fix this you can specify a normalization statement in SQL that will be applied to the incoming raw data before being processed to DataXProcessedInput which you can use to set up rules or alerts in the Rules tab, or to write SQL queries against in the Query tab.
To specify normalization, toggle 'Show Normalization' to on, and write the SQL statement above the default Raw.* statement. For the home automation sample, the normalization is shown below. This will be applied to the incoming data and be used for windowing functions.
Data Accelerator gives you very powerful windowing functions to accomplish most complex of tasks that can be easily expressed in the UI or inline in SQL with keywords. The home automation sample included with Data Accelerator makes use of these dials that you can explore further.