Spark Streaming and Azure Stream Analytics

June 16, 2016, 3:52 pm

≫ Next: Output to Azure Data Lake Store is Generally Available

≪ Previous: Troubleshooting Azure Stream Analytics jobs with SELECT INTO

As Microsoft adds support for both proprietary and open source technologies for processing and analytics of streaming data, customers have been asking us how to choose between Spark Streaming and Azure Stream Analytics. Here is my perspective on this subject, given my close involvement in the development and adoption of both these technologies.

Let’s first start with the technologies.

There are 3 layers in most big data analytics systems.

Programming model, or language
On node runtime
Distributed runtime

Programming Model

Spark Streaming’s programming model is through programmatic construction of a physical processing plan, in the form of a DAG. It’s inspired by DryadLinq (http://research.microsoft.com/en-us/projects/DryadLINQ/), a Microsoft research project. The use of custom expression is very natural in this programming model, because the code describing the DAG and the custom expression is written in the same language, and is in the same program.

In contrast, Azure Stream Analytics only exposes a high level SQL like language to describe the logical processing plan. The abstraction level is similar to Spark SQL. However, before Spark 2.0, Spark SQL could not be used for stream processing. User code in Azure Stream Analytics is introduced explicitly as user defined functions. They can be Azure Machine Learning web service function, or JavaScript functions (currently in private preview). Because of the higher level of abstraction, Azure Stream Analytics supports many temporal processing patterns in the language right out of the box (e.g. use of application time (link), windowed aggregates (link), temporal joins (link), and temporal analytic functions (link). Our intention here is to enable users to do most of what they want using a very SQL like language with some additional functions.

Very recently, Spark Streaming introduced the use of application time in 2.0 release (link), and the operator is restricted to windowed aggregates. More advanced operators such as temporal joins and analytic functions still need to be built by users by using updateStateByKey or the new mapWithState operators. Both are very low level operators that leave the burden on the user to implement the temporal logic they want because they only deal with streaming raw events, and maintaining states on user’s behalf. The Azure Stream Analytics team, in collaboration with Office 365 Custom Fabric team, made an attempt to implement Azure Stream Analytics like operators on Spark Streaming 1.4. Only reorder, windowed aggregates, and temporal joins are implemented at this time. The work was presented during Spark Summit 2016 (link). Our goal in some ways is to bring the simplicity of Azure Stream Analytics to Spark streaming and to evangelize Azure Stream Analytics temporal processing abstractions.

Implementing these operators is not trivial because it’s not just the functional aspects one needs to consider, but should also take into account parameters such as memory usage and scalability. In essence, you are programming at a very low level, but if you want to fully control the processing behavior, or the operator you need doesn’t exist (e.g. session window), that’s what you have to do. Azure Stream Analytics doesn’t allow you have that level of control today. During Spark Summit 2016, more details about Spark Structured Streaming were revealed. The new Spark Streaming programming model is even more unified with batch and interactive processing, so Spark SQL can be used to express stream processing as well. However, these new functionalities are still not considered production ready by Databricks. There are outstanding issues such as state cleanup (to prevent from the state size growing forever), and partial result generation/aggregation for windowed aggregates, where more controls have to be exposed to the users to enable the desired streaming processing behavior.

On node runtime

Because Spark Streaming doesn’t provide many temporal operators out of the box, there is really no temporal runtime used. In Ver 2.0, they are moving towards sharing the same runtime as the batch operators through the use of Structured Streaming, so the same query optimizer (Catalyst) can be used, and the processing engine (Tungsten) can be shared.

Azure Stream Analytics being a streaming only solution, uses an on-node engine highly tuned for streaming processing called Trill (link) which is a fruit of many years of MSR research and learnings from production CEP engine, StreamInsight (shipped with SQL). The design point starts to diverge between Spark Streaming and Azure Stream Analytics from this point onwards (you will see more divergence in the distributed runtime).

Spark streaming and Azure Stream Analytics are not targeting exactly the same space. Spark is covering the spectrum of batch, streaming and interactive. One of the design decisions that Spark team made was to use the same underlying infrastructure to support all these 3 usage patterns.

Azure Stream Analytics only targets stream processing and stream analytics scenarios at this time, the design and technology it is built on is very much optimized for stream processing. For example, when it comes to enabling the use of application time, Azure Stream Analytics sort the incoming events by application time first using a reorder window. Subsequent temporal operators are performed on the temporally sorted events. The implementation and semantics is much cleaner, and the amount of memory required by the state size is much smaller. For instance, if you are computing aggregates at one-minute interval for one million groups, Azure Stream Analytics only needs to keep one million aggregates in memory; Spark Structured Streaming on the other hand, keeps historical aggregates in memory in order to handle events arriving out of order. Today, because there is no state cleanup policy allowed, all historical aggregates have to be kept in memory forever, which won’t work in a production system that keeps running for a long period of time. Another example is left outer join, which doesn’t exist in Spark Streaming yet, but if it exists, the NULL output for the left outer join needs to be either delayed or retracted if a late event arrives that creates a match for the join condition.

Upfront reordering in Azure Stream Analytics ensures that the operator has seen all events in the past, so output can be emitted right away. When it comes to implementing temporal analytic functions (e.g. Last function), out of order processing can make the algorithm quickly intractable, because the temporal logic has a strong dependency on the order of the events. This is especially true if you are searching for a temporal pattern in the events. Upfront reordering is often necessary to even ensure the correctness of the implementation. The downside of this approach – the whole processing pipeline is delayed by the duration of the reorder window, and additional memory needs to be used to sort the events.

Personally, I think the abstraction of RDD being the cornerstone of Spark has its weakness of not being able to be optimized for the usage pattern. RDD abstraction has resulted in tremendous amount of growth in the past for Spark streaming because of the promise that once code is written to operate on RDD, it can be used for all 3 usage patterns. RDD created the platform where all application level logic converges. For example, MLlib can be used for both batch and streaming. This is unlike Azure Stream Analytics, which has to call out Azure Machine Learning service explicitly to perform scoring. It would be interesting to see how much more mileage they will get out of it, as different optimization functions need to be applied to different usage patterns.

Distributed runtime

At the distributed runtime layer, Spark’s core processing unit is RDD. Everything boils down to processing one RDD after another. Stream processing is modeled the same way – events in a fixed wall clock duration are captured in a RDD, and processed by the pipeline. As a result, it’s difficult to apply back pressure when there is not enough compute resource. The input reader keeps producing RDDs even when the rest of the pipeline cannot keep up. Also, because the wall clock duration for each RDD is fixed (set by the user), the end to end latency cannot be shorter than that duration. However, if you shorten the duration too much (e.g. under 1 second), the number of RDDs scheduled to be processed will overwhelm Spark’s scheduler. In my testing the system could keep up with a 10 second duration just fine, but, anything under that causes growth in the event processing backlog, even with very minimal processing logic.

In order to maximize throughput, Azure Stream Analytics also processes events in batches. But, the batch size is variable. When there are more events queued, events are processed in large batches. When there are less events queued, smaller batches are processed to achieve lower latency. There is no scheduler for such batches in Azure Stream Analytics, it’s just a GetBatch/Processing loop and as a result the overhead is very small. At the distributed runtime layer, Azure Stream Analytics uses a generalized anchor protocol that is offset based, instead of being time based. This allows us to have multiple timelines in the event streams processed. This is a useful concept for IoT scenarios, because the clocks on the devices could be significantly out of sync. Spark Streaming doesn’t have that capability at this time, or you have to drop down to the lower level operators to do it with custom code, if it’s possible at all.

For processing resiliency against node failures, the newest Spark Structured Streaming uses a write ahead log (WAL) to track all state changes. Azure Stream Analytics on the other hand, writes full state checkpoint at a fixed interval. Either approach works well for some scenarios but not necessarily for every scenario. There is a Microsoft Research paper discussing the various recovery strategies for your reference (link).

Conclusion

It is definitely possible for Spark Streaming to become more sophisticated over time to overcome many of its shortcomings I’ve mentioned above. With proper level of investments, I can see that happen. At the same time, Azure Stream Analytics is also making significant strides as it grows to become a highly mature managed service, improving reliability, debugging experience, and enabling users to complete more end to end scenarios out of the box. For example, Power BI output is built-in for dashboard scenarios; geospatial support is in private preview for connected car scenarios. We are also going to add interactive event exploration experience (currently in private preview) to help users understand the data in order to write meaningful Azure Stream Analytics queries.

Ultimately, the choice of the technology depends on your scenario, but hopefully the analysis above can help you get started on the right foot!

↧

Output to Azure Data Lake Store is Generally Available

November 23, 2016, 2:01 pm

≫ Next: How to achieve exactly-once delivery for SQL output

≪ Previous: Spark Streaming and Azure Stream Analytics

We are excited to announce that the capability to output to Azure Data Lake Store, a hyper-scale repository for big data analytics workloads from Azure Stream Analytics is now Generally Available.

This integration further advances the ease of enablement of a Lambda architecture where the data that is subjected to real time stream analytics is also stored and then subjected to offline batch processing to unlock powerful insights. A large number of batch processing possibilities are enabled by virtue of Data Lake Store’s integration with Azure Data Lake Analytics, Azure HDInsight, Microsoft Revolution-R Enterprise, and Hadoop distributions from various industry-leading providers.

Azure Data Lake Store is built for supporting the storage needs of big data analytics systems that require massive throughput to query and analyze petabytes of data. The capabilities will be highly useful as data to be analyzed continues to grow exponentially, especially in stream scenarios like IoT. It also provides low latency read/write accesses and high throughput for data from typical Stream Analytics scenarios such as real-time event stream analysis, real time web log analysis, and analytics on Internet of Things (IoT) sensor data.

The ability to output to Azure Data Lake Store from Stream Analytics can be enabled by selecting “Data Lake Store” as the output choice when adding a new output:

Additional details on outputting to Data Lake Store from Stream Analytics are covered here.

— Sam Chandrashekar, Program Manager

↧

How to achieve exactly-once delivery for SQL output

January 12, 2017, 4:07 pm

≫ Next: Processing Configurable Threshold Based Rules in Azure Stream Analytics

≪ Previous: Output to Azure Data Lake Store is Generally Available

Azure Stream Analytics guarantees exactly-once processing within the processing pipeline. However, it currently doesn’t ensure end to end exactly-once delivery to output sink. Instead, it guarantees at-least-once delivery to output sink. When using SQL output, we can achieve exactly-once semantics if the following requirements are met:

all output streaming events have a natural key, i.e. are uniquely identifiable either by a field or a combination of fields.
the output SQL table has a unique constraint (or primary key) created using the natural key of the output events.

This is sufficient to avoid duplicates because the SQL output honors any constraints placed on the table by skipping any events that cause a unique constraint violation.

↧

Processing Configurable Threshold Based Rules in Azure Stream Analytics

January 18, 2017, 12:32 pm

≫ Next: Using Azure Stream Analytics JavaScript UDF to lookup values in JSON array

≪ Previous: How to achieve exactly-once delivery for SQL output

This post covers the usage of Azure Stream Analytics to process configurable threshold based rules.

Canonical scenarios where an alert is to be generated when an event with a certain value occurs or when an aggregated value exceeds a certain threshold can be articulated as threshold based rules in Azure Stream Analytics. These queries are simple to author if the threshold values are predetermined. However, in many cases, either the threshold values need to be configurable, or there are numerous devices or users processed by the same query with each of them having a different threshold value.

This can be achieved with the use of reference data using the following steps:

Store the threshold values in the reference data per key.
Join the events with the reference data on the key.
Use the keyed value from the reference data as the threshold value.

In the example illustrated below, alerts are generated when the aggregate of data streaming in from devices in a minute-long window matches the stipulated values in the rule supplied as reference data.

In the query, for each deviceId, and each metricName under the deviceId, you can configure from 0 to 5 dimensions to GROUP BY. Only the events having the corresponding filter values are grouped. Once grouped, windowed aggregates of Min, Max, Avg, are calculated over a 60 second tumbling window. Filters on the aggregated values are then calculated as per the configured threshold in the reference, to generate the alert.

Reference data:

{
    "ruleId": 1234,
    "deviceId" : "978648",
    "metricName": "CPU",
    "alertName": "hot node AVG CPU over 90",
    "operator" : "AVGGREATEROREQUAL",
    "value": 90,
    "includeDim": {
        "0": "FALSE",
        "1": "FALSE",
        "2": "TRUE",
        "3": "FALSE",
        "4": "FALSE"
    },
    "filter": {
        "0": "",
        "1": "",
        "2": "C1",
        "3": "",
        "4": ""
    }   
}

Query:

WITH transformedInput AS
(
    SELECT
        dim0 = CASE rules.includeDim.[0] WHEN 'TRUE' THEN metrics.custom.dimensions.[0].value ELSE NULL END,
        dim1 = CASE rules.includeDim.[1] WHEN 'TRUE' THEN metrics.custom.dimensions.[1].value ELSE NULL END,
        dim2 = CASE rules.includeDim.[2] WHEN 'TRUE' THEN metrics.custom.dimensions.[2].value ELSE NULL END,
        dim3 = CASE rules.includeDim.[3] WHEN 'TRUE' THEN metrics.custom.dimensions.[3].value ELSE NULL END,
        dim4 = CASE rules.includeDim.[4] WHEN 'TRUE' THEN metrics.custom.dimensions.[4].value ELSE NULL END,
        metric = metrics.metric.value,
        metricName = metrics.metric.name,
        deviceId = rules.deviceId, 
        ruleId = rules.ruleId, 
        alertName = rules.alertName,
        ruleOperator = rules.operator, 
        ruleValue = rules.value
    FROM 
        metrics
        timestamp by eventTime
    JOIN 
        rules
        ON metrics.deviceId = rules.deviceId AND metrics.metric.name = rules.metricName
    WHERE
        (rules.filter.[0] = '' OR metrics.custom.filters.[0].value = rules.filter.[0]) AND 
        (rules.filter.[1] = '' OR metrics.custom.filters.[1].value = rules.filter.[1]) AND
        (rules.filter.[2] = '' OR metrics.custom.filters.[2].value = rules.filter.[2]) AND
        (rules.filter.[3] = '' OR metrics.custom.filters.[3].value = rules.filter.[3]) AND
        (rules.filter.[4] = '' OR metrics.custom.filters.[4].value = rules.filter.[4])
)

SELECT
    System.Timestamp as time, 
    transformedInput.deviceId as deviceId,
    transformedInput.ruleId as ruleId,
    transformedInput.metricName as metric,
    transformedInput.alertName as alert,
    AVG(metric) as avg,
    MIN(metric) as min, 
    MAX(metric) as max, 
    dim0, dim1, dim2, dim3, dim4
FROM
    transformedInput
GROUP BY
    transformedInput.deviceId,
    transformedInput.ruleId,
    transformedInput.metricName,
    transformedInput.alertName,
    dim0, dim1, dim2, dim3, dim4,
    ruleOperator, 
    ruleValue, 
    TumblingWindow(second, 60)
HAVING
    (
        (ruleOperator = 'AVGGREATEROREQUAL' AND avg(metric) >= ruleValue) OR
        (ruleOperator = 'AVGEQUALORLESS' AND avg(metric) <= ruleValue) 
    )

Sample input:

{
    "eventTime": "2017-03-08T14:50:23.1324132Z",
    "deviceId": "978648",
    "custom": {
        "dimensions": {
            "0": {
                "name": "NodeType",
                "value": "N1"
            },
            "1": {
                "name": "Cluster",
                "value": "C1"
            },
            "2": {
                "name": "NodeName",
                "value": "N024"
            }
        },
        "filters": {
            "0": {
                "name": "application",
                "value": "A1"
            },
            "1": {
                "name": "deviceType",
                "value": "T1"
            },
            "2": {
                "name": "cluster",
                "value": "C1"
            },
            "3": {
                "name": "nodeType",
                "value": "N1"
            }
        }
    },
    "metric": {
        "name": "CPU",
        "value": 98,
        "count": 1.0,
        "min": 98,
        "max": 98,
        "stdDev": 0.0
    }
}
{
    "eventTime": "2015-03-08T14:50:24.1324138Z",
    "deviceId": "978648",
    "custom": {
        "dimensions": {
            "0": {
                "name": "NodeType",
                "value": "N2"
            },
            "1": {
                "name": "Cluster",
                "value": "C1"
            },
            "2": {
                "name": "NodeName",
                "value": " N024"
            }
        },
        "filters": {
            "0": {
                "name": "application",
                "value": "A1"
            },
            "1": {
                "name": "deviceType",
                "value": "T1"
            },
            "2": {
                "name": "cluster",
                "value": "C1"
            },
            "3": {
                "name": "nodeType",
                "value": "N2"
            }
        }
    },
    "metric": {
        "name": "CPU",
        "value": 95,
        "count": 1,
        "min": 95,
        "max": 95,
        "stdDev": 0
    }
}
{
    "eventTime": "2015-03-08T14:50:37.1324130Z",
    "deviceId": "978648",
    "custom": {
        "dimensions": {
            "0": {
                "name": "NodeType",
                "value": "N3"
            },
            "1": {
                "name": "Cluster",
                "value": "C1 "
            },
            "2": {
                "name": "NodeName",
                "value": "N014"
            }
        },
        "filters": {
            "0": {
                "name": "application",
                "value": "A1"
            },
            "1": {
                "name": "deviceType",
                "value": "T1"
            },
            "2": {
                "name": "cluster",
                "value": "C1"
            },
            "3": {
                "name": "nodeType",
                "value": "N3"
            }
        }
    },
    "metric": {
        "name": "CPU",
        "value": 80,
        "count": 1,
        "min": 80,
        "max": 80,
        "stdDev": 0
    }
}

Output:

{"time":"2017-01-15T02:03:00.0000000Z","deviceid":"978648","ruleid":1234,"metric":"CPU","alert":"hot node AVG CPU over 90","avg":98.0,"min":98.0,"max":98.0,"dim0":null,"dim1":null,"dim2":"N024","dim3":null,"dim4":null}

↧

Using Azure Stream Analytics JavaScript UDF to lookup values in JSON array

April 4, 2017, 2:44 pm

≫ Next: SU Utilization Metric

≪ Previous: Processing Configurable Threshold Based Rules in Azure Stream Analytics

JavaScript UDF (User-Defined Function) allows you to handle complex JSON schema and keep your query clean. In this blog you will learn how to handle nested JSON arrays with a JavaScript UDF. Below is an example event generated by an IoT gateway. There is an array of two devices – device01 and device02; each device has an array of measurements which represent three types of sensor data: temperature, humidity, and pressure.

[sourcecode language='javascript'  padlinenumbers='true']
[{
    "gateway": "3a40067f-9d21-4a02-ad5f-926f31648f71",
    "timestamp": "2017-03-13T14:33:02",
    "devices": [{
        "deviceId": "device01",
        "timestamp": "2017-03-13T14:33:02",
                "measurements": [{
            "type": "Temperature",
            "unit": "F",
            "value": 76.34
          }, {
            "type": "Humidity",
            "unit": "%RH",
            "value": 53.32
          }, {
            "type": "Pressure",
            "unit": "psi",
            "value": 21.2586
          }
        ]
      }, {
        "deviceId": "device02",
        "timestamp": "2013-03-20T09:33:12",
        "measurements": [{
            "type": "Temperature",
            "unit": "F",
            "value": 75.6875
          }, {
            "type": "Humidity",
            "unit": "%RH",
            "value": 52.9668
          }, {
            "type": "Pressure",
            "unit": "psi",
            "value": 22.1355
          }
        ]
      }
    ]
  }
]
[/sourcecode]

Below is the desired output, which converts each device as one output event, and all measurements under a device become properties.

Timestamp	DeviceId	Temperature	Humidity	Pressure
2017-03-13T14:33:02Z	device01	76.34	53.32	21.2586
2013-03-20T09:33:12Z	device02	75.6875	52.9668	22.1355

The Solution:

We can use two techniques to handle arrays: GetArrayElements() and CROSS APPLY generate one row for each device record in the devices array; UDF getValue() does a value lookup from the measurements array. First, let’s define a JavaScript UDF getValue(identifier, arr) as below. If you don’t know how to add JavaScript UDF, take a look at https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-javascript-user-defined-functions.

[sourcecode language='javascript'  padlinenumbers='true']
// Name the UDF as getValue()
// Set return type as any
// identifier == name
// arr == JSON array containing objects
// returns value or null if not found
function main(identifier, arr) {
  var result = null;

  if (Object.prototype.toString.call(arr) == "[object Array]") {
    for (i = 0; i < arr.length; i++) {
      if (arr[i].type == identifier) {
        result = arr[i].value;
      }
    }
  }
  return result;
}
[/sourcecode]

Your query will be like below:

[sourcecode language='sql' ]
SELECT
    d.arrayvalue.timestamp,
    d.arrayvalue.deviceId,
    udf.getValue('Temperature', d.arrayvalue.measurements) as Temperature,
    udf.getValue('Humidity', d.arrayvalue.measurements) as Humidity,
    udf.getValue('Pressure', d.arrayvalue.measurements) as Pressure
FROM input
CROSS APPLY GetArrayElements(input.devices) as d
[/sourcecode]

An alternative solution:

You might have noticed that in the above query we call getValue() function against the same array multiple times. If there are lots of identifier values to look up, multiple calls to JavaScript UDF may increase your job’s end to end latency. We can modify the UDF to return multiple values in a JSON object, then parse the JSON object in query language.

The below JavaScript UDF does a lookup for all type-value pairs and returns a JSON object.

[sourcecode language='javascript' ]
// Name the UDF as getValues()
// Set return type as any
// arr == JSON array containing type and value pairs
// returns value or null if not found
function main(arr) {
  var result = {};
  if (Object.prototype.toString.call(arr) == "[object Array]") {
    for (i = 0; i < arr.length; i++) {
      var identifier = arr[i].type;
      if (identifier != null) {
        result[identifier] = arr[i].value;
      }
    }
  }
  return result;
}
[/sourcecode]

Your query will get the JSON object first then separate them into multiple properties.

[sourcecode language='sql' ]
WITH flattened AS
(
    SELECT
        d.arrayvalue.timestamp,
        d.arrayvalue.deviceId,
        udf.getValues(d.arrayvalue.measurements) as measurements
    FROM input
    CROSS APPLY GetArrayElements(input.devices) as d
)

SELECT
    timestamp,
    deviceId,
    measurements.Temperature,
    measurements.Humidity,
    measurements.Pressure
FROM flattened f
[/sourcecode]

↧

SU Utilization Metric

May 8, 2017, 9:02 pm

≫ Next: Event Processing Ordering Design Choices for Azure Stream Analytics

≪ Previous: Using Azure Stream Analytics JavaScript UDF to lookup values in JSON array

In order to achieve low latency streaming processing, Azure Stream Analytics jobs perform all processing in memory. When running out of memory, the streaming job fails. As a result, for a production job, it’s important to monitor a streaming job’s resource usage, and make sure there is enough resource allocated, in order to keep the jobs running 24/7.
One important metric you can monitor the resource usage with is the SU % utilization metric. The metric is a percentage number ranging from 0% to 100%. For a streaming job with minimal footprint, the SU % Utilization metric is usually under 10%. It’s best to keep the metric below 80% to account for occasional spikes.

You can set an alert on the metric.
See https://docs.microsoft.com/en-us/azure/monitoring-and-diagnostics/insights-alerts-portal

There are several factors contributing to the increase of SU % utilization.

Stateful query logic
One of the unique capability of Azure Stream Analytics job is to perform stateful processing, such as windowed aggregates, temporal joins, and temporal analytic functions. Each of these operators keep some states.
- The state size of a windowed aggregate is proportional to the number of groups (cardinality) in the group by operator.
  For example, in SELECT count(*) from input group by clusterid, tumblingwindow (minutes, 5) query, the number associated with clusterid is the cardinality of the query.
  In order to ameliorate issues caused by high carnality, send events to Event Hub partitioned by clusterid, and scale out the query by allowing the system to process each input partition separately using the Partition By as shown:SELECT count(*) from input PARTITION BY PartitionId GROUP BY PartitionId, clusterid, tumblingwindow (minutes, 5)Once the query is partitioned out, it is spread out over multiple nodes. As a result, the number of clusterid coming into each node is reduced thereby reducing the cardinality of the group by operator.
  Event Hub partitions should be partitioned by the grouping key to avoid the need for a reduce step. Additional details are covered here. https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-overview.
- The state size of a temporal join is proportional to the number of events in the temporal wiggle room of the join, which is event input rate multiply by the wiggle room size.
  The number of unmatched events in the join affect the memory utilization for the query. The following query is looking to find the ad impressions that generate clicks:SELECT id from clicks INNER JOIN, impressions on impressions.id = clicks.id AND DATEDIFF(hour, impressions, clicks) between 0 AND 10
  It is possible that lots of ads are shown and few people click on it and it is required to keep all the events in the time window. Memory consumed is proportional to the window size and event rate.
  To remediate this, send events to Event Hub partitioned by the join keys (id in this case), and scale out the query by allowing the system to process each input partition separately using the Partition By as shown:
  
  SELECT id from clicks PARTITION BY PartitionId INNER JOIN impressions PARTITION BY PartitionId on impression.PartitionId = clocks.PartitionId AND impressions.id = clicks.id AND DATEDIFF(hour, impressions, clicks) between 0 AND 10
  
  Once the query is partitioned out, it is spread out over multiple nodes. As a results the number of events coming into each node is reduced thereby reducing the size of the state kept in the join window.
- The state size of a temporal analytic function is proportional to the event rate multiply by the duration.
  The remediation is similar to temporal join. You can scale out the query using PARTITION BY.
Out of order buffer
User can configure the out of order buffer size in the Event Ordering configuration pane. The buffer is used to hold inputs for the duration of the window, and reorder them. The size of the buffer is proportional to the event input rate multiply by the out of order window size. The default window size is 0.
To remediate this, scale out query using PARTITION BY. Once the query is partitioned out, it is spread out over multiple nodes. As a results the number of events coming into each node is reduced thereby reducing the number of events in each reorder buffer.
Input partition count
Each input partition of a job input has a buffer. The larger number of input partitions, the more resource the job consumes. For each SU, Azure Stream Analytics can process roughly 1MB/s of input, so you may want to match ASA SU number with the number of partition of your Event Hub. Typically, 1SU job is sufficient for an Event Hub with 2 partitions (which is the minimum for Event Hub) If the Event Hub has more partitions, your ASA job consumes more resources, but not necessarily uses the extra throughput provided by Event Hub. For a 6SU job, you may need 4 or 8 partitions from the Event Hub. Using an Event Hub with 16 partitions or larger in an 1SU job often contributes to excessive resource usage, and should be avoided.

When tuning your jobs to reduce SU % Utilization, and/or deciding how many SU to use for the job, the above factors should be considered.
At this time, the tuning process is more of a trial-and-error process. You can run the job with typical input, and examine the SU % Utilization metric to find out whether the number of SU allocated is sufficient. If 6SU still doesn’t meet your needs, you will need to consider partition your query with PARTITION BY as illustrated above, so you can distribute the partitions. For partition queries, it’s recommended to use multiple 6SUs for the job. The multiplier is the number of partitions in the Event Hub input. It’s possible to use a smaller multiplier, but buffers for each partition may add memory pressure. You will need to experiment , and see what works best for your job. For more information on how to scale out jobs, see
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-scale-jobs

↧

Event Processing Ordering Design Choices for Azure Stream Analytics

August 24, 2017, 3:50 pm

≫ Next: Improved Stream Analytics output support in Visual Studio

≪ Previous: SU Utilization Metric

In this blog, I would like to explain the design choices we made with Azure Stream Analytics for:

‘When’ in processing-time results are materialized, and
‘How’ the results are materialized.

These design choices reduce implementation complexities and run time resource usage while meeting customers’ requirements for their scenarios.

Firstly, I would like to point out that these are two very different issues. The ‘when’ part is a user visible behavior, and preferably controllable by the user. The ‘how’ part is purely an implementation detail. However, depending on the user visible behavior some implementations are more efficient than the others.
In Google Dataflow’s design, the concept of Trigger is introduced to control ‘when’ the results are materialized. This is most intuitively useful for computing windowed aggregates, and perhaps inner joins.
For example,

SELECT DeviceId, COUNT(*) FROM Input TIMESTAMP BY EventTime GROUP BY DeviceId, TumblingWindow(minute, 5)

Here in this case, full count can only be generated when high water mark of EventTime moves beyond the end of every 5 minute window, but partial count may be desirable to display on a dashboard as the count increases. This is especially relevant when events arrive out of order and in cases where count in multiple 5 minute windows can increase as late arriving events are accounted for.

For outer join where lack of match may have to be retracted, and analytic functions where event ordering is important for pattern matching, early output doesn’t provide as much value.
For example

SELECT I.Id as Id FROM Impression AS I TIMESTAMP BY ImpressionTime LEFT OUTER JOIN Click AS C TIMESTAMP BY ClickTime ON I.Id = C.Id AND DATEDIFF(minute, I, C) BETWEEN 0 AND 60 WHERE C.Id IS NULL

This query outputs all ad impressions without clicks within a 60 minutes interval. Generating output before 60 minutes elapses from the ads impression may result in the need to retract the output when a click finally arrives. When composed with additional query processing downstream, the computation caused by such retraction can become overwhelming.

For complex windowed aggregates, e.g. sliding window, the same applies, when late arriving events can alter the results of multiple windows that contains the late arriving event. If we introduce a session window, a useful window type for certain scenarios, a late arriving event may combine two session windows into one. This is especially true for alerting scenarios, a primary use case for stream processing, where, in most cases, retracting an alert is much less desirable than waiting to make sure some pattern indeed has occurred and is worthy of an alert. Beyond a single step windowed aggregate query, composing such behavior with other downstream operators can also become intractable for user to understand. As a result, we have made the design choice to not expose the ability to generate partial results. Results are only generated when the high water mark of event time has elapsed beyond the closing of the window, or beyond all events that may contribute to the results.

The lack of a need to produce partial results presents an opportunity for us to use in-order-processing to compute the results, which allows users to specify a reorder window to wait for the late arriving events, and use the high water mark of the incoming events’ timestamp as a punctuation to move the rest of the processing pipeline forward. Events coming out of the reorder window are fully sorted by event time per partition, and subsequent operator implementations can assume that incoming events are already ordered by event time.

For example, in the above ad impressions and click matching example, user may anticipate event delays by up to 5 minutes. A reorder window of 5 minutes may be specified as a part of the query configuration. You can consider the reorder buffer as a priority queue. As the high water mark of incoming events advances, all events with timestamp older than the high water mark of 5 minutes are pushed out of the queue, ordered by timestamp, and then processed by the join operator. The join operator only needs to buffer 60 minutes of events. As time advances, it can drop events beyond 60 minutes old. The same logic happens to the windowed aggregate query shown above.

In contrast, processing events as they arrive, without reordering, also known as out-of-order processing, has the advantage in reducing buffered state size for windowed aggregates, because only the aggregated state need to be kept around. In the windowed aggregate query (first example above), if events can be out of order by up to 5 minutes, only two counters have to be kept around per device, one for the current 5 minutes, one for the previous 5 minutes. There is no need to buffer all incoming events in the 5 minute reorder buffer, a potentially large saving in memory if the event rate is high.

However, there is no memory advantage for joins and analytic functions because the raw events have to be buffered anyway. When multiple operators are composed together, such event buffering needs to happen at each operator. This consumes much more memory than the per input reorder buffer used by in-order-processing. CPU usage-wise, out-of-order-processing does have the advantage of amortizing the computation over time, because the computation is performed as events arrive, so the CPU usage is less spiky. However, the operator implementation is often much more complex (e.g. multiple counters have to be kept around in the windowed aggregate example above), which in turn increases overall CPU usage, and potentially prevents other optimization techniques from being combined (e.g. columnar processing, instruction cache and compiler optimization). The Trill paper https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/trill-vldb2015.pdf, describes many such techniques. ASA uses Trill as the on-node processing engine.

The drawback of in-order-processing is primarily in the potentially large reorder buffer when a large reorder window is specified. This happens most commonly for IoT scenarios wherein a large number of event senders may result in significantly divergent timelines because of the various network delays and clock skews. However, in such scenarios, the different timelines are logically separated in the query semantics. As a result, we have newly introduced the concept of substream query to allow the timelines to be decoupled, so the reorder buffer size for each timeline can be kept small, while the time gap between timelines can be large.

For example,

SELECT DeviceId, COUNT(*) FROM Input TIMESTAMP BY EventTime OVER DeviceId GROUP BY DeviceId, TumblingWindow(minute, 5)

This query counts events by device id. The devices can have clocks completely out of sync, but for a specific device the events are guaranteed to arrive mostly in order. We can then use a very small reorder window (e.g. a few seconds) to avoid incurring large memory usage for the reorder buffer, and delaying the output.

To conclude, the two design choices namely 1) Not producing partial results, and 2) In-order-processing allows on-node processing in Azure Stream Analytics to be performed very efficiently, with overall lower memory usage as a result of eliminating the need for buffering of events at every operator. In the future, there might be opportunities to combine both in-order and out-of-order processing techniques to achieve even higher memory and CPU efficiency where appropriate.

Improved Stream Analytics output support in Visual Studio

January 15, 2018, 9:25 pm

≫ Next: Now in Public Preview: Visual Studio tools for Azure Stream Analytics on IoT Edge

≪ Previous: Event Processing Ordering Design Choices for Azure Stream Analytics

Power BI and Azure Data Lake Store outputs are now supported in Visual Studio tools. Previously users can only create or view jobs with these types of outputs in the Azure portal. Now all these actions can also be performed in Visual Studio tools.

With this capability, developers can enjoy the same powerful developer experience for all types of Stream Analytics projects. For more information about Stream Analytics tools for Visual Studio, please see the Get started documentation.

Getting started

To create a job with Power BI or Data Lake Store output, just create a Stream Analytics project as before and choose the Sink type from the dropdown list.

When Power BI or Data Lake Store is selected as an output in the Stream Analytic project, you need to authorize use of your existing Power BI or Data Lake Store before submitting the job.

Now you can create a Visual Studio project from an existing job with Data Lake Store or Power BI output just like other output types. Find the job in Server Explorer under Stream Analytics node, right click on it and then choose to export it to a Stream Analytics project.

If you encounter any issues, please contact ASAToolsFeedback@microsoft.com .

You are also very welcomed to share your feedback at https://feedback.azure.com/forums/270577-stream-analytics .

↧

Now in Public Preview: Visual Studio tools for Azure Stream Analytics on IoT Edge

March 22, 2018, 11:33 pm

≫ Next: 8 reasons to choose Azure Stream Analytics for real-time data processing

≪ Previous: Improved Stream Analytics output support in Visual Studio

Azure Stream Analytics (ASA) on IoT Edge empowers developers to deploy real-time analytical intelligence closer to IoT devices so that they can unlock the full value of device-generated data. Today we are happy to announce that Visual Studio tools for ASA now supports development of ASA on IoT Edge, in addition to development of cloud jobs. These tools can greatly simplify the experience for developing ASA Edge jobs.

With these Visual Studio tools, you can easily author an ASA Edge script with IntelliSense support, test it on your local machine against local data inputs and then create a corresponding job in the cloud for deployment on the IoT Edge.

Below we take a look at some of the key features available, or you can quickly get started by following the tutorial.

Script Authoring

As with ASA Cloud project you can start with creating an ASA on IoT Edge project. This lets you create and open queries in the editor that understand IoT Edge syntax and offers several cool features including keyword completion, error marker and syntax highlighting. These features can save editing time as well as help you correct compilation errors as early as possible.

Local testing

Since we released Visual Studio tools for ASA cloud jobs in 2017, local testing has become the most popular feature among users. The tools encapsulate a single-box local runtime which allows you to run the query solely on the local machine. In this way you can focus on verifying the query logic even in a disconnected mode before deploying to devices.

Submit an ASA Edge job

When the local development and testing is done, you can submit an ASA Edge job to Azure and then use IoT Hub to deploy it to your IoT Edge device(s).

If you encounter any issues, please contact ASAToolsFeedback@microsoft.com .

You are also very welcomed to share your feedback at https://feedback.azure.com/forums/270577-stream-analytics

↧

8 reasons to choose Azure Stream Analytics for real-time data processing

June 8, 2018, 10:28 am

≫ Next: Processing compressed event streams with Azure Stream Analytics

≪ Previous: Now in Public Preview: Visual Studio tools for Azure Stream Analytics on IoT Edge

Processing big data in real-time is now an operational necessity for many businesses. Azure Stream Analytics is Microsoft’s serverless real-time analytics offering for complex event processing. It enables customers to unlock valuable insights and gain competitive advantage by harnessing the power of big data. Here are eight reasons why you should choose ASA for real-time analytics.

1. Fully integrated with Azure ecosystem: Build powerful pipelines with few clicks

Whether you have millions of IoT devices streaming data to Azure IoT Hub or have apps sending critical telemetry events to Azure Event Hubs, it only takes a few clicks to connect multiple sources and sinks to create an end-to-end pipeline. Azure Stream Analytics provides best-in-class integration to store your output, like Azure SQL Database, Azure Cosmos DB, Azure Data Lake Store. It also enables you to trigger custom workflows downstream with Azure Functions, Azure Service Bus Queues, Azure Service Bus Topics, or create real-time dashboards using Power BI.

2. Developer productivity

One of the biggest advantages of Stream Analytics is the simple SQL-based query language with its powerful temporal constraints to analyze data in motion. Familiarity with SQL language is sufficient to author powerful queries. Azure Stream Analytics supports language extensibility via JavaScript user-defined functions (UDFs) or user-defined aggregates to perform complex calculations as part of a Stream Analytics query. With Stream Analytics Visual Studio tools you may author queries offline and use CI/CD to submit jobs to Azure. Native support for geospatial functions makes it easy to tackle complex scenarios like fleet management and mobile asset tracking.

3. Intelligent edge

Most data becomes useless just seconds after it’s generated. In many cases, processing data closer to the point of generation is becoming more and more critical. This allows for lower bandwidth costs and ability of a system to function even with intermittent connectivity. Azure Stream Analytics on IoT Edge enables you to deploy real-time analytics closer to IoT devices so that you can unlock the full value of device-generated data.

4. Easily leverage the power of machine learning

Azure Stream Analytics offers real-time event scoring by integrating with Azure Machine Learning solutions. Additionally, Stream Analytics offers built-in support for commonly used scenarios such as anomaly detection, which helps reduce the complexity associated with building and deploying an ML model in your hot-path analytics pipeline to a simple function call. Users can easily detect common anomalies such as spikes, dips, slow positive or negative trends with these online learning and scoring models.

5. Lower your cost of innovation

There are zero upfront costs and you only pay for the number of streaming units you consume to process your data streams. There is absolutely no commitment or cluster provisioning allowing you to focus on making best use of this technology.

6. Best-in-class financially backed SLA by the minute

We understand it is critical for business to prevent data loss and have business continuity. Stream Analytics processes millions of events every second and can deliver results with low latency. This is why Stream Analytics guarantees event processing with a 99.9 percent availability SLA at the minute level, which is unparalleled in the industry.

7. Scale instantly

Stream Analytics is a fully managed serverless (PaaS) offering on Azure. There is no infrastructure to worry about, and no servers, virtual machines, or clusters to manage. We do all the heavy lifting for you in the background. You can instantly scale up or scale-out the processing power from one to hundreds of streaming units for any job.

8. Reliable

Stream Analytics guarantees “exactly once” event processing and at least once delivery of events. It has built-in recovery capabilities in case the delivery of an event fails. So, you never have to worry about your events getting dropped.

Getting started is easy

There is a strong and growing developer community that supports Stream Analytics. We at Microsoft are committed to improving Stream Analytics and always listening for your feedback to improve our product! Learn how to get started and build a real-time fraud detection system.

↧

Processing compressed event streams with Azure Stream Analytics

August 7, 2018, 5:14 pm

≪ Previous: 8 reasons to choose Azure Stream Analytics for real-time data processing

Did you know that Azure Stream Analytics now supports input data with compression data formats such as GZIP and Deflate streams? This is especially critical in scenarios where either bandwidth is limited, and/or message size is high.

This blog shows you how to squeeze the most value out of your data with compression input formats. Introduced last fall after a growing number of requests from customers, Azure Stream Analytics offers support for GZIP and Deflate streams.

Azure Stream Analytics is now a viable option for many customers who previously faced these bandwidth and message size limitations.

Putting the squeeze on your data with GZipStream

Compressed input can be used alongside most supported serialization formats, including Json and Csv, if desired. Avro supports its own compression options, so additional compression is not necessary when using Avro serialization. This blog post outlines how to spin up a job using an EventHub input, JSON serialization, and GZip compression. Both the VSTS Project shown and the sample data used can be found on our GitHub.

Step 1: Defining the Events

Let’s start by defining the events that would be sent to input source:

[DataContract]
public class SampleEvent
{
[DataMember]
public int Id { get; set; }
}

We will be using the Newtonsoft.Json library for Json serialization. You will need to add a NuGet reference for this library through Project -> Manage NuGet Packages.

I have added the package for Service Bus in the same manner.

Here is how Packages.config file looks like after adding NuGet references:

<?xmlversion="1.0"encoding="utf-8"?>
<packages>

<packageid="Newtonsoft.Json"version="11.0.2"targetFramework="netstandard2.0" />
<packageid="WindowsAzure.ServiceBus"version="4.1.10"targetFramework="net45" />

</packages>

Step 2: Serialize the Data

Now let’s look at the example serialization code. We will be using classes from Newtonsoft.Json namespace. The sample file can be found on our GitHub.

private class CompressedEventsJsonSerializer
{
    public byte[] GetSerializedEvents<T>(IEnumerable<object> events)
    {
        if (events == null)
        {
            throw new ArgumentNullException(nameof(events));
        }
        JsonSerializer serializer = new JsonSerializer();
        using (var memoryStream = new MemoryStream())
        using (FileStream file = new FileStream(@"C:\mystream.json.gz", FileMode.OpenOrCreate))
        using (GZipStream stream = new GZipStream(file, CompressionLevel.Optimal))
        using (StreamWriter sw = new StreamWriter(stream))
        using (JsonWriter writer = new JsonTextWriter(sw)) // or using (StringWriter writer = new StringWriter(sw))
        {
            foreach (var current in events)
            {
                serializer.Serialize(writer, current);
            }

            return memoryStream.ToArray();
       }
    }
}

The code above serializes the events in Json format and includes the schema in each payload. Azure Stream Analytics requires the schema to be specified with the payload. Note that the container has multiple events and schema is specified only once.

Step 3: Send the events to Event Hub

Finally, let’s send events to Event Hub:

private static void Main(string[] args)
{
    var eventHubClient =
        EventHubClient.CreateFromConnectionString(
            "",
            "");
    var compressedEventsJsonSerializer = new CompressedEventsJsonSerializer();
    while (true)
    {
        var eventsPayload =
            compressedEventsJsonSerializer.GetSerializedEvents(
                Enumerable.Range(0, 5).Select(i => new SampleEvent() { Id = i }));
        eventHubClient.Send(new EventData(eventsPayload));
        Thread.Sleep(TimeSpan.FromSeconds(10));
    }
}

We have now seen example code for sending events in Avro format to Event Hub; these events can now be consumed by a Stream analytics job by configuring an EventHub input, Json serialization format, and GZIP compression.

Please keep the feedback coming! The Azure Stream Analytics team is highly committed to listening to your feedback and letting the user voice influence our future investments. We welcome you to join the conversation and make your voice heard UserVoice, learn more about what you can do with Azure Stream Analytics, and connect with us on Twitter.

↧