1.What are streaming
units?
Let us have a look at
streaming units which are used to optimize the performance of the Azure Stream
Analytics jobs.
To optimize performance, we
need to monitor and adjust the resources allocated to Stream Analytics
jobs. Streaming Units are a way to measure this performance. This means,
higher the streaming units allocated to a job, the more CPU and memory
resources available to it.
SU% utilization metric is a
measure of resource usage for Streaming units. This can range from 0% to 100%.
It is recommended to keep the SU% utilization below 80% to accommodate future
spikes and It is best practice to set an alert at 80%, so that more Streaming
Units can be allocated to make sure that the job does not run out of resources.
One indication that the Stream Analytics job requires more Streaming Units is a
high number of backlogged input events.
Azure portal provides some metrics to measure the throughput of the job. One of the steps to improve the throughput of the job, that can be taken besides adjusting the partitions and optimizing the query is to add more Streaming Units to the job.
2.Can you call an
Azure Function from Azure Stream Analytics?
Let us see how we can run an Azure Function from an Azure Stream Analytics job.
Step 1: Create
Azure Stream Analytics job: This can be done using the Azure portal UI.
Step 2: Create an
Azure Cache for Redis instance: After creating a cache, make a note of
the Primary connection string under Access Keys.
Step 3: Create an Azure
Function: This will be an HTTP-triggered function. For detailed steps and code
for the Azure function, please see the Reference link below this post.
Step 4: Update the
Stream Analytics job with the Azure Function as output: In this step,
we are just adding Azure Functions as the sink for the Stream Analytics job
that we created in Step 1.
The Azure Steam
Analytics job should be sending data to the Redis Cache using the Azure
Function that we created in Step3.
The HTTP request
timeout is set to 100 seconds. This means that if the Azure Function takes more
than 100 seconds to process a batch, the Stream Analytics job will error out.
But in case this occurs, Azure Stream Analytics is going to retry sending the
failed batch to the Azure Function again.
3.What is Azure
Synapse Link?
Azure Synapse Link for Azure Cosmos DB provides seamless integration between Synapse Analytics and Cosmos DB. The analytics workloads have evolved from the traditional structured data warehouse requiring ETL (Extract Transform Load) integration to a more flexible and semi-structured big data and multi format data analysis. Azure Synapse Link caters to this by providing a hybrid transactional and analytical processing (HTAP) capability within the Azure cloud.
As can be seen in
the above diagram, the Azure Cosmos DB provides a built-in Analytical column
store, which connects seamlessly with Azure Synapse Link. This connection
provides data engineers and data scientists with the ability to process and
analyse transactional data in near real-time latency using Synapse SQL or
Synapse Spark.
The key advantage
of using Azure Synapse Link is that it provides seamless integration without
the need for ETL pipelines. Therefore, this approach is called No ETL
integration. The changes in the transactional data store are synced
automatically to the Analytical Store within Cosmos DB. These changes then
reflect instantly in Azure Synapse Analytics query results without the need for
processing an ETL pipeline.
Since the
Analytical and Transactional stores in Cosmos DB are provisioned independent of
each other, there is no performance impact on the transactional workloads
within Cosmos DB while querying the analytical store using Azure Synapse Link.
4.What are the machine
learning features available in Azure Synapse Analytics?
Machine Learning is one of the most important techniques of data science
and analytics. Azure Synapse Analytics supports various ways of implementing
machine learning out-of-the-box.
The basic process
flow of a typical machine learning project is as follows:
The output of the
first step is to define the basics for the project, in terms of the scope and
outcomes.
Azure Synapse
Analytics has robust data ingestion and acquisition features (Pipelines and close integration with Azure Data
Factory).
The next step is
analytical model selection and training. Azure Synapse Analytics incorporates
two major ways to achieve this:
Spark MLib : Spark
MLib can be used to train models in the more classical sense. This
option is ideal for those who are already familiar with Spark MLib.
Automated ML : If you are
new to machine learning, Azure Synapse Analytics provides this automated way to
train models. This method allows user to select the best model based on certain
metrics.
Apart from
the Spark MLib, Synapse Analytics provides support for other
popular machine learning libraries such as, SciKit Learn.
The final step in a
machine learning project is, model deployment and scoring. Synapse Analytics
supports deployment and batch scoring of Analytical models trained within Azure
(using one of the techniques mentioned above) or outside the Azure environment
as well. There are two ways to run batch scoring in Azure Synapse: TSQL
PREDICT function and Apache Spark Pools for Azure
Synapse.
5.What is Azure
Security Benchmark?
Information Security is one of the most important and crucial aspects
for any enterprise grade system software. For a cloud ecosystem, like Azure,
having sturdy information security (including data security) is fundamental.
Microsoft has been constantly improving the robustness of the Azure platform
ever since its launch.
Recently, Microsoft
have released the second version of their recommended security practices and
configurations, called the Azure Security Benchmark to help
customers adopt recommended information security best practices across the
Azure platform.
The benchmark
provides design and configuration guidelines across various cloud-centric
control areas, called Control Domains. These are based on
well-known security benchmarks such as, Center for Internet Security
(CIS) Controls Version 7.1 and National Institute of Standards
and Technology (NIST) SP 800-53.
The Control Domains included in the Azure Security Benchmark v2 are :
Azure Security Benchmark v2:
Control Domains
Each Information
Security Control Domain mentioned above includes the following:
Azure ID: A unique
Security Benchmark ID to identify the recommendation.
CIS Controls v7.1
ID(s): The corresponding CIS Controls v7.1 id(s) (single or multiple)
for the recommendation.
NIST SP 800-53 r4
ID(s): The corresponding NIST SP 800-53 r4 ID(s) (single or multiple) to
identify the recommendation.
Details: The detailed
explanation around the rationale and helpful guidelines on how to implement the
recommendation. Information about whether the recommendation is supported by
Azure Security Center is also included.
Responsibility: To identify
the responsible entity (Customer, service-provider or both) for implementing
the security recommendation. Due to the overarching nature of the Azure
Security Benchmark, some recommendations may have a shared responsibility
between the customer and service-provider. These will be specified in the
baseline recommendations for the individual service.
Customer Security
Stakeholders: The security function in the customer organization security
department who is responsible for the respective control. This can differ based
the ownership and organization structure for different customers.
6.What are the various
ways to change the DWU allocation in Azure Synapse Analytics?
DWUs and cDWUs are basic units of measuring
performance in Azure Synapse Analytics. In this post, let’s look at
the various methods available in Synapse Analytics to update
the DWU allocation.
The above command
sets the service level objective to DW1000.
Explanation:
AzureDESQLDW –
Database for which we would like the DWU allocation to be updated.
AzureDEServer- Name
of the server where the database has been created.
DW1000c – Number of
DWUs to be allocated
Please note that
the above SQL command should be run after connecting to the master SQL
Database.
Explanation:
AzureDESQLDW –
Database for which we would like the DWU allocation to be updated.
DW1000c – Number of
DWUs to be allocated
The above command
will update the DWU allocation to DW1000c for the database azuredeSQLDB,
which is hosted on server azurede_server which is provisioned
in the rg_azurede resource group.
Explanation:
azuredeSQLDB–
Database for which we would like the DWU allocation to be updated.
azurede_server –
Name of the server where the database has been created.
rg_azurede –
Resource group name for the database server
DW1000c – Number of
DWUs to be allocated
7.What are serverless
SQL pools?
Serverless SQL Pool
is a SQL query service that comes built-in within Synapse Analytics. It enables
users to query non-relational data sources without the need of underlying
provisioned hardware or software resources. (Internally, it uses pooled
resources, hence the name serverless SQL Pool). A default service endpoint is
provided inside the Azure Synapse Analytics instance.
The major benefit
as well as primary use case of using serverless SQL Pool is quick data
discovery and exploration. This feature enables data engineers to quickly
assess the structure, quality and volume of data in non-relational format
(CSV,JSON, Parquet etc.) stored in the data lake using the familiar T-SQL
syntax.
There are also
other secondary use cases for serverless SQL Pool, such as the following:
Apart from enabling
data engineers/scientists to perform ad-hoc querying, serverless SQL Pool also
provides out-of-the-box connectivity for any tool that can establish a TDS
connection.
8.What are dedicated
SQL pools?
We are going to discuss about Dedicated SQL Pool (formerly known as SQL DW) in Azure Synapse Analytics. Microsoft launched Azure Synapse Analytics as a platform which reflects the changing analytics landscape. Today’s analytics workloads and requirements have evolved from the more traditional data warehousing concepts of nightly ETL (Extract Transform Load) to more hybrid scenarios which may include scenarios such as real-time data streaming and analyzing massive amounts of data stored in various formats.
Dedicated SQL Pool
(formerly SQL DW) is a collection of features which enable the implementation
of the more traditional Enterprise Data Warehousing platform using Azure
Synapse Analytics. As the name suggests, Dedicated SQL pool is a collection of
resources measured in Data Warehousing Units (DWUs), provisioned using Synapse
SQL.
Dedicated SQL pool
uses columnar storage and relational tables to store data which enables it to
store and analyze massive amounts of data. The benefits of using columnar
storage is that it reduces the amount of storage required, thereby reducing
overall storage costs while improving query performance at the same time.
Dedicated SQL pool utilizes PolyBase to import from data sources such as Hadoop
and Spark.
9.What are DWUs?
Dedicated SQL Pools in Azure Synapse
Analytics (Synapse SQL Pool for short) are optimized for the
implementation of a more traditional Enterprise Data Warehouse. To enable
this, a Synapse SQL Pool is provisioned using a combination of
resources. The three bundled resources are CPU, Memory, and IO. These bundled
units are called DWUs (Data Warehousing Units). In
other words, DWUs are standardized units of performance for Synapse SQL
Pool.
This means that in
order to increase the performance, we will have to increase the number of DWUs provisioned.
Also, going form a lower service level to a higher service level increases the
number of DWUs provisioned and vice versa. To keep things
simple, changing the DWUs has no effect on the storage costs,
since storage is billed separately.
10.What are cDWUs? What
is the difference between DWUs and cDWUs?
Besides, there are
other settings provided to optimize the cost and performance of Synapse
SQL Pool. One such setting is the Service Level Objective (SLO).
The service levels for Synapse SQL Pool Gen2 are measured
in cDWUs (Compute Data Warehousing Units). This is
represented by appending a “c†at the end of the unit number e.g. DW2000c
(DW2000 in Gen 1).
To Summarize:
Both Gen1 and Gen2
support pausing the Data Warehouse compute (DWU/cDWU) resources
when the usage of the DW is not needed. This can be done on-demand by issuing
a Synapse SQL or PowerShell command. These
will be discussed in a future post.
Another important point to note is that Gen2 uses a local-disk based cache, which gets invalidated every time the Synapse SQl Pool is either paused or scaled (cDWU allocation is increased or decreased). So, with Gen2, after the resumption (unpause or cDWU allocation update) there will be a period of cache warming before the performance of the Synapse SQL Pool reaches an optimal level.
11.How do you estimate
the costs before starting an Azure Synapse Analytics project?
Before starting a project, it is very important to understand and plan
for the estimated costs associated with creating and running an Azure
Synapse Analytics instance. Azure Pricing Calculator provides
an easy-to-use interface where the users can input the numbers and get an
overall estimate of the associated costs.
To use the Azure
Pricing Calculator correctly, one needs to understand the various
resource types that are provisioned while creating and running a Synapse
Analytics instance.
Some of the resource instances and numbers that need to be input in the Azure Pricing Calculator are as follows:
It should be noted
that Azure Synapse Analytics runs on associated Azure
infrastructure that is provisioned along with the Synapse Analytics instance.
One of the key
infrastructures linked to the Azure Synapse Analytics instance
is Azure Data Lake Storage Gen2. It is easy to miss the
associated costs for the Azure Data Lake Storage while
estimating costs for Azure Synapse Analytics.
Also, the Azure Portal shows a summary tab with the estimated costs on the create workspace page for Azure Synapse Analytics.
Years of Experience
Gratified Students
Training Batches
Training Hours
Please subscribe our technical blog to get recent updates.