1.What
are the various windowing functions in Azure Stream Analytics?
Azure Stream Analytics is designed to handle real time, high velocity
streaming data. To efficiently analyze streaming data, we need to create
batches or groups of the incoming data items. This is where windowing functions
come in.
A window in Azure
Stream Analytics context, means a block of time-stamped event data (e.g. IoT,
web clickstream etc.) that enables users to perform various statistical
operations (most commonly, aggregations) on the event data.
To partition and
analyze a window in Azure Stream Analytics, four types of windowing functions
are available:
To understand this concept, lets have a look at the diagram below:
A new window begins
when the first event arrives. The window is kept open until the specified
timeout period counting from the arrival time of the preceeding event. This is
like a countdown for closing the window. If a new event arrives within the
timeout period, the window close countdown is reset to the timeout period,
otherwise the window is closed.
If events keep
arriving before the countdown hits 0, then the window keeps growing until the
maximum duration (specified at the time of defining the window) is reached.
Another important point to note is that, the check for the maximum duration is
done at the defined interval e.g. if the max duration is set to 10 mins then,
the check whether the current window has reached max duration would happen at
10th, 20th ,30th mins and so on.
2.What
data security options are available in Azure SQL DB?
This post is going to discuss the security features for databases on
Azure. Let’s have a look:
3.Which
service would you use to create a Data Warehouse in Azure?
Azure Synapse Analytics
4.Can
you explain the architecture of Azure Synapse Analytics?
Azure Synapse Analytics is designed to process huge amounts of data. Sometimes, tables in Synapse Analytics can have hundreds of millions of rows. Even with these huge amounts of data loads, Syanpse Analytics processes complex queries and returns the query results within seconds. This is possible, because Synapse SQL runs on a Massively Parallel Processing (MPP) architecture where the processing of data is distributed across multiple nodes.
Applications connect to a control node. As the name suggests, a control
node acts as a point of entry to the Synapse Analytics MPP engine. After
receiving the Synapse SQL query, the control node then breaks it down into MPP optimized
format. The individual operations are then forwarded to the compute nodes.
These compute nodes can perform the operations in parallel thereby resulting in
much better query performance.
To implement
parallel processing of Synapse SQL, Data Movement Service (DMS), an
internal service manages the data movement across compute nodes as and when
required.
The term for
provisioned resources in Synapse Analytics is Synapse SQL Pool.
Synapse SQL pools can be scaled by Data Warehouse Units (DWUs). A
DWU is an abstraction of compute power i.e. CPU, memory and IO. The important
thing to note is that the storage is not a part of DWU. This means that the
storage can be scaled independently.
This approach of
having compute and storage scale independent of each other comes with various
benefits to the users. E.g. it is possible to pause the compute capacity
without removing the data, so the users have to pay only for the storage and
not the compute resources.
5.What
are the data masking features available in Azure SQL Database?
Dynamic data masking plays a significant role in data security in the
context of Azure SQL Database. It is way of restricting access of sensitive
information to a specific set of users. Dynamic data masking is available for
Azure SQL Database, Azure SQL Managed Instance and Azure Synapse Analytics.
Dynamic data
masking can be implemented as a security policy which can be implemented on all
the SQL Databases across an Azure subscription. Customers can control the level
of masking as per their requirements. Dynamic data masking only masks the query
results for specific column values on which the data masking has been applied,
it does not affect the actual stored data in the database.
Dynamic data
masking policy can be found under the SQL Database Security configuration blade
on the Azure portal. For Azure Synapse Analytics and SQL Managed Instance
PowerShell or REST API can be used.
To implement dynamic data masking for various scenarios, there are some masking functions available:
6.What
is PolyBase? What are some use cases for PolyBase?
Azure Synapse Analytics has a Massively Parallel Processing (MPP)
architecture with multiple compute nodes controlled by a control node. To take
advantage of the MPP architecture, data ingestion must be parallelized.
PolyBase optimizes the data ingestion into PDW. The other advantage that
Polybase offers is that, it supports T-SQL. This enables developers to query
external data transparently from supported data stores , irrespective of the
storage architecture of the external data store.
Polybase can be used
to access data stored in Azure Blob Storage, Azure Data Lake Storage or any
Hadoop instance such as Azure HDInsight.
Azure Synapse Analytics using PolyBase to access data stored in Azure Blob Storage
PolyBase uses an
HDFS bridge to connect to external data source e.g. Azure Blob Storage. The
connection is bidirectional and can be used to transfer data between Azure
Synapse Analytics and the external source extremely fast.
PolyBase comes very handy when joining data stored in the SQL Server Data Warehouse (hosted on Azure Synapse Analytics) with external source (e.g. Azure Blob Storage) since its native support for T-SQL. This eliminates the need to retrieve the external data separately and loading it into the SQL Data Warehouse for further analysis.
Let’s have a look
at the use cases for PolyBase:
we learnt the basics of Polybase and how it makes data ingestion much
faster. In this post we are going to look at the steps that we need to perform
to ingest data into Azure Synapse Analytics.
Step 1 : Create a Database Master key
CREATE MASTER KEY;
GO
Explanation: Creating a
database master key helps us encrypt the login credentials later.
Step 2: Create
external data source
CREATE EXTERNAL DATA SOURCE <Data
_source_ name>
WITH
(
LOCATION =
'<prefix>://<path>'
[, CREDENTIAL = <database scoped credential> ]
, TYPE = HADOOP
)
Explanation: LOCATION
parameter is the path of the stored data that we want to ingest
Step 3:
Create external file format
CREATE EXTERNAL FILE FORMAT
TextDelimFormat
WITH
(
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|'
, STRING_DELIMITER = ''
, DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
, USE_TYPE_DEFAULT = FALSE
) ;
Explanation: The above code
creates an external file format for CSV file with the “|†symbol as the
column separator (FIELD_TERMINATOR). Following format types are available
PARQUET, ORC (Optimized Row Columnar), RCFILE, DELIMITEDTEXT (CSV), JSON.
Arguments will change according to the file format selected.
Step 4: Create
external table
CREATE EXTERNAL TABLE
[dbo].[DimItemExternal]
( [ItemKey] [int] NOT NULL,
[ItemType] nvarchar NULL,
[ItemName] nvarchar NULL )
WITH
(
LOCATION='/DimItem/'
, DATA_SOURCE = AzureDataLakeStore
, FILE_FORMAT = TextFileFormat
, REJECT_TYPE = VALUE
, REJECT_VALUE = 0
) ;
Explanation: The above example
connects to the CSV file stored in Azure Data Lake Store. REJECT_VALUE and
REJECT_TYPE parameters are provided for error handling. REJECT_VALUE is the
error count and the REJECT_TYPE can be either VALUE or PERCENTAGE. These
will be discussed further in a future post about Error Handling.
Step 5: Create
Table as Select (CTAS)
CREATE TABLE [dbo].[DimItem]
WITH
(
DISTRIBUTION REPLICATE,
HEAP
)
AS
SELECT DISTINCT
[ITEMNAME]
, [ITEMTYPE]
FROM [dbo].[DimItemExternal]
Explanation: CTAS command creates a table and imports the results of a Transact-SQL SELECT statement from the external table we created in the previous step. One of the important parameters to be specified with CTAS is the table distribution (which is REPLICATE in the above example).
7.What
is reserved capacity in Azure Storage?
To optimize Azure Storage costs, Microsoft provides option of reserved
capacity on Azure storage. Reserved storage provides a fixed amount of capacity
to customers, on the Azure cloud for the period of reservation.
Reserved capacity
is priced at a discount compared to normal capacity on Azure storage. Reserved
capacity is available for Block Blobs and Azure Data Lake Store Gen 2 data in
standard storage account.
The cost savings
depend on the following factors:
Azure storage
capacity reservations are available in units of 100 TB and 1 PB per month. The
reserved capacity can be for a single subscription or can be shared across
multiple subscriptions for a customer.
Azure storage capacity reservation discounts are only applicable to the data storage charges. There will still be pay-as-you-go costs associated with other associated activities such as bandwidth and data transfer while accessing the stored data.Reservations can be purchased through the Azure portal.
8.What are pipelines
and activities in Azure Data Factory? What is the difference between the two?
we have discussed about Azure Data Factory in brief. This post will be a
discussion about pipelines and activities that can be performed using Azure
Data Factory.
A pipeline is a grouping
of activities that are arranged to accomplish a task together. Pipelines help
users to manage the individual activities as a group and provide a quick and
easy overview of the activities involved in a complex task with multiple steps.
Below is Microsoft’s illustration to understand the relationship between activities and pipelines better:
There are a lot of
different activities available in Azure Data Factory. Some of the most widely
used ones are:
Copy Activity: Used to
import data from SQL Server to Azure
Dataflow Activity: To process
and transform data using Azure Services such as Synapse Analytics
Azure Data Factory
activities can be grouped into three parts:
9.How do you manually
execute an Azure Data Factory pipeline?
There are various ways to manually execute ADF
Pipelines. One way is using PowerShell :
After creating Azure Data Factory Activities
and Pipelines, we need to execute or run the pipeline to begin the task that the
pipeline is designed to perform. A pipeline run, is an instance of pipeline
execution.
The simplest way to
run a pipeline is Manual or On-demand execution. There are various ways to do
this. We will look at the PowerShell command to execute an Azure Data Factory
Pipeline programmatically using the new Az PowerShell module:
Invoke-AzDataFactoryV2Pipeline
-DataFactory $df -PipelineName "AsdfPipeline" -ParameterFile
.\PipelineParameters.json
Where
“AsdfPipeline†is the name of the pipeline that is being run and the ParameterFile parameter
specifies the path of a JSON file with the source and sink path.
The format of the
JSON file to be passed as a parameter to the above PowerShell command is shown
below:
{
"sourceBlobContainer": "MySourceFolder",
"sinkBlobContainer": "MySinkFolder"
}
Please note that
Azure Data Factory pipelines can also be triggered programmatically using .NET,
Python or REST API. Also, it is possible to run the pipeline manually from the Azure
portal.
10.What is the
difference between control flow and data flow in the context of Azure Data
Factory?
If you are familiar with SQL Service Integration Services (SSIS), then
you must understand the difference between Control Flow and Data Flow.
The concepts apply to Azure Data Factory as well.
Control Flow
Activity is an activity that affects the path of execution of the Data Factory
pipeline. E.g. for each activity, which creates a loop if conditions are met.
Data Flow
Transformations are the used where we need to transform the input data e.g.
Join or Conditional Split.
Let’s look at some
of the differences between Control Flow activities and Data Flow
Transformations
|
Control
Flow Activity |
Data
Flow Transformation |
|
Affects the execution sequence
or path |
Transforms the ingested data |
|
Can be recursive |
Non recursive |
|
No source/sink |
Source and sink are required |
|
Implemented at the pipeline
level |
Implemented at the activity
level |
We will have a look
at various control flow activities in a future post. For now, please refer to
the links below to get the current list.
11.What are the various
Data Flow Partitioning Schemes availablein Azure Data Factory?
Partitioning Schemes are a way to optimize the performance of Data Flow. This setting can be accessed on the Optimize tab in the Configuration panel for the Data Flow Activity.
Microsoft
recommends leaving the setting to default “Use current partitioningâ€
in most cases. This sends an instruction to the data factory to use native
partitioning schemes.
Single Partition option is
used when users want to output to a single destination e.g. a single file in
ADLS Gen2.
As we can see from the screenshot above, there are multiple Partition Types available after selecting the Set Partitioning radio button. Let’s have a look at these partition schemes:
Microsoft
recommends testing each of these partitioning schemes and comparing partition
usage statistics using monitoring view.
12.What is Azure Table
storage? How is it different from other storage types in Azure?
Azure Storage platform is Microsoft’s cloud storage solution. Like other
cloud offerings, it is a managed service and caters to various storage
requirements and scenarios.
It includes the
following storage options:
6. Azure Table Storage
is the storage service optimized for storing structured data. Table Entities
are the basic units of data that can be stored in Azure Table Storage, Table.
Table entities are equivalent to rows in a relational database table. In other
words, Entities are collections of properties and each property is represented
as a key-value pair. The three main system properties for table entities are
the following:
7.
PartitionKey: This property stores the
partition key of the partition that the table entity belongs to. If two or more
entities have the same partition key, they will be stored in the same
partition. Partitions will be discussed in more detail in a future post.
8.
RowKey: A partition can have multiple
table entities. The RowKey property contains the key which identifies the
entity uniquely within the Partition.
9.
TimeStamp: This property
stores the last modified date/time value for the table entity. An alternative
name for this property is version. The Azure Table service
maintains the value for the timestamp property during all insert and update
operations.
10.
The PartitionKey and RowKey together
uniquely identify the table entity across Azure Table Storage. In other
words, PartitionKey and RowKey form the
composite primary key for the table entity.
11.
These two properties create a clustered index on the table. The data is
sorted in ascending order by the PartitionKey and the RowKey values.
12.
Let’s discuss some concepts about table partitions.
13.
Partitions can be identified by the same PartitionKey value.
Each partition is served by a partition server, while one partition server can
serve multiple partitions. A partition server can serve a fixed number of entities
from a partition over time. This rate of serving entities is around 2000
entities per second. The entity serve rate can vary depending on the load on
the partition.
14.
To optimally scale Azure table storage, it is recommended to create
multiple partitions. Since, the entity serve rate of the partition is dependent
on the load on the partition server, Azure Storage utilizes load balancing
strategy to spread the traffic evenly across multiple partition servers. Having
multiple partitions helps improve the performance due to better load balancing.
15.
Another important concept related to partitioning in Azure table storage
is, Range partitions. This will be discussed in a future post.
13.What are partition
sets in Azure Cosmos DB?
Partition sets are geographically distributed structures that are used to manage physical partitions using a set of keys. Unlike, replica sets, which are limited within a cluster, partition sets are scoped to scale the whole region which might span multiple clusters and data centres as shown in the diagram below:
From the diagram
above, partition sets can be thought of as “super replica sets†with more
widespread geographical distribution. Partition sets contain multiple replica
sets with the same set of keys.
Similar to replica
sets, partition sets dynamically manage memberships. Partition set membership
is affected by events related to partitions such as, adding or removing a
partition, adding or removing a region to Cosmos DB, failures etc. The
membership management is decentralized meaning each partition set manages the
membership independent of other partition sets.
There is an option
to configure Cosmos DB with either one or multiple write regions. Depending on
the write region configuration, partition sets can either accept writes in
exactly one or multiple regions.
To summarize, both
replica sets and partition sets are imperative to partition management in
Cosmos DB. The dynamic membership feature of replica sets and partition sets,
provides high availability and scalability to Cosmos DB.
14.What is watermark in
Azure Stream Analytics?
Important concepts related to handling time in Azure Stream Analytics.
Event time: The instance
of time when an event occurs (happens).
Processing Time: The instance
of time when the ingested event reaches the processing stage. Typically, the
event processing system will take a few moments to process the event data once
it has been ingested.
Watermark: An event-time
marker which is a measure of the extent to which the events have entered the
streaming processor in Azure Stream Analytics. Since the incoming event stream
never stops, watermarks provide a progress checkpoint up to a certain point in
the stream.
Further, Azure
Stream Analytics provides users the option to choose between two Event Times:
1. Arrival Time: Arrival time
is the time assigned to the event when it reaches the Stream Analytics input.
By default, arrival time is used for processing events. Arrival time can be
used through the following properties depending on the type of input used:
2. Application
Time: Application time is included in the event payload; it is the
time assigned to the event when it is generated. Application time can be used
during event processing by using Timestamp By clause in
the SELECT query.
15.What are some
optimization best practices for Azure Stream Analytics?
We will discuss options available in Azure Stream Analytics to output
results to an Azure SQL Database and how to optimize the configuration setting
to achieve optimal throughput.
Depending on the
SQL database schema design, indexing and compression, the easiest way to
enhance performance for the Azure Stream Analytics job is to create multiple
output partitions, with each output partition loading data into an Azure SQL
table partition. This means the write operations happen in parallel and a fully
parallel job topology can be achieved.
Further, let’s have
a look at some configuration options to optimize performance:
Inherit
Partitioning: With this configuration setting, it is possible to inherit the
partition scheme of the previous query step or the Stream Analytics input.
Batch Size: This is the
maximum number of records sent with every bulk insert transaction. This
configuration setting is available for Azure Stream Analytics output. The
options for this configuration depend on the nature of the destination table.
For a table with clustered columnstore indexes, the optimal bath size is
100,000 records, whereas in disk-based tables, 10,000 or lower value is
recommended.
Input Message
Tuning: This configuration setting in Azure Stream Analytics increased the
number of input events per message per partition, thereby increasing the write
throughput.
Let’s continue the
discussion and have a look at some other optimization best practices.
Partitioned Table
and Indexes: A good way to ensure that there are no bottlenecks during
partition writes is to use partitioned SQL table with partitioned indexes.
Also, it is recommended to use the same column as the partition key for both
table and indexes.
Avoid Unique Key
violations: Unique key violation will generate a warning message in the Azure
Stream Analytics activity log. Most of the unique key violations occur during
recovery cases. Recovery cases are instances where Azure Stream Analytics jobs
have to recover after a stoppage or failure. This will be discussed further in
a future post.
In-Memory Table as
temp table: In-memory tables can be used as staging tables and improve the
performance about 10 times compared to SQL tables. Azure Data Factory can then
be used to move data from the in-memory table to SQL table. The main limitation
of in-memory tables is the size. The in-memory tables need to fit in the
memory, which makes them unsuitable for huge data loads.
Years of Experience
Gratified Students
Training Batches
Training Hours
Please subscribe our technical blog to get recent updates.