Azure Data Engineering Interview Questions Part2

Daniel AG by Daniel A G

1.What are the various windowing functions in Azure Stream Analytics?

Azure Stream Analytics is designed to handle real time, high velocity streaming data. To efficiently analyze streaming data, we need to create batches or groups of the incoming data items. This is where windowing functions come in.

A window in Azure Stream Analytics context, means a block of time-stamped event data (e.g. IoT, web clickstream etc.) that enables users to perform various statistical operations (most commonly, aggregations) on the event data.

To partition and analyze a window in Azure Stream Analytics, four types of windowing functions are available:

  1. Tumbling Window: This is the easiest to understand, of all the Azure Stream Analytics windowing functions. In tumbling window function, the data stream is segmented into distinct fixed length time segments. This can be easily understood by the diagram below.


  1. Hopping Window: Hopping windows are like Tumbling windows (both have fixed duration segments) but in hopping windows, the data segments can overlap. So, while defining a hopping window there are two parameters that we need to specify, the window size (length of data segment) and hop (duration of the overlap).  In the example below, window size is 10 seconds and hop size is 5 seconds.


  1. Sliding Window: This windowing function does not necessarily produce aggregation after a fixed time interval, unlike the tumbling and hopping window functions. Aggregation occurs every time a new event occurs, or an existing event falls out of the time window.

To understand this concept, lets have a look at the diagram below:

  • Window a : aggregation occurs when the first event, (1) arrives at the 10th second.
  • Window b: When the next event, (5) arrives at the 12th second, another aggregation occurs from 2 secs to 12 secs.
  • Window c: When the next two events, (9 and 7) arrive at the 15th second, aggregation occurs from 5 secs to 15 secs
  • Window d: At the 20th sec, the event 1 (which arrived at the 10th sec) “drops out” of the 10 second aggregation window, triggering a new aggregation, and the process continues.

  1. Session Window:  This function groups events based on time of arrival, so there is no fixed window size. Instead, there are three parameters, timeout, max duration and partitioning key(optional). The purpose of session window is to eliminate quiet periods in the data stream i.e. time periods when no events arrive.

A new window begins when the first event arrives. The window is kept open until the specified timeout period counting from the arrival time of the preceeding event. This is like a countdown for closing the window. If a new event arrives within the timeout period, the window close countdown is reset to the timeout period, otherwise the window is closed.

If events keep arriving before the countdown hits 0, then the window keeps growing until the maximum duration (specified at the time of defining the window) is reached. Another important point to note is that, the check for the maximum duration is done at the defined interval e.g. if the max duration is set to 10 mins then, the check whether the current window has reached max duration would happen at 10th, 20th ,30th mins and so on.

2.What data security options are available in Azure SQL DB?

This post is going to discuss the security features for databases on Azure. Let’s have a look:

  1. Azure SQL Firewall Rules: When it comes to firewall rules, Azure provides two levels of security. There are server level firewall rules which are stored in the SQL Master database. Server level firewall rules determine the access to the Azure database server. Users can also create database level firewall rules which govern the access to the individual databases.
  2. Azure SQL Always Encrypted:  Always encrypted feature is designed to protect sensitive data such as credit card numbers, stored in the Azure SQL database. With Always Encrypted, data is encrypted within the client applications itself using the Always Encrypted-enabled driver. The encryption keys are not shared with Azure SQL Database, which means that database admins do not have access to sensitive data.
  3. Azure SQL Transparent Data Encryption (TDE): TDE is the technology used to encrypt stored data in Azure SQL Database. TDE is also available for Azure SQL Managed Instances and Azure Synapse Analytics. With TDE, the encryption and decryption of database, backups and transaction log files, happens in real time.
  4. Azure SQL Database Auditing: Azure provides comprehensive auditing capabilities within the SQL Database service. Audit policies can be applied at the Database server level, which gets cascaded to all the databases within the server. It is also possible to define the audit policy at the individual database level, giving users the flexibility to choose based on the requirements.

 

3.Which service would you use to create a Data Warehouse in Azure?

Azure Synapse Analytics

4.Can you explain the architecture of Azure Synapse Analytics?

Azure Synapse Analytics is designed to process huge amounts of data. Sometimes, tables in Synapse Analytics can have hundreds of millions of rows. Even with these huge amounts of data loads, Syanpse Analytics processes complex queries and returns the query results within seconds. This is possible, because Synapse SQL runs on a Massively Parallel Processing (MPP) architecture where the processing of data is distributed across multiple nodes.


Applications connect to a control node. As the name suggests, a control node acts as a point of entry to the Synapse Analytics MPP engine. After receiving the Synapse SQL query, the control node then breaks it down into MPP optimized format. The individual operations are then forwarded to the compute nodes. These compute nodes can perform the operations in parallel thereby resulting in much better query performance.

To implement parallel processing of Synapse SQL, Data Movement Service (DMS), an internal service manages the data movement across compute nodes as and when required.

The term for provisioned resources in Synapse Analytics is Synapse SQL Pool. Synapse SQL pools can be scaled by Data Warehouse Units (DWUs). A DWU is an abstraction of compute power i.e. CPU, memory and IO. The important thing to note is that the storage is not a part of DWU. This means that the storage can be scaled independently.

This approach of having compute and storage scale independent of each other comes with various benefits to the users. E.g. it is possible to pause the compute capacity without removing the data, so the users have to pay only for the storage and not the compute resources.

5.What are the data masking features available in Azure SQL Database?

Dynamic data masking plays a significant role in data security in the context of Azure SQL Database. It is way of restricting access of sensitive information to a specific set of users. Dynamic data masking is available for Azure SQL Database, Azure SQL Managed Instance and Azure Synapse Analytics.

Dynamic data masking can be implemented as a security policy which can be implemented on all the SQL Databases across an Azure subscription. Customers can control the level of masking as per their requirements. Dynamic data masking only masks the query results for specific column values on which the data masking has been applied, it does not affect the actual stored data in the database.

Dynamic data masking policy can be found under the SQL Database Security configuration blade on the Azure portal. For Azure Synapse Analytics and SQL Managed Instance PowerShell or REST API can be used.

To implement dynamic data masking for various scenarios, there are some masking functions available:



6.What is PolyBase? What are some use cases for PolyBase?

Azure Synapse Analytics has a Massively Parallel Processing (MPP) architecture with multiple compute nodes controlled by a control node. To take advantage of the  MPP architecture, data ingestion must be parallelized. PolyBase optimizes the data ingestion into PDW. The other advantage that Polybase offers is that, it supports T-SQL. This enables developers to query external data transparently from supported data stores , irrespective of the storage architecture of the external data store.

Polybase can be used to access data stored in Azure Blob Storage, Azure Data Lake Storage or any Hadoop instance such as Azure HDInsight.


Azure Synapse Analytics using PolyBase to access data stored in Azure Blob Storage



PolyBase uses an HDFS bridge to connect to external data source e.g. Azure Blob Storage. The connection is bidirectional and can be used to transfer data between Azure Synapse Analytics and the external source extremely fast.

PolyBase comes very handy when joining data stored in the SQL Server Data Warehouse (hosted on Azure Synapse Analytics) with external source (e.g. Azure Blob Storage) since its native support for T-SQL. This eliminates the need to retrieve the external data separately and loading it into the SQL Data Warehouse for further analysis.

Let’s have a look at the use cases for PolyBase:

  • Query data stored in Hadoop, Azure Blob Storage or Azure Data Lake Store from Azure SQL Database or Azure Synapse Analytics – This eliminates the need to import data from the external source. Also, PolyBase supports T-SQL which makes it easy for developers to query external data.
  • Import data from Hadoop, Azure Blob Storage, or Azure Data Lake Store – No need to install a third party ETL tool and this can be achieved with a few simple T-SQL queries
  • Export data to Hadoop, Azure Blob Storage, or Azure Data Lake Store – Supports export and archiving of data to external data stores

 

we learnt the basics of Polybase and how it makes data ingestion much faster. In this post we are going to look at the steps that we need to perform to ingest data into Azure Synapse Analytics.

Step 1 : Create a Database Master key

CREATE MASTER KEY;

GO

Explanation: Creating a database master key helps us encrypt the login credentials later.

Step 2: Create external data source

CREATE EXTERNAL DATA SOURCE <Data _source_ name>

                   WITH

(    LOCATION      = '<prefix>://<path>'

     [, CREDENTIAL = <database scoped credential> ]

    , TYPE = HADOOP

)

Explanation:  LOCATION parameter is the path of the stored data that we want to ingest

Step 3:  Create external file format

CREATE EXTERNAL FILE FORMAT TextDelimFormat

WITH

(

    FORMAT_TYPE = DELIMITEDTEXT

  , FORMAT_OPTIONS ( FIELD_TERMINATOR = '|'

  , STRING_DELIMITER = ''

  , DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'

  , USE_TYPE_DEFAULT = FALSE

) ;

Explanation: The above code creates an external file format  for CSV file with the “|” symbol as the column separator (FIELD_TERMINATOR). Following format types are available PARQUET, ORC (Optimized Row Columnar), RCFILE, DELIMITEDTEXT (CSV), JSON. Arguments will change according to the file format selected.

Step 4: Create external table

CREATE EXTERNAL TABLE [dbo].[DimItemExternal]

 ( [ItemKey] [int] NOT NULL,

   [ItemType] nvarchar NULL,

   [ItemName] nvarchar NULL )

WITH

(

    LOCATION='/DimItem/'

  , DATA_SOURCE = AzureDataLakeStore

  , FILE_FORMAT = TextFileFormat

  , REJECT_TYPE = VALUE

  , REJECT_VALUE = 0

) ;

Explanation: The above example connects to the CSV file stored in Azure Data Lake Store. REJECT_VALUE and REJECT_TYPE parameters are provided for error handling. REJECT_VALUE is the error count and the REJECT_TYPE can be either VALUE or PERCENTAGE.  These will be discussed further in a future post about Error Handling.

Step 5: Create Table as Select (CTAS)

CREATE TABLE [dbo].[DimItem]

WITH

(

     DISTRIBUTION REPLICATE,

     HEAP

)

AS

SELECT DISTINCT

       [ITEMNAME]

     , [ITEMTYPE]

FROM [dbo].[DimItemExternal]

Explanation: CTAS command creates a table and imports the results of a Transact-SQL SELECT statement from the external table we created in the previous step. One of the important parameters to be specified with CTAS is the table distribution (which is REPLICATE in the above example). 

7.What is reserved capacity in Azure Storage?

To optimize Azure Storage costs, Microsoft provides option of reserved capacity on Azure storage. Reserved storage provides a fixed amount of capacity to customers, on the Azure cloud for the period of reservation.

Reserved capacity is priced at a discount compared to normal capacity on Azure storage. Reserved capacity is available for Block Blobs and Azure Data Lake Store Gen 2 data in standard storage account.  

The cost savings depend on the following factors:

  • Duration of reservation – This can be either one year or three years
  • Storage capacity reserved – The amount of storage capacity reserved (100 TB or more)
  • Access tier for the reserved capacity –  Hot, Cool or Archive, to learn more about Azure Storage access tiers, read this post
  • Type of redundancy chosen – All types of Azure Storage redundancies are available. To learn more about redundancy options, read this post

Azure storage capacity reservations are available in units of 100 TB and 1 PB per month. The reserved capacity can be for a single subscription or can be shared across multiple subscriptions for a customer.

Azure storage capacity reservation discounts are only applicable to the data storage charges. There will still be pay-as-you-go costs associated with other associated activities such as bandwidth and data transfer while accessing the stored data.Reservations can be purchased through the Azure portal.

8.What are pipelines and activities in Azure Data Factory? What is the difference between the two?

we have discussed about Azure Data Factory in brief. This post will be a discussion about pipelines and activities that can be performed using Azure Data Factory.

A pipeline is a grouping of activities that are arranged to accomplish a task together. Pipelines help users to manage the individual activities as a group and provide a quick and easy overview of the activities involved in a complex task with multiple steps.

Below is Microsoft’s illustration to understand the relationship between activities and pipelines better:


There are a lot of different activities available in Azure Data Factory. Some of the most widely used ones are:

Copy Activity: Used to import data from SQL Server to Azure

Dataflow Activity: To process and transform data using Azure Services such as Synapse Analytics

Azure Data Factory activities can be grouped into three parts:

  1. Data Movement Activities – These activities are used for ingesting data into Azure or exporting data from Azure to external data stores. Copy activity is an example of Data Movement Activity.
  2. Data Transformation Activities – These activities are related to data processing and extracting information from data. Dataflow Activity is an example of Data Transformation Activity.
  3. Control Activities – Activities that specify a condition or affect the progress of the pipeline. E.g. For Each activity is used for repeating control flow, Wait Activity induces a delay in the pipeline execution.  

9.How do you manually execute an Azure Data Factory pipeline?

There are various ways to manually execute ADF Pipelines. One way is using PowerShell :

After creating Azure Data Factory Activities and Pipelines, we need to execute or run the pipeline to begin the task that the pipeline is designed to perform. A pipeline run, is an instance of pipeline execution.

The simplest way to run a pipeline is Manual or On-demand execution. There are various ways to do this. We will look at the PowerShell command to execute an Azure Data Factory Pipeline programmatically using the new Az PowerShell module:

Invoke-AzDataFactoryV2Pipeline -DataFactory $df -PipelineName "AsdfPipeline" -ParameterFile .\PipelineParameters.json

Where “AsdfPipeline” is the name of the pipeline that is being run and the ParameterFile parameter specifies the path of a JSON file with the source and sink path.

The format of the JSON file to be passed as a parameter to the above PowerShell command is shown below:

{

  "sourceBlobContainer": "MySourceFolder",

  "sinkBlobContainer": "MySinkFolder"

}

Please note that Azure Data Factory pipelines can also be triggered programmatically using .NET, Python or REST API. Also, it is possible to run the pipeline manually from the Azure portal. 

10.What is the difference between control flow and data flow in the context of Azure Data Factory?

If you are familiar with SQL Service Integration Services (SSIS), then you must understand the difference between Control Flow  and Data Flow. The concepts apply to Azure Data Factory as well.

Control Flow Activity is an activity that affects the path of execution of the Data Factory pipeline. E.g. for each activity, which creates a loop if conditions are met.

Data Flow Transformations are the used where we need to transform the input data e.g. Join or Conditional Split.

Let’s look at some of the differences between Control Flow activities and Data Flow Transformations

Control Flow Activity

Data Flow Transformation

Affects the execution sequence or path
of the pipeline

Transforms the ingested data

Can be recursive

Non recursive

No source/sink

Source and sink are required

Implemented at the pipeline level

Implemented at the activity level

We will have a look at various control flow activities in a future post. For now, please refer to the links below to get the current list.

11.What are the various Data Flow Partitioning Schemes availablein Azure Data Factory?

Partitioning Schemes are a way to optimize the performance of Data Flow. This setting can be accessed on the Optimize tab in the Configuration panel for the Data Flow Activity.


Microsoft recommends leaving the setting to default “Use current partitioning” in most cases. This sends an instruction to the data factory to use native partitioning schemes.

Single Partition option is used when users want to output to a single destination e.g. a single file in ADLS Gen2.

As we can see from the screenshot above, there are multiple Partition Types available after selecting the Set Partitioning radio button. Let’s have a look at these partition schemes:


Microsoft recommends testing each of these partitioning schemes and comparing partition usage statistics using monitoring view.

12.What is Azure Table storage? How is it different from other storage types in Azure?

Azure Storage platform is Microsoft’s cloud storage solution. Like other cloud offerings, it is a managed service and caters to various storage requirements and scenarios.

It includes the following storage options:

  1. Azure Blobs : Blob stands for binary large object. This storage solution supports all kinds of files including, text files, videos, images, documents, binary data etc.
  2. Azure Files : Azure Files is an organized way of storing data on the cloud. The main advantage of using Azure Files over Azure Blobs is that Azure Files allows for organizing the data in a folder structure. Also, Azure Files is SMB compliant i.e. it can be used as a file share.
  3. Azure Queues : Azure Queues is a cloud based messaging store for establishing and brokering communication between various applications and components.
  4. Azure Tables : A NoSQL storage solution for storing structured data which does not meet the standard relational database schema.
  5.  Azure Disks : This is used as a storage solution for Azure VMs (Virtual Machines).

6.     Azure Table Storage is the storage service optimized for storing structured data. Table Entities are the basic units of data that can be stored in Azure Table Storage, Table. Table entities are equivalent to rows in a relational database table. In other words, Entities are collections of properties and each property is represented as a key-value pair. The three main system properties for table entities are the following:

7.     PartitionKey: This property stores the partition key of the partition that the table entity belongs to. If two or more entities have the same partition key, they will be stored in the same partition. Partitions will be discussed in more detail in a future post.

8.     RowKey: A partition can have multiple table entities. The RowKey property contains the key which identifies the entity uniquely within the Partition.

9.     TimeStamp: This property stores the last modified date/time value for the table entity. An alternative name for this property is version. The Azure Table service maintains the value for the timestamp property during all insert and update operations.

10.  The PartitionKey and RowKey together uniquely identify the table entity across Azure Table Storage. In other words, PartitionKey and RowKey form the composite primary key for the table entity.

11.  These two properties create a clustered index on the table. The data is sorted in ascending order by the PartitionKey and the RowKey values.

12.  Let’s discuss some concepts about table partitions.

13.  Partitions can be identified by the same PartitionKey value. Each partition is served by a partition server, while one partition server can serve multiple partitions. A partition server can serve a fixed number of entities from a partition over time. This rate of serving entities is around 2000 entities per second. The entity serve rate can vary depending on the load on the partition.

14.  To optimally scale Azure table storage, it is recommended to create multiple partitions. Since, the entity serve rate of the partition is dependent on the load on the partition server, Azure Storage utilizes load balancing strategy to spread the traffic evenly across multiple partition servers. Having multiple partitions helps improve the performance due to better load balancing.

15.  Another important concept related to partitioning in Azure table storage is, Range partitions. This will be discussed in a future post.

13.What are partition sets in Azure Cosmos DB?

Partition sets are geographically distributed structures that are used to manage physical partitions using a set of keys. Unlike, replica sets, which are limited within a cluster, partition sets are scoped to scale the whole region which might span multiple clusters and data centres as shown in the diagram below:


From the diagram above, partition sets can be thought of as “super replica sets” with more widespread geographical distribution. Partition sets contain multiple replica sets with the same set of keys.

Similar to replica sets, partition sets dynamically manage memberships. Partition set membership is affected by events related to partitions such as, adding or removing a partition, adding or removing a region to Cosmos DB, failures etc. The membership management is decentralized meaning each partition set manages the membership independent of other partition sets.

There is an option to configure Cosmos DB with either one or multiple write regions. Depending on the write region configuration, partition sets can either accept writes in exactly one or multiple regions.

To summarize, both replica sets and partition sets are imperative to partition management in Cosmos DB. The dynamic membership feature of replica sets and partition sets, provides high availability and scalability to Cosmos DB.

14.What is watermark in Azure Stream Analytics?

Important concepts related to handling time in Azure Stream Analytics.

Event time: The instance of time when an event occurs (happens).

Processing Time: The instance of time when the ingested event reaches the processing stage. Typically, the event processing system will take a few moments to process the event data once it has been ingested.

Watermark: An event-time marker which is a measure of the extent to which the events have entered the streaming processor in Azure Stream Analytics. Since the incoming event stream never stops, watermarks provide a progress checkpoint up to a certain point in the stream.

Further, Azure Stream Analytics provides users the option to choose between two Event Times:

1. Arrival Time: Arrival time is the time assigned to the event when it reaches the Stream Analytics input. By default, arrival time is used for processing events. Arrival time can be used through the following properties depending on the type of input used:

  • EventEnqueuedUtcTime property for Event Hubs input
  • IoTHub.EnqueuedTime property for IoT Hub input
  • BlobProperties.LastModified property for blob input

2. Application Time: Application time is included in the event payload; it is the time assigned to the event when it is generated. Application time can be used during event processing by using Timestamp By clause in the SELECT query.

15.What are some optimization best practices for Azure Stream Analytics?

We will discuss options available in Azure Stream Analytics to output results to an Azure SQL Database and how to optimize the configuration setting to achieve optimal throughput.

Depending on the SQL database schema design, indexing and compression, the easiest way to enhance performance for the Azure Stream Analytics job is to create multiple output partitions, with each output partition loading data into an Azure SQL table partition. This means the write operations happen in parallel and a fully parallel job topology can be achieved.

Further, let’s have a look at some configuration options to optimize performance:

Inherit Partitioning: With this configuration setting, it is possible to inherit the partition scheme of the previous query step or the Stream Analytics input.

Batch Size: This is the maximum number of records sent with every bulk insert transaction. This configuration setting is available for Azure Stream Analytics output. The options for this configuration depend on the nature of the destination table. For a table with clustered columnstore indexes, the optimal bath size is 100,000 records, whereas in disk-based tables, 10,000 or lower value is recommended.

Input Message Tuning: This configuration setting in Azure Stream Analytics increased the number of input events per message per partition, thereby increasing the write throughput.

Let’s continue the discussion and have a look at some other optimization best practices.

Partitioned Table and Indexes: A good way to ensure that there are no bottlenecks during partition writes is to use partitioned SQL table with partitioned indexes. Also, it is recommended to use the same column as the partition key for both table and indexes.

Avoid Unique Key violations: Unique key violation will generate a warning message in the Azure Stream Analytics activity log. Most of the unique key violations occur during recovery cases. Recovery cases are instances where Azure Stream Analytics jobs have to recover after a stoppage or failure. This will be discussed further in a future post.

In-Memory Table as temp table: In-memory tables can be used as staging tables and improve the performance about 10 times compared to SQL tables. Azure Data Factory can then be used to move data from the in-memory table to SQL table. The main limitation of in-memory tables is the size. The in-memory tables need to fit in the memory, which makes them unsuitable for huge data loads.



Statistics Says It All

16

Years of Experience

3000

Gratified Students

100

Training Batches

9600

Training Hours


Subscribe

Please subscribe our technical blog to get recent updates.

Empire Data Systems

Social Links