Azure Data Engineering Interview Questions Part3

Daniel AG by Daniel A G

1.What are streaming units?

Let us have a look at streaming units which are used to optimize the performance of the Azure Stream Analytics jobs.

To optimize performance, we need to monitor and adjust the resources allocated to Stream Analytics jobs. Streaming Units are a way to measure this performance. This means, higher the streaming units allocated to a job, the more CPU and memory resources available to it.

SU% utilization metric is a measure of resource usage for Streaming units. This can range from 0% to 100%. It is recommended to keep the SU% utilization below 80% to accommodate future spikes and It is best practice to set an alert at 80%, so that more Streaming Units can be allocated to make sure that the job does not run out of resources. One indication that the Stream Analytics job requires more Streaming Units is a high number of backlogged input events.

Monitor Job Performance

Azure portal provides some metrics to measure the throughput of the job. One of the steps to improve the throughput of the job,  that can be taken besides adjusting the partitions and optimizing the query is to add more Streaming Units to the job.


2.Can you call an Azure Function from Azure Stream Analytics?

Let us see how we can run an Azure Function from an Azure Stream Analytics job.


Step 1: Create Azure Stream Analytics job: This can be done using the Azure portal UI.

Step 2: Create an Azure Cache for Redis instance: After creating a cache, make a note of the Primary connection string under Access Keys.

Step 3: Create an Azure Function: This will be an HTTP-triggered function. For detailed steps and code for the Azure function, please see the Reference link below this post.

Step 4: Update the Stream Analytics job with the Azure Function as output: In this step, we are just adding Azure Functions as the sink for the Stream Analytics job that we created in Step 1.

The Azure Steam Analytics job should be sending data to the Redis Cache using the Azure Function that we created in Step3.

The HTTP request timeout is set to 100 seconds. This means that if the Azure Function takes more than 100 seconds to process a batch, the Stream Analytics job will error out. But in case this occurs, Azure Stream Analytics is going to retry sending the failed batch to the Azure Function again.

3.What is Azure Synapse Link? 

Azure Synapse Link for Azure Cosmos DB provides seamless integration between Synapse Analytics and Cosmos DB. The analytics workloads have evolved from the traditional structured data warehouse requiring ETL (Extract Transform Load) integration to a more flexible and semi-structured big data and multi format data analysis. Azure Synapse Link caters to this by providing a hybrid transactional and analytical processing (HTAP) capability within the Azure cloud.


As can be seen in the above diagram, the Azure Cosmos DB provides a built-in Analytical column store, which connects seamlessly with Azure Synapse Link. This connection provides data engineers and data scientists with the ability to process and analyse transactional data in near real-time latency using Synapse SQL or Synapse Spark.

The key advantage of using Azure Synapse Link is that it provides seamless integration without the need for ETL pipelines. Therefore, this approach is called No ETL integration. The changes in the transactional data store are synced automatically to the Analytical Store within Cosmos DB. These changes then reflect instantly in Azure Synapse Analytics query results without the need for processing an ETL pipeline.

Since the Analytical and Transactional stores in Cosmos DB are provisioned independent of each other, there is no performance impact on the transactional workloads within Cosmos DB while querying the analytical store using Azure Synapse Link.

4.What are the machine learning features available in Azure Synapse Analytics?

Machine Learning is one of the most important techniques of data science and analytics. Azure Synapse Analytics supports various ways of implementing machine learning out-of-the-box.

The basic process flow of a typical machine learning project is as follows:

  1. Understanding the business processes and rules
  2. Data acquisition and ingestion
  3. Analytical model selection and training
  4. Model deployment and scoring

The output of the first step is to define the basics for the project, in terms of the scope and outcomes.

Azure Synapse Analytics has robust data ingestion and acquisition features (Pipelines and close integration with Azure Data Factory).

The next step is analytical model selection and training. Azure Synapse Analytics incorporates two major ways to achieve this:

Spark MLib : Spark MLib can be used to train models in the more classical sense. This option is ideal for those who are already familiar with Spark MLib.

Automated ML : If you are new to machine learning, Azure Synapse Analytics provides this automated way to train models. This method allows user to select the best model based on certain metrics.

Apart from the Spark MLib, Synapse Analytics provides support for other popular machine learning libraries such as, SciKit Learn.

The final step in a machine learning project is, model deployment and scoring. Synapse Analytics supports deployment and batch scoring of Analytical models trained within Azure (using one of the techniques mentioned above) or outside the Azure environment as well. There are two ways to run batch scoring in Azure Synapse: TSQL PREDICT function and Apache Spark Pools for Azure Synapse.

5.What is Azure Security Benchmark?

Information Security is one of the most important and crucial aspects for any enterprise grade system software. For a cloud ecosystem, like Azure, having sturdy information security (including data security) is fundamental. Microsoft has been constantly improving the robustness of the Azure platform ever since its launch.

Recently, Microsoft have released the second version of their recommended security practices and configurations, called the Azure Security Benchmark to help customers adopt recommended information security best practices across the Azure platform.

The benchmark provides design and configuration guidelines across various cloud-centric control areas, called Control Domains. These are based on well-known security benchmarks such as, Center for Internet Security (CIS) Controls Version 7.1 and National Institute of Standards and Technology (NIST) SP 800-53.

The Control Domains included in the Azure Security Benchmark v2 are :


Azure Security Benchmark v2: Control Domains

Each Information Security Control Domain mentioned above includes the following:

Azure ID: A unique Security Benchmark ID to identify the recommendation.

CIS Controls v7.1 ID(s): The corresponding CIS Controls v7.1 id(s) (single or multiple)  for the recommendation.

NIST SP 800-53 r4 ID(s): The corresponding NIST SP 800-53 r4 ID(s) (single or multiple) to identify the recommendation.

Details: The detailed explanation around the rationale and helpful guidelines on how to implement the recommendation. Information about whether the recommendation is supported by Azure Security Center is also included.

Responsibility: To identify the responsible entity (Customer, service-provider or both) for implementing the security recommendation. Due to the overarching nature of the Azure Security Benchmark, some recommendations may have a shared responsibility between the customer and service-provider. These will be specified in the baseline recommendations for the individual service.

Customer Security Stakeholders: The security function in the customer organization security department who is responsible for the respective control. This can differ based the ownership and organization structure for different customers.

6.What are the various ways to change the DWU allocation in Azure Synapse Analytics?

DWUs and cDWUs are basic units of measuring performance in Azure Synapse Analytics. In this post, let’s look at the various methods available in Synapse Analytics to update the DWU allocation.  

  1. Azure Portal GUI: This method is the most obvious and no-code way for changing the DWU allocation in Azure Synapse Analytics. To do this using the portal, we simply go to the Synapse Analytics Database and select Scale. The slider can then simply moved from left to right to increase (or from right to left to decrease) the DWU allocation.

  1. PowerShell: As with most settings within Azure, the DWU allocation can be changed using PowerShell. We can use the Set-AzSqlDatabase PowerShell cmdlet


The above command sets the service level objective to DW1000.

Explanation:

AzureDESQLDW – Database for which we would like the DWU allocation to be updated.

AzureDEServer- Name of the server where the database has been created.

DW1000c – Number of DWUs to be allocated

  1. T-SQL: This method of changing the DWUs is the most SQL developer friendly, since it uses the simple SQL keywords ALTER and MODIFY (see example below):


Please note that the above SQL command should be run after connecting to the master SQL Database.

Explanation:

AzureDESQLDW – Database for which we would like the DWU allocation to be updated.

DW1000c – Number of DWUs to be allocated

  1. REST API: The Create or Update Database REST API can be used to update the DWU allocation.


The above command will update the DWU allocation to DW1000c for the database azuredeSQLDB, which is hosted on server azurede_server which is provisioned in the rg_azurede resource group.

Explanation:

azuredeSQLDB– Database for which we would like the DWU allocation to be updated.

azurede_server – Name of the server where the database has been created.

rg_azurede – Resource group name for the database server

DW1000c – Number of DWUs to be allocated

7.What are serverless SQL pools?

Serverless SQL Pool is a SQL query service that comes built-in within Synapse Analytics. It enables users to query non-relational data sources without the need of underlying provisioned hardware or software resources. (Internally, it uses pooled resources, hence the name serverless SQL Pool). A default service endpoint is provided inside the Azure Synapse Analytics instance.

The major benefit as well as primary use case of using serverless SQL Pool is quick data discovery and exploration. This feature enables data engineers to quickly assess the structure, quality and volume of data in non-relational format (CSV,JSON, Parquet etc.) stored in the data lake using the familiar T-SQL syntax.

There are also other secondary use cases for serverless SQL Pool, such as the following:

  • Creating a unified relational logical data warehouse without moving the underlying files, irrespective of the various file types (CSV, JSON, Parquet etc.)
  • Simple data transformations and visualization of the stored data using BI tools such as Power BI
  • Integration with Apache Spark databases using Spark External tables

Apart from enabling data engineers/scientists to perform ad-hoc querying, serverless SQL Pool also provides out-of-the-box connectivity for any tool that can establish a TDS connection.

8.What are dedicated SQL pools?

We are going to discuss about Dedicated SQL Pool (formerly known as SQL DW) in Azure Synapse Analytics. Microsoft launched Azure Synapse Analytics as a platform which reflects the changing analytics landscape. Today’s analytics workloads and requirements have evolved from the more traditional data warehousing concepts of nightly ETL (Extract Transform Load) to more hybrid scenarios which may include scenarios such as real-time data streaming and analyzing massive amounts of data stored in various formats.


Dedicated SQL Pool (formerly SQL DW) is a collection of features which enable the implementation of the more traditional Enterprise Data Warehousing platform using Azure Synapse Analytics. As the name suggests, Dedicated SQL pool is a collection of resources measured in Data Warehousing Units (DWUs), provisioned using Synapse SQL.  

Dedicated SQL pool uses columnar storage and relational tables to store data which enables it to store and analyze massive amounts of data. The benefits of using columnar storage is that it reduces the amount of storage required, thereby reducing overall storage costs while improving query performance at the same time. Dedicated SQL pool utilizes PolyBase to import from data sources such as Hadoop and Spark.

9.What are DWUs?

Dedicated SQL Pools in Azure Synapse Analytics (Synapse SQL Pool for short) are optimized for the implementation of a more traditional Enterprise Data Warehouse.  To enable this, a Synapse SQL Pool is provisioned using a combination of resources. The three bundled resources are CPU, Memory, and IO. These bundled units are called DWUs (Data Warehousing Units). In other words, DWUs are standardized units of performance for Synapse SQL Pool.

This means that in order to increase the performance, we will have to increase the number of DWUs provisioned. Also, going form a lower service level to a higher service level increases the number of DWUs provisioned and vice versa. To keep things simple, changing the DWUs has no effect on the storage costs, since storage is billed separately.

10.What are cDWUs? What is the difference between DWUs and cDWUs?

Besides, there are other settings provided to optimize the cost and performance of Synapse SQL Pool. One such setting is the Service Level Objective (SLO). The service levels for Synapse SQL Pool Gen2 are measured in cDWUs (Compute Data Warehousing Units). This is represented by appending a “c” at the end of the unit number e.g. DW2000c (DW2000 in Gen 1).

To Summarize:  

  • Gen1 Synapse SQL Pools are measured in DWUs
  • Gen2 Synapse SQL Pools are measured in cDWUs

Both Gen1 and Gen2 support pausing the Data Warehouse compute (DWU/cDWU) resources when the usage of the DW is not needed. This can be done on-demand by issuing a Synapse SQL or PowerShell command. These will be discussed in a future post.

Another important point to note is that Gen2 uses a local-disk based cache, which gets invalidated every time the Synapse SQl Pool is either paused or scaled (cDWU allocation is increased or decreased). So, with Gen2, after the resumption (unpause or cDWU allocation update) there will be a period of cache warming before the performance of the Synapse SQL Pool reaches an optimal level.

11.How do you estimate the costs before starting an Azure Synapse Analytics project?

Before starting a project, it is very important to understand and plan for the estimated costs associated with creating and running an Azure Synapse Analytics instance. Azure Pricing Calculator provides an easy-to-use interface where the users can input the numbers and get an overall estimate of the associated costs.

To use the Azure Pricing Calculator correctly, one needs to understand the various resource types that are provisioned while creating and running a Synapse Analytics instance.

Some of the resource instances and numbers that need to be input in the Azure Pricing Calculator are as follows:


It should be noted that Azure Synapse Analytics runs on associated Azure infrastructure that is provisioned along with the Synapse Analytics instance.

One of the key infrastructures linked to the Azure Synapse Analytics instance is Azure Data Lake Storage Gen2. It is easy to miss the associated costs for the Azure Data Lake Storage while estimating costs for Azure Synapse Analytics.

Also, the Azure Portal shows a summary tab with the estimated costs on the create workspace page for Azure Synapse Analytics.


Statistics Says It All

16

Years of Experience

3000

Gratified Students

100

Training Batches

9600

Training Hours


Subscribe

Please subscribe our technical blog to get recent updates.

Empire Data Systems

Social Links