Azure Data Engineering Interview Questions Part4

Daniel AG by Daniel A G

1.What are mapping data flows?

Azure Data Factory has comprehensive data transformation and integration capabilities. For those looking for a simpler data integration experience compared to Data Factory Pipelines, Microsoft provides Mapping Data Flows which do not require writing any code.

Mapping Data Flows in Azure Data Factory are a visual way to design data transformation flows. The data flows created using Mapping Data Flows, internally become Azure Data Factory activities and get executed as a part of the Azure Data Factory Pipelines. The Data Flows are compatible with Azure Data Factory’s scheduling, control, flow, and monitoring features.

Mapping Data Flows can be created within the same Azure Data Factory resources pane as Pipelines.


On the next screen, there is a visual interface for designing Data Flows. The main feature on this page is the graph, which lets users choose the type of data transformation they would like to use:


Apart from the graphical user interface for creating Data Flows, there is a configuration panel at the bottom of the screen. Optimization Settings are available on the configuration panel with the options represented using pictures:


Mapping Data Flows become operational as Data Flow Activities within the Azure Data Factory. There is also a debug mode provided to view results of each transformation step for easy debugging.

Mapping Data Flows performance monitoring and other settings will be discussed in a future post.

2.What is SSIS runtime?

Azure Data Factory Integration Runtime acts as a mediator between on-prem data Stores and the Azure cloud. It is possible to run SQL Server Integration Services (SSIS) packages (which are developed on-prem) using Integration Runtime. This is an important feature as it can save considerable amount of time and effort required in recreating the SSIS packages in the Azure Data Factory environment.

Azure Data Factory Integration Runtime is the compute infrastructure that is required to provide data integration capabilities such as:

Data Flow: Execute an Azure Data Factory data flow in managed Azure compute environment

Data Movement: Copy data across data stores on the cloud and on premises. Supports in-built connectors, format conversion, column mapping etc.

Activity Dispatch: Dispatch and monitor transformation activities on a variety of Azure services such as Azure Databricks, Azure HDInsight, Azure SQL Database etc.

SSIS Package execution: Native execution of SQL Server Integration Services (SSIS) packages in a managed Azure compute environment.

Integration Runtime acts as a bridge between an activity and a linked service.A linked service is target data store or compute service.

Azure Data Factory Integration Runtimes can be created using Management Hub, which is available in the Azure Data Factory UI.

3.What are the various runtime types in available in Azure Data Factory?

We will discuss Integration Runtime Types. There are three types of Integration Runtime. Microsoft provides the table and illustration below to better understand the usage scenarios of different integration runtimes:



As we can see from the above diagram, with the Azure Integration runtime, we can run data flows, perform copy activity between various data stores in Azure and dispatch transform activity for various data stores in Azure. Azure Integration runtime can be used while connecting to data stores and compute services with public access endpoints.

Self-hosted integration runtime can be used to run a copy activity between cloud data stores and a data store in an on-prem private network behind a firewall or a Virtual Private Network (VPN).

Azure-SSIS Integration Runtime enables users to run SSIS packages natively within the Azure environment i.e. without making any changes. Azure-SSIS integration Runtime can be either provisioned in the public cloud or in a private network (VPN). With Azure-SSIS Integration runtime, Microsoft provides users the flexibility to choose either Azure data factory pipelines or use existing SSIS packages.

4.How can we monitor Azure Data Factory integration runtime?

We are going to discuss ways to monitor the performance of the Integration runtimes.The PowerShell command below can be used to get performance stats for Integration runtimes:


The Get-AzDataFactoryV2IntegrationRuntime cmdlet returns the current status of the Integration runtime along with some other properties. Sample output for Azure Integration


Runtime is shown below:

The Status property can have the following values:

Online: This indicates that the Integration Runtime is online and ready to be used

Offline: The integration runtime is offline due to internal error.

For detailed properties returned by Get-AzDataFactoryV2IntegrationRuntime cmdlet for Self-hosted integration runtime and Azure-SSIS integration runtime, please refer to the reference link below this post.

Azure-SSIS integration runtime can also be monitored using Azure Portal.

5.What is Azure Data Factory trigger execution? What are the benefits of using trigger execution? 

The real benefit of Azure Data Factory is that pipelines can be automated. There are various ways to automate or trigger the execution of Azure Data Factory Pipelines:

Schedule Trigger: A trigger that invokes a pipeline execution at a fixed time or on a fixed schedule e.g. weekly, monthly etc.

Tumbling Window Trigger: Tumbling Window Triggers execute Azure Data Factory Pipeline at periodic time interval from a specified start time while retaining state. Time intervals for tumbling window are fixed size and there is no overlap between them.

Event Based Trigger: These triggers execute an Azure Data Factory Pipeline based on the occurrence of some event e.g. arrival or deletion of a new file in Azure Blob Storage.

It may be noted that both Schedule Trigger and Tumbling Window Trigger can be used to create recurring pipeline executions. The main difference between the two is that, the Tumbling Window Trigger waits for the pipeline execution to finish and captures the state of the pipeline execution. i.e. If the triggered pipeline run is cancelled, the corresponding tumbling window trigger run is marked cancelled. A Schedule Trigger, on the other hand, does not capture the state of the pipeline execution. It just marks the pipeline execution as successful as soon as the pipeline run starts, irrespective of the outcome of the pipeline execution.

6.What are the various data sources supported by Azure Data Factory?

The current list of supported data stores can be found here:

https://docs.microsoft.com/en-us/azure/data-factory/connector-overview

7.What is a sink in Azure Data Factory?

Azure Data Factory is the primary task orchestration/data transformation and load (ETL) tool on the Azure cloud. The easiest way to move and transform data using Azure Data Factory is to use the Copy Activity within a Pipeline.  

Copy Activity is a simple activity designed to copy and move data from one location to another. To accomplish this, there are some connectors provided. Like SSIS, there are two different sets of components available:

Source: This is where the data currently resides which we would like to be copied.

Sink: This is the location/data store, where we would like the data to be loaded to. If you are familiar with SSIS Data Flow Task, this is similar to the Destination component. 


As shown in the diagram above, the Copy Activity simply copies the data from the Source and replicates the copied data to the Sink using the underlying networks (LAN/WAN) on each side.

Both Source and Sink are available as configuration settings within the Copy Activity. Both would need to be linked to the Datasets and the Datasets are usually link to a Linked ServiceLinked Service and Datasets for Azure Data Factory will be covered in a future post.

There are some other important concepts related to the Copy Activity, such as Dataset Mapping which will be discussed in future posts.

For a list of supported data stores and formats for Source and Sink, please visit the MS Docs link below.

8.What is a Linked Service in Azure Data Factory? Can it be parameterized?

There are other components that need to be created before we can start creating Data Factory Pipelines. Some of these are Linked Services and Datasets.

Linked Service: A linked service contains the connection details (connection string), e.g. Database server, database name, file path, URL etc. A Linked Service might include authentication information related to the connection e.g. login id, passwords, API Keys etc. in an encrypted format. Linked Services can be created under the Manage tab in UI (screenshot below) . Linked Services can be parameterized, so that one Linked Service can be reused with similar datastore types e.g. a generic Database Linked Service to access multiple Databases. This will be discussed in detail in a future post.


Datasets: A Dataset is a reference to a data store and provides a very specific pointer to an object within the Linked Service. E.g. If a Linked Service points to a Database instance, the dataset can refer to a specific table that we would like to use as source or sink in the Data Factory Pipeline. Datasets can be created on top of an existing Linked Service in the Author tab. (screenshot below) . Once created, datasets can then be used in the source/sink in Data Movement activity properties in a Data Factory Pipeline. One such activity is the Copy Data Activity which accepts dataset names for the Source and Sink.


Before we can perform a debug run of the pipeline, there is a validation process that validates all the underlying components within the pipeline including Linked Services and Datasets.

9.What do you understand by Data Engineering? What are the responsibilities of a Data Engineer?

Let’s start with the definition of Data Engineering. Simply put, data engineering is the process of designing, developing, maintaing and enhancing the way data is acquired, processed, stored and consumed within an organization.

Data Engineering is an important skill which is quickly becoming a “must have” from a “good to have”. A data engineer is responsible for enabling the optimum usage and consumption of data and analytics.

Main responsibilities of the data engineer may include the following:

  1. To build and maintain the data flow pipelines
  2. To monitor and administer the existing data storage solutions
  3. Maintain consistent data quality by using cleansing and normalization techniques
  4. To identify and remove redundancies in the existing data platforms
  5. To optimize the usage of both hardware and software resources

By any means, the above is not an exhaustive list of responsibilities of a Data Engineer. The Data Engineer role is specially important in the current market scenario, since, a lot of organizations are in the process of moving their data systems from “on premises” to the cloud. Many organizations have moved to a hybrid model where they have some of their IT systems on premises and some on the cloud. Therefore, optimization of resources, in other words, “right sizing” is an important part of the Data Engineer role.UPDATE


Statistics Says It All

16

Years of Experience

3000

Gratified Students

100

Training Batches

9600

Training Hours


Subscribe

Please subscribe our technical blog to get recent updates.

Empire Data Systems

Social Links