1.What are mapping
data flows?
Azure Data Factory has
comprehensive data transformation and integration capabilities. For those
looking for a simpler data integration experience compared to Data
Factory Pipelines, Microsoft provides Mapping Data Flows which
do not require writing any code.
Mapping Data Flows in Azure Data
Factory are a visual way to design data transformation flows. The data flows
created using Mapping Data Flows, internally become Azure Data
Factory activities and get executed as a part of the Azure
Data Factory Pipelines. The Data Flows are compatible with Azure
Data Factory’s scheduling, control, flow, and monitoring features.
Mapping Data Flows can be created within the same Azure Data Factory resources pane as Pipelines.
On the next screen, there is a visual interface for designing Data Flows. The main feature on this page is the graph, which lets users choose the type of data transformation they would like to use:
Apart from the graphical user interface for creating Data Flows, there is a configuration panel at the bottom of the screen. Optimization Settings are available on the configuration panel with the options represented using pictures:
Mapping Data Flows
become operational as Data Flow Activities within the Azure Data Factory. There
is also a debug mode provided to view results of each transformation step for
easy debugging.
Mapping Data Flows performance
monitoring and other settings will be discussed in a future post.
2.What is SSIS
runtime?
Azure Data Factory Integration Runtime acts as a mediator between
on-prem data Stores and the Azure cloud. It is possible to run SQL Server
Integration Services (SSIS) packages (which are developed on-prem) using
Integration Runtime. This is an important feature as it can save considerable
amount of time and effort required in recreating the SSIS packages in the Azure
Data Factory environment.
Azure Data Factory
Integration Runtime is the compute infrastructure that is required to provide
data integration capabilities such as:
Data Flow: Execute an Azure
Data Factory data flow in managed Azure compute environment
Data Movement: Copy data across
data stores on the cloud and on premises. Supports in-built connectors, format
conversion, column mapping etc.
Activity Dispatch: Dispatch and
monitor transformation activities on a variety of Azure services such as Azure
Databricks, Azure HDInsight, Azure SQL Database etc.
SSIS Package
execution: Native execution of SQL Server Integration Services (SSIS) packages in
a managed Azure compute environment.
Integration Runtime
acts as a bridge between an activity and a linked
service.A linked service is target data store or compute
service.
Azure Data Factory
Integration Runtimes can be created using Management Hub, which is available in
the Azure Data Factory UI.
3.What are the various
runtime types in available in Azure Data Factory?
We will discuss Integration Runtime Types. There are three types of Integration Runtime. Microsoft provides the table and illustration below to better understand the usage scenarios of different integration runtimes:
As we can see from
the above diagram, with the Azure Integration runtime, we can run data flows,
perform copy activity between various data stores in Azure and dispatch
transform activity for various data stores in Azure. Azure Integration runtime
can be used while connecting to data stores and compute services with public
access endpoints.
Self-hosted
integration runtime can be used to run a copy activity between cloud data
stores and a data store in an on-prem private network behind a firewall or a
Virtual Private Network (VPN).
Azure-SSIS
Integration Runtime enables users to run SSIS packages natively within the
Azure environment i.e. without making any changes. Azure-SSIS integration
Runtime can be either provisioned in the public cloud or in a private network
(VPN). With Azure-SSIS Integration runtime, Microsoft provides users the
flexibility to choose either Azure data factory pipelines or use existing SSIS
packages.
4.How can we monitor
Azure Data Factory integration runtime?
We are going to discuss ways to monitor the performance of the Integration runtimes.The PowerShell command below can be used to get performance stats for Integration runtimes:
The Get-AzDataFactoryV2IntegrationRuntime cmdlet returns the current status of the Integration runtime along with some other properties. Sample output for Azure Integration
Runtime is shown
below:
The Status property
can have the following values:
Online: This
indicates that the Integration Runtime is online and ready to be used
Offline: The integration
runtime is offline due to internal error.
For detailed
properties returned by Get-AzDataFactoryV2IntegrationRuntime cmdlet
for Self-hosted integration runtime and Azure-SSIS
integration runtime, please refer to the reference link below this
post.
Azure-SSIS
integration runtime can also be monitored using Azure Portal.
5.What is Azure Data
Factory trigger execution? What are the benefits of using trigger
execution?
The real benefit of Azure Data Factory is that pipelines can be
automated. There are various ways to automate or trigger the execution of Azure
Data Factory Pipelines:
Schedule Trigger: A trigger
that invokes a pipeline execution at a fixed time or on a fixed schedule e.g.
weekly, monthly etc.
Tumbling Window
Trigger: Tumbling Window Triggers execute Azure Data Factory Pipeline at
periodic time interval from a specified start time while retaining state.
Time intervals for tumbling window are fixed size and there is no overlap
between them.
Event Based
Trigger: These triggers execute an Azure Data Factory Pipeline based on the
occurrence of some event e.g. arrival or deletion of a new file in Azure Blob
Storage.
It may be noted
that both Schedule Trigger and Tumbling Window Trigger can
be used to create recurring pipeline executions. The main difference between
the two is that, the Tumbling Window Trigger waits for the
pipeline execution to finish and captures the state of the pipeline execution.
i.e. If the triggered pipeline run is cancelled, the corresponding tumbling
window trigger run is marked cancelled. A Schedule Trigger, on the
other hand, does not capture the state of the pipeline execution. It just marks
the pipeline execution as successful as soon as the pipeline run starts,
irrespective of the outcome of the pipeline execution.
6.What are the various
data sources supported by Azure Data Factory?
The current list of supported data stores can
be found here:
https://docs.microsoft.com/en-us/azure/data-factory/connector-overview
7.What is a sink in
Azure Data Factory?
Azure Data Factory is the primary task orchestration/data
transformation and load (ETL) tool on the Azure cloud. The easiest way to move
and transform data using Azure Data Factory is to use
the Copy Activity within a Pipeline.
Copy Activity is a simple activity designed to copy and move data from one
location to another. To accomplish this, there are some connectors provided.
Like SSIS, there are two different sets of components available:
Source: This is where
the data currently resides which we would like to be copied.
Sink: This is the
location/data store, where we would like the data to be loaded to. If you are
familiar with SSIS Data Flow Task, this is similar to the Destination component.
As shown in the
diagram above, the Copy Activity simply copies the data from
the Source and replicates the copied data to the Sink using
the underlying networks (LAN/WAN) on each side.
Both Source and Sink are
available as configuration settings within the Copy Activity. Both
would need to be linked to the Datasets and the Datasets are
usually link to a Linked Service. Linked Service and Datasets for Azure
Data Factory will be covered in a future post.
There are some
other important concepts related to the Copy Activity, such as
Dataset Mapping which will be discussed in future posts.
For a list of
supported data stores and formats for Source and Sink,
please visit the MS Docs link below.
8.What is a Linked
Service in Azure Data Factory? Can it be parameterized?
There are other components that need to be created before we can start
creating Data Factory Pipelines. Some of these are Linked
Services and Datasets.
Linked Service: A linked service contains the connection details (connection string), e.g. Database server, database name, file path, URL etc. A Linked Service might include authentication information related to the connection e.g. login id, passwords, API Keys etc. in an encrypted format. Linked Services can be created under the Manage tab in UI (screenshot below) . Linked Services can be parameterized, so that one Linked Service can be reused with similar datastore types e.g. a generic Database Linked Service to access multiple Databases. This will be discussed in detail in a future post.
Datasets: A Dataset is a reference to a data store and provides a very specific pointer to an object within the Linked Service. E.g. If a Linked Service points to a Database instance, the dataset can refer to a specific table that we would like to use as source or sink in the Data Factory Pipeline. Datasets can be created on top of an existing Linked Service in the Author tab. (screenshot below) . Once created, datasets can then be used in the source/sink in Data Movement activity properties in a Data Factory Pipeline. One such activity is the Copy Data Activity which accepts dataset names for the Source and Sink.
Before we can
perform a debug run of the pipeline, there is a validation process
that validates all the underlying components within the pipeline including Linked
Services and Datasets.
9.What do you
understand by Data Engineering? What are the responsibilities of a Data
Engineer?
Let’s start with the definition of Data Engineering. Simply put, data
engineering is the process of designing, developing, maintaing and enhancing
the way data is acquired, processed, stored and consumed within an organization.
Data Engineering is
an important skill which is quickly becoming a “must have†from a “good to
haveâ€. A data engineer is responsible for enabling the optimum usage and
consumption of data and analytics.
Main
responsibilities of the data engineer may include the following:
By any means, the above is not an exhaustive list of responsibilities of a Data Engineer. The Data Engineer role is specially important in the current market scenario, since, a lot of organizations are in the process of moving their data systems from “on premises†to the cloud. Many organizations have moved to a hybrid model where they have some of their IT systems on premises and some on the cloud. Therefore, optimization of resources, in other words, “right sizing†is an important part of the Data Engineer role.UPDATE
Years of Experience
Gratified Students
Training Batches
Training Hours
Please subscribe our technical blog to get recent updates.