Azure Data Engineering Interview Questions Part1

Daniel AG by Daniel A G

1.What is Microsoft Azure?

Azure is Microsoft’s cloud computing platform. Cloud computing is an online computing service (which may include both hardware and software) where the service provider creates a managed service to enable users to access these services on demand. So instead of “buying” hardware or software which means paying to own the product, users are “renting” the hardware or software resources from the cloud service provider and only have to pay for the size and time that they use. The main advantage of using a cloud service is the flexibility, scalability, cost effectiveness and ease of use based on user requirements.

Cloud services can be broadly categorized into three categories based on the type of services offered:

IaaS (Infrastructure as a Service) : This is where a service provider offers cloud based services for hardware and system resources such as CPU (compute resources), RAM, Storage (hard disk space) etc.

PaaS (Platform as a Service) : When a software platform (e.g. a database, IDE, Design tool etc. ) is offered. The main users for PaaS tools are software developers, power users, designers etc.

SaaS (Software as a Service) : When a finished software product is offered on the cloud which can be accessed by end users through a browser/app from any internet connected device e.g. Office productivity tools such as Office 365, Google Docs and Accounting tools such as Xero

2.What are the various storage types available in Azure?

Azure Storage platform is Microsoft’s cloud storage solution. Like other cloud offerings, it is a managed service and caters to various storage requirements and scenarios.It includes the following storage options:

  1. Azure Blobs : Blob stands for binary large object. This storage solution supports all kinds of files including, text files, videos, images, documents, binary data etc.
  2. Azure Files : Azure Files is an organized way of storing data on the cloud. The main advantage of using Azure Files over Azure Blobs is that Azure Files allows for organizing the data in a folder structure. Also, Azure Files is SMB compliant i.e. it can be used as a file share.
  3. Azure Queues : Azure Queues is a cloud based messaging store for establishing and brokering communication between various applications and components.
  4. Azure Tables : A NoSQL storage solution for storing structured data which does not meet the standard relational database schema.
  5.  Azure Disks : This is used as a storage solution for Azure VMs (Virtual Machines).

 

3.What is data redundancy? What data redundancy options are available in Azure?

Data redundancy is the practice of storing multiple copies of data to ensure that the data is always available even during unexpected events e.g. disk failure, in case of a natural disaster etc.

To ensure high levels of data availability, Azure always keeps multiple copies of data. Depending on the criticality and time required to enable access to the replica, there are a few data redundancy options available to customers in Azure:

  1. Locally Redundant Storage (LRS): Data is replicated across different racks in the same data center. This option ensures that there are at least three copies of the data. It is the cheapest redundancy option.
  2. Zone Redundant Storage (ZRS): Choosing this option ensures that data is replicated across three zones within the primary region. In case of a zone failure, Azure takes care of DNS repointing automatically. There may be few changes required to the network settings for any applications accessing data after the DNS repointing.
  3. Geo Redundant Storage (GRS): As the name suggests, data is replicated across two regions. This option ensures that data can be recovered if an entire region goes down. In case of a Geo failure, it may take some time for the Geo failover to complete and the data to become accessible in the secondary region.
  4. Geo Zone Redundant Storage (GZRS): This option is a combination of GRS and ZRS i.e. the data is replicated to three zones in the primary region and copied to one zone in the secondary region.
  5. Read Access Geo Redundant storage (RA-GRS): This is same as GRS but with the added option of read access to the data in the secondary region, in case of a failure in the primary region.
  6. Read Access Geo Zone Redundant Storage (RA-GZRS): This is same as GZRS but with the added option of read access to the data in the secondary region, in case of a failure in the primary region.

Azure maintains data integrity by using, cycling redundancy checks (CRCs) for the stored data and checksums on the network traffic. If any errors are found, it automatically performs a “repair” from the redundant data.

4.What are multi-model databases? What is the primary multi-model database service available on the Microsoft Azure platform?

Cosmos DB (formerly, Document DB) is Microsoft’s premier NoSQL service offering on Azure. It is the first globally distributed, multi model database offered on cloud by any vendor.


Lets have a look at the main features of Cosmos DB:

Globally Distributed:  It is extremely easy to distribute data globally by using multiple Azure regions. Having data spread across multiple geographies enables faster access to data and a better overall user experience.          

Automatic Indexing: Cosmos DB indexes all stored data automatically, irrespective of the underlying schema used. No schema or index management is required. For customers, who would like to have more control, there is option to customize indexing as well.

Multi-model: It can be used to store data in various data storage models such as Key-value pair, document based, graph based, column-family based etc. No matter what data model the customer chooses, consistency, low latency, global distribution and automatic indexing features are the same. This will be discussed in detail in a future post.

Multiple consistency models: Cosmos DB supports five consistency models. Eventual, Prefix, Session, Bounded and Strong. This gives the customers the flexibility to choose the way data is replicated between multiple geographical regions. Consistency models will be discussed in detail in a future post.

Guaranteed Low Latency: Cosmos DB guarantees 10ms latency at the 99th percentile for reads and writes. What this means is 99% of the requests will have a latency of 10ms or less. This guarantee is applicable to all consistency levels.  This is made possible due to the data being distributed globally on multiple Azure regions. Users can access data from the Azure region geographically closest to their physical location.  

Multi-Language and Multi-API Support: Cosmos DB supports a wide range of languages and APIs. Languages for which SDKs are available include Java, .NET, Python, Node.js, JavaScript etc. The following APIs are supported: SQL API, Cosmos DB Table API, Mongo DB, Graph API, Cassandra etc. More details about API features will be discussed in future posts.


Key-value: As the name suggests, data is stored as a tuple(pair), with a unique attribute(key) mapped to its corresponding content (value). Most Key value store databases use JSON (JavaScript Objet Notification) format to implement key value pairs. Examples of Key-value store databases: Redis, Riak, Berkeley DB, Couchbase and MemcacheDB

Column Family or Columnar: In this type of data model, data is stored in groups of column families that are accessed together. It is a form of Key-value pair, where the key is mapped to a value that is set to a group of columns. Examples: HBase, Cassandra, Amazon DynamoDB and Google BigTable

Document: Document data model allows data to be stored and queried from a key-value pair based JSON style document. Document databases extend the functionality of JSON and enable users to create flexible schemas. E.g. an online marketplace can have thousands of products with different attributes, rather than creating a sparate table for each product, it is easier to create one document with all the product attributes using separate blocks in the document for each product. Examples: MongoDB, CouchDB, IBM Domino and DocumentDB (the precursor to Azure Cosmos DB).

Graph: Data is stored in terms of entities and relationships. It enables creation of a map or network of connections between entities. Graph databases are vey useful when it comes to modeling social networking scenarios. Examples: Neo4j, OrientDB, and FlockDB

5.What are some ways to ingest data from on-prem storage to Azure?

One of the initial challenges customers come across after purchasing Azure Storage, is moving on-prem data to the Azure Storage account. Microsoft provides data transfer solutions for various scenarios. In today’s post we are going to have a look at these solutions.

The main factors to consider while choosing a data transfer solution are:

  • Data Size
  • Data Transfer Frequency (One-time or Periodic)
  • Network Bandwidth

Depending on the above factors, data movement solutions can be either Offline or through Network. Lets have a look:

Offline transfer: This method is used for one-time bulk data transfer. Microsoft can provide customers with disks or secure storage device. Customers also have the option to ship their own disks to Microsoft. The offline options are named data box, data box disk, data box heavy and import/export (customers provide own disks).

Network transfer: Data transfer over network connection can be performed in the following ways:

  • Graphical Interface: This is ideal while transferring a few files and when there is no need to automate the data transfer. Graphical interface options include Azure Storage Explorer and Azure Portal.
  • Programmatic Transfer: Some of the available scriptable data transfer tools are AzCopy, Azure PowerShell, Azure CLI. Various programming language SDKs are also available.
  • On-premises devices: A physical device called Data Box Edge, along with a virtual Data Box Gateway are installed at the customer’s premises which optimize the data transfer to Azure.
  • Managed Data Factory pipeline: Azure Data Factory pipelines can be used to move, transform and automate regular data transfers from on-prem data stores to Azure.

Microsoft provides the following informative illustration to help customers decide on the suitable data transfer solution as per their requirements:

 

6.What is the best way to migrate data from on-prem databases to Azure?

Microsoft’s SQL Server has been one of the most popular relational databases for quite some time. The cloud version of SQL Server is called Azure SQL Database.

Azure provides the following options to move data from existing on premises SQL Server to Azure SQL database :

  1. SQL Server Stretch Database: This option moves some of the data from SQL Server 2016 or above to Azure. It identifies the cold rows which are accessed infrequently by users and moves them to the cloud. These rows can still be accessed and queried by users, but the performance would be a bit slower. This helps in lowering costs as storing data in Azure is cheaper than provisioning new storage on premises. It also results in quicker backups for the on-premises database.
  2. Azure SQL Database: This option is suitable for organizations that want to go ahead with a cloud only strategy and move the whole database to Azure. There are some limitations to consider while migrating your data from SQL Server to Azure SQL Database, since they are not 100% compatible with each other. E.g. a table cannot have more than 1023 columns on Azure SQL Database.
  3. SQL Server Managed Instance: This is one of the configurations supported on Azure Database as a Service (DBaaS).  Microsoft provides this option where the database maintenance is taken care of by Microsoft and the database is almost 100% compatible with on premises SQL Server.
  4. SQL Server on a Virtual Machine: If a customer wants to have complete control over the database management and maintenance, then this is a suitable option. This ensures a 100% compatibility with the existing on premises instance.

Microsoft provides a tool called Data Migration Assistant which helps customers identify which options are suitable based on the customers’ existing on premises SQL Server setup. This tool provides useful information e.g. the tables suitable for configuring cloud storage for SQL Server Stretch Database, how much restructuring and re-engineering would be required to move the SQL Server instance to Azure SQL Database.

7.What is the difference between Azure Data Lake Storage (ADLS) and Azure Synapse Analytics?

Azure Data Lake Storage Gen2 (ADLS Gen2) and Azure Synapse Analytics (formerly, SQL Data Warehouse) both are highly scalable and have the capability to ingest and process huge amounts of data (on a Peta Byte scale). But there are some differences :

ADLS Gen2

Azure Synapse Analytics

 Optimized for storing and processing structed and non-structured data

Optimized for processing structured data in a well-defined schema

Used for data analytics and exploration by data scientists and engineers

Used for Business Analytics i.e. for disseminating data to business users

Built to work with Hadoop

Built on SQL Server

No regulatory compliance

Compliant with regulatory standards such as HIPAA

USQL (combination of TSQL and C#) and Hadoop used for accessing data

Synapse SQL (improved version of TSQL) used for accessing data

Can handle data streaming using tools such as Azure Stream Analytics

Built-in data pipelines and data streaming capabilities (currently in preview)

 

8.What are the various consistency models available in Azure Cosmos DB?

Consistency models, also known as consistency levels, provide a way for developers to choose between high availability and better performance. Usually the choice is between two extremes but Cosmos DB provides a granular scale of consistency levels that can be configured to suit specific business requirement.


Let’s have a look at each of the consistency models available in Cosmos DB:

  1. Strong: This consistency model guarantees that every time a read operation occurs, it fetches the most recent version of the data. For this to work, Cosmos DB must ensure that each write operation is propagated to all replicas before the operation is considered complete. The cost of read operation is higher compared to other consistency models
  2. Bounded Staleness: In this consistency model, there is an option to set a time lag between the write and the read operation. This can also be set based on item versions instead of a time interval. This consistency level is suitable for scenarios where availability and consistency have equal priority.
  3. Session: This is the default  and the most popular consistency level in Cosmos DB. The concept of this consistency level is based on regions. A user accessing data from the same region where the write was performed, will see the latest data. Users from other regions can only read lag data. This guarantees consistency for each client session. It offers the lowest latency reads and writes among all the consistency levels.
  4. Consistent Prefix: This consistency level guarantees that users do not see out-of-order writes but there is no time bound replication of data across regions. E.g. if data is written is 1,2,3 users can see 1,1,2 or 1,2,3 but never out-of-order 2,1,1 or 3,2,1
  5. Eventual: As the name suggests, this consistency level does not guarantee any time bound or version bound replication. This means if there is no new data written, then eventually all replicas will be synced and have the latest data. This means there could be out of order reads as well. It has the lowest read latency but also the lowest level of consistency.

 

9.What is Cosmos DB Synthetic Partition Key?

It is important to select a good partition key that distributes the data evenly across multiple partitions. However, if there is no suitable column with properly distributed values, we can create a Synthetic Partition Key. There are three ways to create a synthetic partition key:

  1. Concatenate Properties– We can combine multiple property values to form a synthetic partition key. For example, suppose we have the following document: 

 

 

{

"deviceId": "ade-123",

"date": 2020

}

The two properties deviceId and date can be combined into a synthetic primary key as shown below:

{

"deviceId": "ade-123",

"date": 2020,

"partitionKey": "ade-123-2020"

}

2. Random Suffix: In this method, a random number is added to the end of the partition key value. For example, if a partition key is of date datatype and we choose a random number between 1 and 500. After random concatenation, the date values arranged in ascending order will look like : 2020-07-05.1, 2020-07-05.2, 2020-07-05.3 and so on. Since we chose a random number, the write operations for a given date will be evenly distributed across partitions.

3. Pre-calculated Suffix: Using the above random suffix method makes the write operations faster but it doesn’t improve the read performance, since we do not know which partition to go to when we want to query a specific value.

To improve the read performance, we can use the pre-calculated suffix strategy. For example, if we have a partition key with date values and we have Vehicle Identification Number (VIN) which is frequently used to query the data. We can use a hash function that uses the VIN as input and returns a number within a range, say between 1 and 500. We can then use this random number as a suffix to the date value for creating the synthetic partition key. This way when we get a specific VIN value in the query, we can use the hash function to determine the partition suffix and go straight to the partition containing the particular VIN value.

10.How do you capture streaming data (e.g., website clickstream, social media feed etc.) in Azure?

As connected devices become more commonplace, we are moving towards a true IoT world, where every device will be connected to the internet and be able to generate a continuous stream of data. This calls for a new requirement of collecting, storing and analyzing this continuous stream of data. To address this need, Azure provides a dedicated analytics service, aptly named,  Azure Stream Analytics.


As explained in the diagram above, Azure Stream Analytics is designed to analyze high volumes of high velocity data from multiple sources (primarily IoT data) in real time.

A stream analytics job is fully managed by the Azure cloud, i.e. users do not need to provision any hardware. It is also fully integrated with other Azure services and very easy to setup with a few clicks. The output of the Azure Stream Analytics job can be sent to a visualization service such as PowerBI or can be stored on another Azure analytics service such  as Azure HDInsight or Azure Synapse Analytics for further processing.

Azure Stream Analytics provides a simple SQL-based language called, Stream Analytics Query Language. Azure Stream Analytics provides developers, the ability to extend the query language by defining additional Machine Learning functions.

Azure Stream Analytics can process massive amounts of data on the scale of over a million events per second and deliver the results with ultra-low latency.

Azure Stream Analytics is compliant with leading industry certifications such as HIPAA, ISO27001, FedRAMP etc.

11.What is Azure Storage Explorer? What are they used for?

Azure Storage Explorer is a versatile standalone application for managing Azure Storage from any platform. Azure Storage Explorer is available for Windows, Mac OS and Linux. It can be downloaded from Microsoft.

Azure Storage Explorer provides access to multiple Azure data stores such as, Cosmos DB, ADLS Gen2, Queues, Tables, Blobs etc.  It provides an easy to navigate rich GUI.


After downloading, the next step is to link Azure storage account. This will enable users to connect to the storage resources already available in Azure or create and manage new data stores.

One of the key features of Azure Storage Explorer is that it allows users to work even when they are disconnected from the Azure cloud service. This is achieved by attaching local emulators. Users have the option to use Azure Storage Emulator on Windows or Azurite which supports macOS and Linux. Users can benefit from cost savings and increased productivity.

The simplest way to connect to Cosmos DB is to use a connection string. Similarly, easiest way to connect to ADLS Gen2 is to connect using URI.

Azure Storage Explorer provides a one stop solution through which users can manage data on multiple data stores seamlessly.

12.What is Azure Databricks? How is it different from the original Databricks?

Azure Databricks is the Azure implementation of Apache Spark. Apache Spark is an open source big data processing platform. Azure Databricks was developed in consultation with the developer of Apache Spark, M. Zaharia, who later became the founder of a company named Databricks.


Databricks sits in the data preparation or processing stage in the data lifecycle. This starts with the data being ingested in Azure using Data Factory and stored in a permanent storage (such as ADLS Gen2 or Blob Storage). Alternatively, data could be streamed using Kafka, Event Hub or IoT Hub.

In the next stage data is processed using Machine Learning in Databricks (which uses Apache Spark under the hood) and the extracted insights are then loaded into one of the Analysis Services in Azure (Cosmos DB, Synapse Analytics or SQL Database).

These insights are now ready to be visualized and presented to the end users with the help of Analytical reporting tools such as Power BI.  

Let’s have a look at the Apache Spark modules available in Azure Databricks:

SparkSQL and Dataframes: Spark SQL is the module that enables users to query structured data. Dataframe is a Spark data structure which is equivalent of a table in SQL. Data is organized into named columns in a dataframe.

Streaming: This module adds real time data processing capabilities. Data can be streamed using HDFS, Flume or Kafka.

MLlib: Machine Learning library consists of a variety of machine learning tools such as ML Algorithms, pipelines and utilities.

GraphX: Graphx module provides graph computation features for various analytics use cases.

Spark Core API: support for R, SQL, Python, Scala, and Java.

13.What is the primary ETL (Extract Transform Load) service in Azure? How is it different from on-prem tools such as SSIS?

Azure Data Factory is similar in functionality to SSIS in terms of data transformation and integration, with more comprehensive task automation and orchestration features.

Azure Data Factory is a code-free ETL and data integration service that enables users to create data pipelines and workflows to ingest, transform and transport data at scale. Azure Data Factory can also be used to automate various maintenance tasks such as partitioning. Azure Data Factory is based on the on-prem ETL tool that comes bundled with SQL Server Enterprise Edition, SQL Server Integration Services (SSIS).


Let’s have a look at some of the components and concepts related to Azure Data Factory:

Pipeline: A pipeline is a logical grouping of activities to perform a specific task. In the most basic form, a pipeline would be a dataflow to move data from a source to a target.

Activity: An activity is a processing step within a pipeline. E.g. copy activity to copy data from a source file to a target table in a data store. Activities can be categorized into three types data movement activity, data transformation activity and control activity.

Linked Service: This is the equivalent of a connection string in SSIS. It stores the information that is required to connect to an external resource. A linked service can be used to represent either a data store  or a compute resource in Azure.

Dataset: Datasets in Azure Data Factory represent the structure of the data. They are just a reference to the data that the users want to use in the activity usually as an input or an output.

Triggers: As the name suggests, a trigger invokes an execution of the pipeline. It determines when a pipeline execution needs to be kicked off.

Control Flow: This refers to the orchestration of the activities within a pipeline. This can include conditional branching, outcome dependent data partitioning, passing parameters etc. An important feature of control flows is looping containers e.g. for each iterator

Azure Data Factory provides a UI where developers can access all the above features with no coding required. Besides, it is possible to interact with Azure Data Factory programmatically using Azure PowerShell, .NET, Python and REST API.

14.What is serverless database computing? How is it implemented in Azure?

Serverless computing is a relatively newer development in the cloud computing space. In the normal computing scenario, program code resides either on the server or the client. Serverless computing changes this paradigm, with stateless code, i.e. code that does not require any infrastructure.

Customers only have to pay for the compute resources used by the code during the short time that it takes for executing the code, (usually a few seconds or milliseconds) which is very cost effective. These code snippets that perform specific tasks are called functions.

Azure Functions is the serverless computing service on the Azure platform that provides Function as a Service (FAAS) feature to customers. Azure Functions is very tightly integrated with Azure Cosmos DB.

Cosmos DB triggers are procedural codes that run automatically when a database event such as insert, update or delete occurs.

 There are three ways in which Azure Functions can be used with Azure Cosmos DB for serverless computing:

  1. Cosmos DB Trigger: This option is for invoking an Azure Function based on changes in Cosmos DB container. When a change is made to a container in Cosmos DB, the Cosmos DB change feed stream is sent to the Cosmos DB trigger, the trigger then invokes the Azure function


 

    2.Input Binding: Once input binding is created; Cosmos DB trigger can be used to invoke a function. When the function executes, data is read from Cosmos DB container.

 

3.Output Binding: If a function is bound to an Azure Cosmos DB container using output binding, data is written to the container when the function executes.


15.How is data security implemented in ADLS Gen2?

Data security and privacy are very important and in some parts of the world, also mandated by law. Data security is also one of the primary concerns that discourages organizations from moving data to the cloud. ADLS Gen2 has a robust, multi-layered security model.

Lets have a look at the six data security layers of ADLS Gen2:

  1. Authentication: The first layer which provides user account security. ADLS Gen2 provides three authentication modes, Azure Active Directory (AAD), Shared Access Token (SAS), Shared Key.
  2. Access Control: The next layer for restricting access to individual containers or files. This can be manages using Roles and Access Control Lists (ACLs)
  3. Network Isolation: This layer enables administrators to manage access by disabling or enabling access to only particular Virtual Private Networks (VPNs) or IP Addresses.
  4. Data Protection: This is achieved by always encrypting in-transit data using HTTPS. Options to encrypt stored data are also available.
  5. Advanced Threat Protection: If enabled, ADLS Gen2 will monitor any unauthorized attempts to access or exploit the storage account.
  6. Auditing: This is the sixth and final layer of security. ADLS Gen2 provides comprehensive auditing features where all account management activity is logged. These logs can be later reviewed to ensure the highest level of security.

With six layers of data protection, ADLS Gen2 provides one of the best data security models in the cloud storage market today.



Statistics Says It All

16

Years of Experience

3000

Gratified Students

100

Training Batches

9600

Training Hours


Subscribe

Please subscribe our technical blog to get recent updates.

Empire Data Systems

Social Links