1.What is Microsoft Azure?
Azure is Microsoft’s cloud
computing platform. Cloud computing is an online computing service (which may
include both hardware and software) where the service provider creates a
managed service to enable users to access these services on demand. So instead
of “buying†hardware or software which means paying to own the product, users
are “renting†the hardware or software resources from the cloud service
provider and only have to pay for the size and time that they use. The main
advantage of using a cloud service is the flexibility, scalability, cost
effectiveness and ease of use based on user requirements.
Cloud services can be broadly
categorized into three categories based on the type of services offered:
IaaS (Infrastructure as a Service) : This is where a service
provider offers cloud based services for hardware and system resources such as
CPU (compute resources), RAM, Storage (hard disk space) etc.
PaaS (Platform as a Service) : When a software platform
(e.g. a database, IDE, Design tool etc. ) is offered. The main users for PaaS
tools are software developers, power users, designers etc.
SaaS (Software as a Service) : When a finished software
product is offered on the cloud which can be accessed by end users through a
browser/app from any internet connected device e.g. Office productivity tools
such as Office 365, Google Docs and Accounting tools such as Xero
2.What are the various storage types available
in Azure?
Azure Storage platform is Microsoft’s cloud storage solution. Like other
cloud offerings, it is a managed service and caters to various storage
requirements and scenarios.It includes the following storage options:
3.What is data redundancy? What data redundancy
options are available in Azure?
Data redundancy is the practice of storing multiple copies of data to
ensure that the data is always available even during unexpected events e.g.
disk failure, in case of a natural disaster etc.
To ensure high levels of data availability, Azure always keeps multiple
copies of data. Depending on the criticality and time required to enable access
to the replica, there are a few data redundancy options available to customers
in Azure:
Azure maintains
data integrity by using, cycling redundancy checks (CRCs) for the stored data
and checksums on the network traffic. If any errors are found, it automatically
performs a “repair†from the redundant data.
4.What are multi-model databases? What is the
primary multi-model database service available on the Microsoft Azure platform?
Cosmos DB (formerly, Document DB) is Microsoft’s premier NoSQL service offering on Azure. It is the first globally distributed, multi model database offered on cloud by any vendor.
Lets have a look at
the main features of Cosmos DB:
Globally
Distributed: It is extremely easy to distribute data globally by using multiple
Azure regions. Having data spread across multiple geographies enables faster
access to data and a better overall user experience.
Automatic Indexing: Cosmos DB
indexes all stored data automatically, irrespective of the underlying schema
used. No schema or index management is required. For customers, who would like
to have more control, there is option to customize indexing as well.
Multi-model: It can be
used to store data in various data storage models such as Key-value pair,
document based, graph based, column-family based etc. No matter what data model
the customer chooses, consistency, low latency, global distribution and
automatic indexing features are the same. This will be discussed in detail in a
future post.
Multiple
consistency models: Cosmos DB supports five consistency models. Eventual, Prefix,
Session, Bounded and Strong. This gives the customers the flexibility to choose
the way data is replicated between multiple geographical regions. Consistency
models will be discussed in detail in a future post.
Guaranteed Low
Latency: Cosmos DB guarantees 10ms latency at the 99th percentile
for reads and writes. What this means is 99% of the requests will have a
latency of 10ms or less. This guarantee is applicable to all consistency
levels. This is made possible due to the data being distributed globally
on multiple Azure regions. Users can access data from the Azure region
geographically closest to their physical location.
Multi-Language and Multi-API Support: Cosmos DB supports a wide range of languages and APIs. Languages for which SDKs are available include Java, .NET, Python, Node.js, JavaScript etc. The following APIs are supported: SQL API, Cosmos DB Table API, Mongo DB, Graph API, Cassandra etc. More details about API features will be discussed in future posts.
Key-value: As the name
suggests, data is stored as a tuple(pair), with a unique attribute(key) mapped
to its corresponding content (value). Most Key value store databases use JSON
(JavaScript Objet Notification) format to implement key value pairs. Examples
of Key-value store databases: Redis, Riak, Berkeley DB, Couchbase and
MemcacheDB
Column Family or
Columnar: In this type of data model, data is stored in groups of column families
that are accessed together. It is a form of Key-value pair, where the key is
mapped to a value that is set to a group of columns. Examples: HBase,
Cassandra, Amazon DynamoDB and Google BigTable
Document: Document data
model allows data to be stored and queried from a key-value pair based JSON
style document. Document databases extend the functionality of JSON and enable
users to create flexible schemas. E.g. an online marketplace can have thousands
of products with different attributes, rather than creating a sparate table for
each product, it is easier to create one document with all the product
attributes using separate blocks in the document for each product. Examples:
MongoDB, CouchDB, IBM Domino and DocumentDB (the precursor to Azure Cosmos DB).
Graph: Data is stored in
terms of entities and relationships. It enables creation of a map or network of
connections between entities. Graph databases are vey useful when it comes to
modeling social networking scenarios. Examples: Neo4j, OrientDB, and FlockDB
5.What are some ways to ingest data from
on-prem storage to Azure?
One of the initial challenges customers come across after purchasing
Azure Storage, is moving on-prem data to the Azure Storage account. Microsoft
provides data transfer solutions for various scenarios. In today’s post we are
going to have a look at these solutions.
The main factors to
consider while choosing a data transfer solution are:
Depending on the
above factors, data movement solutions can be either Offline or through
Network. Lets have a look:
Offline transfer: This method
is used for one-time bulk data transfer. Microsoft can provide customers with
disks or secure storage device. Customers also have the option to ship their
own disks to Microsoft. The offline options are named data box, data box disk,
data box heavy and import/export (customers provide own disks).
Network transfer: Data transfer
over network connection can be performed in the following ways:
Microsoft provides
the following informative illustration to help customers decide on the suitable
data transfer solution as per their requirements:
6.What is the best way to migrate data from
on-prem databases to Azure?
Microsoft’s SQL Server has been one of the most popular relational
databases for quite some time. The cloud version of SQL Server is called Azure
SQL Database.
Azure provides the
following options to move data from existing on premises SQL Server to Azure
SQL database :
Microsoft provides
a tool called Data Migration Assistant which helps customers
identify which options are suitable based on the customers’ existing on
premises SQL Server setup. This tool provides useful information e.g. the
tables suitable for configuring cloud storage for SQL Server Stretch Database,
how much restructuring and re-engineering would be required to move the SQL
Server instance to Azure SQL Database.
7.What is the difference between Azure Data
Lake Storage (ADLS) and Azure Synapse Analytics?
Azure Data Lake Storage Gen2 (ADLS Gen2) and Azure Synapse Analytics
(formerly, SQL Data Warehouse) both are highly scalable and have
the capability to ingest and process huge amounts of data (on a Peta Byte
scale). But there are some differences :
|
ADLS
Gen2 |
Azure
Synapse Analytics |
|
Optimized for storing and
processing structed and non-structured data |
Optimized for processing
structured data in a well-defined schema |
|
Used for data analytics and
exploration by data scientists and engineers |
Used for Business Analytics
i.e. for disseminating data to business users |
|
Built to work with Hadoop |
Built on SQL Server |
|
No regulatory compliance |
Compliant with regulatory
standards such as HIPAA |
|
USQL (combination of TSQL and
C#) and Hadoop used for accessing data |
Synapse SQL (improved version
of TSQL) used for accessing data |
|
Can handle data streaming using
tools such as Azure Stream Analytics |
Built-in data pipelines and
data streaming capabilities (currently in preview) |
8.What
are the various consistency models available in Azure Cosmos DB?
Consistency models, also known as consistency levels, provide a way for developers to choose between high availability and better performance. Usually the choice is between two extremes but Cosmos DB provides a granular scale of consistency levels that can be configured to suit specific business requirement.
Let’s have a look
at each of the consistency models available in Cosmos DB:
9.What
is Cosmos DB Synthetic Partition Key?
It is important to select a good partition key that distributes the data
evenly across multiple partitions. However, if there is no suitable column with
properly distributed values, we can create a Synthetic Partition Key. There are
three ways to create a synthetic partition key:
{
"deviceId":
"ade-123",
"date": 2020
}
The two
properties deviceId and date can be combined
into a synthetic primary key as shown below:
{
"deviceId":
"ade-123",
"date": 2020,
"partitionKey":
"ade-123-2020"
}
2. Random
Suffix: In this method, a random number is added to the end of the
partition key value. For example, if a partition key is of date datatype and we
choose a random number between 1 and 500. After random concatenation, the date
values arranged in ascending order will look like : 2020-07-05.1, 2020-07-05.2,
2020-07-05.3 and so on. Since we chose a random number, the write operations
for a given date will be evenly distributed across partitions.
3. Pre-calculated
Suffix: Using the above random suffix method makes the write
operations faster but it doesn’t improve the read performance, since we do not
know which partition to go to when we want to query a specific value.
To improve the read
performance, we can use the pre-calculated suffix strategy. For example, if we
have a partition key with date values and we have Vehicle Identification Number
(VIN) which is frequently used to query the data. We can use a hash function
that uses the VIN as input and returns a number within a range, say between 1
and 500. We can then use this random number as a suffix to the date value for
creating the synthetic partition key. This way when we get a specific VIN value
in the query, we can use the hash function to determine the partition suffix
and go straight to the partition containing the particular VIN value.
10.How
do you capture streaming data (e.g., website clickstream, social media feed
etc.) in Azure?
As connected devices become more commonplace, we are moving towards a
true IoT world, where every device will be connected to the internet and be
able to generate a continuous stream of data. This calls for a new requirement
of collecting, storing and analyzing this continuous stream of data. To address
this need, Azure provides a dedicated analytics service, aptly named,
Azure Stream Analytics.
As explained in the
diagram above, Azure Stream Analytics is designed to analyze high volumes of
high velocity data from multiple sources (primarily IoT data) in real time.
A stream analytics
job is fully managed by the Azure cloud, i.e. users do not need to provision
any hardware. It is also fully integrated with other Azure services and very
easy to setup with a few clicks. The output of the Azure Stream Analytics job
can be sent to a visualization service such as PowerBI or can be stored on
another Azure analytics service such as Azure HDInsight or Azure Synapse
Analytics for further processing.
Azure Stream Analytics
provides a simple SQL-based language called, Stream Analytics Query Language.
Azure Stream Analytics provides developers, the ability to extend the
query language by defining additional Machine Learning functions.
Azure Stream
Analytics can process massive amounts of data on the scale of over a million
events per second and deliver the results with ultra-low latency.
Azure Stream Analytics is compliant with leading industry certifications such as HIPAA, ISO27001, FedRAMP etc.
11.What
is Azure Storage Explorer? What are they used for?
Azure Storage Explorer is a versatile standalone application for
managing Azure Storage from any platform. Azure Storage Explorer is available
for Windows, Mac OS and Linux. It can be downloaded from Microsoft.
Azure Storage Explorer provides access to multiple Azure data stores such as, Cosmos DB, ADLS Gen2, Queues, Tables, Blobs etc. It provides an easy to navigate rich GUI.
After downloading,
the next step is to link Azure storage account. This will enable users to
connect to the storage resources already available in Azure or create and
manage new data stores.
One of the key
features of Azure Storage Explorer is that it allows users to work even when
they are disconnected from the Azure cloud service. This is achieved by
attaching local emulators. Users have the option to use Azure Storage Emulator
on Windows or Azurite which supports macOS and Linux. Users can benefit from
cost savings and increased productivity.
The simplest way to
connect to Cosmos DB is to use a connection string. Similarly, easiest way to
connect to ADLS Gen2 is to connect using URI.
Azure Storage Explorer provides a one stop solution through which users can manage data on multiple data stores seamlessly.
12.What
is Azure Databricks? How is it different from the original Databricks?
Azure Databricks is the Azure implementation of Apache Spark. Apache Spark is an open source big data processing platform. Azure Databricks was developed in consultation with the developer of Apache Spark, M. Zaharia, who later became the founder of a company named Databricks.
Databricks sits in
the data preparation or processing stage in the data lifecycle. This starts
with the data being ingested in Azure using Data Factory and stored in a permanent
storage (such as ADLS Gen2 or Blob Storage). Alternatively, data could be
streamed using Kafka, Event Hub or IoT Hub.
In the next stage
data is processed using Machine Learning in Databricks (which uses Apache Spark
under the hood) and the extracted insights are then loaded into one of the
Analysis Services in Azure (Cosmos DB, Synapse Analytics or SQL Database).
These insights are
now ready to be visualized and presented to the end users with the help of
Analytical reporting tools such as Power BI.
Let’s have a look
at the Apache Spark modules available in Azure Databricks:
SparkSQL and
Dataframes: Spark SQL is the module that enables users to query structured data.
Dataframe is a Spark data structure which is equivalent of a table in SQL. Data
is organized into named columns in a dataframe.
Streaming: This module adds
real time data processing capabilities. Data can be streamed using HDFS, Flume
or Kafka.
MLlib: Machine Learning
library consists of a variety of machine learning tools such as ML Algorithms,
pipelines and utilities.
GraphX: Graphx module
provides graph computation features for various analytics use cases.
Spark Core API: support for R,
SQL, Python, Scala, and Java.
13.What
is the primary ETL (Extract Transform Load) service in Azure? How is it
different from on-prem tools such as SSIS?
Azure Data Factory is
similar in functionality to SSIS in terms of data transformation and
integration, with more comprehensive task automation and orchestration
features.
Azure Data Factory is a code-free ETL and data integration service that enables users to create data pipelines and workflows to ingest, transform and transport data at scale. Azure Data Factory can also be used to automate various maintenance tasks such as partitioning. Azure Data Factory is based on the on-prem ETL tool that comes bundled with SQL Server Enterprise Edition, SQL Server Integration Services (SSIS).
Let’s have a look
at some of the components and concepts related to Azure Data Factory:
Pipeline: A pipeline is
a logical grouping of activities to perform a specific task. In the most basic
form, a pipeline would be a dataflow to move data from a source to a target.
Activity: An activity
is a processing step within a pipeline. E.g. copy activity to copy data from a
source file to a target table in a data store. Activities can be categorized
into three types data movement activity, data transformation activity and
control activity.
Linked Service: This is the
equivalent of a connection string in SSIS. It stores the information that is
required to connect to an external resource. A linked service can be used to
represent either a data store or a compute resource in Azure.
Dataset: Datasets in
Azure Data Factory represent the structure of the data. They are just a
reference to the data that the users want to use in the activity usually as an
input or an output.
Triggers: As the name
suggests, a trigger invokes an execution of the pipeline. It determines when a
pipeline execution needs to be kicked off.
Control Flow: This refers
to the orchestration of the activities within a pipeline. This can include
conditional branching, outcome dependent data partitioning, passing parameters
etc. An important feature of control flows is looping containers e.g. for each
iterator
Azure Data Factory
provides a UI where developers can access all the above features with no coding
required. Besides, it is possible to interact with Azure Data Factory
programmatically using Azure PowerShell, .NET, Python and REST API.
14.What
is serverless database computing? How is it implemented in Azure?
Serverless computing is a relatively newer development in the cloud
computing space. In the normal computing scenario, program code resides either
on the server or the client. Serverless computing changes this paradigm, with
stateless code, i.e. code that does not require any infrastructure.
Customers only have
to pay for the compute resources used by the code during the short time that it
takes for executing the code, (usually a few seconds or milliseconds) which is
very cost effective. These code snippets that perform specific tasks are called
functions.
Azure Functions is the
serverless computing service on the Azure platform that provides Function as a
Service (FAAS) feature to customers. Azure Functions is very tightly integrated
with Azure Cosmos DB.
Cosmos DB triggers are
procedural codes that run automatically when a database event such as insert,
update or delete occurs.
There are
three ways in which Azure Functions can be used with Azure Cosmos DB for
serverless computing:
2.Input Binding: Once input binding is created; Cosmos DB trigger can be used to invoke a function. When the function executes, data is read from Cosmos DB container.
3.Output Binding: If a function is bound to an Azure Cosmos DB container using output binding, data is written to the container when the function executes.
15.How
is data security implemented in ADLS Gen2?
Data security and privacy are very important and in some parts of the world,
also mandated by law. Data security is also one of the primary concerns that
discourages organizations from moving data to the cloud. ADLS Gen2 has a
robust, multi-layered security model.
Lets have a look at
the six data security layers of ADLS Gen2:
With six layers of
data protection, ADLS Gen2 provides one of the best data security models in the
cloud storage market today.
Years of Experience
Gratified Students
Training Batches
Training Hours
Please subscribe our technical blog to get recent updates.