The ins and outs of Azure Synapse Analytics

Published on
May 9, 2022
Pravin Ghavare
Azure Synapse DBA
Subscribe to newsletter
By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Synapse Analytics is basically an analytics service that has a virtually unlimited scale to support analytics workloads.

Most of you are working on a relational database and planning to move/or moved to azure synapse analytics, so it has become very important to understand the architecture and best practices around it.

As compared to an on-premises SQL server the query execution in the synapse is totally different. It’s very important to write your queries cautiously while working on azure synapse analytics, as it can cause a severe impact on overall server performance.

Let’s understand the architecture first


Control Node: A user who is querying from any application will come to the control node first. The control node is the brain of the architecture which decides how your query will execute in parallel across available compute nodes. (Responsible for query plan generation).It stores all the metadata of your database example: tables, stored procedures, views etc.

Compute Node: Compute nodes are the machines that are responsible to execute your queries in parallel. As compared to an on-premises SQL server there is only one machine that is responsible to execute your queries. In azure synapse analytics, we have 60 such machines that will execute your queries in parallel. (Query assignment to compute node is done by control node).

Compute nodes are nothing but machines which have CPU, Memory, Cache, SSD, temp DB etc. which are responsible to execute your queries. Assignment of compute nodes, depending on your DWU (Data Warehouse Unit), varies from 100 DWU to 30000 DWU. So more the compute nodes you have more room you will get for your query execution.

For more details on DWU refer this link (

Distribution: In simple terms distribution is storage where all your data resides. Like SQL server it has data file and log file. When user query the system to get the desired data it fetches from the cache which is present in control node if data is not in cache, it will get it from the distribution which Is primary storage of the data. Distributions refers to how your data is being stored at storage level in each distribution.

We have fixed 60 distributions in azure synapse analytics which are not dependent on number of DWU you have. It will be always 60 distributions regardless of what your service level is. Each smaller queries which are divided by control node will execute on these data distributions. Minimum compute nodes will have maximum distributions attached to it.

DMS: Data movement service is responsible for moving data between nodes. Sometimes queries require to get data from multiple distributions to obtain desired result, DMS helps to get that data from multiple distributions.More data movement is also bad for your query performance. Hence you must be careful while choosing distributions while working with data.

Distribution Types:

A distributed table appears as a single table, but the rows are stored across 60 distributions. The rows are distributed using a hash or round-robin algorithm.

Hash-distributed tables:

A hash-distributed table can deliver the highest query performance for joins and aggregations for large tables. To share data into a hash-distributed table, a dedicated SQL pool uses a hash function to deterministically assign each row to each distribution. In the table definition, one of the columns is designated as the distribution column. The hash function uses the values in the distribution column to assign each row to a distribution.

The following diagram illustrates how a full (non-distributed table) gets stored as a hash-distributed table.

There are performance considerations for the selection of a distribution column, such as distinctness, data skew, and the types of queries that are frequently executed in the system.

Round-robin distributed tables:

A round-robin table is the simplest table to create and delivers fast performance when used as a staging table. A round-robin distributed table distributes data evenly across the table without any further optimization. A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables. Joins on round-robin tables require data reshuffling, which takes additional time. Round-robin is also the default algorithm we used to generate temporary tables in multi-pass SQL generated in Micro Strategy.

Replicated tables:

A table that is replicated caches a full copy of the table on each compute node. Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation. Replicated tables are best utilized for small tables. Extra storage is required and there is additional overhead when writing data, which makes this option impractical for large tables.

The diagram below shows a replicated table that is cached on the first distribution on each compute node.

Below is a quick comparison between multiple distributions-

Hash Round Robin Replicate
Distributed using hashing algorithm Distributed evenly but randomly All data present on every node
Equal values hashed to same distribution Doesn’t require knowledge about data or queries Simplify many query plans and reduce data movement
Optimal for large fact table Optimal for large tables without a good hash column or varied queries Best for small lookup tables

Detailed working about distributions please refer to below link


Experience Azure to build and deploy the apps your way

Contact us to start your journey today

Privacy PolicyTerms of ServiceSitemap
© 2022 Cloudaeon All Rights Reserved.