2 professional girls working together in laptop

Blog

Cloud Parallel Databases Jul 30, 2024 — by Nicholas Galemmo in Architecture

What is a Parallel Database?

Cloud environments consist of numerous independent processors that perform tasks simultaneously, but not truly in parallel. Parallelism requires systematic coordination between processes to complete a specific task. In databases, this means solving a single query in a parallel manner. Since performing JOINs is essential in any relational query, the challenge becomes: how do you design a database to perform a parallel join?

In simplest terms, divide a table into portions (aka data slices) so that a portion of one table joins with a single portion of another table. Each process accesses its portion of each table without interfering with any other process acting on different portions of the same tables. Early versions of such systems were created using separate independent computer systems interfaced using a high-speed network. The main system is the Controller, which received and compiled SQL commands. The Controller would then send tasks to each of the parallel processors to complete. No resources are shared between these parallel processors. All interactions are with the Controller. This is referred to as a Shared-Nothing Architecture and is implemented as an MPP database.

The MPP Database

A Massive Parallel Processing (MPP) database is a hardware architecture made up of multiple independent systems (CPU, Memory, and Disk) often called a Data Slice, managed by a controller system. The controller handles all interactions with the outside world and orchestrates the execution of the data slices. All interaction is through the controller.

Figure 1 below shows a typical arrangement of an MPP with 5 data slices, numbered 0 through 4. The controller will receive a query, compile it, then send directions to each of the data slices to resolve the query. Each slice handles its part of the query without communicating with other data slices. The controller assembles the results and sends them to the user.

Performance is achieved using a technique called “co-location.” That is, distribute a table across the data slices based on its join key. When a row is inserted into such a table, the key column value is hashed (probably using a CRC) then divided by the number of data slices. The remainder from the division selects the data slice. The objective is to do it in such a way that distributes data evenly across the data slices and matches the distribution of an often-used JOIN target.

How Co-Location Works

When you define the table, you choose columns that are used by the database to decide which data slice to place the row. When you INSERT a new row, the system uses a deterministic algorithm to decide which data slice receives the row. This becomes the permanent physical location of the row. It can only be accessed by that processor in that data slice.

For the data distribution to be beneficial, you use column(s) that are used to JOIN with another table. In the case of Star Schema, the common practice is to distribute Fact tables on the foreign key of its most populous Dimension. Dimensions are distributed on their primary key. Some MPP systems offered a DISTRIBUTE ALL option, that placed a full copy of the table in each data slice. This would only be used for small tables.

If you distribute two tables along the same key (co-locate) and JOIN them, the system can direct all the data slices to perform the join with the data they have. If you join two tables that are not distributed in the same manner, the Controller will initiate a broadcast of the smaller table, causing each data slice to send the piece of the table it has to all the other data slices. Excessive broadcasts will cripple a poorly designed parallel schema.

SPARC Parallel Databases

SPARC is a Symmetrical Multiprocessor System (SMP) and data access is shared, none of the commercially available databases are true MPP databases. Instead, this architecture is imitated by implementing a distributed table as a collection of pieces. There is still a controller process that coordinates the data slice executions, running in its own thread.

A data slice is defined by specifying one or more columns as the distribution key. Like a true MPP, the key is hashed, then divided by the number of slices, and the remainder chooses the slice. When you join two tables on matching distribution keys, the system will fire off multiple processes to do the joins in parallel across each slice. If the distribution arrangements do not match, it still executes parallel processes, but one of the tables needs to be joined in its entirety in each process. This is undesirable, as performance is hampered. But not as bad as a physical MPP, in which an entire table is broadcast across all data slices. As SPARC is a shared environment, each data slice can access the pieces it needs directly.

How Distribution Works

As an example, let’s say you have a Sales fact table and two dimensions: a customer dimension and a date dimension. The date dimension is small, so it is not distributed, while both customer and sales facts are distributed on CUSTOMER_KEY. It is a small system, so there are 5 data slices. Each table slice is a separate data object.

Grey represents the date dimension, light orange the customer, and blue the fact table. Each distributed slice is paired and joined independently. The same date table is used in each process.

As the Date dimension is not clustered, it is used in all five processes. For the others, the fact and dimension join only uses the matching slices, allowing each JOIN to work with 20% of the data.

Using Star Schema

A parallel join only works between two tables distributed on the same key. Since most queries involve more than two tables, how is this approach better? If you are implementing a normalized (3NF) data structure, it isn’t. The only things that can be effectively clustered are subordinate table pairs, such as an order header table and an order line table. Performance of additional joins in the query can be dismal. MPP database systems that supported a normalized model strategy also supported a vast array of indexing structures in an attempt to achieve fast query times. Such systems required regular maintenance and tinkering to address user query performance issues.

A Star Schema, on the other hand, performs extremely well with this architecture due to the nature of the table structure:

Surrogate keys ensure all joins must be equijoins.
Most Dimension tables are relatively small.
Fact tables are significantly larger than Dimension tables.

Next, it is important to understand optimizations the query compiler will perform when given a query against a Star Schema. The collection of equijoins from a central table clues the optimizer. It determines if one of the joins to a Dimension table has the same distribution as the fact table. The system then projects each of the other Dimension tables into cache. It then performs the final join in parallel, with a single pass through the Fact table taking the other Dimension attributes from cache. Under SPARC, this memory cache is shared, so it is a bit more efficient than the old MPP’s.

Also, another advantage of this approach is you don’t need indexes. A majority of the JOINs are performed in memory against cached data and the one table that is joined is a fraction of its total size. All queries against properly clustered tables will always be optimal unless the Star has more than one very large dimension. In such a situation you need to make your best guess on which Dimension to distribute the Facts. In extreme cases, you could manage multiple copies of the Fact table with different distributions, but I haven’t seen such a case that needed such a solution.

Commercial Implementations

The number of data slices for all distributed tables is fixed, based on your configuration. This determines the level of parallelization attainable in your database. It is common for this to be different across environments. Usually, development is the smallest. In general, the QA and production environments are usually the same so that performance issues can be discovered and resolved before promoting a release to production. When moving tables between environments you should rebuild them, using CTAS so the rows are distributed in the same manner as other tables in the environment. Distribution mismatches will result in serial execution of a query.

In all cases, the system determines if it can perform a parallel join prior to executing it. It does so based on the following criteria:

There is an equijoin between two tables (i.e., keya = keyb)
All distribution columns are used in the JOIN
The tables are distributed on the same columns

Since you must use all distribution columns between the two tables in the JOIN, common practice is to only declare one such column. In a star schema, large Dimension tables are distributed on the surrogate primary key. Small Dimensions (~20k rows) are not distributed. Fact tables are distributed on the foreign key of the most populous Dimension in a Fact table.

Redshift

Redshift syntax uses the term DISTKEY, for ‘distribution key’. It allows you to choose one column in the table to serve that purpose.

Azure Synapse

Synapse uses DISTRIBUTION = HASH(column names…) syntax. It also supports partitioning.

Other Options

Besides distributing on a column value, you may also specify a ROUND-ROBIN or ALL distribution. ROUND-ROBIN means the data is distributed evenly across all slices. As no distribution column is specified, the system cannot use a parallel join, as any row could be in any slice. This is normally used for tables that are not JOINed, such as staging or interim tables. The ALL option, on an MPP, meant that a full copy of the table exists in each data slice. In SPARC, this is not necessary as such tables are visible to all processes.

Indexing

If you are implementing Star Schema, I do not recommend indexing of any sort. Indexing introduces significant overhead when loading and querying data. They also require regular maintenance to ensure performance. Using an index, queries will only run faster than a full table scan if it needs (and isolates) less than 5% of the rows in the table.

If performance is a problem, you are usually better off increasing the number of data slices rather than implementing indexes. However, this involves reconfiguring the database. Most vendors will provide assistance in doing this.

Your Vendor

There are various ways a vendor may provide services to such databases. Foremost is the size of the configuration. In addition, physical attributes such as solid-state disk, dedicated cores, storage proximity, and other features may be available.

It is normal that the development environment is small. However, I would recommend identical configurations for QA and Production. This provides a means to identify performance issues before release.

Effect of Data Skew

Data skew refers to the difference between the number of rows in a particular data slice versus the average number of rows per data slice.

Skew hurts parallel queries because the shortest completion time is based on the biggest (slowest) slice. Each parallel process must complete for the query to finish. Ideally, skew should be kept to within +/-5% of average. A skew of 10% or more is considered a worst-case scenario and should be addressed.

This is a significant problem in some industries. In Retail, over 40% of transactions are anonymous. These cash transactions are a disaster to tables distributed around Customer. This can be resolved by creating multiple, randomly chosen, anonymous customer identifiers, such as “CUST001” through “CUST999” to spread the data evenly among the slices. The chosen range should be 3 to 5 times the number of slices to ensure an even distribution. The number of random values should be an even multiple of the number of slices. An uneven multiple will cause skewed distribution.

Effect of Column-Oriented Structures

A number of vendors tout their use of column-oriented Tables and this is often misconstrued as the reason for a Parallel Database’s performance. That is a misstatement.

Tables stored using column-oriented structures provide a number of advantages to queries:

Only reads data of the columns the query projects.
Ability to quickly isolate populations based on attribute filters.

In such tables, when you read a “row,” the actual data value is retrieved based on handles stored in the database. This redirection and data compression adds additional processor effort to materializing the row. It gains the advantage in that it only needs to materialize specific columns, allowing it to cache and reduce the amount of work, including I/O, which can reduce query times by 20% or more.

As much as column-oriented structures can improve performance, it is Parallelism that provides orders of magnitude reduction in query times. Parallelism is always the primary avenue for optimization.

Other “Parallel” Databases

Other database products claim parallel capabilities. This is related to their scripting capabilities that allow you to coordinate and execute multiple queries at the same time. This is not the same thing as an MPP, which executes a single query in parallel.

This distinction is important as the parallel execution is inherent in the database itself. Any 3rd party query tool will execute fully parallel queries without adjustment or modification of the tool.

Design Tips

Don’t define distribution if you do not need to. Dimension tables under 20,000 rows are usually too small to distribute. Only distribute your largest dimensions. Non-distributed tables are not an issue in the cloud as the SMP nature of SPARC eliminates the need to broadcast the table.
Define a cluster using as few columns as possible. Ideally used only one column. If you use multiple columns, ALL columns MUST be used in the JOIN because the remainder is calculated by a hash of all the columns. Without the full join condition, it cannot pair slices and perform the query in parallel.
Never distribute a Fact table on more than one foreign key.
Choose a distribution key that spreads the data evenly among slices. As a rule, cluster size variance should be within +/-5% of the average size.
In a Star Schema, distribute dimensions on their primary key and fact tables on the dimension foreign key with the highest cardinality. This ensures colocation for the largest join and improves the chance for an even distribution.