Collecting statistics on your data and then loading it into SQL Data Warehouse is one of the most important things you can do to optimize your queries. It compares the cost of various query plans, and then chooses the plan with the lowest cost, which is in most cases the plan that executes the fastest.
For example, if the optimizer estimates that the date you are filtering in your query will return one row, it can choose a different plan than if it estimates that the selected date will return 1 million rows.
The query optimizer creates statistics on individual columns in the query predicate or join condition to improve cardinality estimates for the query plan. Automatic creation of statistics is currently turned on by default. You can check if your data warehouse has this configured by running the following command: Note Automatic creation of statistics are not created on temporary or external tables.
Automatic creation of statistics is generated synchronously so you may incur a slight degraded query performance if your columns do not already have statistics created for them.
Creating statistics can take a few seconds on a single column depending on the size of the table. To avoid measuring performance degradation, especially in performance benchmarking, you should ensure stats have been created first by executing the benchmark workload before profiling the system.
Note The creation of stats will also be logged in sys. When automatic statistics are created, they will take the form: This cannot be an external table. The second argument is the name of the target index, statistics, or column for which to display statistics information. Updating statistics One best practice is to update statistics on date columns each day as new dates are added.
Each time new rows are loaded into the data warehouse, new load dates or transaction dates are added. These change the data distribution and make the statistics out of date.
Assuming the distribution is constant between customers, adding new rows to the table variation isn't going to change the data distribution. However, if your data warehouse only contains one country and you bring in data from a new country, resulting in data from multiple countries being stored, then you need to update statistics on the country column.
The following are recommendations updating statistics: Frequency of stats updates Conservative: Daily After loading or transforming your data Sampling Less than 1 billion rows, use default sampling 20 percent With more than 1 billion rows, statistics on a 2-percent range is good One of the first questions to ask when you're troubleshooting a query is, "Are the statistics up to date?
An up-to-date statistics object might be old if there's been no material change to the underlying data. When the number of rows has changed substantially, or there is a material change in the distribution of values for a column, then it's time to update statistics.
Because there is no dynamic management view to determine if data within the table has changed since the last time statistics were updated, knowing the age of your statistics can provide you with part of the picture. You can use the following query to determine the last time your statistics were updated on each table. Note Remember that if there is a material change in the distribution of values for a column, you should update statistics regardless of the last time they were updated.
Conversely, statistics on a gender column in a customer table might never need to be updated. However, if your data warehouse contains only one gender and a new requirement results in multiple genders, then you need to update statistics on the gender column. For more information, see general guidance for Statistics. Implementing statistics management It is often a good idea to extend your data-loading process to ensure that statistics are updated at the end of the load. Therefore, this is a logical place to implement some management processes.
The following guiding principles are provided for updating your statistics during the load process: Ensure that each loaded table has at least one statistics object updated. This updates the table size row count and page count information as part of the statistics update. Consider updating "ascending key" columns such as transaction dates more frequently, because these values will not be included in the statistics histogram.
Consider updating static distribution columns less frequently. Remember, each statistic object is updated in sequence. For more information, see Cardinality Estimation. Create statistics These examples show how to use various options for creating statistics. The options that you use for each column depend on the characteristics of your data and how the column will be used in queries. Create single-column statistics with default options To create statistics on a column, simply provide a name for the statistics object and the name of the column.
This syntax uses all of the default options. However, you can adjust the sampling rate. To sample the full table, use this syntax: This is called a filtered statistic. For example, you can use filtered statistics when you plan to query a specific partition of a large partitioned table. By creating statistics on only the partition values, the accuracy of the statistics will improve, and therefore improve query performance. This example creates statistics on a range of values.
The values can easily be defined to match the range of values in a partition. Using the previous example, the query's WHERE clause needs to specify col1 values between and Create single-column statistics with all the options You can also combine the options together. The following example creates a filtered statistics object with a custom sample size: Create multi-column statistics To create a multi-column statistics object, simply use the previous examples, but specify more columns.
Note The histogram, which is used to estimate the number of rows in the query result, is only available for the first column listed in the statistics object definition. This stored procedure creates a single column statistics object on every column of the database that doesn't already have statistics.
The following example will help you get started with your database design. Feel free to adapt it to your needs: Valid range 1 default , 2 fullscan or 3 sample. This procedures uses a 20 percent sample rate.
Update statistics To update statistics, you can: Update one statistics object. Specify the name of the statistics object you want to update.
Update all statistics objects on a table. Specify the name of the table instead of one specific statistics object. Update one specific statistics object Use the following syntax to update a specific statistics object: This requires some thought to choose the best statistics objects to update.
Update all statistics on a table This shows a simple method for updating all the statistics objects on a table: Just remember that it updates all statistics on the table, and therefore might perform more work than is necessary.
If the performance is not an issue, this is the easiest and most complete way to guarantee that statistics are up to date. Note When updating all statistics on a table, SQL Data Warehouse does a scan to sample the table for each statistics object.
If the table is large and has many columns and many statistics, it might be more efficient to update individual statistics based on need.
For the full syntax, see Update Statistics. Statistics metadata There are several system views and functions that you can use to find information about statistics.
For example, you can see if a statistics object might be out of date by using the stats-date function to see when statistics were last created or updated.
Catalog views for statistics These system views provide information about statistics: