Jirka's Public Notepad

Data Engineering | Python | SQL Server | Teradata

June 21, 2020 By Jiří Hubáček Leave a Comment

EightKB Virtual Conference – Demystifying Statistics in SQL Server

I’ve heard of the EightKB conference from Brent Ozar‘s webcast. I checked the schedule and thought: OMG, this is precisely the kind of nerdy event I’m so looking for! I love to see what is under the hood, how SQL Server internally functions, what affects what… In my opinion, knowing the internals supports any further leaning as all the information pieces fit together. Additionally, one can easily spot those SQL Server myths circulating the internet. 😀

The schedule was jam-packed with amazing speakers and topics:

#noteForMyself

As part of this blog post series, I wanted to extract interesting, not known or otherwise useful information or scripts from each speaker’s talk. By writing the blog post, I had to revisit the topic and run the demo code myself. Lastly, when I write something on my blog, I know it’s there and can always get back to it :D. For the full experience, I’d recommend viewing the conference recording and going to its Github page.

I’d like to hugely thank all speakers and organisers for the fantastic nerdy content they put together.

Demystifying Statistics in SQL Server

Erin Stellato took over the second event on statistics in SQL Server. She aligned everyone’s knowledge of why statistics are important and how do they impact optimizer. Throughout her demo, she hacked the statistics there and back, talked when they get invalidated, what are the thresholds and how and when optimize statistics.

Her talk starts at 1:54:08, scripts are available on the event’s Github and uses extended WideWorldImpoters database throughout the demos. Extension script can be found on the “SQL Server 2019 Revealed” book’s Github.

Erin had the following query in the bonus section for listing all database objects statistics. I find this particularly handy for getting a general overview so I’m opening this blog post with it :).

SELECT
	[ss].[stats_id] AS StaisicsId,
	[sch].[name] + '.' + [so].[name] AS [TableName] ,
	[ss].[name] AS [Statistic] ,
	STUFF(( SELECT  ', ' + [c].[name]
        FROM [sys].[stats_columns] [sc]
		JOIN [sys].[columns] [c] 
			ON [c].[column_id] = [sc].[column_id]
            AND [c].[object_id] = [sc].[OBJECT_ID]
        WHERE [sc].[object_id] = [ss].[object_id]
			AND [sc].[stats_id] = [ss].[stats_id]
        ORDER BY [sc].[stats_column_id]
        FOR XML PATH('')), 1, 2, '') AS [ColumnsInStatistic] ,
	[ss].[auto_Created] AS [WasAutoCreated] ,
	[ss].[user_created] AS [WasUserCreated] ,
	[ss].[has_filter] AS [IsFiltered] ,
	[ss].[filter_definition] AS [FilterDefinition] ,
	[ss].[is_temporary] AS [IsTemporary] ,
	[sp].[last_updated] AS [StatsLastUpdated] ,
	[sp].[rows] AS [RowsInTable] ,
	[sp].[rows_sampled] AS [RowsSampled] ,
	[sp].[unfiltered_rows] AS [UnfilteredRows] ,
	[sp].[modification_counter] AS [RowModifications] ,
	[sp].[steps] AS [HistogramSteps],
	[ss].[no_recompute]
FROM [sys].[stats] [ss]
JOIN [sys].[objects] [so] 
	ON [ss].[object_id] = [so].[object_id]
JOIN [sys].[schemas] [sch] 
	ON [so].[schema_id] = [sch].[schema_id]
LEFT OUTER JOIN [sys].[indexes] AS [si] 
	ON [so].[object_id] = [si].[object_id]
	AND [ss].[name] = [si].[name]
OUTER APPLY [sys].[dm_db_stats_properties]([so].[object_id], [ss].[stats_id]) sp
WHERE [so].[object_id] = OBJECT_ID(N'Sales.Orders')
ORDER BY [ss].[stats_id];

I’ve added stats_id to the query as this id is used when querying dm_db_stats_properties and dm_db_stats_histogram DMVs.

Another way of viewing index statistics is through DBCC SHOWSTATS command. The command returns statistics information like number of rows sampled, bucket count, density and a histogram.

DBCC SHOW_STATISTICS ('Sales.Orders', FK_Sales_Orders_CustomerID);

Hacking partition’s approximate row count to affect optimiser’s behaviour of plan building.

UPDATE STATISTICS [Sales].[Orders] PK_Sales_Orders
WITH ROWCOUNT = 1000000, PAGECOUNT = 9383;

Observing the result:

SELECT * 
FROM sys.dm_db_partition_stats
WHERE OBJECT_ID = OBJECT_ID('Sales.Orders');

Turning memory grant feedback off to prevent SQL Server adjusting queries’ memory grant information – a feature introduced in SQL 2019 and database compatibility level 150.

ALTER DATABASE SCOPED CONFIGURATION
SET ROW_MODE_MEMORY_GRANT_FEEDBACK = OFF;

When a query executes, it asks for memory grant to perform sorting and hashing operations. If statistics are not up to date, optimiser may asks for too much or too little memory for query execution.

A low-ball grant means SQL Server needs spilling all the data that don’t fit into memory to disk. An excessive grant might prevent SQL query from execution before the required memory becomes available – RESOURCE_SEMAPHORE wait type occurs.

Memory grant feedback lets SQL Server observe and learn from the past query executions and “heal” itself from former sub-optimal executions. Cases when this can become a bit tricky for debugging, I’m leaving aside for now. 🙂

Auto-update Statistics option

There’s “generally” no reason to turn Auto-update stats off. Automatic statistics update occurs when a changed rows threshold is reached. The threshold value is determined by SQL Server version:

  • Old threshold: 20% + 500 rows changed
  • New threshold (130+ compatibility mode, TF 2371 down to 2008 R2 SP1): SQRT(1000*Number of rows)
From Erin Stellato’s slides

Plan Validity

When statistics are updated, plans may or may not get invalidated. The behaviour is driven by SQL Server’s version:

  • SQL Server 2012+ when statistics are updated, if NO data has changed, plans will NOT invalidate
  • Before SQL Server 2012 if you updated stats with AUTO_UPDATE_STATS enabled, plans were invalidated. If you update stats with AUTO_UPDATE_STATS disabled, plans were not invalidated

When debugging query plans, it might be useful to evict all plans related to a table from cache. This is achieved by sp_recompile command:

sp_recomplie 'schema.table'; 

Estimated are not always correct when:

  • Statistics haven’t been updated recently
  • Uneven distribution of data
  • Sample rate is too low
  • Ascending keys
  • Table variables
  • Local variables/modifying in-flight
  • Linked servers
  • partitioned tables
  • In-memory OLTP

Related

Filed Under: SQL Server Tagged With: eighkb, internals, SQL Server, statistics

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • GitHub
  • LinkedIn
  • RSS
  • Twitter
© 2022 · Jiří Hubáček, PGP