Nested Loop has very low estimates due to skewed data
On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.
If I'm correct SQL Server 2014+ uses Coarse Histogram Estimation to calculate the estimated number of rows on a join.
But when I execute the query, SQL Server uses the density vector to calculate the number of estimated rows.
Is SQL Server only using the Coarse Histogram Estimation if there is no where
clause?
Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.
Is there a way to improve the estimations on the nested loop?
Using following code you can reproduce the data:
create table MyTable
(
id int identity,
field varchar(50),
constraint pk_id primary key clustered (id)
)
go
create table SkewedTable
(
id int identity,
startdate datetime,
myTableId int,
remark varchar(50),
constraint pk_id primary key clustered (id)
)
set nocount on
insert into MyTable select top 1000 [name] from master..spt_values
go
insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go 1000
insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go
CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)
go
CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)
go
--95=varchar in sys.messages
set nocount off
;with cte as
(
select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark
union all
select * from cte
)
insert into skewedtable select top 40000 * from cte
option(maxrecursion 0)
go
update statistics mytable with fullscan
go
update statistics skewedtable with fullscan
go
sql-server query-performance sql-server-2016 cardinality-estimates
add a comment |
On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.
If I'm correct SQL Server 2014+ uses Coarse Histogram Estimation to calculate the estimated number of rows on a join.
But when I execute the query, SQL Server uses the density vector to calculate the number of estimated rows.
Is SQL Server only using the Coarse Histogram Estimation if there is no where
clause?
Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.
Is there a way to improve the estimations on the nested loop?
Using following code you can reproduce the data:
create table MyTable
(
id int identity,
field varchar(50),
constraint pk_id primary key clustered (id)
)
go
create table SkewedTable
(
id int identity,
startdate datetime,
myTableId int,
remark varchar(50),
constraint pk_id primary key clustered (id)
)
set nocount on
insert into MyTable select top 1000 [name] from master..spt_values
go
insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go 1000
insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go
CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)
go
CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)
go
--95=varchar in sys.messages
set nocount off
;with cte as
(
select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark
union all
select * from cte
)
insert into skewedtable select top 40000 * from cte
option(maxrecursion 0)
go
update statistics mytable with fullscan
go
update statistics skewedtable with fullscan
go
sql-server query-performance sql-server-2016 cardinality-estimates
add a comment |
On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.
If I'm correct SQL Server 2014+ uses Coarse Histogram Estimation to calculate the estimated number of rows on a join.
But when I execute the query, SQL Server uses the density vector to calculate the number of estimated rows.
Is SQL Server only using the Coarse Histogram Estimation if there is no where
clause?
Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.
Is there a way to improve the estimations on the nested loop?
Using following code you can reproduce the data:
create table MyTable
(
id int identity,
field varchar(50),
constraint pk_id primary key clustered (id)
)
go
create table SkewedTable
(
id int identity,
startdate datetime,
myTableId int,
remark varchar(50),
constraint pk_id primary key clustered (id)
)
set nocount on
insert into MyTable select top 1000 [name] from master..spt_values
go
insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go 1000
insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go
CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)
go
CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)
go
--95=varchar in sys.messages
set nocount off
;with cte as
(
select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark
union all
select * from cte
)
insert into skewedtable select top 40000 * from cte
option(maxrecursion 0)
go
update statistics mytable with fullscan
go
update statistics skewedtable with fullscan
go
sql-server query-performance sql-server-2016 cardinality-estimates
On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.
If I'm correct SQL Server 2014+ uses Coarse Histogram Estimation to calculate the estimated number of rows on a join.
But when I execute the query, SQL Server uses the density vector to calculate the number of estimated rows.
Is SQL Server only using the Coarse Histogram Estimation if there is no where
clause?
Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.
Is there a way to improve the estimations on the nested loop?
Using following code you can reproduce the data:
create table MyTable
(
id int identity,
field varchar(50),
constraint pk_id primary key clustered (id)
)
go
create table SkewedTable
(
id int identity,
startdate datetime,
myTableId int,
remark varchar(50),
constraint pk_id primary key clustered (id)
)
set nocount on
insert into MyTable select top 1000 [name] from master..spt_values
go
insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go 1000
insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go
CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)
go
CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)
go
--95=varchar in sys.messages
set nocount off
;with cte as
(
select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark
union all
select * from cte
)
insert into skewedtable select top 40000 * from cte
option(maxrecursion 0)
go
update statistics mytable with fullscan
go
update statistics skewedtable with fullscan
go
sql-server query-performance sql-server-2016 cardinality-estimates
sql-server query-performance sql-server-2016 cardinality-estimates
edited 1 hour ago
Paul White♦
49.4k14260414
49.4k14260414
asked 2 hours ago
Frederik VanderhaegenFrederik Vanderhaegen
6861316
6861316
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.
You should find the following filtered statistic helpful:
CREATE STATISTICS [stats id (field=varchar)]
ON dbo.MyTable (id)
WHERE field = 'varchar'
WITH FULLSCAN;
This gives the optimizer information about the distribution of id
values that match field = 'varchar'
, giving a much better selectivity estimate for the join:
The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).
This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).
If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.
add a comment |
Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT
operator
You could add an index or change the existing index:
CREATE INDEX IX_SkewedTable_MytableId_startdate
ON SkewedTable(myTableId,startdate)
INCLUDE(remark);
Insert the values in an intermediate temp table
CREATE TABLE #temp2(param int);
INSERT INTO #temp2(param)
SELECT t.id
FROM mytable t
WHERE t.field
= 'varchar';
Add an index on the temp table
CREATE INDEX IX_ID on #temp2(param);
And then use a CTE to remove the sort operator from the query plan
;WITH CTE AS
(
select TOP 100 PERCENT
s.myTableId,s.id,s.remark from
SkewedTable s
order by startdate
)
SELECT s.id , s.remark
from CTE s
INNER JOIN #temp2
on s.myTableId = #temp2.param
OPTION(RECOMPILE);
Result:
Which removes the sort operator.
I am not certain that TOP 100 PERCENT
is reliable, in that case you could use:
DECLARE @counter INT;
select @counter=COUNT(*) FROM SkewedTable;
;WITH CTE AS
(
select TOP(@counter)
s.myTableId,s.id,s.remark from
SkewedTable s
order by startdate
)
SELECT s.id , s.remark
from CTE s
INNER JOIN #temp2
on s.myTableId = #temp2.param
OPTION(RECOMPILE);
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227747%2fnested-loop-has-very-low-estimates-due-to-skewed-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.
You should find the following filtered statistic helpful:
CREATE STATISTICS [stats id (field=varchar)]
ON dbo.MyTable (id)
WHERE field = 'varchar'
WITH FULLSCAN;
This gives the optimizer information about the distribution of id
values that match field = 'varchar'
, giving a much better selectivity estimate for the join:
The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).
This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).
If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.
add a comment |
Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.
You should find the following filtered statistic helpful:
CREATE STATISTICS [stats id (field=varchar)]
ON dbo.MyTable (id)
WHERE field = 'varchar'
WITH FULLSCAN;
This gives the optimizer information about the distribution of id
values that match field = 'varchar'
, giving a much better selectivity estimate for the join:
The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).
This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).
If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.
add a comment |
Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.
You should find the following filtered statistic helpful:
CREATE STATISTICS [stats id (field=varchar)]
ON dbo.MyTable (id)
WHERE field = 'varchar'
WITH FULLSCAN;
This gives the optimizer information about the distribution of id
values that match field = 'varchar'
, giving a much better selectivity estimate for the join:
The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).
This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).
If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.
Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.
You should find the following filtered statistic helpful:
CREATE STATISTICS [stats id (field=varchar)]
ON dbo.MyTable (id)
WHERE field = 'varchar'
WITH FULLSCAN;
This gives the optimizer information about the distribution of id
values that match field = 'varchar'
, giving a much better selectivity estimate for the join:
The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).
This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).
If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.
edited 1 hour ago
answered 1 hour ago
Paul White♦Paul White
49.4k14260414
49.4k14260414
add a comment |
add a comment |
Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT
operator
You could add an index or change the existing index:
CREATE INDEX IX_SkewedTable_MytableId_startdate
ON SkewedTable(myTableId,startdate)
INCLUDE(remark);
Insert the values in an intermediate temp table
CREATE TABLE #temp2(param int);
INSERT INTO #temp2(param)
SELECT t.id
FROM mytable t
WHERE t.field
= 'varchar';
Add an index on the temp table
CREATE INDEX IX_ID on #temp2(param);
And then use a CTE to remove the sort operator from the query plan
;WITH CTE AS
(
select TOP 100 PERCENT
s.myTableId,s.id,s.remark from
SkewedTable s
order by startdate
)
SELECT s.id , s.remark
from CTE s
INNER JOIN #temp2
on s.myTableId = #temp2.param
OPTION(RECOMPILE);
Result:
Which removes the sort operator.
I am not certain that TOP 100 PERCENT
is reliable, in that case you could use:
DECLARE @counter INT;
select @counter=COUNT(*) FROM SkewedTable;
;WITH CTE AS
(
select TOP(@counter)
s.myTableId,s.id,s.remark from
SkewedTable s
order by startdate
)
SELECT s.id , s.remark
from CTE s
INNER JOIN #temp2
on s.myTableId = #temp2.param
OPTION(RECOMPILE);
add a comment |
Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT
operator
You could add an index or change the existing index:
CREATE INDEX IX_SkewedTable_MytableId_startdate
ON SkewedTable(myTableId,startdate)
INCLUDE(remark);
Insert the values in an intermediate temp table
CREATE TABLE #temp2(param int);
INSERT INTO #temp2(param)
SELECT t.id
FROM mytable t
WHERE t.field
= 'varchar';
Add an index on the temp table
CREATE INDEX IX_ID on #temp2(param);
And then use a CTE to remove the sort operator from the query plan
;WITH CTE AS
(
select TOP 100 PERCENT
s.myTableId,s.id,s.remark from
SkewedTable s
order by startdate
)
SELECT s.id , s.remark
from CTE s
INNER JOIN #temp2
on s.myTableId = #temp2.param
OPTION(RECOMPILE);
Result:
Which removes the sort operator.
I am not certain that TOP 100 PERCENT
is reliable, in that case you could use:
DECLARE @counter INT;
select @counter=COUNT(*) FROM SkewedTable;
;WITH CTE AS
(
select TOP(@counter)
s.myTableId,s.id,s.remark from
SkewedTable s
order by startdate
)
SELECT s.id , s.remark
from CTE s
INNER JOIN #temp2
on s.myTableId = #temp2.param
OPTION(RECOMPILE);
add a comment |
Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT
operator
You could add an index or change the existing index:
CREATE INDEX IX_SkewedTable_MytableId_startdate
ON SkewedTable(myTableId,startdate)
INCLUDE(remark);
Insert the values in an intermediate temp table
CREATE TABLE #temp2(param int);
INSERT INTO #temp2(param)
SELECT t.id
FROM mytable t
WHERE t.field
= 'varchar';
Add an index on the temp table
CREATE INDEX IX_ID on #temp2(param);
And then use a CTE to remove the sort operator from the query plan
;WITH CTE AS
(
select TOP 100 PERCENT
s.myTableId,s.id,s.remark from
SkewedTable s
order by startdate
)
SELECT s.id , s.remark
from CTE s
INNER JOIN #temp2
on s.myTableId = #temp2.param
OPTION(RECOMPILE);
Result:
Which removes the sort operator.
I am not certain that TOP 100 PERCENT
is reliable, in that case you could use:
DECLARE @counter INT;
select @counter=COUNT(*) FROM SkewedTable;
;WITH CTE AS
(
select TOP(@counter)
s.myTableId,s.id,s.remark from
SkewedTable s
order by startdate
)
SELECT s.id , s.remark
from CTE s
INNER JOIN #temp2
on s.myTableId = #temp2.param
OPTION(RECOMPILE);
Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT
operator
You could add an index or change the existing index:
CREATE INDEX IX_SkewedTable_MytableId_startdate
ON SkewedTable(myTableId,startdate)
INCLUDE(remark);
Insert the values in an intermediate temp table
CREATE TABLE #temp2(param int);
INSERT INTO #temp2(param)
SELECT t.id
FROM mytable t
WHERE t.field
= 'varchar';
Add an index on the temp table
CREATE INDEX IX_ID on #temp2(param);
And then use a CTE to remove the sort operator from the query plan
;WITH CTE AS
(
select TOP 100 PERCENT
s.myTableId,s.id,s.remark from
SkewedTable s
order by startdate
)
SELECT s.id , s.remark
from CTE s
INNER JOIN #temp2
on s.myTableId = #temp2.param
OPTION(RECOMPILE);
Result:
Which removes the sort operator.
I am not certain that TOP 100 PERCENT
is reliable, in that case you could use:
DECLARE @counter INT;
select @counter=COUNT(*) FROM SkewedTable;
;WITH CTE AS
(
select TOP(@counter)
s.myTableId,s.id,s.remark from
SkewedTable s
order by startdate
)
SELECT s.id , s.remark
from CTE s
INNER JOIN #temp2
on s.myTableId = #temp2.param
OPTION(RECOMPILE);
edited 12 mins ago
answered 38 mins ago
Randi VertongenRandi Vertongen
1,866316
1,866316
add a comment |
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227747%2fnested-loop-has-very-low-estimates-due-to-skewed-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown