Nested Loop has very low estimates due to skewed data












3















On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.



If I'm correct SQL Server 2014+ uses Coarse Histogram Estimation to calculate the estimated number of rows on a join.

But when I execute the query, SQL Server uses the density vector to calculate the number of estimated rows.

Is SQL Server only using the Coarse Histogram Estimation if there is no where clause?



Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.



Is there a way to improve the estimations on the nested loop?



Using following code you can reproduce the data:



create table MyTable
(
id int identity,
field varchar(50),
constraint pk_id primary key clustered (id)
)
go

create table SkewedTable
(
id int identity,
startdate datetime,
myTableId int,
remark varchar(50),
constraint pk_id primary key clustered (id)
)

set nocount on

insert into MyTable select top 1000 [name] from master..spt_values
go

insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go 1000

insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go

CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)
go

CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)
go

--95=varchar in sys.messages
set nocount off

;with cte as
(
select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark
union all
select * from cte
)
insert into skewedtable select top 40000 * from cte
option(maxrecursion 0)
go

update statistics mytable with fullscan
go

update statistics skewedtable with fullscan
go









share|improve this question





























    3















    On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.



    If I'm correct SQL Server 2014+ uses Coarse Histogram Estimation to calculate the estimated number of rows on a join.

    But when I execute the query, SQL Server uses the density vector to calculate the number of estimated rows.

    Is SQL Server only using the Coarse Histogram Estimation if there is no where clause?



    Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.



    Is there a way to improve the estimations on the nested loop?



    Using following code you can reproduce the data:



    create table MyTable
    (
    id int identity,
    field varchar(50),
    constraint pk_id primary key clustered (id)
    )
    go

    create table SkewedTable
    (
    id int identity,
    startdate datetime,
    myTableId int,
    remark varchar(50),
    constraint pk_id primary key clustered (id)
    )

    set nocount on

    insert into MyTable select top 1000 [name] from master..spt_values
    go

    insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
    go 1000

    insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
    go

    CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)
    go

    CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)
    go

    --95=varchar in sys.messages
    set nocount off

    ;with cte as
    (
    select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark
    union all
    select * from cte
    )
    insert into skewedtable select top 40000 * from cte
    option(maxrecursion 0)
    go

    update statistics mytable with fullscan
    go

    update statistics skewedtable with fullscan
    go









    share|improve this question



























      3












      3








      3








      On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.



      If I'm correct SQL Server 2014+ uses Coarse Histogram Estimation to calculate the estimated number of rows on a join.

      But when I execute the query, SQL Server uses the density vector to calculate the number of estimated rows.

      Is SQL Server only using the Coarse Histogram Estimation if there is no where clause?



      Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.



      Is there a way to improve the estimations on the nested loop?



      Using following code you can reproduce the data:



      create table MyTable
      (
      id int identity,
      field varchar(50),
      constraint pk_id primary key clustered (id)
      )
      go

      create table SkewedTable
      (
      id int identity,
      startdate datetime,
      myTableId int,
      remark varchar(50),
      constraint pk_id primary key clustered (id)
      )

      set nocount on

      insert into MyTable select top 1000 [name] from master..spt_values
      go

      insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
      go 1000

      insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
      go

      CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)
      go

      CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)
      go

      --95=varchar in sys.messages
      set nocount off

      ;with cte as
      (
      select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark
      union all
      select * from cte
      )
      insert into skewedtable select top 40000 * from cte
      option(maxrecursion 0)
      go

      update statistics mytable with fullscan
      go

      update statistics skewedtable with fullscan
      go









      share|improve this question
















      On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.



      If I'm correct SQL Server 2014+ uses Coarse Histogram Estimation to calculate the estimated number of rows on a join.

      But when I execute the query, SQL Server uses the density vector to calculate the number of estimated rows.

      Is SQL Server only using the Coarse Histogram Estimation if there is no where clause?



      Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.



      Is there a way to improve the estimations on the nested loop?



      Using following code you can reproduce the data:



      create table MyTable
      (
      id int identity,
      field varchar(50),
      constraint pk_id primary key clustered (id)
      )
      go

      create table SkewedTable
      (
      id int identity,
      startdate datetime,
      myTableId int,
      remark varchar(50),
      constraint pk_id primary key clustered (id)
      )

      set nocount on

      insert into MyTable select top 1000 [name] from master..spt_values
      go

      insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
      go 1000

      insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
      go

      CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)
      go

      CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)
      go

      --95=varchar in sys.messages
      set nocount off

      ;with cte as
      (
      select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark
      union all
      select * from cte
      )
      insert into skewedtable select top 40000 * from cte
      option(maxrecursion 0)
      go

      update statistics mytable with fullscan
      go

      update statistics skewedtable with fullscan
      go






      sql-server query-performance sql-server-2016 cardinality-estimates






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 1 hour ago









      Paul White

      49.4k14260414




      49.4k14260414










      asked 2 hours ago









      Frederik VanderhaegenFrederik Vanderhaegen

      6861316




      6861316






















          2 Answers
          2






          active

          oldest

          votes


















          1















          Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.




          You should find the following filtered statistic helpful:



          CREATE STATISTICS [stats id (field=varchar)]
          ON dbo.MyTable (id)
          WHERE field = 'varchar'
          WITH FULLSCAN;


          This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:



          Actual execution plan



          The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).



          This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).



          If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.






          share|improve this answer

































            0














            Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT operator



            You could add an index or change the existing index:



            CREATE INDEX IX_SkewedTable_MytableId_startdate
            ON SkewedTable(myTableId,startdate)
            INCLUDE(remark);


            Insert the values in an intermediate temp table



            CREATE TABLE  #temp2(param int);
            INSERT INTO #temp2(param)
            SELECT t.id
            FROM mytable t
            WHERE t.field
            = 'varchar';


            enter image description here



            Add an index on the temp table



            CREATE INDEX IX_ID on #temp2(param);


            And then use a CTE to remove the sort operator from the query plan



            ;WITH CTE AS
            (
            select TOP 100 PERCENT
            s.myTableId,s.id,s.remark from
            SkewedTable s
            order by startdate
            )
            SELECT s.id , s.remark
            from CTE s
            INNER JOIN #temp2
            on s.myTableId = #temp2.param
            OPTION(RECOMPILE);


            Result:



            enter image description here



            Which removes the sort operator.



            I am not certain that TOP 100 PERCENT is reliable, in that case you could use:



            DECLARE @counter INT;
            select @counter=COUNT(*) FROM SkewedTable;
            ;WITH CTE AS
            (
            select TOP(@counter)
            s.myTableId,s.id,s.remark from
            SkewedTable s
            order by startdate
            )
            SELECT s.id , s.remark
            from CTE s
            INNER JOIN #temp2
            on s.myTableId = #temp2.param
            OPTION(RECOMPILE);





            share|improve this answer

























              Your Answer








              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "182"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227747%2fnested-loop-has-very-low-estimates-due-to-skewed-data%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1















              Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.




              You should find the following filtered statistic helpful:



              CREATE STATISTICS [stats id (field=varchar)]
              ON dbo.MyTable (id)
              WHERE field = 'varchar'
              WITH FULLSCAN;


              This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:



              Actual execution plan



              The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).



              This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).



              If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.






              share|improve this answer






























                1















                Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.




                You should find the following filtered statistic helpful:



                CREATE STATISTICS [stats id (field=varchar)]
                ON dbo.MyTable (id)
                WHERE field = 'varchar'
                WITH FULLSCAN;


                This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:



                Actual execution plan



                The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).



                This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).



                If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.






                share|improve this answer




























                  1












                  1








                  1








                  Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.




                  You should find the following filtered statistic helpful:



                  CREATE STATISTICS [stats id (field=varchar)]
                  ON dbo.MyTable (id)
                  WHERE field = 'varchar'
                  WITH FULLSCAN;


                  This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:



                  Actual execution plan



                  The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).



                  This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).



                  If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.






                  share|improve this answer
















                  Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.




                  You should find the following filtered statistic helpful:



                  CREATE STATISTICS [stats id (field=varchar)]
                  ON dbo.MyTable (id)
                  WHERE field = 'varchar'
                  WITH FULLSCAN;


                  This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:



                  Actual execution plan



                  The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).



                  This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).



                  If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited 1 hour ago

























                  answered 1 hour ago









                  Paul WhitePaul White

                  49.4k14260414




                  49.4k14260414

























                      0














                      Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT operator



                      You could add an index or change the existing index:



                      CREATE INDEX IX_SkewedTable_MytableId_startdate
                      ON SkewedTable(myTableId,startdate)
                      INCLUDE(remark);


                      Insert the values in an intermediate temp table



                      CREATE TABLE  #temp2(param int);
                      INSERT INTO #temp2(param)
                      SELECT t.id
                      FROM mytable t
                      WHERE t.field
                      = 'varchar';


                      enter image description here



                      Add an index on the temp table



                      CREATE INDEX IX_ID on #temp2(param);


                      And then use a CTE to remove the sort operator from the query plan



                      ;WITH CTE AS
                      (
                      select TOP 100 PERCENT
                      s.myTableId,s.id,s.remark from
                      SkewedTable s
                      order by startdate
                      )
                      SELECT s.id , s.remark
                      from CTE s
                      INNER JOIN #temp2
                      on s.myTableId = #temp2.param
                      OPTION(RECOMPILE);


                      Result:



                      enter image description here



                      Which removes the sort operator.



                      I am not certain that TOP 100 PERCENT is reliable, in that case you could use:



                      DECLARE @counter INT;
                      select @counter=COUNT(*) FROM SkewedTable;
                      ;WITH CTE AS
                      (
                      select TOP(@counter)
                      s.myTableId,s.id,s.remark from
                      SkewedTable s
                      order by startdate
                      )
                      SELECT s.id , s.remark
                      from CTE s
                      INNER JOIN #temp2
                      on s.myTableId = #temp2.param
                      OPTION(RECOMPILE);





                      share|improve this answer






























                        0














                        Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT operator



                        You could add an index or change the existing index:



                        CREATE INDEX IX_SkewedTable_MytableId_startdate
                        ON SkewedTable(myTableId,startdate)
                        INCLUDE(remark);


                        Insert the values in an intermediate temp table



                        CREATE TABLE  #temp2(param int);
                        INSERT INTO #temp2(param)
                        SELECT t.id
                        FROM mytable t
                        WHERE t.field
                        = 'varchar';


                        enter image description here



                        Add an index on the temp table



                        CREATE INDEX IX_ID on #temp2(param);


                        And then use a CTE to remove the sort operator from the query plan



                        ;WITH CTE AS
                        (
                        select TOP 100 PERCENT
                        s.myTableId,s.id,s.remark from
                        SkewedTable s
                        order by startdate
                        )
                        SELECT s.id , s.remark
                        from CTE s
                        INNER JOIN #temp2
                        on s.myTableId = #temp2.param
                        OPTION(RECOMPILE);


                        Result:



                        enter image description here



                        Which removes the sort operator.



                        I am not certain that TOP 100 PERCENT is reliable, in that case you could use:



                        DECLARE @counter INT;
                        select @counter=COUNT(*) FROM SkewedTable;
                        ;WITH CTE AS
                        (
                        select TOP(@counter)
                        s.myTableId,s.id,s.remark from
                        SkewedTable s
                        order by startdate
                        )
                        SELECT s.id , s.remark
                        from CTE s
                        INNER JOIN #temp2
                        on s.myTableId = #temp2.param
                        OPTION(RECOMPILE);





                        share|improve this answer




























                          0












                          0








                          0







                          Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT operator



                          You could add an index or change the existing index:



                          CREATE INDEX IX_SkewedTable_MytableId_startdate
                          ON SkewedTable(myTableId,startdate)
                          INCLUDE(remark);


                          Insert the values in an intermediate temp table



                          CREATE TABLE  #temp2(param int);
                          INSERT INTO #temp2(param)
                          SELECT t.id
                          FROM mytable t
                          WHERE t.field
                          = 'varchar';


                          enter image description here



                          Add an index on the temp table



                          CREATE INDEX IX_ID on #temp2(param);


                          And then use a CTE to remove the sort operator from the query plan



                          ;WITH CTE AS
                          (
                          select TOP 100 PERCENT
                          s.myTableId,s.id,s.remark from
                          SkewedTable s
                          order by startdate
                          )
                          SELECT s.id , s.remark
                          from CTE s
                          INNER JOIN #temp2
                          on s.myTableId = #temp2.param
                          OPTION(RECOMPILE);


                          Result:



                          enter image description here



                          Which removes the sort operator.



                          I am not certain that TOP 100 PERCENT is reliable, in that case you could use:



                          DECLARE @counter INT;
                          select @counter=COUNT(*) FROM SkewedTable;
                          ;WITH CTE AS
                          (
                          select TOP(@counter)
                          s.myTableId,s.id,s.remark from
                          SkewedTable s
                          order by startdate
                          )
                          SELECT s.id , s.remark
                          from CTE s
                          INNER JOIN #temp2
                          on s.myTableId = #temp2.param
                          OPTION(RECOMPILE);





                          share|improve this answer















                          Completely agree with the filtered index, this answer is added to expand on the other option that @PaulWhite mentioned, to use an intermediate temp table and consequentially get rid of the SORT operator



                          You could add an index or change the existing index:



                          CREATE INDEX IX_SkewedTable_MytableId_startdate
                          ON SkewedTable(myTableId,startdate)
                          INCLUDE(remark);


                          Insert the values in an intermediate temp table



                          CREATE TABLE  #temp2(param int);
                          INSERT INTO #temp2(param)
                          SELECT t.id
                          FROM mytable t
                          WHERE t.field
                          = 'varchar';


                          enter image description here



                          Add an index on the temp table



                          CREATE INDEX IX_ID on #temp2(param);


                          And then use a CTE to remove the sort operator from the query plan



                          ;WITH CTE AS
                          (
                          select TOP 100 PERCENT
                          s.myTableId,s.id,s.remark from
                          SkewedTable s
                          order by startdate
                          )
                          SELECT s.id , s.remark
                          from CTE s
                          INNER JOIN #temp2
                          on s.myTableId = #temp2.param
                          OPTION(RECOMPILE);


                          Result:



                          enter image description here



                          Which removes the sort operator.



                          I am not certain that TOP 100 PERCENT is reliable, in that case you could use:



                          DECLARE @counter INT;
                          select @counter=COUNT(*) FROM SkewedTable;
                          ;WITH CTE AS
                          (
                          select TOP(@counter)
                          s.myTableId,s.id,s.remark from
                          SkewedTable s
                          order by startdate
                          )
                          SELECT s.id , s.remark
                          from CTE s
                          INNER JOIN #temp2
                          on s.myTableId = #temp2.param
                          OPTION(RECOMPILE);






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited 12 mins ago

























                          answered 38 mins ago









                          Randi VertongenRandi Vertongen

                          1,866316




                          1,866316






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Database Administrators Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227747%2fnested-loop-has-very-low-estimates-due-to-skewed-data%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              How to label and detect the document text images

                              Tabula Rosettana

                              Aureus (color)