Why are Shapefiles limited to 2GB in size? [on hold]












2















I'm hoping somebody can clarify for me why .shp are limited to a 2GB file size? Having read through the ESRI considerations and technical description, I cannot understand why it exists.



Since they use dBASE for the .dbf component of the multifile format, it must abide by dBASE limits which have a maximum file size of 2GB. Although, that points to the same question, why does that limit exist? Does it have something to do with these formats being created when 32-bit OS' were widely used? If so, how does that influence the limit? I've seen posts regarding this as 2^(31-1) which is ~2.1GB but that just means 32-bit addressing is used, but I am not sure how it fits here. Other posts mention that these formats use 32-bit offsets, specifically "32-bit offsets to 16-bit words", but I don't follow that either.










share|improve this question















put on hold as too broad by Vince, Hornbydd, Erik, Jochen Schwarze, whyzar 13 hours ago


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.














  • 1





    because they use 32bit ints for the addressing (and change byte order half way through the header)

    – Ian Turton
    15 hours ago











  • @IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

    – datta
    15 hours ago








  • 1





    Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

    – Spacedman
    15 hours ago






  • 2





    Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

    – Vince
    15 hours ago











  • @Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

    – datta
    12 hours ago
















2















I'm hoping somebody can clarify for me why .shp are limited to a 2GB file size? Having read through the ESRI considerations and technical description, I cannot understand why it exists.



Since they use dBASE for the .dbf component of the multifile format, it must abide by dBASE limits which have a maximum file size of 2GB. Although, that points to the same question, why does that limit exist? Does it have something to do with these formats being created when 32-bit OS' were widely used? If so, how does that influence the limit? I've seen posts regarding this as 2^(31-1) which is ~2.1GB but that just means 32-bit addressing is used, but I am not sure how it fits here. Other posts mention that these formats use 32-bit offsets, specifically "32-bit offsets to 16-bit words", but I don't follow that either.










share|improve this question















put on hold as too broad by Vince, Hornbydd, Erik, Jochen Schwarze, whyzar 13 hours ago


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.














  • 1





    because they use 32bit ints for the addressing (and change byte order half way through the header)

    – Ian Turton
    15 hours ago











  • @IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

    – datta
    15 hours ago








  • 1





    Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

    – Spacedman
    15 hours ago






  • 2





    Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

    – Vince
    15 hours ago











  • @Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

    – datta
    12 hours ago














2












2








2








I'm hoping somebody can clarify for me why .shp are limited to a 2GB file size? Having read through the ESRI considerations and technical description, I cannot understand why it exists.



Since they use dBASE for the .dbf component of the multifile format, it must abide by dBASE limits which have a maximum file size of 2GB. Although, that points to the same question, why does that limit exist? Does it have something to do with these formats being created when 32-bit OS' were widely used? If so, how does that influence the limit? I've seen posts regarding this as 2^(31-1) which is ~2.1GB but that just means 32-bit addressing is used, but I am not sure how it fits here. Other posts mention that these formats use 32-bit offsets, specifically "32-bit offsets to 16-bit words", but I don't follow that either.










share|improve this question
















I'm hoping somebody can clarify for me why .shp are limited to a 2GB file size? Having read through the ESRI considerations and technical description, I cannot understand why it exists.



Since they use dBASE for the .dbf component of the multifile format, it must abide by dBASE limits which have a maximum file size of 2GB. Although, that points to the same question, why does that limit exist? Does it have something to do with these formats being created when 32-bit OS' were widely used? If so, how does that influence the limit? I've seen posts regarding this as 2^(31-1) which is ~2.1GB but that just means 32-bit addressing is used, but I am not sure how it fits here. Other posts mention that these formats use 32-bit offsets, specifically "32-bit offsets to 16-bit words", but I don't follow that either.







shapefile dbf file-size






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 15 hours ago







datta

















asked 15 hours ago









dattadatta

1195




1195




put on hold as too broad by Vince, Hornbydd, Erik, Jochen Schwarze, whyzar 13 hours ago


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.









put on hold as too broad by Vince, Hornbydd, Erik, Jochen Schwarze, whyzar 13 hours ago


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.










  • 1





    because they use 32bit ints for the addressing (and change byte order half way through the header)

    – Ian Turton
    15 hours ago











  • @IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

    – datta
    15 hours ago








  • 1





    Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

    – Spacedman
    15 hours ago






  • 2





    Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

    – Vince
    15 hours ago











  • @Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

    – datta
    12 hours ago














  • 1





    because they use 32bit ints for the addressing (and change byte order half way through the header)

    – Ian Turton
    15 hours ago











  • @IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

    – datta
    15 hours ago








  • 1





    Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

    – Spacedman
    15 hours ago






  • 2





    Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

    – Vince
    15 hours ago











  • @Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

    – datta
    12 hours ago








1




1





because they use 32bit ints for the addressing (and change byte order half way through the header)

– Ian Turton
15 hours ago





because they use 32bit ints for the addressing (and change byte order half way through the header)

– Ian Turton
15 hours ago













@IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

– datta
15 hours ago







@IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

– datta
15 hours ago






1




1





Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

– Spacedman
15 hours ago





Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

– Spacedman
15 hours ago




2




2





Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

– Vince
15 hours ago





Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

– Vince
15 hours ago













@Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

– datta
12 hours ago





@Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

– datta
12 hours ago










1 Answer
1






active

oldest

votes


















9














You're asking several History of Computing questions here. All the reasons you've listed are true. The maximum file size on the OS was 2GB. The maximum integer size was 2GB. The maximum file offset in the OSes was 2GB. But the shapefile format specification explicitly states it has a 2GB limit. Isn't that enough of a reason?



There are scads of new formats that out-perform shapefile. File geodatabase is so much better that I haven't created an output shapefile this decade. But I've used input shapefiles because that was what was available, and I've generated new shapefiles with turn-of-the-millennium tools, because that's what was available then.



Has computing changed? Of course it has. Can you hack the shapefile format to 4Gb or 8Gb? Yes, but not without being non-conformant. And it's the conformance that is shapefile's greatest strength, and violating conformance is what will destroy whatever utility remains of the format.






share|improve this answer


























  • I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

    – datta
    13 hours ago








  • 1





    The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

    – Vince
    13 hours ago













  • Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

    – datta
    12 hours ago











  • You do not need this information to implement a standard, especially such an ancient standard.

    – Vince
    11 hours ago













  • Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

    – datta
    10 hours ago


















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









9














You're asking several History of Computing questions here. All the reasons you've listed are true. The maximum file size on the OS was 2GB. The maximum integer size was 2GB. The maximum file offset in the OSes was 2GB. But the shapefile format specification explicitly states it has a 2GB limit. Isn't that enough of a reason?



There are scads of new formats that out-perform shapefile. File geodatabase is so much better that I haven't created an output shapefile this decade. But I've used input shapefiles because that was what was available, and I've generated new shapefiles with turn-of-the-millennium tools, because that's what was available then.



Has computing changed? Of course it has. Can you hack the shapefile format to 4Gb or 8Gb? Yes, but not without being non-conformant. And it's the conformance that is shapefile's greatest strength, and violating conformance is what will destroy whatever utility remains of the format.






share|improve this answer


























  • I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

    – datta
    13 hours ago








  • 1





    The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

    – Vince
    13 hours ago













  • Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

    – datta
    12 hours ago











  • You do not need this information to implement a standard, especially such an ancient standard.

    – Vince
    11 hours ago













  • Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

    – datta
    10 hours ago
















9














You're asking several History of Computing questions here. All the reasons you've listed are true. The maximum file size on the OS was 2GB. The maximum integer size was 2GB. The maximum file offset in the OSes was 2GB. But the shapefile format specification explicitly states it has a 2GB limit. Isn't that enough of a reason?



There are scads of new formats that out-perform shapefile. File geodatabase is so much better that I haven't created an output shapefile this decade. But I've used input shapefiles because that was what was available, and I've generated new shapefiles with turn-of-the-millennium tools, because that's what was available then.



Has computing changed? Of course it has. Can you hack the shapefile format to 4Gb or 8Gb? Yes, but not without being non-conformant. And it's the conformance that is shapefile's greatest strength, and violating conformance is what will destroy whatever utility remains of the format.






share|improve this answer


























  • I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

    – datta
    13 hours ago








  • 1





    The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

    – Vince
    13 hours ago













  • Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

    – datta
    12 hours ago











  • You do not need this information to implement a standard, especially such an ancient standard.

    – Vince
    11 hours ago













  • Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

    – datta
    10 hours ago














9












9








9







You're asking several History of Computing questions here. All the reasons you've listed are true. The maximum file size on the OS was 2GB. The maximum integer size was 2GB. The maximum file offset in the OSes was 2GB. But the shapefile format specification explicitly states it has a 2GB limit. Isn't that enough of a reason?



There are scads of new formats that out-perform shapefile. File geodatabase is so much better that I haven't created an output shapefile this decade. But I've used input shapefiles because that was what was available, and I've generated new shapefiles with turn-of-the-millennium tools, because that's what was available then.



Has computing changed? Of course it has. Can you hack the shapefile format to 4Gb or 8Gb? Yes, but not without being non-conformant. And it's the conformance that is shapefile's greatest strength, and violating conformance is what will destroy whatever utility remains of the format.






share|improve this answer















You're asking several History of Computing questions here. All the reasons you've listed are true. The maximum file size on the OS was 2GB. The maximum integer size was 2GB. The maximum file offset in the OSes was 2GB. But the shapefile format specification explicitly states it has a 2GB limit. Isn't that enough of a reason?



There are scads of new formats that out-perform shapefile. File geodatabase is so much better that I haven't created an output shapefile this decade. But I've used input shapefiles because that was what was available, and I've generated new shapefiles with turn-of-the-millennium tools, because that's what was available then.



Has computing changed? Of course it has. Can you hack the shapefile format to 4Gb or 8Gb? Yes, but not without being non-conformant. And it's the conformance that is shapefile's greatest strength, and violating conformance is what will destroy whatever utility remains of the format.







share|improve this answer














share|improve this answer



share|improve this answer








edited 13 hours ago

























answered 15 hours ago









VinceVince

14.6k32748




14.6k32748













  • I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

    – datta
    13 hours ago








  • 1





    The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

    – Vince
    13 hours ago













  • Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

    – datta
    12 hours ago











  • You do not need this information to implement a standard, especially such an ancient standard.

    – Vince
    11 hours ago













  • Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

    – datta
    10 hours ago



















  • I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

    – datta
    13 hours ago








  • 1





    The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

    – Vince
    13 hours ago













  • Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

    – datta
    12 hours ago











  • You do not need this information to implement a standard, especially such an ancient standard.

    – Vince
    11 hours ago













  • Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

    – datta
    10 hours ago

















I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

– datta
13 hours ago







I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

– datta
13 hours ago






1




1





The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

– Vince
13 hours ago







The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

– Vince
13 hours ago















Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

– datta
12 hours ago





Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

– datta
12 hours ago













You do not need this information to implement a standard, especially such an ancient standard.

– Vince
11 hours ago







You do not need this information to implement a standard, especially such an ancient standard.

– Vince
11 hours ago















Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

– datta
10 hours ago





Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

– datta
10 hours ago



Popular posts from this blog

How to label and detect the document text images

Vallis Paradisi

Tabula Rosettana