Why are Shapefiles limited to 2GB in size? [on hold]

I'm hoping somebody can clarify for me why .shp are limited to a 2GB file size? Having read through the ESRI considerations and technical description, I cannot understand why it exists.

Since they use dBASE for the .dbf component of the multifile format, it must abide by dBASE limits which have a maximum file size of 2GB. Although, that points to the same question, why does that limit exist? Does it have something to do with these formats being created when 32-bit OS' were widely used? If so, how does that influence the limit? I've seen posts regarding this as 2^(31-1) which is ~2.1GB but that just means 32-bit addressing is used, but I am not sure how it fits here. Other posts mention that these formats use 32-bit offsets, specifically "32-bit offsets to 16-bit words", but I don't follow that either.

edited 15 hours ago

asked 15 hours ago

datta

1195

put on hold as too broad by Vince, Hornbydd, Erik, Jochen Schwarze, whyzar 13 hours ago

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

1

because they use 32bit ints for the addressing (and change byte order half way through the header)

– Ian Turton♦
15 hours ago

@IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

– datta
15 hours ago

1

Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

– Spacedman
15 hours ago

2

Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

– Vince
15 hours ago

@Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

– datta
12 hours ago

|
show 2 more comments

I'm hoping somebody can clarify for me why .shp are limited to a 2GB file size? Having read through the ESRI considerations and technical description, I cannot understand why it exists.

edited 15 hours ago

asked 15 hours ago

datta

1195

put on hold as too broad by Vince, Hornbydd, Erik, Jochen Schwarze, whyzar 13 hours ago

1

because they use 32bit ints for the addressing (and change byte order half way through the header)

– Ian Turton♦
15 hours ago

@IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

– datta
15 hours ago

1

Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

– Spacedman
15 hours ago

2

Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

– Vince
15 hours ago

@Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

– datta
12 hours ago

|
show 2 more comments

I'm hoping somebody can clarify for me why .shp are limited to a 2GB file size? Having read through the ESRI considerations and technical description, I cannot understand why it exists.

edited 15 hours ago

asked 15 hours ago

datta

1195

I'm hoping somebody can clarify for me why .shp are limited to a 2GB file size? Having read through the ESRI considerations and technical description, I cannot understand why it exists.

shapefile dbf file-size

edited 15 hours ago

asked 15 hours ago

datta

1195

edited 15 hours ago

asked 15 hours ago

datta

1195

edited 15 hours ago

asked 15 hours ago

datta

1195

asked 15 hours ago

datta

1195

asked 15 hours ago

datta

1195

put on hold as too broad by Vince, Hornbydd, Erik, Jochen Schwarze, whyzar 13 hours ago

1

because they use 32bit ints for the addressing (and change byte order half way through the header)

– Ian Turton♦
15 hours ago

@IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

– datta
15 hours ago

1

Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

– Spacedman
15 hours ago

2

Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

– Vince
15 hours ago

@Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

– datta
12 hours ago

|
show 2 more comments

1

because they use 32bit ints for the addressing (and change byte order half way through the header)

– Ian Turton♦
15 hours ago

@IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

– datta
15 hours ago

1

Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

– Spacedman
15 hours ago

2

Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

– Vince
15 hours ago

@Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

– datta
12 hours ago

because they use 32bit ints for the addressing (and change byte order half way through the header)

– Ian Turton♦
15 hours ago

@IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified.

– datta
15 hours ago

Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary.

– Spacedman
15 hours ago

Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size.

– Vince
15 hours ago

@Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB?

– datta
12 hours ago

|
show 2 more comments

1 Answer
1

active

oldest

votes

You're asking several History of Computing questions here. All the reasons you've listed are true. The maximum file size on the OS was 2GB. The maximum integer size was 2GB. The maximum file offset in the OSes was 2GB. But the shapefile format specification explicitly states it has a 2GB limit. Isn't that enough of a reason?

There are scads of new formats that out-perform shapefile. File geodatabase is so much better that I haven't created an output shapefile this decade. But I've used input shapefiles because that was what was available, and I've generated new shapefiles with turn-of-the-millennium tools, because that's what was available then.

Has computing changed? Of course it has. Can you hack the shapefile format to 4Gb or 8Gb? Yes, but not without being non-conformant. And it's the conformance that is shapefile's greatest strength, and violating conformance is what will destroy whatever utility remains of the format.

edited 13 hours ago

answered 15 hours ago

Vince

14.6k32748

I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

– datta
13 hours ago

1

The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

– Vince
13 hours ago

Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

– datta
12 hours ago

You do not need this information to implement a standard, especially such an ancient standard.

– Vince
11 hours ago

Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

– datta
10 hours ago

|
show 2 more comments

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

edited 13 hours ago

answered 15 hours ago

Vince

14.6k32748

I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

– datta
13 hours ago

1

The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

– Vince
13 hours ago

Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

– datta
12 hours ago

You do not need this information to implement a standard, especially such an ancient standard.

– Vince
11 hours ago

Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

– datta
10 hours ago

|
show 2 more comments

edited 13 hours ago

answered 15 hours ago

Vince

14.6k32748

I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

– datta
13 hours ago

1

The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

– Vince
13 hours ago

Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

– datta
12 hours ago

You do not need this information to implement a standard, especially such an ancient standard.

– Vince
11 hours ago

Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

– datta
10 hours ago

|
show 2 more comments

edited 13 hours ago

answered 15 hours ago

Vince

14.6k32748

edited 13 hours ago

answered 15 hours ago

Vince

14.6k32748

edited 13 hours ago

answered 15 hours ago

Vince

14.6k32748

answered 15 hours ago

Vince

14.6k32748

answered 15 hours ago

Vince

14.6k32748

I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

– datta
13 hours ago

1

The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

– Vince
13 hours ago

Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

– datta
12 hours ago

You do not need this information to implement a standard, especially such an ancient standard.

– Vince
11 hours ago

Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

– datta
10 hours ago

|
show 2 more comments

I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

– datta
13 hours ago

1

The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

– Vince
13 hours ago

Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

– datta
12 hours ago

You do not need this information to implement a standard, especially such an ancient standard.

– Vince
11 hours ago

Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

– datta
10 hours ago

I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth.

– datta
13 hours ago

The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about.

– Vince
13 hours ago

Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning.

– datta
12 hours ago

You do not need this information to implement a standard, especially such an ancient standard.

– Vince
11 hours ago

Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on.

– datta
10 hours ago

|
show 2 more comments

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk