Algorithm to convert a fixed-length string to the smallest possible collision-free representation?

I have a US-based telephone number (in the format 000-000-0000) and need to convert it to a "shorter" representation. Using base32/64 produces too long of a string. I've discovered CRC-16, which produces a string of length 4, however, I'm not sure it's a suitable tool for the job (I want to prevent collisions).

Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.

Are you aware of any algorithm implementation for this?

edited 14 hours ago

greybeard

4131314

asked 16 hours ago

anemaria20

1061

New contributor

1

$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
13 hours ago

1

$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
6 hours ago

add a comment |

Are you aware of any algorithm implementation for this?

edited 14 hours ago

greybeard

4131314

asked 16 hours ago

anemaria20

1061

New contributor

1

$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
13 hours ago

1

$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
6 hours ago

add a comment |

Are you aware of any algorithm implementation for this?

edited 14 hours ago

greybeard

4131314

asked 16 hours ago

anemaria20

1061

New contributor

Are you aware of any algorithm implementation for this?

algorithms data-compression base-conversion

edited 14 hours ago

greybeard

4131314

asked 16 hours ago

anemaria20

1061

New contributor

edited 14 hours ago

greybeard

4131314

asked 16 hours ago

anemaria20

1061

New contributor

edited 14 hours ago

greybeard

4131314

edited 14 hours ago

greybeard

4131314

edited 14 hours ago

greybeard

4131314

asked 16 hours ago

anemaria20

1061

New contributor

asked 16 hours ago

anemaria20

1061

asked 16 hours ago

anemaria20

1061

New contributor

anemaria20 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

1

$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
13 hours ago

1

$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
6 hours ago

add a comment |

1

$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
13 hours ago

1

$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
6 hours ago

Check out Huffman coding.

– Pål GD
13 hours ago

The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?

– JesseM
6 hours ago

add a comment |

2 Answers
2

active

oldest

votes

You have 10 base10 digits. This is $10^{10}$ possible values.

Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.

If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.

Unless there is a pattern in the number itself this is the best you can do.

Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.

answered 16 hours ago

ratchet freak

2,86388

add a comment |

As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:

[2-9][0-8]d - [2-9]dd - dddd

This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)

If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.

Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.

(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)

P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o, O, 0, Q, I, l, and 1 to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=.

edited 11 hours ago

answered 14 hours ago

Draconis

5,252820

$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
12 hours ago

$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
12 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "419"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f105488%2falgorithm-to-convert-a-fixed-length-string-to-the-smallest-possible-collision-fr%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You have 10 base10 digits. This is $10^{10}$ possible values.

Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.

If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.

Unless there is a pattern in the number itself this is the best you can do.

Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.

answered 16 hours ago

ratchet freak

2,86388

add a comment |

You have 10 base10 digits. This is $10^{10}$ possible values.

Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.

If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.

Unless there is a pattern in the number itself this is the best you can do.

Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.

answered 16 hours ago

ratchet freak

2,86388

add a comment |

You have 10 base10 digits. This is $10^{10}$ possible values.

Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.

If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.

Unless there is a pattern in the number itself this is the best you can do.

Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.

answered 16 hours ago

ratchet freak

2,86388

You have 10 base10 digits. This is $10^{10}$ possible values.

Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.

If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.

Unless there is a pattern in the number itself this is the best you can do.

Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.

answered 16 hours ago

ratchet freak

2,86388

answered 16 hours ago

ratchet freak

2,86388

answered 16 hours ago

ratchet freak

2,86388

answered 16 hours ago

ratchet freak

2,86388

add a comment |

[2-9][0-8]d - [2-9]dd - dddd

(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)

edited 11 hours ago

answered 14 hours ago

Draconis

5,252820

$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
12 hours ago

$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
12 hours ago

add a comment |

[2-9][0-8]d - [2-9]dd - dddd

(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)

edited 11 hours ago

answered 14 hours ago

Draconis

5,252820

$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
12 hours ago

$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
12 hours ago

add a comment |

[2-9][0-8]d - [2-9]dd - dddd

(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)

edited 11 hours ago

answered 14 hours ago

Draconis

5,252820

[2-9][0-8]d - [2-9]dd - dddd

(You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)

edited 11 hours ago

answered 14 hours ago

Draconis

5,252820

edited 11 hours ago

answered 14 hours ago

Draconis

5,252820

answered 14 hours ago

Draconis

5,252820

answered 14 hours ago

Draconis

5,252820

$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
12 hours ago

$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
12 hours ago

add a comment |

$begingroup$
how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
$endgroup$
– ratchet freak
12 hours ago

$begingroup$
@ratchetfreak D'oh, of course, you're right. Let me adjust that.
$endgroup$
– Draconis
12 hours ago

how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.

– ratchet freak
12 hours ago

@ratchetfreak D'oh, of course, you're right. Let me adjust that.

– Draconis
12 hours ago

add a comment |

anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Computer Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk