Algorithm to convert a fixed-length string to the smallest possible collision-free representation?












1












$begingroup$


I have a US-based telephone number (in the format 000-000-0000) and need to convert it to a "shorter" representation. Using base32/64 produces too long of a string. I've discovered CRC-16, which produces a string of length 4, however, I'm not sure it's a suitable tool for the job (I want to prevent collisions).



Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.



Are you aware of any algorithm implementation for this?










share|cite|improve this question









New contributor




anemaria20 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$








  • 1




    $begingroup$
    Check out Huffman coding.
    $endgroup$
    – Pål GD
    13 hours ago






  • 1




    $begingroup$
    The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
    $endgroup$
    – JesseM
    6 hours ago
















1












$begingroup$


I have a US-based telephone number (in the format 000-000-0000) and need to convert it to a "shorter" representation. Using base32/64 produces too long of a string. I've discovered CRC-16, which produces a string of length 4, however, I'm not sure it's a suitable tool for the job (I want to prevent collisions).



Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.



Are you aware of any algorithm implementation for this?










share|cite|improve this question









New contributor




anemaria20 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$








  • 1




    $begingroup$
    Check out Huffman coding.
    $endgroup$
    – Pål GD
    13 hours ago






  • 1




    $begingroup$
    The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
    $endgroup$
    – JesseM
    6 hours ago














1












1








1





$begingroup$


I have a US-based telephone number (in the format 000-000-0000) and need to convert it to a "shorter" representation. Using base32/64 produces too long of a string. I've discovered CRC-16, which produces a string of length 4, however, I'm not sure it's a suitable tool for the job (I want to prevent collisions).



Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.



Are you aware of any algorithm implementation for this?










share|cite|improve this question









New contributor




anemaria20 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




I have a US-based telephone number (in the format 000-000-0000) and need to convert it to a "shorter" representation. Using base32/64 produces too long of a string. I've discovered CRC-16, which produces a string of length 4, however, I'm not sure it's a suitable tool for the job (I want to prevent collisions).



Basically, if the input is 000-000-0000 (without the dashes), I want the output to be something like 4xg4. In this use case, the advantage is that I have 10 digits, thus I could "convert" them to something smaller that uses letters as well.



Are you aware of any algorithm implementation for this?







algorithms data-compression base-conversion






share|cite|improve this question









New contributor




anemaria20 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|cite|improve this question









New contributor




anemaria20 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|cite|improve this question




share|cite|improve this question








edited 14 hours ago









greybeard

4131314




4131314






New contributor




anemaria20 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 16 hours ago









anemaria20anemaria20

1061




1061




New contributor




anemaria20 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





anemaria20 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






anemaria20 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 1




    $begingroup$
    Check out Huffman coding.
    $endgroup$
    – Pål GD
    13 hours ago






  • 1




    $begingroup$
    The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
    $endgroup$
    – JesseM
    6 hours ago














  • 1




    $begingroup$
    Check out Huffman coding.
    $endgroup$
    – Pål GD
    13 hours ago






  • 1




    $begingroup$
    The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
    $endgroup$
    – JesseM
    6 hours ago








1




1




$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
13 hours ago




$begingroup$
Check out Huffman coding.
$endgroup$
– Pål GD
13 hours ago




1




1




$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
6 hours ago




$begingroup$
The answers below are mathematically correct. With the requirement of "no collisions" and no further information which could reduce your possible inputs, those are the best you can do. But if we understood why you needed a shorter representation, we might be able to offer other solutions. Disk space or memory seem unlikely constraints if you have the CPU to compress/decompress on the fly this kind of data. So is it some sort of UI constraint where the display needs to be shorter? Or something else?
$endgroup$
– JesseM
6 hours ago










2 Answers
2






active

oldest

votes


















8












$begingroup$

You have 10 base10 digits. This is $10^{10}$ possible values.



Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.



If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.



Unless there is a pattern in the number itself this is the best you can do.



Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.






share|cite|improve this answer









$endgroup$





















    8












    $begingroup$

    As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:



    [2-9][0-8]d - [2-9]dd - dddd


    This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)



    If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.



    Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.



    (You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)



    P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o, O, 0, Q, I, l, and 1 to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=.






    share|cite|improve this answer











    $endgroup$













    • $begingroup$
      how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
      $endgroup$
      – ratchet freak
      12 hours ago










    • $begingroup$
      @ratchetfreak D'oh, of course, you're right. Let me adjust that.
      $endgroup$
      – Draconis
      12 hours ago











    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "419"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });






    anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.










    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f105488%2falgorithm-to-convert-a-fixed-length-string-to-the-smallest-possible-collision-fr%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    8












    $begingroup$

    You have 10 base10 digits. This is $10^{10}$ possible values.



    Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.



    If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.



    Unless there is a pattern in the number itself this is the best you can do.



    Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.






    share|cite|improve this answer









    $endgroup$


















      8












      $begingroup$

      You have 10 base10 digits. This is $10^{10}$ possible values.



      Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.



      If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.



      Unless there is a pattern in the number itself this is the best you can do.



      Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.






      share|cite|improve this answer









      $endgroup$
















        8












        8








        8





        $begingroup$

        You have 10 base10 digits. This is $10^{10}$ possible values.



        Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.



        If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.



        Unless there is a pattern in the number itself this is the best you can do.



        Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.






        share|cite|improve this answer









        $endgroup$



        You have 10 base10 digits. This is $10^{10}$ possible values.



        Encoding this in an alphabet with 64 tokens you need $lceil log_{64}(10^{10}) rceil = 6$ characters.



        If you encode it in straight binary you only need $lceil log_{2}(10^{10}) rceil = 34$ bits total. 4 bytes + 2 bits.



        Unless there is a pattern in the number itself this is the best you can do.



        Note that this looks at the number in a vacuum. If you encode more than just that number you can use arithmetic encoding to use some of the "spare room" the rounding creates.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered 16 hours ago









        ratchet freakratchet freak

        2,86388




        2,86388























            8












            $begingroup$

            As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:



            [2-9][0-8]d - [2-9]dd - dddd


            This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)



            If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.



            Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.



            (You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)



            P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o, O, 0, Q, I, l, and 1 to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=.






            share|cite|improve this answer











            $endgroup$













            • $begingroup$
              how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
              $endgroup$
              – ratchet freak
              12 hours ago










            • $begingroup$
              @ratchetfreak D'oh, of course, you're right. Let me adjust that.
              $endgroup$
              – Draconis
              12 hours ago
















            8












            $begingroup$

            As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:



            [2-9][0-8]d - [2-9]dd - dddd


            This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)



            If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.



            Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.



            (You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)



            P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o, O, 0, Q, I, l, and 1 to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=.






            share|cite|improve this answer











            $endgroup$













            • $begingroup$
              how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
              $endgroup$
              – ratchet freak
              12 hours ago










            • $begingroup$
              @ratchetfreak D'oh, of course, you're right. Let me adjust that.
              $endgroup$
              – Draconis
              12 hours ago














            8












            8








            8





            $begingroup$

            As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:



            [2-9][0-8]d - [2-9]dd - dddd


            This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)



            If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.



            Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.



            (You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)



            P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o, O, 0, Q, I, l, and 1 to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=.






            share|cite|improve this answer











            $endgroup$



            As ratchet freak says, you have ten decimal digits, which should give $10^{10}$ possible values. But in practice, there are a few more restrictions. The format of a North American telephone number looks something like this:



            [2-9][0-8]d - [2-9]dd - dddd


            This gives 8*9*10 * 8*10*10 * 10*10*10*10 values. In addition, the fifth and sixth characters can't both be 1, which removes 1% of these. This means there are 5,702,400,000 possible telephone numbers. (Let's call this number $T$.)



            If you want to encode these using an alphabet with $b$ different characters, you'll need a code $lceil log_b T rceil$ characters long. So if you want a code four characters long, you'll need $b geq 275$. If you want a code five characters long, you'll need $b geq 90$. And if you want a code six characters long, you'll need $b geq 43$.



            Unfortunately, this is the best you can do, unless you have additional information about what numbers you'll be getting. This comes down to the pigeonhole principle, a remarkably straightforward theorem that's quite useful in CS. It's mathematically impossible to use fewer characters than this without ever having any collisions. You can make collisions extremely unlikely, but never fully prevent them.



            (You can also make some of the codes be shorter without collisions, but as a consequence some of them will have to be longer. Again, pigeonhole principle.)



            P.S. Note that, while base64 also produces a six-character string (since 64 is between 43 and 90), you can take advantage of the fact that you only need 43 to choose a "nicer" character set. For example, you can get rid of o, O, 0, Q, I, l, and 1 to cut down on confusion in writing. My suggestion would be 23456789ABCDEFGHJKLMNPRSTVXYZabdeghknpqrt-=.







            share|cite|improve this answer














            share|cite|improve this answer



            share|cite|improve this answer








            edited 11 hours ago

























            answered 14 hours ago









            DraconisDraconis

            5,252820




            5,252820












            • $begingroup$
              how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
              $endgroup$
              – ratchet freak
              12 hours ago










            • $begingroup$
              @ratchetfreak D'oh, of course, you're right. Let me adjust that.
              $endgroup$
              – Draconis
              12 hours ago


















            • $begingroup$
              how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
              $endgroup$
              – ratchet freak
              12 hours ago










            • $begingroup$
              @ratchetfreak D'oh, of course, you're right. Let me adjust that.
              $endgroup$
              – Draconis
              12 hours ago
















            $begingroup$
            how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
            $endgroup$
            – ratchet freak
            12 hours ago




            $begingroup$
            how does "the fifth and sixth characters can't both be 1" remove 8 possible numbers? shouldn't that be 1 out of every 100? So 5,702,400,000 total possible numbers.
            $endgroup$
            – ratchet freak
            12 hours ago












            $begingroup$
            @ratchetfreak D'oh, of course, you're right. Let me adjust that.
            $endgroup$
            – Draconis
            12 hours ago




            $begingroup$
            @ratchetfreak D'oh, of course, you're right. Let me adjust that.
            $endgroup$
            – Draconis
            12 hours ago










            anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.













            anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.












            anemaria20 is a new contributor. Be nice, and check out our Code of Conduct.
















            Thanks for contributing an answer to Computer Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f105488%2falgorithm-to-convert-a-fixed-length-string-to-the-smallest-possible-collision-fr%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to label and detect the document text images

            Vallis Paradisi

            Tabula Rosettana