Probabilistic Machine Learning model to match spatial data












0












$begingroup$


I have spatial data from multiple sources. This data consists of ID, lat, long, and time.



My goal is that given a new lat-long, the model needs to return (preferably with a probability) the data points that match the new lat-long. This matching should be based on the features (such as lat, long, timestamp).



I could only think of clustering. ie. Cluster the dataset and try to predict which cluster the new data belongs to. The drawback is that if the cluster has a lot of points then its hard to accurately pin point to which point in the cluster matches the closest to the new point.



Is there any other ways to do this? Any probabilistic model (HMM?).










share|improve this question







New contributor




ajroot is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    0












    $begingroup$


    I have spatial data from multiple sources. This data consists of ID, lat, long, and time.



    My goal is that given a new lat-long, the model needs to return (preferably with a probability) the data points that match the new lat-long. This matching should be based on the features (such as lat, long, timestamp).



    I could only think of clustering. ie. Cluster the dataset and try to predict which cluster the new data belongs to. The drawback is that if the cluster has a lot of points then its hard to accurately pin point to which point in the cluster matches the closest to the new point.



    Is there any other ways to do this? Any probabilistic model (HMM?).










    share|improve this question







    New contributor




    ajroot is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      0












      0








      0





      $begingroup$


      I have spatial data from multiple sources. This data consists of ID, lat, long, and time.



      My goal is that given a new lat-long, the model needs to return (preferably with a probability) the data points that match the new lat-long. This matching should be based on the features (such as lat, long, timestamp).



      I could only think of clustering. ie. Cluster the dataset and try to predict which cluster the new data belongs to. The drawback is that if the cluster has a lot of points then its hard to accurately pin point to which point in the cluster matches the closest to the new point.



      Is there any other ways to do this? Any probabilistic model (HMM?).










      share|improve this question







      New contributor




      ajroot is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I have spatial data from multiple sources. This data consists of ID, lat, long, and time.



      My goal is that given a new lat-long, the model needs to return (preferably with a probability) the data points that match the new lat-long. This matching should be based on the features (such as lat, long, timestamp).



      I could only think of clustering. ie. Cluster the dataset and try to predict which cluster the new data belongs to. The drawback is that if the cluster has a lot of points then its hard to accurately pin point to which point in the cluster matches the closest to the new point.



      Is there any other ways to do this? Any probabilistic model (HMM?).







      probability model-selection geospatial






      share|improve this question







      New contributor




      ajroot is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      ajroot is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      ajroot is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 2 days ago









      ajrootajroot

      11




      11




      New contributor




      ajroot is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      ajroot is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      ajroot is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          So you have an existing dataset;
          $$mathbf{X} = { [id_i, latitude_i, longitude_i, time_i] : i in {1,...,n} } $$
          And you receive a new sample;
          $$ mathbf{x}_* = [latitude, longitude, now] $$
          And you want to determine a probability of which datapoints in $mathbf{X}$ match $mathbf{x_*}$?



          For fear of pointing out the obvious why not just use K-nearest-neighbours, where $K=1$. Obviously you need to establish a distance metric that accounts for time, or at least equate time to longitude and latitude difference, or you could just ignore time altogether.



          Distance Metric



          The notion of nearest is obviously associated with some measure of distance. Considering a map one can easily associate a point on a map with a coordinate system and the notion of distance is intuitive. Given a sample data point you can just find the nearest datapoint to it on the map.



          But what if you have another dimension that is not intuitively obvious. This is often specific to the problem. Suppose you were searching for two criminals in a city, and criminal Adam was last seen at coordinates (0,0) 1 week ago, and criminal Brian was last seen at coordinates (1,1) 1 day ago. Now you have a new sighting at (0.25, 0.25). This is closer in geography to (0,0) than (1,1) but Brian was more recently seen so perhaps this is the more likely lead to allocate resources for a search?



          When using K-nearest neighbors you might need some kind of transformation function to convert distances to probabilities. There may be numerous choices, google softmax as one possible starting point.






          share|improve this answer











          $endgroup$













          • $begingroup$
            Yes. It makes sense. I have looked into using k-nearest neighbors. My only question is that my model must be trained based on all the factors (id, time, lat, long, bearing). I was unsure how to go about with all these features included (time mainly)? Would also like to point out that I may have multiple points returned which match the new data point. A probability would help determine which one to chose. Could you elaborate about "establish a distance metric..".
            $endgroup$
            – ajroot
            2 days ago










          • $begingroup$
            K-nearest-neighbours does not need training. The dataset serves as the model itself by finding the nearest samples for each new datapoint
            $endgroup$
            – Attack68
            2 days ago










          • $begingroup$
            How do you recommend giving more importance to time feature while using KNN?
            $endgroup$
            – ajroot
            yesterday











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          ajroot is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47098%2fprobabilistic-machine-learning-model-to-match-spatial-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0












          $begingroup$

          So you have an existing dataset;
          $$mathbf{X} = { [id_i, latitude_i, longitude_i, time_i] : i in {1,...,n} } $$
          And you receive a new sample;
          $$ mathbf{x}_* = [latitude, longitude, now] $$
          And you want to determine a probability of which datapoints in $mathbf{X}$ match $mathbf{x_*}$?



          For fear of pointing out the obvious why not just use K-nearest-neighbours, where $K=1$. Obviously you need to establish a distance metric that accounts for time, or at least equate time to longitude and latitude difference, or you could just ignore time altogether.



          Distance Metric



          The notion of nearest is obviously associated with some measure of distance. Considering a map one can easily associate a point on a map with a coordinate system and the notion of distance is intuitive. Given a sample data point you can just find the nearest datapoint to it on the map.



          But what if you have another dimension that is not intuitively obvious. This is often specific to the problem. Suppose you were searching for two criminals in a city, and criminal Adam was last seen at coordinates (0,0) 1 week ago, and criminal Brian was last seen at coordinates (1,1) 1 day ago. Now you have a new sighting at (0.25, 0.25). This is closer in geography to (0,0) than (1,1) but Brian was more recently seen so perhaps this is the more likely lead to allocate resources for a search?



          When using K-nearest neighbors you might need some kind of transformation function to convert distances to probabilities. There may be numerous choices, google softmax as one possible starting point.






          share|improve this answer











          $endgroup$













          • $begingroup$
            Yes. It makes sense. I have looked into using k-nearest neighbors. My only question is that my model must be trained based on all the factors (id, time, lat, long, bearing). I was unsure how to go about with all these features included (time mainly)? Would also like to point out that I may have multiple points returned which match the new data point. A probability would help determine which one to chose. Could you elaborate about "establish a distance metric..".
            $endgroup$
            – ajroot
            2 days ago










          • $begingroup$
            K-nearest-neighbours does not need training. The dataset serves as the model itself by finding the nearest samples for each new datapoint
            $endgroup$
            – Attack68
            2 days ago










          • $begingroup$
            How do you recommend giving more importance to time feature while using KNN?
            $endgroup$
            – ajroot
            yesterday
















          0












          $begingroup$

          So you have an existing dataset;
          $$mathbf{X} = { [id_i, latitude_i, longitude_i, time_i] : i in {1,...,n} } $$
          And you receive a new sample;
          $$ mathbf{x}_* = [latitude, longitude, now] $$
          And you want to determine a probability of which datapoints in $mathbf{X}$ match $mathbf{x_*}$?



          For fear of pointing out the obvious why not just use K-nearest-neighbours, where $K=1$. Obviously you need to establish a distance metric that accounts for time, or at least equate time to longitude and latitude difference, or you could just ignore time altogether.



          Distance Metric



          The notion of nearest is obviously associated with some measure of distance. Considering a map one can easily associate a point on a map with a coordinate system and the notion of distance is intuitive. Given a sample data point you can just find the nearest datapoint to it on the map.



          But what if you have another dimension that is not intuitively obvious. This is often specific to the problem. Suppose you were searching for two criminals in a city, and criminal Adam was last seen at coordinates (0,0) 1 week ago, and criminal Brian was last seen at coordinates (1,1) 1 day ago. Now you have a new sighting at (0.25, 0.25). This is closer in geography to (0,0) than (1,1) but Brian was more recently seen so perhaps this is the more likely lead to allocate resources for a search?



          When using K-nearest neighbors you might need some kind of transformation function to convert distances to probabilities. There may be numerous choices, google softmax as one possible starting point.






          share|improve this answer











          $endgroup$













          • $begingroup$
            Yes. It makes sense. I have looked into using k-nearest neighbors. My only question is that my model must be trained based on all the factors (id, time, lat, long, bearing). I was unsure how to go about with all these features included (time mainly)? Would also like to point out that I may have multiple points returned which match the new data point. A probability would help determine which one to chose. Could you elaborate about "establish a distance metric..".
            $endgroup$
            – ajroot
            2 days ago










          • $begingroup$
            K-nearest-neighbours does not need training. The dataset serves as the model itself by finding the nearest samples for each new datapoint
            $endgroup$
            – Attack68
            2 days ago










          • $begingroup$
            How do you recommend giving more importance to time feature while using KNN?
            $endgroup$
            – ajroot
            yesterday














          0












          0








          0





          $begingroup$

          So you have an existing dataset;
          $$mathbf{X} = { [id_i, latitude_i, longitude_i, time_i] : i in {1,...,n} } $$
          And you receive a new sample;
          $$ mathbf{x}_* = [latitude, longitude, now] $$
          And you want to determine a probability of which datapoints in $mathbf{X}$ match $mathbf{x_*}$?



          For fear of pointing out the obvious why not just use K-nearest-neighbours, where $K=1$. Obviously you need to establish a distance metric that accounts for time, or at least equate time to longitude and latitude difference, or you could just ignore time altogether.



          Distance Metric



          The notion of nearest is obviously associated with some measure of distance. Considering a map one can easily associate a point on a map with a coordinate system and the notion of distance is intuitive. Given a sample data point you can just find the nearest datapoint to it on the map.



          But what if you have another dimension that is not intuitively obvious. This is often specific to the problem. Suppose you were searching for two criminals in a city, and criminal Adam was last seen at coordinates (0,0) 1 week ago, and criminal Brian was last seen at coordinates (1,1) 1 day ago. Now you have a new sighting at (0.25, 0.25). This is closer in geography to (0,0) than (1,1) but Brian was more recently seen so perhaps this is the more likely lead to allocate resources for a search?



          When using K-nearest neighbors you might need some kind of transformation function to convert distances to probabilities. There may be numerous choices, google softmax as one possible starting point.






          share|improve this answer











          $endgroup$



          So you have an existing dataset;
          $$mathbf{X} = { [id_i, latitude_i, longitude_i, time_i] : i in {1,...,n} } $$
          And you receive a new sample;
          $$ mathbf{x}_* = [latitude, longitude, now] $$
          And you want to determine a probability of which datapoints in $mathbf{X}$ match $mathbf{x_*}$?



          For fear of pointing out the obvious why not just use K-nearest-neighbours, where $K=1$. Obviously you need to establish a distance metric that accounts for time, or at least equate time to longitude and latitude difference, or you could just ignore time altogether.



          Distance Metric



          The notion of nearest is obviously associated with some measure of distance. Considering a map one can easily associate a point on a map with a coordinate system and the notion of distance is intuitive. Given a sample data point you can just find the nearest datapoint to it on the map.



          But what if you have another dimension that is not intuitively obvious. This is often specific to the problem. Suppose you were searching for two criminals in a city, and criminal Adam was last seen at coordinates (0,0) 1 week ago, and criminal Brian was last seen at coordinates (1,1) 1 day ago. Now you have a new sighting at (0.25, 0.25). This is closer in geography to (0,0) than (1,1) but Brian was more recently seen so perhaps this is the more likely lead to allocate resources for a search?



          When using K-nearest neighbors you might need some kind of transformation function to convert distances to probabilities. There may be numerous choices, google softmax as one possible starting point.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 2 days ago

























          answered 2 days ago









          Attack68Attack68

          1513




          1513












          • $begingroup$
            Yes. It makes sense. I have looked into using k-nearest neighbors. My only question is that my model must be trained based on all the factors (id, time, lat, long, bearing). I was unsure how to go about with all these features included (time mainly)? Would also like to point out that I may have multiple points returned which match the new data point. A probability would help determine which one to chose. Could you elaborate about "establish a distance metric..".
            $endgroup$
            – ajroot
            2 days ago










          • $begingroup$
            K-nearest-neighbours does not need training. The dataset serves as the model itself by finding the nearest samples for each new datapoint
            $endgroup$
            – Attack68
            2 days ago










          • $begingroup$
            How do you recommend giving more importance to time feature while using KNN?
            $endgroup$
            – ajroot
            yesterday


















          • $begingroup$
            Yes. It makes sense. I have looked into using k-nearest neighbors. My only question is that my model must be trained based on all the factors (id, time, lat, long, bearing). I was unsure how to go about with all these features included (time mainly)? Would also like to point out that I may have multiple points returned which match the new data point. A probability would help determine which one to chose. Could you elaborate about "establish a distance metric..".
            $endgroup$
            – ajroot
            2 days ago










          • $begingroup$
            K-nearest-neighbours does not need training. The dataset serves as the model itself by finding the nearest samples for each new datapoint
            $endgroup$
            – Attack68
            2 days ago










          • $begingroup$
            How do you recommend giving more importance to time feature while using KNN?
            $endgroup$
            – ajroot
            yesterday
















          $begingroup$
          Yes. It makes sense. I have looked into using k-nearest neighbors. My only question is that my model must be trained based on all the factors (id, time, lat, long, bearing). I was unsure how to go about with all these features included (time mainly)? Would also like to point out that I may have multiple points returned which match the new data point. A probability would help determine which one to chose. Could you elaborate about "establish a distance metric..".
          $endgroup$
          – ajroot
          2 days ago




          $begingroup$
          Yes. It makes sense. I have looked into using k-nearest neighbors. My only question is that my model must be trained based on all the factors (id, time, lat, long, bearing). I was unsure how to go about with all these features included (time mainly)? Would also like to point out that I may have multiple points returned which match the new data point. A probability would help determine which one to chose. Could you elaborate about "establish a distance metric..".
          $endgroup$
          – ajroot
          2 days ago












          $begingroup$
          K-nearest-neighbours does not need training. The dataset serves as the model itself by finding the nearest samples for each new datapoint
          $endgroup$
          – Attack68
          2 days ago




          $begingroup$
          K-nearest-neighbours does not need training. The dataset serves as the model itself by finding the nearest samples for each new datapoint
          $endgroup$
          – Attack68
          2 days ago












          $begingroup$
          How do you recommend giving more importance to time feature while using KNN?
          $endgroup$
          – ajroot
          yesterday




          $begingroup$
          How do you recommend giving more importance to time feature while using KNN?
          $endgroup$
          – ajroot
          yesterday










          ajroot is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          ajroot is a new contributor. Be nice, and check out our Code of Conduct.













          ajroot is a new contributor. Be nice, and check out our Code of Conduct.












          ajroot is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47098%2fprobabilistic-machine-learning-model-to-match-spatial-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Tabula Rosettana

          Aureus (color)