Elastic search duplicate document check





The Basic script to find the duplicate count is as below, but we will not get the complete information of the documents as the bucket size is limited to 10 by default.



GET /index/type/_search
{
    "size":0, 
    "aggs" : { 
       "db" : { 
          "terms" : { 
             "field" : "source-dbtype" 
                         }, 
     "aggs" : { 
        "count" : { 
           "terms" : { 
              "field" : "column_name","min_doc_count": 2 
                         } 
                    }
               } 
          } 
     } 
}


To get the complete details we need to use cardinality aggregation as shown below.

Cardinality Aggregation

single-value metrics aggregation that calculates an approximate count of distinct values. Values can be extracted either from specific fields in the document or generated by a script.

GET index/type/_search
{
   "size": 0,
   "aggs": {
      "maximum_match_counts": {
         "cardinality": {
            "field": "column_name",
            "precision_threshold": 100
         }
      }
   }
}
get value of maximum_match_counts aggregations
Now you can get all duplicate userids
GET index/type/_search
{
   "size": 0,
   "aggs": {
      "column_name": {
         "terms": {
            "field": "column_name",
            "size": maximum_match_counts,
            "min_doc_count": 2
         }
      }
   }
}

This will give you the complete output of the duplicates in your index.

Hope this helps :)



No comments:

Post a Comment