mongodb 中匹配的 OR 条件的百分比

我的数据格式如下..

I have got my data in following format..

{
  "_id" : ObjectId("534fd4662d22a05415000000"),
  "product_id" : "50862224",
  "ean" : "8808992479390",
  "brand" : "LG",
  "model" : "37LH3000",
  "features" : [{
      {
      "key" : "Screen Format",
      "value" : "16:9",
    }, {
      "key" : "DVD Player / Recorder",
      "value" : "No",
    }, 
      "key" : "Weight in kg",
      "value" : "12.6",
    }
    ... so on
    ]
}

我需要将一种产品的功能与其他产品进行比较,并根据功能匹配的百分比将结果分为不同的类别(100% 匹配,50-99 % 匹配).

I need to compare features of one product with others and divide the result into separate categories ( 100% match, 50-99 % match) based on % of feature matches..

我最初的想法是为每个功能准备一个动态查询或条件,并在 php 中执行百分比操作,但这意味着 mongodb 甚至会返回我,即使是那些只有 1 个功能匹配的产品.而且我认为几乎一个类别的所有产品可能都有一些共同点,所以我担心我可能会在 php 中开发很多产品.

My initial thought was to prepare a dynamic query with or condition for each feature and do the percentage thing in php but then that means mongodb will return me even those product which only have 1 feature matching. And I I think nearly all products of a category might have some feature in common, so I fear I might be working on lot of products in php.

我基本上有两个问题.

  1. 还有其他方法吗?
  2. 我使用的数据结构是否足以支持我正在寻找的功能,或者我是否应该考虑更改它

推荐答案

你的解决方案真的应该是特定于 MongoDB 的,否则你最终会在客户端进行计算和可能的匹配,这对性能.

Well your solution really should be MongoDB specific otherwise you will end up doing your calculations and possible matching on the client side, and that is not going to be good for performance.

当然,您真正想要的是一种在服务器端进行处理的方法:

So of course what you really want is a way for that to have that processing on the server side:

db.products.aggregate([

    // Match the documents that meet your conditions
    { "$match": {
        "$or": [
            { 
                "features": { 
                    "$elemMatch": {
                       "key": "Screen Format",
                       "value": "16:9"
                    }
                }
            },
            { 
                "features": { 
                    "$elemMatch": {
                       "key" : "Weight in kg",
                       "value" : { "$gt": "5", "$lt": "8" }
                    }
                }
            },
        ]
    }},

    // Keep the document and a copy of the features array
    { "$project": {
        "_id": {
            "_id": "$_id",
            "product_id": "$product_id",
            "ean": "$ean",
            "brand": "$brand",
            "model": "$model",
            "features": "$features"
        },
        "features": 1
    }},

    // Unwind the array
    { "$unwind": "$features" },

    // Find the actual elements that match the conditions
    { "$match": {
        "$or": [
            { 
               "features.key": "Screen Format",
               "features.value": "16:9"
            },
            { 
               "features.key" : "Weight in kg",
               "features.value" : { "$gt": "5", "$lt": "8" }
            },
        ]
    }},

    // Count those matched elements
    { "$group": {
        "_id": "$_id",
        "count": { "$sum": 1 }
    }},

    // Restore the document and divide the mated elements by the
    // number of elements in the "or" condition
    { "$project": {
        "_id": "$_id._id",
        "product_id": "$_id.product_id",
        "ean": "$_id.ean",
        "brand": "$_id.brand",
        "model": "$_id.model",
        "features": "$_id.features",
        "matched": { "$divide": [ "$count", 2 ] }
    }},

    // Sort by the matched percentage
    { "$sort": { "matched": -1 } }

])

所以当您知道所应用的 $or 条件的长度"时,您只需找出特征"中有多少元素数组匹配这些条件.这就是管道中的第二个 $match 的全部内容.

So as you know the "length" of the $or condition being applied, then you simply need to find out how many of the elements in the "features" array match those conditions. So that is what the second $match in the pipeline is all about.

一旦有了这个计数,您只需将条件数除以作为 $or 传入的条件数.这里的美妙之处在于,现在您可以用这种方式做一些有用的事情,比如按相关性排序,然后甚至分页"结果服务器端.

Once you have that count, you simply divide by the number of conditions what were passed in as your $or. The beauty here is that now you can do something useful with this like sort by that relevance and then even "page" the results server side.

当然,如果您想要对此进行一些额外的分类",您需要做的就是在管道末尾添加另一个 $project 阶段:

Of course if you want some additional "categorization" of this, all you would need to do is add another $project stage to the end of the pipeline:

    { "$project": {
        "product_id": 1
        "ean": 1
        "brand": 1
        "model": 1,
        "features": 1,
        "matched": 1,
        "category": { "$cond": [
            { "$eq": [ "$matched", 1 ] },
            "100",
            { "$cond": [ 
                { "$gte": [ "$matched", .7 ] },
                "70-99",
                { "$cond": [
                   "$gte": [ "$matched", .4 ] },
                   "40-69",
                   "under 40"
                ]} 
            ]}
        ]}
    }}

或类似的东西.但是 $cond 运营商可以在这里为您提供帮助.

Or as something similar. But the $cond operator can help you here.

架构应该没问题,因为您可以在特征数组中的条目的键"和值"上建立一个复合索引,这应该可以很好地扩展查询.

The architecture should be fine as you have it as you can have a compound index on the "key" and "value" for the entries in your features array and this should scale well for queries.

当然,如果您确实需要更多的东西,例如分面搜索和结果,您可以查看 Solr 或弹性搜索等解决方案.但是这里的完整实现会有点冗长.

Of course if you actually need something more than that, such as faceted searching and results, you can look at solutions like Solr or elastic search. But the full implementation of that would be a bit lengthy for here.

相关文章