返回有限数量的某种类型的记录,但无限数量的其他记录?
我有一个查询,我需要返回 10 条A 类"记录,同时返回所有其他记录.我怎样才能做到这一点?
I have a query where I need to return 10 of "Type A" records, while returning all other records. How can I accomplish this?
更新:诚然,我可以通过两个查询来做到这一点,但如果可能的话,我想避免这种情况,认为它会减少开销,并且可能会提高性能.我的查询已经是一个聚合查询,将两种记录都考虑在内,我只需要限制结果中一种记录的数量即可.
Update: Admittedly, I could do this with two queries, but I wanted to avoid that, if possible, thinking it would be less overhead, and possibly more performant. My query already is an aggregation query that takes both kinds of records into account, I just need to limit the number of the one type of record in the results.
更新:以下是突出问题的示例查询:
Update: the following is an example query that highlights the problem:
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Fiction"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Horror"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Science"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
我希望在一个查询中返回所有这些记录,但将类型限制为最多 10 个任何类别.我意识到 typeSortOrder 在查询被这样分解时不需要是有条件的,当查询是一个查询时,我有它,最初(这是我想回到的地方).
I would like to have all these records returned in one query, but limit the type to at most 10 of any category. I realize that the typeSortOrder doesn't need to be conditional when the queries are broken out like this, I had it there for when the queries were one query, originally (which is where I would like to get back to).
推荐答案
问题
<小时>这里的结果并非不可能,但也可能不切实际.已经做了一般说明,您不能切片"数组或以其他方式限制"推送到一个数组的结果数量.而按类型"执行此操作的方法本质上是使用数组.
Problem
The results here are not impossible but are also possibly impractical. The general notes have been made that you cannot "slice" an array or otherwise "limit" the amount of results pushed onto one. And the method for doing this per "type" is essentially to use arrays.
不切实际"的部分通常是关于结果的数量,其中太大的结果集会在分组"时破坏 BSON 文档限制.但是,我将在您的地理搜索"中考虑这一点以及其他一些建议,最终目标是最多返回每种类型"的 10 个结果.
The "impractical" part is usually about the number of results, where too large a result set is going to blow up the BSON document limit when "grouping". But, I'm going to consider this with some other recommendations on your "geo search" along with the ultimate goal to return 10 results of each "type" at most.
为了首先考虑和理解这个问题,让我们看一下简化的数据集"以及从每种类型返回前 2 个结果"所需的管道代码:
To first consider and understand the problem, let's look at a simplified "set" of data and the pipeline code necessary to return the "top 2 results" from each type:
{ "title": "Title 1", "author": "Author 1", "type": "Fiction", "distance": 1 },
{ "title": "Title 2", "author": "Author 2", "type": "Fiction", "distance": 2 },
{ "title": "Title 3", "author": "Author 3", "type": "Fiction", "distance": 3 },
{ "title": "Title 4", "author": "Author 4", "type": "Science", "distance": 1 },
{ "title": "Title 5", "author": "Author 5", "type": "Science", "distance": 2 },
{ "title": "Title 6", "author": "Author 6", "type": "Science", "distance": 3 },
{ "title": "Title 7", "author": "Author 7", "type": "Horror", "distance": 1 }
这是数据的简化视图,在一定程度上代表了初始查询后的文档状态.现在来了如何使用聚合管道为每个类型"获取最近"两个结果的技巧:
That's a simplified view of the data and somewhat representative of the state of documents after an initial query. Now comes the trick of how to use the aggregation pipeline to get the "nearest" two results for each "type":
db.books.aggregate([
{ "$sort": { "type": 1, "distance": 1 } },
{ "$group": {
"_id": "$type",
"1": {
"$first": {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
}
},
"books": {
"$push": {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
}
}
}},
{ "$project": {
"1": 1,
"books": {
"$cond": [
{ "$eq": [ { "$size": "$books" }, 1 ] },
{ "$literal": [false] },
"$books"
]
}
}},
{ "$unwind": "$books" },
{ "$project": {
"1": 1,
"books": 1,
"seen": { "$eq": [ "$1", "$books" ] }
}},
{ "$sort": { "_id": 1, "seen": 1 } },
{ "$group": {
"_id": "$_id",
"1": { "$first": "$1" },
"2": { "$first": "$books" },
"books": {
"$push": {
"$cond": [ { "$not": "$seen" }, "$books", false ]
}
}
}},
{ "$project": {
"1": 1,
"2": 2,
"pos": { "$literal": [1,2] }
}},
{ "$unwind": "$pos" },
{ "$group": {
"_id": "$_id",
"books": {
"$push": {
"$cond": [
{ "$eq": [ "$pos", 1 ] },
"$1",
{ "$cond": [
{ "$eq": [ "$pos", 2 ] },
"$2",
false
]}
]
}
}
}},
{ "$unwind": "$books" },
{ "$match": { "books": { "$ne": false } } },
{ "$project": {
"_id": "$books._id",
"title": "$books.title",
"author": "$books.author",
"type": "$_id",
"distance": "$books.distance",
"sortOrder": {
"$add": [
{ "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] }
]
}
}},
{ "$sort": { "sortOrder": 1 } }
])
当然这只是两个结果,但它概述了获取 n
结果的过程,这自然是在生成的管道代码中完成的.在开始编写代码之前,该过程值得一试.
Of course that is just two results, but it outlines the process for getting n
results, which naturally is done in generated pipeline code. Before moving onto the code the process deserves a walk through.
任何查询后,这里要做的第一件事是$sort
结果,这基本上是通过作为类型"的分组键"和距离"来完成的,以便最近" 项目在顶部.
After any query, the first thing to do here is $sort
the results, and this you want to basically do by both the "grouping key" which is the "type" and by the "distance" so that the "nearest" items are on top.
原因在 $将重复的阶段分组
.所做的基本上是弹出 $first
每个分组堆栈的结果.因此其他文档不会丢失,它们使用 $push
.
The reason for this is shown in the $group
stages that will repeat. What is done is essentially "popping the $first
result off of each grouping stack. So other documents are not lost, they are placed in an array using $push
.
为了安全起见,下一个阶段实际上只需要在第一步"之后,但可以选择添加以在重复中进行类似过滤.这里的主要检查是生成的数组"不仅仅是一个项.如果不是,则将内容替换为单个 false 值.原因即将揭晓.
Just to be safe, the next stage is really only required after the "first step", but could optionally be added for similar filtering in the repetition. The main check here is that the resulting "array" is larger than just one item. Where it is not, the contents are replaced with a single value of false. The reason for which is about to become evident.
在这个第一步"之后,真正的重复循环存在,然后该数组被去规范化" $unwind
然后是 $project
是为了匹配"上次看到"的文档.
After this "first step" the real repetition cycle beings, where that array is then "de-normalized" with $unwind
and then a $project
made in order to "match" the document that has been last "seen".
由于只有一个文档会匹配此条件,因此将再次排序"结果,以便将未见过"的文档浮动到顶部,同时保持分组顺序.接下来的事情与第一个 $group
步骤类似,但保留了所有保留的位置,并且第一个看不见的"文档再次从堆栈中弹出".
As only one of the documents will match this condition the results are again "sorted" in order to float the "unseen" documents to the top, while of course maintaining the grouping order. The next thing is similar to the first $group
step, but where any kept positions are maintained and the "first unseen" document is "popped off the stack" again.
看到"的文档随后被推回数组,而不是作为本身,而是作为 false
的值.这不会与保留的值匹配,这通常是处理此问题的方法,而不会对数组内容造成破坏性",如果没有足够的匹配项来覆盖 n,您不希望操作失败
需要结果.
The document that was "seen" is then pushed back to the array not as itself but as a value of false
. This is not going to match the kept value and this is generally the way to handle this without being "destructive" to the array contents where you don't want the operations to fail should there not be enough matches to cover the n
results required.
完成后清理,下一个投影"将一个数组添加到现在按类型"分组的最终文档中,表示所需的 n
结果中的每个位置.当这个数组展开时,文档可以再次组合在一起,但现在都在一个数组中可能包含多个 false
值,但长度为 n
个元素.
Cleaning up when complete, the next "projection" adds an array to the final documents now grouped by "type" representing each position in the n
results required. When this array is unwound, the documents can again be grouped back together, but now all in a single array
that possibly contains several false
values but is n
elements long.
最后再次展开数组,使用 $match
过滤掉false
值,投影到需要的文档形式.
Finally unwind the array again, use $match
to filter out the false
values, and project to the required document form.
如前所述,问题在于过滤结果的数量,因为可以推入数组的结果数量存在实际限制.这主要是 BSON 限制,但您也不真正想要 1000 件物品,即使这仍在限制之下.
The problem as stated earlier is with the number of results being filtered as there is a real limit on the number of results that can be pushed into an array. That is mostly the BSON limit, but you also don't really want 1000's of items even if that is still under the limit.
这里的技巧是保持初始匹配"足够小,以使切片操作"变得实用.$geoNear
有一些东西 流水线工艺可以实现这一点.
The trick here is keeping the initial "match" small enough that the "slicing operations" becomes practical. There are some things with the $geoNear
pipeline process that can make this a possibility.
显而易见的是limit
.默认情况下,这是 100,但您显然希望拥有以下范围内的内容:
The obvious is limit
. By default this is 100 but you clearly want to have something in the range of:
(您可能匹配的类别数)X(必需匹配)
(the number of categories you can possibly match) X ( required matches )
但如果这本质上是一个不在 1000 中的数字,那么这里已经有一些帮助了.
But if this is essentially a number not in the 1000's then there is already some help here.
其他的是 maxDistance
和 minDistance
,本质上您可以在其中设置搜索多远"的上限和下限.最大界限是通用限制器,而最小界限在分页"时很有用,它是下一个帮助器.
The others are maxDistance
and minDistance
, where essentially you put upper and lower bounds on how "far out" to search. The max bound is the general limiter while the min bound is useful when "paging", which is the next helper.
当向上分页"时,您可以使用 query
参数,以便使用 _id 值://docs.mongodb.org/manual/reference/operator/query/nin/" rel="nofollow">$nin
查询.以同样的方式,minDistance
可以填充最后一次看到的"最大距离,或者至少是类型"的最小最大距离.这允许一些过滤掉已经看到"的东西并获得另一个页面的概念.
When "upwardly paging", you can use the query
argument in order to exclude the _id
values of documents "already seen" using the $nin
query. In much the same way, the minDistance
can be populated with the "last seen" largest distance, or at least the smallest largest distance by "type". This allows some concept of filtering out things that have already been "seen" and getting another page.
本身确实是一个主题,但这些都是在减少初始匹配以使过程实用时要寻找的一般性内容.
Really a topic in itself, but those are the general things to look for in reducing that initial match in order to make the process practical.
返回每种类型最多 10 个结果"的一般问题显然需要一些代码来生成管道阶段.没有人想把它打出来,实际上你可能会想在某个时候改变这个数字.
The general problem of returning "10 results at most, per type" is clearly going to want some code in order to generate the pipeline stages. No-one wants to type that out, and practically speaking you will probably want to change that number at some point.
现在到可以生成怪物管道的代码.所有代码都在 JavaScript 中,但原则上易于翻译:
So now to the code that can generate the monster pipeline. All code in JavaScript, but easy to translate in principles:
var coords = [-118.09771, 33.89244];
var key = "$type";
var val = {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
};
var maxLen = 10;
var stack = [];
var pipe = [];
var fproj = { "$project": { "pos": { "$literal": [] } } };
pipe.push({ "$geoNear": {
"near": coords,
"distanceField": "distance",
"spherical": true
}});
pipe.push({ "$sort": {
"type": 1, "distance": 1
}});
for ( var x = 1; x <= maxLen; x++ ) {
fproj["$project"][""+x] = 1;
fproj["$project"]["pos"]["$literal"].push( x );
var rec = {
"$cond": [ { "$eq": [ "$pos", x ] }, "$"+x ]
};
if ( stack.length == 0 ) {
rec["$cond"].push( false );
} else {
lval = stack.pop();
rec["$cond"].push( lval );
}
stack.push( rec );
if ( x == 1) {
pipe.push({ "$group": {
"_id": key,
"1": { "$first": val },
"books": { "$push": val }
}});
pipe.push({ "$project": {
"1": 1,
"books": {
"$cond": [
{ "$eq": [ { "$size": "$books" }, 1 ] },
{ "$literal": [false] },
"$books"
]
}
}});
} else {
pipe.push({ "$unwind": "$books" });
var proj = {
"$project": {
"books": 1
}
};
proj["$project"]["seen"] = { "$eq": [ "$"+(x-1), "$books" ] };
var grp = {
"$group": {
"_id": "$_id",
"books": {
"$push": {
"$cond": [ { "$not": "$seen" }, "$books", false ]
}
}
}
};
for ( n=x; n >= 1; n-- ) {
if ( n != x )
proj["$project"][""+n] = 1;
grp["$group"][""+n] = ( n == x ) ? { "$first": "$books" } : { "$first": "$"+n };
}
pipe.push( proj );
pipe.push({ "$sort": { "_id": 1, "seen": 1 } });
pipe.push(grp);
}
}
pipe.push(fproj);
pipe.push({ "$unwind": "$pos" });
pipe.push({
"$group": {
"_id": "$_id",
"msgs": { "$push": stack[0] }
}
});
pipe.push({ "$unwind": "$books" });
pipe.push({ "$match": { "books": { "$ne": false } }});
pipe.push({
"$project": {
"_id": "$books._id",
"title": "$books.title",
"author": "$books.author",
"type": "$_id",
"distance": "$books",
"sortOrder": {
"$add": [
{ "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] },
]
}
}
});
pipe.push({ "$sort": { "sortOrder": 1, "distance": 1 } });
备用
<小时>当然,这里的最终结果和上述所有问题的一般问题是,您真的只希望每种类型"的前 10 名"返回.聚合管道会这样做,但代价是保持超过 10 个,然后从堆栈中弹出"直到达到 10 个.
Alternate
Of course the end result here and the general problem with all above is that you really only want the "top 10" of each "type" to return. The aggregation pipeline will do it, but at the cost of keeping more than 10 and then "popping off the stack" until 10 is reached.
另一种方法是使用 mapReduce 和全局范围"变量蛮力"它.不太好,因为结果都在数组中,但它可能是一种实用的方法:
An alternate approach is to "brute force" this with mapReduce and "globally scoped" variables. Not as nice since the results all in arrays, but it may be a practical approach:
db.collection.mapReduce(
function () {
if ( !stash.hasOwnProperty(this.type) ) {
stash[this.type] = [];
}
if ( stash[this.type.length < maxLen ) {
stash[this.type].push({
"title": this.title,
"author": this.author,
"type": this.type,
"distance": this.distance
});
emit( this.type, 1 );
}
},
function(key,values) {
return 1; // really just want to keep the keys
},
{
"query": {
"location": {
"$nearSphere": [-118.09771, 33.89244]
}
},
"scope": { "stash": {}, "maxLen": 10 },
"finalize": function(key,value) {
return { "msgs": stash[key] };
},
"out": { "inline": 1 }
}
)
这是一个真正的作弊,它只是使用全局范围"来保留一个对象,其键是分组键.结果被推送到该全局对象中的数组中,直到达到最大长度.结果已经按最近排序,因此映射器在每个键达到 10 个后放弃对当前文档执行任何操作.
This is a real cheat which just uses the "global scope" to keep a single object whose keys are the grouping keys. The results are pushed onto an array in that global object until the maximum length is reached. Results are already sorted by nearest, so the mapper just gives up doing anything with the current document after the 10 are reached per key.
reducer 不会被调用,因为每个键只发出 1 个文档.finalize 然后只是从全局中拉"值并在结果中返回它.
The reducer wont be called since only 1 document per key is emitted. The finalize then just "pulls" the value from the global and returns it in the result.
很简单,但是如果您真的需要它们,当然您没有所有 $geoNear
选项,并且此表单具有 100 个文档作为初始查询输出的硬性限制.
Simple, but of course you don't have all the $geoNear
options if you really need them, and this form has the hard limit of 100 document as the output from the initial query.
相关文章