Libpuzzle 索引数百万张图片?
关于 php 的 libpuzzle 库(http://libpuzzle.pureftpd.org/project/libpuzzle)来自弗兰克丹尼斯先生.我试图了解如何在我的 mysql 数据库中索引和存储数据.向量的生成是绝对没有问题的.
例子:
# 计算两个图像的签名$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');# 计算两个签名之间的距离$d = 谜题向量归一化距离($cvec1,$cvec2);#图片相似吗?如果($d
这对我来说很清楚 - 但是现在当我有大量图片 >1.000.000 时我该如何工作?我计算向量并将其与数据库中的文件名一起存储?现在如何找到相似的图片?如果我将每个向量存储在 mysql 中,我必须打开每条记录并使用 puzzle_vector_normalized_distance 函数计算距离.该过程需要大量时间(打开每个数据库条目 - 将其抛出函数,...)
我从 lib puzzle libaray 中阅读了自述文件,发现以下内容:
<块引用>它是否适用于拥有数百万张图片的数据库?
一个典型的图像签名只需要 182 个字节,使用内置的压缩/解压功能.
相似的签名共享相同的词",即.相同的序列相同位置的值.通过使用复合索引(word +位置),可能的相似向量的集合是显着的减少,并且在大多数情况下,实际上不需要矢量距离得到计算.
通过单词和位置进行索引也可以很容易地拆分数据到多个表和服务器中.
所以是的,Puzzle 库肯定与需要索引数百万张图片的项目.
我还找到了关于索引的描述:
<块引用>------------ 索引 ------------
如果是百万条记录,如何快速找到相似的图片?
原始论文有一个简单而有效的答案.
将向量切割成固定长度的单词.例如,让我们考虑以下向量:
[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]
字长(K)为10,可以得到以下单词:
[ a b c d e f g h i j ] 在位置 0 [ b c d e f g h i j k ]在位置 1 找到 [ c d e f g h i j k l ] 在位置 2 等找到.直到位置 N-1
然后,使用(单词+位置)的复合索引来索引您的向量.
即使有数百万张图像,K = 10 和 N = 100 也应该足以很少有条目共享相同的索引.
这是一个非常基本的示例数据库架构:
+-----------------------------+|签名 |+------------------------------+|sig_id |签名 |pic_id |+--------+------------+--------++----------------------------+|话|+----------------------------+|pos_and_word |fk_sig_id |+--------------+------------+
<块引用>
我建议至少将单词"表拆分为多个表和/或服务器.
默认情况下 (lambas=9) 签名为 544 字节长.为了节省存储空间,可以压缩到原来的1/3通过puzzle_compress_cvec() 函数确定大小.使用前,他们必须用puzzle_uncompress_cvec() 解压缩.
我认为压缩是错误的方式,因为我必须在比较之前解压缩每个向量.
我现在的问题是 - 处理数百万张图片的方法是什么,以及如何快速有效地比较它们.我无法理解切割矢量"应该如何帮助我解决我的问题.
非常感谢 - 也许我可以在这里找到正在使用 libpuzzle libaray 的人.
干杯.
解决方案那么,让我们看看他们给出的例子,并尝试扩展.
假设您有一个存储与每个图像相关的信息(路径、名称、描述等)的表.在该表中,您将包含一个压缩签名字段,在您最初填充数据库时计算和存储.让我们这样定义该表:
创建表图像 (image_id INTEGER NOT NULL PRIMARY KEY,名称文本,描述文本,文件路径文本不为空,url_path 文本不为空,签名文本不为空);
当您最初计算签名时,您还将根据签名计算多个单词:
//这将为每个图像运行一次:$cvec = puzzle_fill_cvec_from_file('img1.jpg');$words = 数组();$wordlen = 10;//这是示例中的 $k$wordcnt = 100;//这是示例中的 $nfor ($i=0; $i
现在您可以将这些词放入一个表中,定义如下:
创建表 img_sig_words (image_id 整数非空,sig_word 文本不为空,外键 (image_id) 参考图像 (image_id),索引 (image_id, sig_word));
现在您插入到该表中,将找到单词的位置索引放在前面,以便您知道单词何时匹配它在签名中的同一位置匹配:
//签名以及所有其他数据已经插入到图像中//表,并且 $image_id 已经填充了结果主键foreach ($words as $index => $word) {$sig_word = $index.'__'.$word;$dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,'$sig_word')");//绘制一个适当定义的数据库抽象层...}
您的数据如此初始化,您可以相对轻松地抓取匹配单词的图像:
//$image_id 设置为您尝试查找匹配的基本图像$dbobj->query("SELECT i.*, COUNT(isw.sig_word) as strength FROM images i JOIN img_sig_wordsisw ON i.image_id = isw.image_id 加入 img_sig_words isw_search ON isw.sig_word =isw_search.sig_word AND isw.image_id != isw_search.image_id WHEREisw_search.image_id = $image_id GROUP BY i.image_id, i.name, i.description,i.file_path, i.url_path, i.signature ORDER BY strength DESC");
您可以通过添加要求最低 strength
的 HAVING
子句来改进查询,从而进一步减少匹配集.
我不保证这是最有效的设置,但它应该大致可以完成您正在寻找的功能.
基本上,以这种方式拆分和存储单词可以让您进行粗略的距离检查,而无需对签名运行专门的函数.
its about the libpuzzle libray for php ( http://libpuzzle.pureftpd.org/project/libpuzzle ) from Mr. Frank Denis. I´am trying to understand how to index and store the data in my mysql database. The generation of the vector is absolutly no problem.
Example:
# Compute signatures for two images
$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');
# Compute the distance between both signatures
$d = puzzle_vector_normalized_distance($cvec1, $cvec2);
# Are pictures similar?
if ($d < PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD) {
echo "Pictures are looking similar
";
} else {
echo "Pictures are different, distance=$d
";
}
Thats all clear to me - but now how do i work when i have a big amount of pictures >1.000.000? I calculate the vector and store it with the filename in the database? How to find the similar pictures now? If i store every vector in the mysql i have to open each record and calculate the distance with the puzzle_vector_normalized_distance function. That procedures takes alot of time (open each database entry - put it throw the function ,...)
I read the readme from the lib puzzle libaray and found the following:
Will it work with a database that has millions of pictures?
A typical image signature only requires 182 bytes, using the built-in compression/decompression functions.
Similar signatures share identical "words", ie. identical sequences of values at the same positions. By using compound indexes (word + position), the set of possible similar vectors is dramatically reduced, and in most cases, no vector distance actually requires to get computed.
Indexing through words and positions also makes it easy to split the data into multiple tables and servers.
So yes, the Puzzle library is certainely not incompatible with projects that need to index millions of pictures.
Also i found this description about indexing:
------------------------ INDEXING ------------------------
How to quickly find similar pictures, if they are millions of records?
The original paper has a simple, yet efficient answer.
Cut the vector in fixed-length words. For instance, let's consider the following vector:
[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]
With a word length (K) of 10, you can get the following words:
[ a b c d e f g h i j ] found at position 0 [ b c d e f g h i j k ] found at position 1 [ c d e f g h i j k l ] found at position 2 etc. until position N-1
Then, index your vector with a compound index of (word + position).
Even with millions of images, K = 10 and N = 100 should be enough to have very little entries sharing the same index.
Here's a very basic sample database schema:
+-----------------------------+
| signatures |
+-----------------------------+
| sig_id | signature | pic_id |
+--------+-----------+--------+
+--------------------------+
| words |
+--------------------------+
| pos_and_word | fk_sig_id |
+--------------+-----------+
I'd recommend splitting at least the "words" table into multiple tables and/or servers.
By default (lambas=9) signatures are 544 bytes long. In order to save storage space, they can be compressed to 1/third of their original size through the puzzle_compress_cvec() function. Before use, they must be uncompressed with puzzle_uncompress_cvec().
I think that compressing is the wrong way cause then i have to uncompress every vector before comparing it.
My question is now - whats the way to handle millions of pictures and how to compare them in a fast and efficient way. I cant understand how the "cutting of the vector" should help me with my problem.
Many thanks - maybe i can find someone here which is working with the libpuzzle libaray.
Cheers.
解决方案So, let's take a look at the example they give and try to expand.
Let's assume you have a table that stores information relating to each image (path, name, description, etc). In that table, you'll include a field for the compressed signature, calculated and stored when you initially populate the database. Let's define that table thus:
CREATE TABLE images (
image_id INTEGER NOT NULL PRIMARY KEY,
name TEXT,
description TEXT,
file_path TEXT NOT NULL,
url_path TEXT NOT NULL,
signature TEXT NOT NULL
);
When you initially compute the signature, you're also going to compute a number of words from the signature:
// this will be run once for each image:
$cvec = puzzle_fill_cvec_from_file('img1.jpg');
$words = array();
$wordlen = 10; // this is $k from the example
$wordcnt = 100; // this is $n from the example
for ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {
$words[] = substr($cvec, $i, $wordlen);
}
Now you can put those words into a table, defined thus:
CREATE TABLE img_sig_words (
image_id INTEGER NOT NULL,
sig_word TEXT NOT NULL,
FOREIGN KEY (image_id) REFERENCES images (image_id),
INDEX (image_id, sig_word)
);
Now you insert into that table, prepending the position index of where the word was found, so that you know when a word matches that it matched in the same place in the signature:
// the signature, along with all other data, has already been inserted into the images
// table, and $image_id has been populated with the resulting primary key
foreach ($words as $index => $word) {
$sig_word = $index.'__'.$word;
$dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,
'$sig_word')"); // figure a suitably defined db abstraction layer...
}
Your data thus initialized, you can grab images with matching words relatively easily:
// $image_id is set to the base image that you are trying to find matches to
$dbobj->query("SELECT i.*, COUNT(isw.sig_word) as strength FROM images i JOIN img_sig_words
isw ON i.image_id = isw.image_id JOIN img_sig_words isw_search ON isw.sig_word =
isw_search.sig_word AND isw.image_id != isw_search.image_id WHERE
isw_search.image_id = $image_id GROUP BY i.image_id, i.name, i.description,
i.file_path, i.url_path, i.signature ORDER BY strength DESC");
You could improve the query by adding a HAVING
clause that requires a minimum strength
, thus further reducing your matching set.
I make no guarantees that this is the most efficient setup, but it should be roughly functional to accomplish what you're looking for.
Basically, splitting and storing the words in this manner allows you to do a rough distance check without having to run a specialized function on the signatures.
相关文章