用于大规模持久化图的 NoSQL 解决方案

2022-01-13 00:00:00 python networkx graph-theory nosql

问题描述

我迷上了使用 Python 和 NetworkX 来分析图表,随着我了解更多,我想使用越来越多的数据(我猜我正在成为数据迷 :-).最终我认为我的 NetworkX 图(存储为 dict 的 dict)将超过我系统上的内存.我知道我可能可以添加更多内存,但我想知道是否有办法将 NetworkX 与 Hbase 或类似解决方案集成?

I'm hooked on using Python and NetworkX for analyzing graphs and as I learn more I want to use more and more data (guess I'm becoming a data junkie :-). Eventually I think my NetworkX graph (which is stored as a dict of dict) will exceed the memory on my system. I know I can probably just add more memory but I was wondering if there was a way to instead integrate NetworkX with Hbase or a similar solution?

我环顾四周,并没有真正找到任何东西,但我也找不到任何与允许简单的 MySQL 后端相关的东西.

I looked around and couldn't really find anything but I also couldn't find anything related to allowing a simple MySQL backend as well.

这可能吗?是否有任何东西可以连接到某种持久存储?

Is this possible? Does anything exist to allow for connectivity to some kind of persistant storage?

谢谢!

更新:我记得在初创公司的社交网络分析"中看到过这个主题,作者谈到了其他存储方法(包括 hbase、s3 等),但没有说明如何执行此操作或是否可行.

Update: I remember seeing this subject in 'Social Network Analysis for Startups', the author talks about other storage methods(including hbase, s3, etc..) but does not show how to do this or if its possible.


解决方案

存储图的容器一般有两种:

There are two general types of containers for storing graphs:

  1. 真正的图形数据库: 例如,Neo4J、agamemnon、GraphDB 和 快板图;这些不仅存储一个图表,而且他们也知道一个图表是,例如,你可以查询这些数据库,例如,最短路径之间有多少个节点节点 X 和节点 Y?

  1. true graph databases: e.g., Neo4J, agamemnon, GraphDB, and AllegroGraph; these not only store a graph but they also understand that a graph is, so for instance, you can query these databases e.g., how many nodes are between the shortest path from node X and node Y?

静态图容器:Twitter 适应 MySQL 的 FlockDB 是这里最著名的示例.这些数据库可以存储和检索图表就好了;但是要查询图形本身,您必须首先从数据库中检索图形,然后使用库(例如,Python 的优秀的 Networkx) 来查询图本身.

static graph containers: Twitter's MySQL-adapted FlockDB is the most well-known exemplar here. These DBs can store and retrieve graphs just fine; but to query the graph itself, you have to first retrieve the graph from the DB then use a library (e.g., Python's excellent Networkx) to query the graph itself.

我在下面讨论的基于 redis 的图形容器属于第二类,尽管显然 redis 也非常适合第一类容器,redis-graph,一个非常小的python包,用于在redis中实现图形数据库.

The redis-based graph container i discuss below is in the second category, though apparently redis is also well-suited for containers in the first category as evidenced by redis-graph, a remarkably small python package for implementing a graph database in redis.

redis 在这里可以很好地工作.

redis will work beautifully here.

Redis 是一个适合生产使用的重型、耐用的数据存储,但它也很简单,可以用于命令行分析.

Redis is a heavy-duty, durable data store suitable for production use, yet it's also simple enough to use for command-line analysis.

Redis 与其他数据库的不同之处在于它具有多种数据结构类型;我在这里推荐的是 hash 数据类型.使用这种 redis 数据结构,您可以非常接近地模仿字典列表",这是一种用于存储图的传统模式,其中列表中的每个项目都是一个边字典,键控到这些边源自的节点.

Redis is different than other databases in that it has multiple data structure types; the one i would recommend here is the hash data type. Using this redis data structure allows you to very closely mimic a "list of dictionaries", a conventional schema for storing graphs, in which each item in the list is a dictionary of edges keyed to the node from which those edges originate.

您需要先安装 redis 和 python 客户端.DeGizmo 博客 有一个出色的启动和运行"教程,其中包括一个分步安装指南.

You need to first install redis and the python client. The DeGizmo Blog has an excellent "up-and-running" tutorial which includes a step-by-step guid on installing both.

一旦安装了 redis 及其 python 客户端,启动一个 redis 服务器,你可以这样做:

Once redis and its python client are installed, start a redis server, which you do like so:

  • cd 到你安装 redis 的目录(/usr/local/bin 如果你通过 make install);下一个

  • cd to the directory in which you installed redis (/usr/local/bin on 'nix if you installed via make install); next

在 shell 提示符下键入 redis-server 然后输入

type redis-server at the shell prompt then enter

您现在应该在 shell 窗口中看到服务器日志文件的尾部

you should now see the server log file tailing on your shell window

>>> import numpy as NP
>>> import networkx as NX

>>> # start a redis client & connect to the server:
>>> from redis import StrictRedis as redis
>>> r1 = redis(db=1, host="localhost", port=6379)

在下面的片段中,我存储了一个四节点图;下面的每一行在 redis 客户端上调用 hmset 并存储一个节点和连接到该节点的边(0" => 无边,1" => 边).(当然,在实践中,你会在一个函数中抽象出这些重复的调用;这里我展示了每个调用,因为这样可能更容易理解.)

In the snippet below, i have stored a four-node graph; each line below calls hmset on the redis client and stores one node and the edges connected to that node ("0" => no edge, "1" => edge). (In practice, of course, you would abstract these repetitive calls in a function; here i'm showing each call because it's likely easier to understand that way.)

>>> r1.hmset("n1", {"n1": 0, "n2": 1, "n3": 1, "n4": 1})
      True

>>> r1.hmset("n2", {"n1": 1, "n2": 0, "n3": 0, "n4": 1})
      True

>>> r1.hmset("n3", {"n1": 1, "n2": 0, "n3": 0, "n4": 1})
      True

>>> r1.hmset("n4", {"n1": 0, "n2": 1, "n3": 1, "n4": 1})
      True

>>> # retrieve the edges for a given node:
>>> r1.hgetall("n2")
      {'n1': '1', 'n2': '0', 'n3': '0', 'n4': '1'}

现在图表已被持久化,从 redis 数据库中检索它作为 NetworkX 图表.

Now that the graph is persisted, retrieve it from the redis DB as a NetworkX graph.

有很多方法可以做到这一点,下面是在两个 *steps*:

There are many ways to do this, below did it in two *steps*:

  1. 将redis数据库中的数据提取成一个邻接矩阵,实现为 2D NumPy 数组;那么

  1. extract the data from the redis database into an adjacency matrix, implemented as a 2D NumPy array; then

使用 NetworkX 将其直接转换为 NetworkX 图内置功能:

convert that directly to a NetworkX graph using a NetworkX built-in function:

简化为代码,这两个步骤是:

reduced to code, these two steps are:

>>> AM = NP.array([map(int, r1.hgetall(node).values()) for node in r1.keys("*")])
>>> # now convert this adjacency matrix back to a networkx graph:
>>> G = NX.from_numpy_matrix(am)

>>> # verify that G in fact holds the original graph:
>>> type(G)
      <class 'networkx.classes.graph.Graph'>
>>> G.nodes()
      [0, 1, 2, 3]
>>> G.edges()
      [(0, 1), (0, 2), (0, 3), (1, 3), (2, 3), (3, 3)]

当你结束一个 redis 会话时,你可以像这样从客户端关闭服务器:

When you end a redis session, you can shut down the server from the client like so:

>>> r1.shutdown()

redis 在关闭之前保存到磁盘,因此这是确保所有写入都被持久化的好方法.

redis saves to disk just before it shuts down so this is a good way to ensure all writes were persisted.

那么 redis 数据库在哪里呢?它以默认文件名存储在默认位置,即您的主目录中的 dump.rdb.

So where is the redis DB? It is stored in the default location with the default file name, which is dump.rdb on your home directory.

要更改此设置,请编辑 redis.conf 文件(包含在 redis 源代码分发中);转到以:

To change this, edit the redis.conf file (included with the redis source distribution); go to the line starting with:

# The filename where to dump the DB
dbfilename dump.rdb

将 dump.rdb 更改为您想要的任何内容,但保留 .rdb 扩展名.

change dump.rdb to anything you wish, but leave the .rdb extension in place.

接下来要更改文件路径,在redis.conf中找到这一行:

Next, to change the file path, find this line in redis.conf:

# Note that you must specify a directory here, not a file name

下面一行是redis数据库的目录位置.编辑它,让它背诵你想要的位置.保存您的修订并重命名此文件,但保留 .conf 扩展名.您可以将此配置文件存储在您希望的任何位置,只需在启动 redis 服务器时在同一行提供此自定义配置文件的完整路径和名称:

The line below that is the directory location for the redis database. Edit it so that it recites the location you want. Save your revisions and rename this file, but keep the .conf extension. You can store this config file anywhere you wish, just provide the full path and name of this custom config file on the same line when you start a redis server:

所以下次启动redis服务器时,一定要这样(从shell提示符:

So the next time you start a redis server, you must do it like so (from the shell prompt:

$> cd /usr/local/bin    # or the directory in which you installed redis 

$> redis-server /path/to/redis.conf

最后,Python 包索引 列出了一个专门用于在 redis 中实现图形数据库的包.这个包叫做 redis-graph 我没有用过.

Finally, the Python Package Index lists a package specifically for implementing a graph database in redis. The package is called redis-graph and i have not used it.

相关文章