从哈希键中检索不同的值 - DynamoDB

2022-01-13 00:00:00 python boto nosql amazon-dynamodb

问题描述

我有一个 dynamodb 表来存储电子邮件属性信息.我在电子邮件上有一个哈希键，在时间戳(数字)上有一个范围键.使用电子邮件作为哈希键的最初想法是按电子邮件查询所有电子邮件.但我想做的一件事是检索所有电子邮件 ID(在哈希键中).我为此使用 boto，但我不确定如何检索不同的电子邮件 ID.

I have a dynamodb table to store email attribute information. I have a hash key on the email, range key on timestamp(number). The initial idea for using email as hash key is to query all emails by per email. But one thing I trying to do is retrieve all email ids(in hash key). I am using boto for this, but I am unsure as to how to retrieve distinct email ids.

我当前提取 10,000 条电子邮件记录的代码是

My current code to pull 10,000 email records is

conn=boto.dynamodb2.connect_to_region('us-west-2') email_attributes = Table('email_attributes', connection=conn) s = email_attributes.scan(limit=10000,attributes=['email'])

但是要检索不同的记录，我必须进行全表扫描，然后在代码中选择不同的记录.我的另一个想法是维护另一个表，该表将仅存储这些电子邮件并进行条件写入以查看电子邮件 ID 是否存在，如果不存在则写入.但是我正在尝试考虑这是否会更昂贵，并且会是有条件的写入.

But to retrieve the distinct records, I will have to do a full table scan and then pick the distinct records in the code. Another idea that I have is to maintain another table that will just store these emails and do conditional writes to see if an email id exists, if not then write. But I am trying to think if this will be more expensive and it will be a conditional write.

Q1.) Is there a way to retrieve distinct records using a DynamoDB scan? Q2.) Is there a good way to calculate the cost per query?

解决方案

使用 DynamoDB 扫描，您需要在客户端过滤掉重复项(在您的情况下，使用 boto).即使您使用反向架构创建 GSI，您仍然会得到重复项.给定一个名为 stamped_emails 的 email_id+timestamp 的 H+R 表，所有唯一 email_ids 的列表是 H+R stamped_emails 表的物化视图.您可以启用 DynamoDB Stream 在 stamped_emails 表上，订阅 Lambda 函数对 stamped_emails 的 Stream 执行 PutItem (email_id) 到名为 emails_only 的仅哈希表.然后，您可以 Scan emails_only 并且不会收到重复邮件.

Using a DynamoDB Scan, you would need to filter out duplicates on the client side (in your case, using boto). Even if you create a GSI with the reverse schema, you will still get duplicates. Given a H+R table of email_id+timestamp called stamped_emails, a list of all unique email_ids is a materialized view of the H+R stamped_emails table. You could enable a DynamoDB Stream on the stamped_emails table, subscribe a Lambda function to stamped_emails' Stream that does a PutItem (email_id) to a Hash-only table called emails_only. Then, you could Scan emails_only and you would get no duplicates.

最后，关于您关于成本的问题，即使您只请求这些项目的某些预计属性，Scan 也会读取整个项目.其次，Scan 必须通读每个项目，即使它被 FilterExpression(条件表达式)过滤掉.第三，扫描顺序读取项目.这意味着为了计量目的，每个扫描调用都被视为一次大读取.这样做的成本含义是，如果一个 Scan 调用读取 200 个不同的项目，它不一定会花费 100 个 RCU.如果每个项目的大小为 100 字节，则该 Scan 调用将花费 ROUND_UP((20000 字节/1024 kb/字节)/8 kb/EC RCU) = 3 RCU.即使此调用仅返回 123 个项目，如果 Scan 必须读取 200 个项目，在这种情况下您将产生 3 个 RCU.

Finally, regarding your question about cost, Scan will read entire items even if you only request certain projected attributes from those items. Second, Scan has to read through every item, even if it is filtered out by a FilterExpression (Condition Expression). Third, Scan reads through items sequentially. That means that each scan call is treated as one big read for metering purposes. The cost implication of this is that if a Scan call reads 200 different items, it will not necessarily cost 100 RCU. If the size of each of those items is 100 bytes, that Scan call will cost ROUND_UP((20000 bytes / 1024 kb/byte) / 8 kb / EC RCU) = 3 RCU. Even if this call only returns 123 items, if the Scan had to read 200 items, you would incur 3 RCU in this situation.

相关文章