解析与使用 Coldfusion & 比较数据MySQL

2022-01-09 00:00:00 mysql coldfusion

首先,我将解释我需要做什么,然后是我如何认为我可以实现它.我目前的计划在理论上似乎效率很低,所以我的问题是是否有更好的方法来完成它.

First, I'll explain what I need to do, then how I think I can achieve it. My current plan seems very inefficient in theory, so my question is whether there is a better way of accomplishing it.

我有 2 个表 - 我们称它们为Products"和Products_Temp",两者是相同的.我需要从供应商处下载包含产品详细信息(库存、定价等)的大量文件(XML 或 XLS).然后将它们解析到 Products_Temp 表中.现在,我计划使用 CF 计划任务来处理下载,并使用 Navicat 进行实际解析 - 我很高兴这足够有效.

I have 2 Tables - lets call them 'Products' and 'Products_Temp', both are identical. I need to download a large number of files (XML or XLS) which contain product details (stock, pricing etc) from suppliers. These are then parsed into the Products_Temp table. Right now, I plan to use CF Scheduled Tasks to handle the downloading, and Navicat to do the actual parsing - I'm happy enough this is adequate and efficient enough.

下一步是我苦苦挣扎的地方——一旦文件被下载和解析,我需要寻找数据中的任何变化.这将与 Products 表进行比较.如果发现更改,则应该添加或更新该行(如果应该删除它,那么我需要标记它而不是仅仅删除它).比较完所有数据后,应该清空 products_temp 表.

The next step is where I'm struggling - once the file has been downloaded and parsed, I need to look for any changes in the data. This will be compared against the Products table. If a change is found, then that row should be added or updated (if it should be removed, then I'll need to flag it rather than just delete it). Once all the data has been compared, the products_temp table should be emptied.

我知道比较表并相应地同步它们的方法,但是我遇到的问题是我将处理来自不同来源的多个文件.我曾考虑仅使用 products 表和追加/更新,但我不确定如何管理标记已删除"要求.

I'm aware of methods to compare tables and sync them accordingly, however the issue I have is the fact I'll be handling multiple files from different sources. I had considered using only the products table and append/update, but I'm unsure how I could manage the 'flag deleted' requirement.

现在,我知道唯一能让它工作的方法是循环遍历 products_temp 表,执行各种 cfquery 并在完成后删除该行.但是,这似乎非常低效,而且考虑到我们可能要处理数十万行,如果我们每天更新所有内容,就不太可能有效.

Right now, the only way I know I can make it work is to loop through the products_temp table, do various cfquerys and delete the row once complete. However, that seems incredibly inefficient, and given the fact we're likely to be dealing with hundreds of thousands of rows, unlikely to be effective if we update everything daily.

任何关于更好路线的指示或建议将不胜感激!

Any pointers or advice on a better route would be appreciated!

推荐答案

两种回答都有可能.只是为了稍微扩展您的选择..

Both responses have possibilities. Just to expand on your options a little ..

IF mySQL 支持某种散列,基于每行,您可以使用 comodoro 的建议 避免硬删除.

IF mySQL supports some sort of hashing, on a per row basis, you could use a variation of comodoro's suggestion to avoid hard deletes.

识别更改

要识别更改,请对主键进行内部连接并检查哈希值.如果它们不同,则产品已更改并应更新:

To identify changes, do an inner join on the primary key and check the hash values. If they are different, the product was changed and should be updated:

    UPDATE Products p INNER JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
    SET    p.ProductName = tmp.ProductName
           , p.Stock = tmp.Stock
           , ...
           , p.DateLastChanged = now()
           , p.IsDiscontinued  = 0
    WHERE  tmp.TheRowHash <> p.TheRowHash

识别已删除

使用简单的外连接来识别临时表中不存在的记录,并将它们标记为已删除"

Use a simple outer join to identify records that do not exist in the temp table, and flag them as "deleted"

    UPDATE Products p LEFT JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
    SET    p.DateLastChanged = now()
           , p.IsDiscontinued = 1
    WHERE  tmp.ProductID IS NULL

识别新的

最后,使用类似的外连接插入任何新"产品.

Finally, use a similar outer join to insert any "new" products.

    INSERT INTO Products ( ProductName, Stock, DateLastChanged, IsDiscontinued, .. )
    SELECT tmp.ProductName, tmp.Stock, now() AS DateLastChanged, 0 AS IsDiscontinued, ...
    FROM   Products_Temp tmp LEFT JOIN Products p ON tmp.ProductID = p.ProductID
    WHERE  p.ProductID IS NULL

选项 #2

如果每行散列不可行,另一种方法是 Sharondio 的建议的变体.

在临时表中添加一个状态"列,并通过一系列连接将所有导入的记录标记为新"、已更改"或未更改".(默认应该是更改").

Add a "status" column to the temp table and flag all imported records as "new", "changed" or "unchanged" through a series of joins. (The default should be "changed").

识别未更改的

首先在所有字段上使用内连接来识别未更改的产品.(注意,如果您的表包含任何可为空的字段,请记住使用类似 coalesce 之类的内容,否则,结果可能会出现偏差,因为 null 值不等于任何值.

First use an inner join, on all fields, to identify products that have NOT changed. (Note, if your table contains any nullable fields, remember to use something like coalesce Otherwise, the results may be skewed because null values are not equal to anything.

    UPDATE  Products_Temp tmp INNER JOIN Products p ON tmp.ProductID = p.ProductID
    SET     tmp.Status = 'Unchanged'
    WHERE   p.ProductName = tmp.ProductName
    AND     p.Stock = tmp.Stock
    ... 

识别新的

像以前一样,使用外连接来识别新"记录.

Like before, use an outer join to identify "new" records.

    UPDATE  Products_Temp tmp LEFT JOIN Products p ON tmp.ProductID = p.ProductID
    SET     tmp.Status = 'New'
    WHERE   p.ProductID IS NULL

通过消除过程,临时表中的所有其他记录都更改"了.计算完状态后,您可以更新 Products 表:

By process of elimination, all other records in the temp table are "changed". Once you have calculated the statuses, you can update the Products table:

    /*  update changed products */
    UPDATE Products p INNER JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
    SET    p.ProductName = tmp.ProductName
           , p.Stock = tmp.Stock
           , ...
           , p.DateLastChanged = now()
           , p.IsDiscontinued = 0
    WHERE  tmp.status = 'Changed'

    /*  insert new products */
    INSERT INTO Products ( ProductName, Stock, DateLastChanged, IsDiscontinued, .. )
    SELECT tmp.ProductName, tmp.Stock, now() AS DateLastChanged, 0 AS IsDiscontinued, ...
    FROM   Products_Temp tmp
    WHERE  tmp.Status = 'New'

    /* flag deleted records */
    UPDATE Products p LEFT JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
    SET    p.DateLastChanged = now()
           , p.IsDiscontinued = 1
    WHERE  tmp.ProductID IS NULL

相关文章