如何在组织内共享数据

组织在多个部门和应用程序之间共享关键数据的好方法是什么?

What are some good ways for an organization to share key data across many deparments and applications?

举个例子,假设有一个主要的应用程序和数据库来管理客户数据.组织中还有十个其他应用程序和数据库读取该数据并将其与自己的数据相关联.目前,这种数据共享是通过混合数据库 (DB) 链接、物化视图、触发器、临时表、重新键入信息、Web 服务等来完成的.

To give an example, let's say there is one primary application and database to manage customer data. There are ten other applications and databases in the organization that read that data and relate it to their own data. Currently this data sharing is done through a mixture of database (DB) links, materialized views, triggers, staging tables, re-keying information, web services, etc.

还有其他好的方法来共享数据吗?并且,您的方法与上述方法相比,在以下问题方面如何:

  • 重复数据
  • 容易出错的数据同步过程
  • 紧耦合与松耦合(减少依赖/脆弱性/测试协调)
  • 架构简化
  • 安全
  • 表现
  • 定义良好的接口
  • 其他相关问题?

    Are there any other good approaches for sharing data? And, how do your approaches compare to the ones above with respect to concerns like:

  • duplicate data
  • error prone data synchronization processes
  • tight vs. loose coupling (reducing dependencies/fragility/test coordination)
  • architectural simplification
  • security
  • performance
  • well-defined interfaces
  • other relevant concerns?

    请记住,共享客户数据的使用方式多种多样,从简单的单记录查询到复杂的多谓词、多排序、与存储在不同数据库中的其他组织数据的连接.

    Keep in mind that the shared customer data is used in many ways, from simple, single record queries to complex, multi-predicate, multi-sort, joins with other organization data stored in different databases.

    感谢您的建议和建议...

    Thanks for your suggestions and advice...

    推荐答案

    我相信你已经看到了,这取决于".

    I'm sure you saw this coming, "It Depends".

    这取决于一切.而A部门共享Customer数据的解决方案可能与B部门共享Customer数据完全不同.

    It depends on everything. And the solution to sharing Customer data for department A may be completely different for sharing Customer data with department B.

    多年来我最喜欢的概念是最终一致性"的概念.该术语来自亚马逊谈论分布式系统.

    My favorite concept that has risen up over the years is the concept of "Eventual Consistency". The term came from Amazon talking about distributed systems.

    前提是,虽然分布式企业中的数据状态现在可能不完全一致,但最终"会如此.

    The premise is that while the state of data across a distributed enterprise may not be perfectly consistent now, it "eventually" will be.

    例如,当客户记录在系统 A 上更新时,系统 B 的客户数据现在已过时且不匹配.但是,最终",来自 A 的记录将通过某个过程发送到 B.因此,最终,这两个实例将匹配.

    For example, when a customer record gets updated on system A, system B's customer data is now stale and not matching. But, "eventually", the record from A will be sent to B through some process. So, eventually, the two instances will match.

    当您使用单个系统时,您没有EC",而是拥有即时更新、单一事实来源"以及通常用于处理竞争条件和冲突的锁定机制.

    When you work with a single system, you don't have "EC", rather you have instant updates, a single "source of truth", and, typically, a locking mechanism to handle race conditions and conflicts.

    您的操作处理EC"数据的能力越强,分离这些系统就越容易.一个简单的例子是销售使用的数据仓库.他们使用 DW 来运行他们的每日报告,但他们直到凌晨才运行他们的报告,而且他们总是查看昨天"(或更早)的数据.因此,DW 无需实时与日常运营系统完全一致.一个流程在营业结束时运行并在大型单一更新操作中将交易和活动一起移动几天,这是完全可以接受的.

    The more able your operations are able to work with "EC" data, the easier it is to separate these systems. A simple example is a Data Warehouse used by sales. They use the DW to run their daily reports, but they don't run their reports until the early morning, and they always look at "yesterdays" (or earlier) data. So there's no real time need for the DW to be perfectly consistent with the daily operations system. It's perfectly acceptable for a process to run at, say, close of business and move over the days transactions and activities en masse in a large, single update operation.

    你可以看到这个需求是如何解决很多问题的.没有事务数据的争用,不用担心某些报告数据会在累积统计数据的过程中发生变化,因为报告对实时数据库进行了两次单独的查询.白天无需为高细节的喋喋不休吸纳网络和cpu处理等.

    You can see how this requirement can solve a lot of issues. There's no contention for the transactional data, no worries that some reports data is going to change in the middle of accumulating the statistic because the report made two separate queries to the live database. No need to for the high detail chatter to suck up network and cpu processing, etc. during the day.

    现在,这是 EC 的一个极端、简化且非常粗略的示例.

    Now, that's an extreme, simplified, and very coarse example of EC.

    但是考虑像 Google 这样的大型系统.作为搜索的消费者,我们不知道谷歌在搜索页面上获得的搜索结果何时或需要多长时间.1毫秒?1秒?10s?10小时?很容易想象,如果您访问 Google 的西海岸服务器,您很可能会得到与访问他们的东海岸服务器不同的搜索结果.这两个实例在任何时候都不是完全一致的.但在很大程度上,它们大多是一致的.对于他们的用例,他们的消费者并没有真正受到滞后和延迟的影响.

    But consider a large system like Google. As a consumer of Search, we have no idea when or how long it takes for a search result that Google harvests to how up on a search page. 1ms? 1s? 10s? 10hrs? It's easy to imaging how if you're hitting Googles West Coast servers, you may very well get a different search result than if you hit their East Coast servers. At no point are these two instances completely consistent. But by large measure, they are mostly consistent. And for their use case, their consumers aren't really affected by the lag and delay.

    考虑电子邮件.A 想向 B 发送消息,但在此过程中,消息通过系统 C、D 和 E 进行路由.每个系统都接受消息,对其承担全部责任,然后将其交给另一个系统.发件人看到电子邮件继续发送.接收者不会真的错过它,因为他们不一定知道它的到来.因此,该消息在系统中移动可能需要很长的时间窗口,而无需任何人知道或关心它的速度.

    Consider email. A wants to send message to B, but in the process the message is routed through system C, D, and E. Each system accepts the message, assume complete responsibility for it, and then hands it off to another. The sender sees the email go on its way. The receiver doesn't really miss it because they don't necessarily know its coming. So, there is a big window of time that it can take for that message to move through the system without anyone concerned knowing or caring about how fast it is.

    另一方面,A 可以和 B 通电话.我刚刚发送了,你收到了吗?现在?现在?现在?现在收到?"

    On the other hand, A could have been on the phone with B. "I just sent it, did you get it yet? Now? Now? Get it now?"

    因此,存在某种潜在的、隐含的性能和响应水平.最后,最终",A 的发件箱与 B 的收件箱匹配.

    Thus, there is some kind of underlying, implied level of performance and response. In the end, "eventually", A's outbox matches B inbox.

    这些延迟、对陈旧数据的接受,无论是一天前还是 1-5 秒前,都控制着您系统的最终耦合.此要求越宽松,耦合就越宽松,您在设计方面的灵活性就越大.

    These delays, the acceptance of stale data, whether its a day old or 1-5s old, are what control the ultimate coupling of your systems. The looser this requirement, the looser the coupling, and the more flexibility you have at your disposal in terms of design.

    这适用于 CPU 中的内核.运行在同一系统上的现代、多核、多线程应用程序可以对相同"数据有不同的看法,只有微秒过时.如果您的代码可以在数据可能彼此不一致的情况下正常工作,那么快乐的一天,它会继续前进.如果不是,您需要特别注意确保您的数据完全一致,使用易失性内存限定或锁定构造等技术.所有这些,都以他们的方式,性价比.

    This is true down to the cores in your CPU. Modern, multi core, multi-threaded applications running on the same system, can have different views of the "same" data, only microseconds out of date. If your code can work correctly with data potentially inconsistent with each other, then happy day, it zips along. If not you need to pay special attention to ensure your data is completely consistent, using techniques like volatile memory qualifies, or locking constructs, etc. All of which, in their way, cost performance.

    所以,这是基本考虑因素.所有其他决定都从这里开始.回答这个问题可以告诉您如何跨机器对应用程序进行分区、共享哪些资源以及如何共享它们.哪些协议和技术可用于移动数据,以及执行传输的处理成本.复制、负载均衡、数据共享等等,都是基于这个概念.

    So, this is the base consideration. All of the other decisions start here. Answering this can tell you how to partition applications across machines, what resources are shared, and how they are shared. What protocols and techniques are available to move the data, and how much it will cost in terms of processing to perform the transfer. Replication, load balancing, data shares, etc. etc. All based on this concept.

    编辑,回应第一条评论.

    Edit, in response to first comment.

    正确,完全正确.这里的游戏,例如,如果 B 不能更改客户数据,那么更改客户数据有什么危害?您可以冒险"让它在短时间内过时吗?也许您的客户数据进入的速度足够慢,您可以立即将其从 A 复制到 B.假设更改被放在一个队列中,由于音量低,很容易被取走(<1s),但即使如此,原始更改仍将超出事务",因此有一个小窗口,A 将有 B 没有的数据.

    Correct, exactly. The game here, for example, if B can't change customer data, then what is the harm with changed customer data? Can you "risk" it being out of date for a short time? Perhaps your customer data comes in slowly enough that you can replicate it from A to B immediately. Say the change is put on a queue that, because of low volume, gets picked up readily (< 1s), but even still it would be "out of transaction" with the original change, and so there's a small window where A would have data that B does not.

    现在大脑真的开始旋转了.在那段滞后"期间会发生什么,最糟糕的情况是什么.你能围绕它进行设计吗?如果您可以设计大约 1 秒的延迟,那么您可能能够设计大约 5 秒、1 米甚至更长的延迟.您在 B 上实际使用了多少客户数据?也许 B 是一个旨在促进从库存中拣货的系统.很难想象有什么比简单的客户 ID 和姓名更必要的了.只是在组装时粗略地确定订单是谁的东西.

    Now the mind really starts spinning. What happens during that 1s of "lag", whats the worst possible scenario. And can you engineer around it? If you can engineer around a 1s lag, you may be able to engineer around a 5s, 1m, or even longer lag. How much of the customer data do you actually use on B? Maybe B is a system designed to facilitate order picking from inventory. Hard to imagine anything more being necessary than simply a Customer ID and perhaps a name. Just something to grossly identify who the order is for while it's being assembled.

    拣货系统不一定需要在拣货过程结束前打印出所有客户信息,届时订单可能已转移到另一个可能更新的系统,尤其是运输信息,因此最终拣选系统根本不需要任何客户数据.事实上,您可以在拣配订单中嵌入和非规范化客户信息,因此无需或期望稍后进行同步.只要客户 ID 是正确的(无论如何都不会更改)和名称(更改很少,因此不值得讨论),这是您唯一需要的真实参考,并且您的所有提货单在当时都是完全准确的创作.

    The picking system doesn't necessarily need to print out all of the customer information until the very end of the picking process, and by then the order may have moved on to another system that perhaps is more current with, especially, shipping information, so in the end the picking system doesn't need hardly any customer data at all. In fact, you could EMBED and denormalize the customer information within the picking order, so there's no need or expectation of synchronizing later. As long as the Customer ID is correct (which will never change anyway) and the name (which changes so rarely it's not worth discussing), that's the only real reference you need, and all of your pick slips are perfectly accurate at the time of creation.

    关键在于思维方式,即分解系统并专注于任务所需的基本数据.您不需要的数据不需要复制或同步.人们对非规范化和数据缩减等事情感到恼火,尤其是当他们来自关系数据建模世界时.有充分的理由,应该谨慎考虑.但是一旦你去分布式,你就隐式地非规范化了.哎呀,你现在正在批量复制它.所以,你最好更聪明一点.

    The trick is the mindset, of breaking the systems up and focusing on the essential data that's necessary for the task. Data you don't need doesn't need to be replicated or synchronized. Folks chafe at things like denormalization and data reduction, especially when they're from the relational data modeling world. And with good reason, it should be considered with caution. But once you go distributed, you have implicitly denormalized. Heck, you're copying it wholesale now. So, you may as well be smarter about it.

    所有这些都可以通过可靠的程序和对工作流程的透彻理解来缓解.识别风险并制定政策和程序来处理它们.

    All this can mitigated through solid procedures and thorough understanding of workflow. Identify the risks and work up policy and procedures to handle them.

    但困难的部分是一开始就打破中央数据库的链条,并告诉人们他们不能像他们期望的那样拥有一切",当您拥有一个单一的、中央的、完美的信息存储时.

    But the hard part is breaking the chain to the central DB at the beginning, and instructing folks that they can't "have it all" like they may expect when you have a single, central, perfect store of information.

  • 相关文章