Oracle 分页策略
我想从两个时间戳之间的表中获取数百万行,然后对其进行处理.启动一个查询并一次检索所有记录看起来是一个糟糕的策略,因为它可能会超出我的 JVM 程序的内存容量.
I want to fetch million of rows from a table between two timestamps and then do processing over it. Firing a single query and retrieving all the records at once looks to be a bad strategy as it might exceed the memory capabilities of my JVM program.
我已阅读这篇文章:
http://oracle.readthedocs.io/en/latest/sql/indexes/top-n-pagination.html
因此,我计划以 1000 为一组进行分页,并提出以下策略:假设 Start_Date = X 和 End_Date = Y
Hence, I am planning to do pagination in batches of 1000 and have come up with following strategy:
Let's say Start_Date = X and End_Date = Y
- 发起查询,
select * from table where CREATE_TIMESTAMP >X AND CREATE_TIMESTAMP
如果我只得到少于 1000 行,那意味着所有记录都已完成.如果我正好得到 1000 行,这意味着可能会有更多记录.
If I get less than 1000 rows only, that means all records have been completed. If I get exactly 1000 rows, that means, there might be more records.
设置 X = 第 1000 条记录的 CREATE_TIMESTAMP
select * from table where CREATE_TIMESTAMP >X AND CREATE_TIMESTAMP
重复这个过程,直到我得到少于 1000 条记录.
This repeats until I get less than 1000 records.
有人觉得这种方法有什么问题吗?
Does anyone see any issues with this approach?
推荐答案
分页模式是为了网站展示的目的而发明的(与滚动导航相反),并且在那里效果最好.简而言之,实时用户几乎无法同时查看数千/数百万条记录,因此将信息分为短页面(50~200 条记录),通常每个页面都会向数据库发送一个查询.用户通常只点击几个页面,而不是全部浏览,而且用户浏览页面需要一点时间,所以查询不是一个一个地发送到数据库,而是间隔很长时间.检索一大块数据的时间比检索所有数百万条记录的时间要短得多,因此用户很高兴,因为他不必为后续页面等待很长时间,并且整体系统负载更小.
Pagination pattern has been invented for the purpose of websites presentation (in opposite to scrolling navigation), and works best there. In short, the live user is practically unable to view thousands/millions of records at once, so the information is divided into short pages (50~200 records), where one query is usually sent to the database for each page. The user usually clicks on a few pages only, but does not browse all of them, in addition the user needs a bit of time to browse the page, so the queries are not sent to the database one by one, but in long intervals. The time to retrieve a chunk of data is much shorter than retrieving all millions of record, so the user is happy because he does not have to wait long for subsequent pages, and the overall system load is smaller.
但从问题来看,您的应用程序的性质似乎面向批处理,而不是网络演示.应用程序必须获取所有记录并对每条记录进行一些操作/转换(计算).在这种情况下,使用了完全不同的设计模式(流/流水线处理、步骤序列、并行步骤/操作等),并且分页将不起作用,如果你这样做你会降低系统性能.强>
But it seems from the question that the nature of your application is oriented to batch processing rather than to the web presentation. The application must fetch all records and do some operations/transformations (calculations) on each of the records. In this case , completely different design patterns are used (stream/pipelined processing, sequence of steps, parallel steps/operations etc), and pagination will not work, if you go that way you will kill your system performance.
让我们看一个简单实用的例子,而不是花哨的理论,它会告诉你我们在这里谈论的速度有什么不同
假设有一个表 PAGINATION
大约有 7 百万条记录:
Instead of fancy theory, let's look at simple and practical example which will show you what differences in speed we are talking here
Let say there is a table PAGINATION
with about 7 millions of records:
create table pagination as
select sysdate - 200 * dbms_random.value As my_date, t.*
from (
select o.* from all_objects o
cross join (select * from dual connect by level <= 100)
fetch first 10000000 rows only
) t;
select count(*) from pagination;
COUNT(*)
----------
7369600
<小时>
假设在 MY_DATE
列上创建了一个索引,并且索引统计信息是最新的:
Let say there is an index created on MY_DATE
column, and index statistics are fresh:
create index PAGINATION_IX on pagination( my_date );
BEGIN dbms_stats.gather_table_stats( 'TEST', 'PAGINATION', method_opt => 'FOR ALL COLUMNS' ); END;
/
假设我们将在以下日期之间处理表中大约 10% 的记录:
Let say that we are going to process about 10% of records from the table between the below dates:
select count(*) from pagination
where my_date between date '2017-10-01' and '2017-10-21';
COUNT(*)
----------
736341
<小时>
最后让我们说,为了简单起见,我们的处理"将包括字段之一的长度的简单求和.
这是一个简单的分页实现:
and finally let say that our "processing" for simplicity, will consist in simple summing of lengths of one of field.
This is a simple paging implementation:
public class Pagination {
public static class RecordPojo {
Date myDate;
String objectName;
public Date getMyDate() {
return myDate;
}
public RecordPojo setMyDate(Date myDate) {
this.myDate = myDate;
return this;
}
public String getObjectName() {
return objectName;
}
public RecordPojo setObjectName(String objectName) {
this.objectName = objectName;
return this;
}
};
static class MyPaginator{
private Connection conn;
private int pageSize;
private int currentPage = 0;
public MyPaginator( Connection conn, int pageSize ) {
this.conn = conn;
this.pageSize = pageSize;
}
static final String QUERY = ""
+ "SELECT my_date, object_name FROM pagination "
+ "WHERE my_date between date '2017-10-01' and '2017-10-21' "
+ "ORDER BY my_date "
+ "OFFSET ? ROWS FETCH NEXT ? ROWS ONLY";
List<RecordPojo> getNextPage() {
List<RecordPojo> list = new ArrayList<>();
ResultSet rs = null;
try( PreparedStatement ps = conn.prepareStatement(QUERY);) {
ps.setInt(1, pageSize * currentPage++ );
ps.setInt(2, pageSize);
rs = ps.executeQuery();
while( rs.next()) {
list.add( new RecordPojo().setMyDate(rs.getDate(1)).setObjectName(rs.getString(2)));
}
} catch (SQLException e) {
e.printStackTrace();
}finally {
try{rs.close();}catch(Exception e) {}
}
return list;
}
public int getCurrentPage() {
return currentPage;
}
}
public static void main(String ...x) throws SQLException {
OracleDataSource ds = new OracleDataSource();
ds.setURL("jdbc:oracle:thin:test/test@//localhost:1521/orcl");
long startTime = System.currentTimeMillis();
long value = 0;
int pageSize = 1000;
try( Connection conn = ds.getConnection();){
MyPaginator p = new MyPaginator(conn, pageSize);
List<RecordPojo> list;
while( ( list = p.getNextPage()).size() > 0 ) {
value += list.stream().map( y -> y.getObjectName().length()).mapToLong(Integer::longValue).sum();
System.out.println("Page: " + p.getCurrentPage());
}
System.out.format("==================
Value = %d, Pages = %d, time = %d seconds", value, p.getCurrentPage(), (System.currentTimeMillis() - startTime)/1000);
}
}
}
<小时>
结果是:
A result is:
Value = 18312338, Pages = 738, time = 2216 seconds
<小时>
现在让我们测试一个非常简单的基于流的解决方案 - 只需要一个记录,处理它,丢弃它(释放内存),然后获取下一个.
Now let's test a very simple stream based solution - just take only one record, process it, discard it (freeing up memory), and take the next one.
public class NoPagination {
static final String QUERY = ""
+ "SELECT my_date, object_name FROM pagination "
+ "WHERE my_date between date '2017-10-01' and '2017-10-21' "
+ "ORDER BY my_date ";
public static void main(String[] args) throws SQLException {
OracleDataSource ds = new OracleDataSource();
ds.setURL("jdbc:oracle:thin:test/test@//localhost:1521/orcl");
long startTime = System.currentTimeMillis();
long count = 0;
ResultSet rs = null;
PreparedStatement ps = null;
try( Connection conn = ds.getConnection();){
ps = conn.prepareStatement(QUERY);
rs = ps.executeQuery();
while( rs.next()) {
// processing
RecordPojo r = new RecordPojo().setMyDate(rs.getDate(1)).setObjectName(rs.getString(2));
count+=r.getObjectName().length();
}
System.out.format("==================
Value = %d, time = %d seconds", count, (System.currentTimeMillis() - startTime)/1000);
}finally {
try { rs.close();}catch(Exception e) {}
try { ps.close();}catch(Exception e) {}
}
}
<小时>
结果是:
A result is:
Value = 18312328, time = 11 seconds
是 - 2216 秒/11 秒 = 快 201 倍 - 快 20 100 % !!!
难以置信的 ?你可以自己测试一下.
这个例子说明了选择正确的解决方案(正确的设计模式)来解决问题是多么重要.
Yes - 2216 seconds / 11 seconds = 201 times faster - 20 100 % faster !!!
Unbelievable ? You can test it yourself.
This example shows how important it is to choose the right solution (right design patterns) to solve the problem.
相关文章