如何检测String.substring是否复制了字符数据
我知道对于 Oracle Java 1.7 更新 6 及更高版本,在使用 String.substring
时,复制 String 的内部字符数组,对于旧版本,它是共享的.但我没有找到可以告诉我当前行为的官方 API.
用例
我的用例是:在解析器中,我喜欢检测 String.substring
是否复制或共享底层字符数组.问题是,如果字符数组是共享的,那么我的解析器需要使用 new String(s)
显式取消共享"以避免记忆问题.但是,如果 String.substring
无论如何都会复制数据,那么这不是必需的,并且可以避免在解析器中显式复制数据.用例:
//可能查询非常非常大String query = "select * from test ...";//标识符在解析器之外使用字符串标识符 = query.substring(14, 18);//尽可能避免速度,//但如果内部标识符是必需的//引用大查询字符数组标识符 = 新字符串(标识符);
我需要什么
基本上,我想要一个静态方法 boolean isSubstringCopyingForSure()
来检测是否不需要 new String(..)
.如果有 SecurityManager
,检测不起作用,我没关系.基本上,检测应该是保守的(为了避免内存问题,我宁愿使用 new String(..)
即使没有必要).
选项
我有几个选择,但我不确定它们是否可靠,特别是对于非 Oracle JVM:
检查 String.offset 字段
/*** @return 如果子字符串正在复制,则返回 true,如果没有或不清楚,则返回 false*/静态布尔 isSubstringCopyingForSure() {if (System.getSecurityManager() != null) {//我们不能可靠地检查它返回假;}尝试 {for (字段 f : String.class.getDeclaredFields()) {if ("offset".equals(f.getName())) {返回假;}}返回真;} 捕捉(异常 e){//奇怪,我们有安全管理器吗?}返回假;}
检查 JVM 版本
静态布尔 isSubstringCopyingForSure() {//但是非 Oracle JRE 呢?return System.getProperty("java.vendor").startsWith("Oracle") &&System.getProperty("java.version").compareTo("1.7.0_45") >= 0;}
检查行为有两种选择,都比较复杂.一种是使用自定义字符集创建一个字符串,然后使用子字符串创建一个新字符串 b,然后 修改 原始字符串并检查 b 是否也被更改.第二个选项是创建巨大的字符串,然后是一些子字符串,并检查内存使用情况.
解决方案对,这个改动确实是在 7u6 中进行的.对此没有 API 更改,因为此更改严格来说是实现更改,而不是 API 更改,也没有 API 来检测正在运行的 JDK 的行为.但是,应用程序当然可能会因为更改而注意到性能或内存利用率的差异.事实上,编写一个在 7u4 中工作但在 7u6 中失败的程序并不难,反之亦然.我们预计这种权衡对大多数应用程序都是有利的,但毫无疑问,有些应用程序会受到这种变化的影响.
有趣的是,您担心共享字符串值的情况(在 7u6 之前).我听说的大多数人都有相反的担忧,他们喜欢共享和 7u6 对非共享值的更改正在给他们带来问题(或者,他们担心这会导致问题).p>
无论如何,要做的是衡量,而不是猜测!
首先,比较您的应用程序在类似 JDK 之间的性能,无论是否发生变化,例如7u4 和 7u6.可能您应该查看 GC 日志或其他内存监控工具.如果差异是可以接受的,那么您就完成了!
假设7u6之前的共享字符串值有问题,接下来尝试new String(s.substring(...))
的简单变通方法强制字符串值不共享.然后测量它.同样,如果两个 JDK 的性能都可以接受,那么您就完成了!
如果事实证明在未共享的情况下,对 new String()
的额外调用是不可接受的,那么检测这种情况并使取消共享"调用有条件的最佳方法可能是反映一个字符串的 value
字段,它是一个 char[]
,并得到它的长度:
int getValueLength(String s) 抛出异常 {Field field = String.class.getDeclaredField("value");field.setAccessible(true);返回 ((char[])field.get(s)).length;}
考虑一个由调用 substring()
产生的字符串,它返回一个比原始字符串短的字符串.在共享的情况下,子字符串的 length()
将不同于如上所示检索到的 value
数组的长度.在非共享的情况下,它们将是相同的.例如:
String s = "abcdefghij".substring(2, 5);int logicalLength = s.length();int valueLength = getValueLength(s);System.out.printf("%d %d", logicalLength, valueLength);如果(逻辑长度!= valueLength){System.out.println("共享");别的System.out.println("未共享");
在 7u6 之前的 JDK 上,值的长度为 10,而在 7u6 或更高版本上,值的长度为 3.当然,在这两种情况下,逻辑长度都是 3.
I know that for Oracle Java 1.7 update 6 and newer, when using String.substring
,
the internal character array of the String is copied, and for older versions, it is shared.
But I found no offical API that would tell me the current behavior.
Use Case
My use case is:
In a parser, I like to detect whether String.substring
copies or shares the underlying character array.
The problem is, if the character array is shared, then my parser needs to explicitly "un-share" using new String(s)
to avoid
memory problems. However, if String.substring
anyway copies the data, then this is not necessary, and explicitly copying the data in the parser could be avoided. Use case:
// possibly the query is very very large
String query = "select * from test ...";
// the identifier is used outside of the parser
String identifier = query.substring(14, 18);
// avoid if possible for speed,
// but needed if identifier internally
// references the large query char array
identifier = new String(identifier);
What I Need
Basically, I would like to have a static method boolean isSubstringCopyingForSure()
that would detect if new String(..)
is not needed. I'm OK if detection doesn't work if there is a SecurityManager
. Basically, the detection should be conservative (to avoid memory problems, I'd rather use new String(..)
even if not necessary).
Options
I have a few options, but I'm not sure if they are reliable, specially for non-Oracle JVMs:
Checking for the String.offset field
/**
* @return true if substring is copying, false if not or if it is not clear
*/
static boolean isSubstringCopyingForSure() {
if (System.getSecurityManager() != null) {
// we can not reliably check it
return false;
}
try {
for (Field f : String.class.getDeclaredFields()) {
if ("offset".equals(f.getName())) {
return false;
}
}
return true;
} catch (Exception e) {
// weird, we do have a security manager?
}
return false;
}
Checking the JVM version
static boolean isSubstringCopyingForSure() {
// but what about non-Oracle JREs?
return System.getProperty("java.vendor").startsWith("Oracle") &&
System.getProperty("java.version").compareTo("1.7.0_45") >= 0;
}
Checking the behavior There are two options, both are rather complicated. One is create a string using custom charset, then create a new string b using substring, then modify the original string and check whether b is also changed. The second options is create huge string, then a few substrings, and check the memory usage.
解决方案Right, indeed this change was made in 7u6. There is no API change for this, as this change is strictly an implementation change, not an API change, nor is there an API to detect which behavior the running JDK has. However, it is certainly possible for applications to notice a difference in performance or memory utilization because of the change. In fact, it's not difficult to write a program that works in 7u4 but fails in 7u6 and vice-versa. We expect that the tradeoff is favorable for the majority of applications, but undoubtedly there are applications that will suffer from this change.
It's interesting that you're concerned about the case where string values are shared (prior to 7u6). Most people I've heard from have the opposite concern, where they like the sharing and the 7u6 change to unshared values is causing them problems (or, they're afraid it will cause problems).
In any case the thing to do is measure, not guess!
First, compare the performance of your application between similar JDKs with and without the change, e.g. 7u4 and 7u6. Probably you should be looking at GC logs or other memory monitoring tools. If the difference is acceptable, you're done!
Assuming that the shared string values prior to 7u6 cause a problem, the next step is to try the simple workaround of new String(s.substring(...))
to force the string value to be unshared. Then measure that. Again, if the performance is acceptable on both JDKs, you're done!
If it turns out that in the unshared case, the extra call to new String()
is unacceptable, then probably the best way to detect this case and make the "unsharing" call conditional is to reflect on a String's value
field, which is a char[]
, and get its length:
int getValueLength(String s) throws Exception {
Field field = String.class.getDeclaredField("value");
field.setAccessible(true);
return ((char[])field.get(s)).length;
}
Consider a string resulting from a call to substring()
that returns a string shorter than the original. In the shared case, the substring's length()
will differ from the length of the value
array retrieved as shown above. In the unshared case, they'll be the same. For example:
String s = "abcdefghij".substring(2, 5);
int logicalLength = s.length();
int valueLength = getValueLength(s);
System.out.printf("%d %d ", logicalLength, valueLength);
if (logicalLength != valueLength) {
System.out.println("shared");
else
System.out.println("unshared");
On JDKs older than 7u6, the value's length will be 10, whereas on 7u6 or later, the value's length will be 3. In both cases, of course, the logical length will be 3.
相关文章