java中的txt文件格式验证
验证 .txt 文件是否为:
What is the best way to validate whether a .txt file is:
实际上是一个 .txt 文件,而不是其他类型的文件,只是更改了扩展名.
In fact a .txt file and not another type of file with only the extension changed.
.txt文件的格式与指定的格式匹配(因此能够正确解析,包含所有相关信息等)
The format of the .txt file matches the specified format (so it is able to be parsed correctly, contains all the relevant information, etc.)
这一切都是在 Java 中完成的,其中将检索一个文件,然后需要检查它以确保它是应该的.到目前为止,我只发现 JHOVE(现在是 JHOVE2)作为这项任务的工具,但在 Java 代码中而不是通过命令行实现它的文档方式中没有找到太多.感谢您的帮助.
This is all being done in Java, where a file will be retrieved and then needs to be checked to make sure it is what it is supposed to be. So far I have only found JHOVE (and now JHOVE2) as tools for this task but have not found much in the way of documentation for implementing it within Java code as opposed to through the command line. Thanks for your help.
推荐答案
听起来您正在寻找一种通用的格式化选项,我可以向您推荐正则表达式吗?您可以使用正则表达式进行各种不同类型的匹配.我在下面写了一个简单的例子[对于所有那些正则表达式专家,如果我没有使用完美的表达,请怜悯我;)].您可以将 REGEX 和 MAX_LINES_TO_READ 常量放入属性文件并对其进行修改以使其更加通用.
As it sounds like you're looking for a general sort of formatting option, could I recommend regular expressions to you? You can do all sorts of different kinds of matching using regex. I've written a simple example below [for all those regex experts out there, have mercy on me if I didn't use the perfect expression ;) ]. You could put the REGEX and MAX_LINES_TO_READ constants into a properties file and modify that to make it even more generalized.
您基本上会测试您的.txt"文件的最大行数(但是需要很多行才能确定格式是否良好 - 您也可以将正则表达式用于标题行或根据需要执行多个不同的正则表达式测试格式),如果所有这些行都匹配,文件将被标记为有效".
You would basically test your ".txt" file for a maximum number of lines (however many lines are needed to establish the formatting is good - you could also use regular expressions for a header line or do multiple different regular expressions as needed to test the formatting) and if all those lines matched, the file would be flagged as "valid".
这只是您可能运行的示例.您应该实施适当的异常处理,而不仅仅是为一个捕获异常".
This is just an example for you to possibly run with. You should implement proper exception handling other than just catching "Exception" for one.
要在 Java 中测试您的正则表达式,http://www.regexplanet.com/simple/index.html 效果很好.
For testing your regular expressions in Java, http://www.regexplanet.com/simple/index.html works very nice.
这里是ValidateTxtFile"源...
Here's the "ValidateTxtFile" source...
import java.io.*;
public class ValidateTxtFile {
private final int MAX_LINES_TO_READ = 5;
private final String REGEX = ".{15}[ ]{5}.{15}[ ]{5}[-]\d{2}\.\d{2}[ ]{9}\d{2}/\d{2}/\d{4}";
public void testFile(String fileName) {
int lineCounter = 1;
try {
BufferedReader br = new BufferedReader(new FileReader(fileName));
String line = br.readLine();
while ((line != null) && (lineCounter <= MAX_LINES_TO_READ)) {
// Validate the line is formatted correctly based on regular expressions
if (line.matches(REGEX)) {
System.out.println("Line " + lineCounter + " formatted correctly");
}
else {
System.out.println("Invalid format on line " + lineCounter + " (" + line + ")");
}
line = br.readLine();
lineCounter++;
}
} catch (Exception ex) {
System.out.println("Exception occurred: " + ex.toString());
}
}
public static void main(String args[]) {
ValidateTxtFile vtf = new ValidateTxtFile();
vtf.testFile("transactions.txt");
}
}
这是transactions.txt"中的内容...
Here's what's in "transactions.txt"...
Electric Electric Co. -50.99 12/28/2011
Food Food Store -80.31 12/28/2011
Clothes Clothing Store -99.36 12/28/2011
Entertainment Bowling -30.4393 12/28/2011
Restaurant Mcdonalds -10.35 12/28/11
我运行应用程序时的输出是...
The output when I ran the app was...
Line 1 formatted correctly
Line 2 formatted correctly
Line 3 formatted correctly
Invalid format on line 4 (Entertainment Bowling -30.4393 12/28/2011)
Invalid format on line 5 (Restaurant Mcdonalds -10.35 12/28/11)
编辑 2011 年 12 月 29 日上午 10:00 左右
不确定这是否存在性能问题,但仅供参考,我多次复制transactions.txt"中的条目以构建一个包含大约 130 万行的文本文件,并且我能够通过整个文件在我的电脑上大约 7 秒.我将 System.out 更改为仅在无效 (524,288) 和有效 (786,432) 格式条目的末尾显示总计数.transactions.txt"的大小约为 85mb.
EDIT 12/29/2011 about 10:00am
Not sure if there is a performance concern on this or not, but just as an FYI I duplicated the entries in "transactions.txt" several times to build a text file with about 1.3 million rows in it and I was able to get through the whole file in about 7 seconds on my PC. I changed the System.out's to just show a grand total count at the end of invalid (524,288) and valid (786,432) formatted entries. "transactions.txt" was about 85mb in size.
相关文章