Java使用POI提取word, Excel, PPt, txt的文本内容及文件属性中的作者

2022-07-02 00:00:00 文本提取文件属性

Java使用POI提取word, Excel, PPt, txt的文本内容及文件属性中的作者

新公司实习的第一个任务，在网上查了一些博客后接触到了poi，它为Java提供API对Microsoft Office文件进行读写操作的功能。
可以在apache官网下载jar包http://poi.apache.org/download.html
《Java使用POI提取word, Excel, PPt, txt的文本内容及文件属性中的作者》
查看API文档http://poi.apache.org/components/index.html

1、新建普通的maven项目

poi的jar包较多，于是选用maven仓库导入，先建一个普通的maven项目
《Java使用POI提取word, Excel, PPt, txt的文本内容及文件属性中的作者》

然后next，再起项目名就可以了

2、在pom.xml里添加poi的依赖

在标签组里添加

<dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi</artifactId>
      <version>3.17</version>
    </dependency>
    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi-scratchpad</artifactId>
      <version>3.17</version>
    </dependency>
    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi-ooxml</artifactId>
      <version>3.17</version>
    </dependency>

3、提取word文本和作者

一开始只知道查看别人博客给出代码，但很多都跟自己需要的不一样，而且不完整、导包环境不一样等，总是不满意，搜索很花时间而且效果也不太好，于是试着直接去参考官网上给出的example
http://poi.apache.org/components/document/quick-guide.html
《Java使用POI提取word, Excel, PPt, txt的文本内容及文件属性中的作者》
HWPF对应.doc类型的文件，XWPF对应.docx类型的文件，Excel、PPt也是类似的

import com.google.common.base.CharMatcher;
import com.google.common.collect.Lists;
import com.google.gson.Gson;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import java.util.List;

public class WordUtil { 
    public static String readWordFile(String path) { 
        List<String> contextList = Lists.newArrayList();
        InputStream inputStream = null;
        try { 
            inputStream = new FileInputStream(new File(path));
            if (path.endsWith(".doc")) { 
                HWPFDocument document = new HWPFDocument(inputStream);
                System.out.println("作者："+document.getSummaryInformation().getAuthor());
                WordExtractor extractor = new WordExtractor(document);
                String[] contextArray = extractor.getParagraphText();
                Arrays.asList(contextArray).forEach(context -> contextList.add(CharMatcher.whitespace().removeFrom(context)));
                extractor.close();
                document.close();
            } else if (path.endsWith(".docx")) { 
                XWPFDocument document = new XWPFDocument(inputStream).getXWPFDocument();
                System.out.println("作者："+document.getProperties().getCoreProperties().getCreator());
                List<XWPFParagraph> paragraphList = document.getParagraphs();
                paragraphList.forEach(paragraph -> contextList.add(CharMatcher.whitespace().removeFrom(paragraph.getParagraphText())));
                document.close();
            } else { 
                //LOGGER.debug("此文件{}不是word文件", path);
                return "此文件不是Word文件"+path;
            }
        } catch (IOException e) { 
            e.printStackTrace();
        } finally { 
            if (inputStream != null) try { 
                inputStream.close();
            } catch (IOException e) { 
                e.printStackTrace();
                //LOGGER.debug("读取word文件失败");
                System.out.println("读取Word文件失败");
            }
        }
        return new Gson().toJson(contextList);
    }
}

使用了Google的Guava工具类去做集合和字符串操作，前提是在pom.xml里加上它的依赖

<dependency>
      <groupId>com.google.guava</groupId>
      <artifactId>guava</artifactId>
      <version>21.0</version>
    </dependency>

    <dependency>
      <groupId>com.google.guava</groupId>
      <artifactId>guava</artifactId>
      <version>24.0-jre</version>
    </dependency>

使用了Google的gson把集合类转换成JSON类型（又好像是JavaBean类型）,同样要添加依赖

<dependency>
          <groupId>com.google.code.gson</groupId>
          <artifactId>gson</artifactId>
          <version>2.2.4</version>
      </dependency>

在代码中可以看到，使用了getSummaryInformation()去获得文档的摘要信息，再从摘要信息中获取需要的作者信息。

一开始不知道怎么使用poi获取作者信息，以为只要使用jdk的方法就能获取文件属性，但发现只能获取文件的基本属性，比如

Path testPath = Paths.get("E:\\test\\test1.xls");
        FileOwnerAttributeView ownerView = Files.getFileAttributeView(testPath, FileOwnerAttributeView.class);
        System.out.println("文件所有者：" + ownerView.getOwner());

BasicFileAttributes attrs = Files.readAttributes(testPath, BasicFileAttributes.class);

《Java使用POI提取word, Excel, PPt, txt的文本内容及文件属性中的作者》
这些类和方法参考了Oracle上有关IO流的jdk文档
https://docs.oracle.com/javase/tutorial/essential/io/fileio.html
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#getAttribute-java.nio.file.Path-java.lang.String-java.nio.file.LinkOption…-

然后又发现一个类似poi可以操作office文件的工具，Spire
https://www.e-iceblue.com/Download/doc-for-java-free.html
《Java使用POI提取word, Excel, PPt, txt的文本内容及文件属性中的作者》
先添加依赖

<repositories>
        <repository>
            <id>com.e-iceblue</id>
            <name>e-iceblue</name>
            <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
        </repository>
    </repositories>
    
    <dependencies>
    <dependency>
          <groupId>e-iceblue</groupId>
          <artifactId>spire.doc.free</artifactId>
          <version>2.0.0</version>
      </dependency>
  </dependencies>

读取文档属性

Document doc = new Document("E:\\test\\test1.doc");
//读取内置文档属性
System.out.println("作者： " + doc.getBuiltinDocumentProperties().getAuthor());

但是没有看到java的XLS包，而且使用依赖导Presentation包时maven总是找不到对应的包，于是放弃了。想到同样是操作office类型的，那poi自己肯定也有类似获取文件属性的方法，就找到了如上代码中的，其他Office类型也类似。

4、提取Excel文本和作者

看到个好刘逼好想详细的POI Excel操作博客：https://www.cnblogs.com/huajiezh/p/5467821.html

import com.google.common.collect.Lists;
import com.google.gson.Gson;

import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;

public class ExcelUtil { 
    public static String readExcelFile(String path){ 
        List<List<String>> rowlist = Lists.newArrayList();
        InputStream inputStream = null;
        String str = "";
        Workbook workbook = null;
        try { 
            //获取文件输入流
            inputStream = new FileInputStream(new File(path));
            //获取Excel工作簿对象
            if (path.endsWith(".xls")) { 
                workbook = new HSSFWorkbook(inputStream);
                System.out.println("作者：" + ((HSSFWorkbook) workbook).getSummaryInformation().getAuthor());
            }else if (path.endsWith(".xlsx")) { 
                workbook = new XSSFWorkbook(inputStream);
                System.out.println("作者：" + ((XSSFWorkbook) workbook).getProperties().getCoreProperties().getCreator());
            }
            else { 
                //LOGGER.debug("此文件{}不是word文件", path);
                return "此文件不是Excel文件" + path;
            }
            //得到Excel工作表对象
            for (Sheet sheet : workbook ) { 
                for (Row row : sheet) { 
                    //首行（即表头）不读取
                    if (row.getRowNum() == 0) { 
                        continue;
                    }
                    List<String> cellList = Lists.newArrayList();
                    for (Cell cell : row) { 
                        switch (cell.getCellTypeEnum()) { 
                            case STRING:
                                cellList.add(cell.getRichStringCellValue().getString());
                                break;
                            case NUMERIC:
                                if (DateUtil.isCellDateFormatted(cell)) { 
                                    cellList.add(""+cell.getDateCellValue());
                                } else { 
                                    cellList.add(""+cell.getNumericCellValue());
                                }
                                break;
                            case BOOLEAN:
                                cellList.add(""+cell.getBooleanCellValue());
                                break;
                            case FORMULA:
                                cellList.add(cell.getCellFormula());
                                break;
                            case BLANK:
                                cellList.add("");
                                break;
                            default:
                                cellList.add("");
                        }
                    }
                    if (cellList.size() > 0)
                        rowlist.add(cellList);
                }
            }
            Gson gson = new Gson();
            str = gson.toJson(rowlist);
            //关闭流
            workbook.close();
        } catch (IOException e) { 
            e.printStackTrace();
        }finally { 
            if (inputStream != null) try { 
                inputStream.close();
            } catch (IOException e) { 
                e.printStackTrace();
                //LOGGER.debug("读取word文件失败");
                System.out.println("读取Excel文件失败");
            }
        }
        return str;
	}
}

5、提取PPt文本和作者

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;

import com.google.common.collect.Lists;
import com.google.gson.Gson;
import org.apache.poi.hslf.usermodel.HSLFSlideShow;
import org.apache.poi.sl.usermodel.Shape;
import org.apache.poi.sl.usermodel.Slide;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.usermodel.TextShape;
import org.apache.poi.xslf.usermodel.XMLSlideShow;


public class PPtUtil { 
    //直接抽取幻灯片的全部内容
    public static String readPPtFile(String path) { 
        List<String> textList = Lists.newArrayList();
        InputStream inputStream = null;
        SlideShow ppt = null;
        try { 
            //获取文件输入流
            inputStream = new FileInputStream(new File(path));
            if (path.endsWith(".ppt")) { 
                ppt = new HSLFSlideShow(inputStream);
                System.out.println("作者：" + ((HSLFSlideShow)ppt).getSlideShowImpl().getSummaryInformation().getAuthor());
            }else if (path.endsWith(".pptx")) { 
                ppt = new XMLSlideShow(inputStream);
                System.out.println("作者：" + ((XMLSlideShow)ppt).getProperties().getCoreProperties().getCreator());
            }
            else { 
                //LOGGER.debug("此文件{}不是word文件", path);
                return "此文件不是PPt文件" + path;
            }
            // get slides
            List<Slide> slides = ppt.getSlides();

            for (Slide slide : slides) { 
                List<Shape> shapes = slide.getShapes();
                for (Shape sh : shapes) { 
                    //如果是一个文本框
                    if (sh instanceof TextShape) { 
                        TextShape shape = (TextShape) sh;
                        textList.add(shape.getText());
                    }
                }
            }
        } catch (IOException e) { 
            e.printStackTrace();
        }finally { 
            if (inputStream != null) try { 
                inputStream.close();
            } catch (IOException e) { 
                e.printStackTrace();
                //LOGGER.debug("读取word文件失败");
                System.out.println("读取PPt文件失败");
            }
        }
        return new Gson().toJson(textList);
    }
}

6、提取Txt文本和作者

import com.google.common.collect.Lists;
import com.google.gson.Gson;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.List;

public class TxtUtil { 
    public static String readTxtFile(String path) { 
        List<String> txtList = Lists.newArrayList();
        FileReader fileReader = null;
        BufferedReader bufferedReader = null;
        try { 
            if (path.endsWith(".txt")) { 
                fileReader = new FileReader(path);
                bufferedReader = new BufferedReader(fileReader);
                String s = "";
                while ((s = bufferedReader.readLine()) != null) { 
                    txtList.add(s);
                }
            } else { 
                return "此文件不是Txt文件" + path;
            }
        } catch (IOException e) { 
            e.printStackTrace();
        }finally { 
            if (fileReader != null && bufferedReader != null) try { 
                fileReader.close();
                bufferedReader.close();
            } catch (IOException e) { 
                e.printStackTrace();
                //LOGGER.debug("读取word文件失败");
                System.out.println("读取Txt文件失败");
            }
        }
        return new Gson().toJson(txtList);
    }
}

7、测试类

import poiUtils.ExcelUtil;
import poiUtils.PPtUtil;
import poiUtils.TxtUtil;
import poiUtils.WordUtil;

public class App 
{ 
    public static void main(String[] args ){ 
        System.out.println("E:\\test\\test1.doc："+new WordUtil().readWordFile("E:\\test\\test1.doc"));
        System.out.println("E:\\test\\test1.docx："+new WordUtil().readWordFile("E:\\test\\test1.docx"));
        System.out.println("E:\\test\\test1.xls："+new ExcelUtil().readExcelFile("E:\\test\\test1.xls"));
        System.out.println("E:\\test\\test1.xlsx："+new ExcelUtil().readExcelFile("E:\\test\\test1.xlsx"));
        System.out.println("E:\\test\\test1.ppt："+new PPtUtil().readPPtFile("E:\\test\\test1.ppt"));
        System.out.println("E:\\test\\test1.pptx："+new PPtUtil().readPPtFile("E:\\test\\test1.pptx"));
        System.out.println("E:\\test\\test1.txt："+new TxtUtil().readTxtFile("E:\\test\\test1.txt"));
    }
}

输出结果：

作者：ying
E:\test\test1.doc：["文档内容","第一段","第二段"]
作者：ying
E:\test\test1.docx：["文档内容","第一段","第二段"]
作者：ying
E:\test\test1.xls：[["小王","男","19.0","Mon Nov 08 00:00:00 CST 1999","true"],["大王","女","20.0","Wed Nov 04 00:00:00 CST 1998","false"]]
作者：ying
E:\test\test1.xlsx：[["小王","男","19.0","Mon Nov 08 00:00:00 CST 1999","true"],["大王","女","20.0","Wed Nov 04 00:00:00 CST 1998","false"]]
作者：ying
E:\test\test1.ppt：["这个标题","这个内容"]
作者：ying
E:\test\test1.pptx：["这个标题","这个内容"]
E:\test\test1.txt：["txt内容","第二行","第三行"]

    原文作者：polar-bear-lily
    原文地址: https://blog.csdn.net/qq_39380838/article/details/100726709
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。

相关文章