使用 Drive API/DriveApp 将 PDF 转换为 Google 文档

此问题已成功解决.我正在编辑我的帖子以记录我的经验,以供后人参考.

This problem has been successfully resolved. I am editing my post to document my experience for posterity and future reference.

我有 117 个 PDF 文件(平均大小约为 238 KB)上传到 Google 云端硬盘.我想将它们全部转换为 Google Docs 并将它们保存在不同的 Drive 文件夹中.

I have 117 PDF files (average size ~238 KB) uploaded to Google Drive. I want to convert them all to Google Docs and keep them in a different Drive folder.

我尝试使用 Drive.Files.insert.但是,在大多数情况下,只有 5 个文件可以通过这种方式在函数因此错误过早过期之前进行转换

I attempted to convert the files using Drive.Files.insert. However, under most circumstances, only 5 files could be converted this way before the function expires prematurely with this error

超出限制:DriveApp.(第 # 行,文件代码")

Limit Exceeded: DriveApp. (line #, file "Code")

上面引用的行是调用 insert 函数的地方.第一次调用此函数后,后续调用通常会立即失败,并且没有创建额外的 google doc.

where the line referenced above is when the insert function is called. After calling this function for the first time, subsequent calls typically failed immediately with no additional google doc created.

我使用了 3 种主要方法来实现我的目标.一个是使用 Drive.Files.insert,如上所述.另外两个涉及使用 Drive.Files.copy 并发送 一批 HTTP 请求.最后两种方法是 Tanaike 建议的,我建议阅读下面的答案以获取更多信息.insertcopy 函数来自 Google Drive REST v2 API,而批处理多个 HTTP 请求来自 Drive REST v3.

I used 3 main ways to achieve my goal. One was using the Drive.Files.insert, as mentioned above. The other two involved using Drive.Files.copy and sending a batch of HTTP requests. These last two methods were suggested by Tanaike, and I recommend reading his answer below for more information. The insert and copy functions are from Google Drive REST v2 API, while batching multiple HTTP requests is from Drive REST v3.

使用 Drive.Files.insert,我遇到了处理问题具有执行限制(在上面的问题部分中进行了解释).一种解决方案是多次运行这些功能.为此,我需要一种方法来跟踪哪些文件被转换.我有两个选择:使用电子表格和 延续令牌.因此,我有 4 种不同的方法来测试:本段中提到的两种,批处理 HTTP 请求,并调用 Drive.Files.copy.

With Drive.Files.insert, I experienced issues dealing with execution limitations (explained in the Problem section above). One solution was to run the functions multiple times. And for that, I needed a way to keep track of which files were converted. I had two options for this: using a spreadsheet and a continuation token. Therefore, I had 4 different methods to test: the two mentioned in this paragraph, batching HTTP requests, and calling Drive.Files.copy.

因为团队驱动器的行为不同于常规驱动器,我觉得有必要尝试每种方法两次,其中包含 PDF 的文件夹是常规的非团队驱动器文件夹,另一种方法是该文件夹位于团队驱动器下.总的来说,这意味着我有 8 个不同的测试方法.

Because team drives behave differently from regular drives, I felt it necessary to try each of those methods twice, one in which the folder containing the PDFs is a regular non-Team Drive folder and one in which that folder is under a Team Drive. In total, this means I had 8 different methods to test.

这些是我使用的确切功能.每个都使用了两次,唯一的变化是源文件夹和目标文件夹的 ID(出于上述原因):

These are the exact functions I used. Each of these was used twice, with the only variations being the ID of the source and destination folders (for reasons stated above):

function toDocs() {
  var sheet = SpreadsheetApp.openById(/* spreadsheet id*/).getSheets()[0];
  var range = sheet.getRange("A2:E118");
  var table = range.getValues();
  var len = table.length;
  var resources = {
    title: null,
    mimeType: MimeType.GOOGLE_DOCS,
    parents: [{id: /* destination folder id */}]
  };
  var count = 0;
  var files = DriveApp.getFolderById(/* source folder id */).getFiles();
  while (files.hasNext()) {
    var blob = files.next().getBlob();
    var blobName = blob.getName();
    for (var i=0; i<len; i++) {
      if (table[i][0] === blobName.slice(5, 18)) {
        if (table[i][4])
          break;
        resources.title = blobName;
        Drive.Files.insert(resources, blob);  // Limit Exceeded: DriveApp. (line 51, file "Code")
        table[i][4] = "yes";
      }
    }

    if (++count === 10) {
      range.setValues(table);
      Logger.log("time's up");
    }
  }
}

方法 B:使用 Drive.Files.insert 和延续令牌

function toDocs() {
  var folder = DriveApp.getFolderById(/* source folder id */);
  var sprop = PropertiesService.getScriptProperties();
  var contToken = sprop.getProperty("contToken");
  var files = contToken ? DriveApp.continueFileIterator(contToken) : folder.getFiles();
  var options = {
    ocr: true
  };
  var resource = {
    title: null,
    mimeType: null,
    parents: [{id: /* destination folder id */}]
  };

  while (files.hasNext()) {
    var blob = files.next().getBlob();
    resource.title = blob.getName();
    resource.mimeType = blob.getContentType();
    Drive.Files.insert(resource, blob, options);  // Limit Exceeded: DriveApp. (line 113, file "Code")
    sprop.setProperty("contToken", files.getContinuationToken());
  }
}

方法 C:使用 Drive.Files.copyH2>

此功能的功劳归于 Tanaike - 有关详细信息,请参阅下面的答案.

Method C: Using Drive.Files.copy

Credit for this function goes to Tanaike -- see his answer below for more details.

function toDocs() {
  var sourceFolderId = /* source folder id */;
  var destinationFolderId = /* destination folder id */;
  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  while (files.hasNext()) {
    var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
    Logger.log(res) 
  }
}

方法 D:发送 批量 HTTP 请求

此功能的功劳归于 Tanaike - 有关详细信息,请参阅下面的答案.

Method D: Sending batches of HTTP requests

Credit for this function goes to Tanaike -- see his answer below for more details.

function toDocs() {
  var sourceFolderId = /* source folder id */;
  var destinationFolderId = /* destination folder id */;

  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  var rBody = [];
  while (files.hasNext()) {
    rBody.push({
      method: "POST",
      endpoint: "https://www.googleapis.com/drive/v3/files/" + files.next().getId() + "/copy",
      requestBody: {
        mimeType: "application/vnd.google-apps.document",
        parents: [destinationFolderId]
      }
    });
  }
  var cycle = 20; // Number of API calls at 1 batch request.
  for (var i = 0; i < Math.ceil(rBody.length / cycle); i++) {
    var offset = i * cycle;
    var body = rBody.slice(offset, offset + cycle);
    var boundary = "xxxxxxxxxx";
    var contentId = 0;
    var data = "--" + boundary + "
";
    body.forEach(function(e){
      data += "Content-Type: application/http
";
      data += "Content-ID: " + ++contentId + "

";
      data += e.method + " " + e.endpoint + "
";
      data += e.requestBody ? "Content-Type: application/json; charset=utf-8

" : "
";
      data += e.requestBody ? JSON.stringify(e.requestBody) + "
" : "";
      data += "--" + boundary + "
";
    });
    var options = {
      method: "post",
      contentType: "multipart/mixed; boundary=" + boundary,
      payload: Utilities.newBlob(data).getBytes(),
      headers: {'Authorization': 'Bearer ' + ScriptApp.getOAuthToken()},
      muteHttpExceptions: true,
    };
    var res = UrlFetchApp.fetch("https://www.googleapis.com/batch", options).getContentText();
//    Logger.log(res); // If you use this, please remove the comment.
  }
}

什么有效,什么无效

  • 没有使用 Drive.Files.insert 的功能 工作.每一个使用 insert 进行转换的函数失败并出现此错误

    What Worked and What Didn't

    • None of the functions using Drive.Files.insert worked. Every function using insert for conversion failed with this error

      超出限制:DriveApp.(第 # 行,文件代码")

      Limit Exceeded: DriveApp. (line #, file "Code")

      (行号替换为通用符号).没有进一步的细节或可以找到错误的描述.一个显着的变化是我在其中使用了电子表格,PDF 位于团队驱动器中文件夹;而所有其他方法立即失败而没有转换单个文件,此文件在失败前转换为 5.然而,当考虑到为什么这种变体比其他变体做得更好,我认为比任何与使用特定资源(电子表格、团队驱动等)

      (line number replaced with generic symbol). No further details or description of the error could be found. A notable variation was one in which I used a spreadsheet and the PDFs were in a team drive folder; while all other methods failed instantly without converting a single file, this one converted 5 before failing. However, when considering why this variation did better than the others, I think it was more of a fluke than any reason related to the use of particular resources (spreadsheet, team drive, etc.)

      使用 Drive.Files.copy 和 批处理 HTTP 请求 仅有效当源文件夹是个人(非团队云端硬盘)文件夹时.

      Using Drive.Files.copy and batch HTTP requests worked only when the source folder was a personal (non-Team Drive) folder.

      在读取团队云端硬盘时尝试使用 copy 功能文件夹失败并出现此错误:

      Attempting to use the copy function while reading from a Team Drive folder fails with this error:

      找不到文件:1RAGxe9a_-euRpWm3ePrbaGaX5brpmGXu(第 # 行,文件代码")

      File not found: 1RAGxe9a_-euRpWm3ePrbaGaX5brpmGXu (line #, file "Code")

      (行号替换为通用符号).被引用的行是

      (line number replaced with generic symbol). The line being referenced is

      var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
      

    • 在从团队云端硬盘读取数据时使用 批量 HTTP 请求文件夹什么都不做——不创建 doc 文件,也不抛出任何错误.函数在没有完成任何事情的情况下静默终止.

    • Using batch HTTP requests while reading from a Team Drive folder does nothing -- no doc files are created and no errors are thrown. Function silently terminates without having accomplished anything.

      如果您希望将大量 PDF 转换为 google 文档或文本文件,请使用 Drive.Files.copy 或 发送批量 HTTP 请求 并确保 PDF 存储在个人驱动器而不是团队驱动器中.

      If you wish to convert a large number of PDFs to google docs or text files, then use Drive.Files.copy or send batches of HTTP requests and make sure that the PDFs are stored in a personal drive rather than a Team Drive.

      特别感谢 @tehhowch 对我的问题如此感兴趣并反复回来提供反馈,并感谢 @Tanaike 提供代码以及成功解决我的问题的解释(请注意,请阅读详情见上文).

      推荐答案

      您想从文件夹中的 PDF 文件转换为 Google 文档.PDF 文件位于团队驱动器的文件夹中.您想将转换后的它们导入您的 Google Drive 的文件夹中.如果我的理解是正确的,这个方法怎么样?

      You want to convert from PDF files in the folder to Google Documents. PDF files are in a folder of team drive. You want to import converted them to a folder of your Google Drive. If my understanding is correct, how about this method?

      对于从 PDF 到 Google Document 的转换,它不仅可以使用 Drive.Files.insert() 进行转换,还可以使用 Drive.Files.copy() 进行转换.使用Drive.Files.copy()的好处是

      For the conversion from PDF to Google Document, it can convert using not only Drive.Files.insert(), but also Drive.Files.copy(). The advantage of use of Drive.Files.copy() is

      • 虽然 Drive.Files.insert() 的大小限制为 5 MB,但 Drive.Files.copy() 可以使用超过 5 MB 的大小.
      • 在我的环境中,处理速度比 Drive.Files.insert() 快.
      • Although Drive.Files.insert() has the size limitation of 5 MB, Drive.Files.copy() can use over the size of 5 MB.
      • In my envoronment, the process speed was faster than Drive.Files.insert().

      对于这种方法,我想提出以下两种模式.

      For this method, I would like to propose the following 2 patterns.

      在这种情况下,Advanced Google Services 的 Drive API v2 用于转换文件.

      In this case, Drive API v2 of Advanced Google Services is used for converting files.

      function myFunction() {
        var sourceFolderId = "/* source folder id */";
        var destinationFolderId = "/* dest folder id */";
        var files = DriveApp.getFolderById(sourceFolderId).getFiles();
        while (files.hasNext()) {
          var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
      //    Logger.log(res) // If you use this, please remove the comment.
        }
      }
      

      模式 2:使用 Drive API v3

      在这种情况下,Drive API v3 用于转换文件.在这里,我针对这种情况使用了批处理请求.因为批量请求可以通过一个 API 调用使用 100 个 API 调用.这样就可以解决API配额的问题了.

      Pattern 2 : Using Drive API v3

      In this case, Drive API v3 is used for converting files. And here, I used the batch requests for this situation. Because the batch requests can use 100 API calls by one API call. By this, the issue of API quota can be removed.

      function myFunction() {
        var sourceFolderId = "/* source folder id */";
        var destinationFolderId = "/* dest folder id */";
      
        var files = DriveApp.getFolderById(sourceFolderId).getFiles();
        var rBody = [];
        while (files.hasNext()) {
          rBody.push({
            method: "POST",
            endpoint: "https://www.googleapis.com/drive/v3/files/" + files.next().getId() + "/copy",
            requestBody: {
              mimeType: "application/vnd.google-apps.document",
              parents: [destinationFolderId]
            }
          });
        }
        var cycle = 100; // Number of API calls at 1 batch request.
        for (var i = 0; i < Math.ceil(rBody.length / cycle); i++) {
          var offset = i * cycle;
          var body = rBody.slice(offset, offset + cycle);
          var boundary = "xxxxxxxxxx";
          var contentId = 0;
          var data = "--" + boundary + "
      ";
          body.forEach(function(e){
            data += "Content-Type: application/http
      ";
            data += "Content-ID: " + ++contentId + "
      
      ";
            data += e.method + " " + e.endpoint + "
      ";
            data += e.requestBody ? "Content-Type: application/json; charset=utf-8
      
      " : "
      ";
            data += e.requestBody ? JSON.stringify(e.requestBody) + "
      " : "";
            data += "--" + boundary + "
      ";
          });
          var options = {
            method: "post",
            contentType: "multipart/mixed; boundary=" + boundary,
            payload: Utilities.newBlob(data).getBytes(),
            headers: {'Authorization': 'Bearer ' + ScriptApp.getOAuthToken()},
            muteHttpExceptions: true,
          };
          var res = UrlFetchApp.fetch("https://www.googleapis.com/batch", options).getContentText();
      //    Logger.log(res); // If you use this, please remove the comment.
        }
      }
      

      注意:

      • 如果1个批量请求的API调用次数很大(当前值为100),请修改var cycle = 100.
      • 如果 Drive API v3 不能用于团队驱动,请告诉我.我可以将它转换为 Drive API v2.
      • 如果团队驱动器是您遇到问题的原因,您可以在将 PDF 文件复制到您的 Google 驱动器后尝试此操作吗?
        • 批处理请求

        如果这些对你没有用,我很抱歉.

        If these are not useful for you, I'm sorry.

相关文章