Extract text
Extract plain text from selected pages in a PDF document
|
|||||
|
|
||||
|
|
With this building block you can extract the text of a PDF document.
You can select the pages from which to extract the text.
Use cases
- Extract values from incoming invoices and automatically update your records
Configuration
Clicking on given file in the title of the building block lets you pick the desired file from your Google Drive to extract the text from.
You can deselect the picked file by clicking on the x
button on the right of the selected filename.
If no file has been picked, Ultradox will load the given file that is stored in the input variable.
When loading files stored in a variable, make sure that the input prefix matches the output prefix of the building block that provides the document.
Click on the bold part of the title of the building block to open the configuration dialog to configure the pages where the content will be extracted.
Enter the page numbers to be extracted delimited by a comma.
Page numbers are starting with 1. If you enter 1,3,5
the resulting document will contain the text from the first, the third an the fifth page of the given PDF document.
You can also specify ranges of pages, e.g. 2,4-6
will extract the text from the second and pages 4,5 and 6 into the target document.
If the entered page numbers are greater than the number of pages, text extraction will end at the last page. If you for example enter 2-999
and our PDF document has only 5 pages, the text from all pages except the first page will be extracted.
If you enter negative values the pages are calculated from the end of the document. For example entering -3--1 will extract the text of the last two pages of the document.
Make sure not to include any spaces in the list of pages!
The extracted text contains return/newline characters “rn”. If you want to output the content to a HTML document, you’ll have to use “(wrap)” to convert then to HTML breaks. E.g. <h1 id="extractPages" style="text-align: left"> <span>Extract pages</span> </h1>
<h4 id="extractPagesFromAPDFDocumentIntoANewDocument" style="text-align: left"> <span>Extract pages from a PDF document into a new document</span> </h4>
<a id="t.cf95b98610bb1f133523ee5d2f8a3eadc1337364" shape="rect"></a>
<a id="t.0" shape="rect"></a>
<table class="fs-building-block">
<tr>
<td colspan="2">
<table class="fs-toolbar">
<tr>
<td class="fs-type fs-pdf icon-pdfExtractPages"></td>
<td class="fs-title"><p style="text-align: left"> <span>Extract pages </span><strong>2</strong><span> to </span><strong>4</strong><span> of </span><strong>given file</strong> </p></td>
<td class="fs-block-action hidden-xs"><a class="fs-button "></a></td>
<td class="fs-block-action hidden-xs"><a href="https://help.ultradox.com/en/reference/overview.html#ultradoxHelp">
<div class="fs-helpButton icon-condition"></div></a></td>
<td class="fs-block-action hidden-xs"><a href="https://help.ultradox.com/en/reference/overview.html#licenseIndicator">
<div class="fs-licenseIndicator">
3
</div></a></td>
<td class="fs-block-action hidden-xs"><a href="https://help.ultradox.com/en/reference/overview.html#conditionalExecution">
<div class="fs-indicator icon-off"></div></a></td>
<td class="fs-block-action hidden-xs"><a href="https://help.ultradox.com/en/reference/overview.html#breakpoint">
<div class="fs-indicator icon-breakePointOff"></div></a></td>
</tr>
</table></td>
</tr>
<tr class="fs-variables">
<td class="fs-parameters">
<table class="fs-inout">
<tr>
<td class="icon-input"><td class="fs-title"></td></td>
<td class="fs-block-action hidden-xs"><a class="fs-button icon-keyboard" href="https://help.ultradox.com/en/reference/overview.html#enterValues"></a></td>
<td class="fs-block-action hidden-xs"><a class="fs-button icon-forms" href="https://help.ultradox.com/en/reference/overview.html#createForm"></a></td>
<td class="fs-block-action hidden-xs"><a class="fs-button icon-prefix" href="https://help.ultradox.com/en/reference/overview.html#inputPrefix"></a></td>
</tr>
</table></td>
<td class="fs-parameters">
<table class="fs-inout">
<tr>
<td class="icon-output"><td class="fs-title"></td></td>
</tr>
</table></td>
</tr>
<tr class="fs-variables">
<td class="fs-parameters">
<table class="fs-inout">
<tr>
<td class="fs-title">pdf.file</td>
</tr>
</table></td>
<td class="fs-parameters">
<table class="fs-inout">
<tr>
<td class="fs-title">pdf.file</td>
</tr>
</table></td>
</tr>
<tr class="fs-variables">
<td class="fs-parameters">
<table class="fs-inout">
<tr>
<td class="fs-title"></td>
</tr>
</table></td>
<td class="fs-parameters">
<table class="fs-inout">
<tr>
<td class="fs-title">pdf.mimeType</td>
</tr>
</table></td>
</tr>
</table>
<p style="text-align: left"> <span>With this building block you can extract one or more pages from an existing PDF document and store them as a new document.</span> </p>
<div class="fs-docs-section">
<h3 id="useCases" style="text-align: left"> <span>Use cases</span> </h3>
<ul>
<li style="text-align: left"> <span>Extracting relevant pages from a large document</span> </li>
</ul>
</div>
<div class="fs-docs-section">
<h3 id="configuration" style="text-align: left"> <span>Configuration</span> </h3>
<p style="text-align: left"> <span>Clicking on </span><strong>given file</strong><span> in the title of the building block lets you pick the desired file from your Google Drive to extract pages from.</span> </p>
<p style="text-align: left"> <span>You can deselect the picked file by clicking on the </span><code>x</code><span> button on the right of the selected filename.</span> </p>
<p style="text-align: left"> <span>If no file has been picked, Ultradox will load the given file that is stored in the input variable.</span> </p>
<a id="t.0f1683bb99b5557eff6267dbfff5b1807dbbadbc" shape="rect"></a>
<a id="t.1" shape="rect"></a>
<div class="alert alert-info" role="alert" style="text-align: left">
<span class="glyphicon glyphicon-star-empty" aria-hidden="true"></span>
<p style="text-align: left"> <span>When loading files stored in a variable, make sure that the input prefix matches the output prefix of the building block that provides the document.</span> </p>
</div>
<p style="text-align: left"> <span>Click on the </span><strong>bold</strong><span> part of the title of the building block to open the configuration dialog to extract the pages to be extracted.</span> </p>
<p style="text-align: left"> <span>Enter the page numbers to be extracted delimited by a comma.</span> </p>
<a id="t.0cfc835372e10baa1fdb666224d9990cd7d4b37a" shape="rect"></a>
<a id="t.2" shape="rect"></a>
<div class="alert alert-info" role="alert" style="text-align: left">
<span class="glyphicon glyphicon-star-empty" aria-hidden="true"></span>
<p style="text-align: left"> <span>Page numbers are starting with 1. If you enter </span><code>1,3,5</code><span> the resulting document will contain the first, the third an the fifth page of the given PDF document.</span> </p>
</div>
<p style="text-align: left"> <span>You can also specify ranges of pages, e.g. </span><code>2,4-6</code><span> will extract the second, and pages 4,5,6 into the target document.</span> </p>
<p style="text-align: left"> <span>If the entered page numbers are greater than the number of pages, extraction will end at the last page. If you for example enter </span><code>2-999</code><span> and our PDF document has only 5 pages, all pages except the first page will make it into the target document.</span> </p>
<a id="t.d80a144c11a42e0d3fa27ddba840e0f2fc360988" shape="rect"></a>
<a id="t.3" shape="rect"></a>
<div class="alert alert-info" role="alert" style="text-align: left">
<span class="glyphicon glyphicon-star-empty" aria-hidden="true"></span>
<p style="text-align: left"> <span>If you enter negative values the pages are calculated from the end of the document. For example entering -3--1 will extract the last two pages of the document.</span> </p>
</div>
<p style="text-align: left"> <span></span> </p>
<a id="t.22105d76f18dfe7285c41dc4e37427bb081bc3a2" shape="rect"></a>
<a id="t.4" shape="rect"></a>
<div class="alert alert-danger" role="alert" style="text-align: left">
<span class="glyphicon glyphicon-exclamation-sign" aria-hidden="true"></span>
<p style="text-align: left"> <span>Make sure not to include any spaces in the list of pages to be extracted!</span> </p>
</div>
<p style="text-align: left"> <span></span> </p>
</div>
Questions and Feedback
If you have any comments on this page, feel free to add suggestions right to the Google document that we are using to create this site.
If you are not yet member of the Ultradox community on Google+, please join now to get updates from our end or to provide feedback, bug reports or discuss with other users.
Last Updated: 3/4/19