.NET/SQL Analyst Developer +20 years experience
PdfLogicExtractorPdfLogicExtractor is a piece of software designed to extract information from PDF documents in a logical and orderly manner, so that can later be processed by the systems that integrates with.
The system is based on adaptable logic implemented in a template system that can process all documents of a certain type.
A template is capable of adapting to variations established in its definition, such as months with different numbers of days, displaced document areas, or differences between pages within the same document. And in general, any type of extraction logic needed.
Template extraction logic can perform result cleaning based on predefined rules, obtaining pure data types or removing parts of insignificant text.
Template extraction logic can perform calculations based on results obtained from different extractions or predefined values, such as calculating totals in a table based on price per unit.
Likewise, all functionalities or exceptions that the template logic of a document type requires to be a more effective tool can be specifically programmed.
The system is contained in a dynamic link library DLL, included in a NuGet package, which can be incorporated into any type of platform or software project in the .NET universe.
Direct integration into a .NET project can be done including the NuGet in the project itself and using the well-defined standard calls in the documentation or in Visual Studio's own intelligence system.
Open NuGet Package Manager in Visual Studio and search for 'Angelves' for access to our packages.
Manual installation by console:
Integration defines a very simple software interface with a few overloaded calls and a response data model accessible directly, which can be easily processed as JSON responses.
namespace Angelves.PdfLogicalExtractor.PublicInterface
{
public interface ILetsGoToExtraction
{
ExtractionResult Start(Template template, string filePath, DocumentType type);
ExtractionResult Start(string templatePath, string filePath, DocumentType type);
ExtractionResult Start(string templatePath, Stream fileStream, DocumentType type);
ExtractionResult Start(Template template, Stream fileStream, DocumentType type);
string GetWordsInPdf(string filePath, string? filter1 = null, int round = 0);
}
}
A Template is a JSON file that implements extraction logic against a PDF document.
The goal of these templates is obtain an organized outcome, capable of be processed by an application or system, from plain text reading.
The following table contains a description of sections of template, where you can define the extraction boxes.
Section | Description |
---|---|
Name | The name of the template is set in this property. |
Settings | This section configures the template configuration. |
Offsets [pro] |
In this section, control points are established to detect movements in the text over the original format. |
Metaboxes [pro] |
In this section the Metaboxes are defined. These elements are extraction boxes in themselves but have no impact on the output, as they are used in the calculation of formulas. |
Boxes | In this section, the Boxes or extraction boxes are defined, which are the data that we extract from the document, calculations, tables, etc. |
Renames | In this section the output field renamings are established. This section is useful for redefining automatic names from extract operations. |
The following table contains a description of the commands that you can use within a Box in the template definition.
Functionality | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Box name. This property is important for calculations. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
id (*1) [pro] |
offset identifier | ||||||||||||||||||||||||||||||||||||||||||||||||||||
type | Defines the type of the Box:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
idoffset | Box link with the id of the offset control. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
required | Boxing is required or not. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
metabox [pro] |
Converts a normal Box into a Metabox that will not be reflected in the output. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
x1, x2, y1 and y2 | Text extraction coordinates. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
deviationbetweenpages [pro] |
Derivation of the Y coordinate between pages. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
master (*2) | Boolean that sets the column that masters the Table extraction. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
additionaldata | Additional data required by some type of Box. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
parameters | Array of parameters required by some type of Box. | ||||||||||||||||||||||||||||||||||||||||||||||||||||
alternativecoords [pro] |
Alternative coordinates based on conditions
Json array of conditions with a "condition" field where a formula equal to result is expressed, and with 4 optional fields x1, x2, y1 and y2, where the new coordinates are established if the condition is met. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
extractionrules | Rules in data extraction.
Json array with the fields "action", "target" and "parameters[]":
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
formula [pro] |
Calculation formulas with numbers in template and/or metaboxes or boxes.
Formula syntax:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
comments | Comments, without use in the process. |
{
"templatename": "EXAMPLE TEMPLATE",
"config": {
"externalsfieldsintables": false,
"decimalseparator": ","
},
"offsets": [
{
"id": 1,
"text": "PROGRAM",
"x": 82.92,
"y": 153.95
},
{
"id": 2,
"text": "ADVERTISER:",
"x": 82.92,
"y": 101.99
}
],
"metaboxes": [
{
"name": "end_month_1",
"type": "Number",
"x1": 714,
"y1": 146,
"x2": 723,
"y2": 153
},
{
"name": "end_month_2",
"type": "Number",
"x1": 705,
"y1": 146,
"x2": 714,
"y2": 143
},
{
"name": "end_month_3",
"type": "Number",
"x1": 696,
"y1": 146,
"x2": 705,
"y2": 153
}
],
"boxes": [
{
"name": "station",
"required": true,
"x1": 600,
"y1": 60,
"x2": 800,
"y2": 80
},
{
"name": "advertiser",
"idoffset": 2,
"required": true,
"x1": 160,
"y1": 90,
"x2": 350,
"y2": 110
},
{
"name": "product",
"idoffset": 2,
"required": true,
"x1": 160,
"y1": 106,
"x2": 350,
"y2": 116
},
{
"name": "campaign",
"type": "Empty"
},
{
"name": "reference",
"idoffset": 2,
"required": true,
"x1": 600,
"y1": 90,
"x2": 800,
"y2": 110,
"extractionrules": [
{
"action": "QuitSpaces"
},
{
"action": "Erase",
"target": "N.ORDER:"
}
]
},
{
"name": "invoicedate",
"type": "DateTime",
"idoffset": 2,
"required": true,
"format": "dd/MM/yyyy",
"x1": 600,
"y1": 106,
"x2": 800,
"y2": 125,
"extractionrules": [
{
"action": "Erase",
"target": "DATE:"
},
{
"action": "QuitSpaces"
}
]
},
{
"name": "table1",
"type": "Table",
"header": [
{
"name": "format",
"idoffset": 1,
"required": true,
"master": true,
"x1": 235,
"y1": 171.10,
"x2": 273,
"y2": 178.41,
"extractionrules": [
{
"action": "QuitSpaces"
},
{
"action": "Erase",
"target": "20"
},
{
"action": "Erase",
"target": "\""
}
]
},
{
"name": "duration",
"idoffset": 1,
"required": true,
"x1": 235,
"x2": 280,
"extractionrules": [
{
"action": "QuitSpaces"
},
{
"action": "Erase",
"target": "CRADLE"
},
{
"action": "Erase",
"target": "\""
}
]
},
{
"name": "program",
"idoffset": 1,
"required": true,
"x1": 70,
"x2": 180
},
{
"name": "startend_hour",
"type": "TextSplit",
"idoffset": 1,
"parameters": [ "-" ],
"x1": 180,
"x2": 235,
"extractionrules": [
{
"action": "QuitSpaces"
},
{
"action": "Erase",
"target": "("
},
{
"action": "Erase",
"target": "MF"
},
{
"action": "Erase",
"target": "S-U"
},
{
"action": "Erase",
"target": ")"
}
]
},
{
"name": "swap",
"type": "ArraySwap",
"idoffset": 1,
"additionaldata": "Number",
"deletenullsinarray": true,
"x1": 444.21,
"x2": 723.43,
"w": 9.007,
"arrayindex": {
"y1": 146,
"y2": 153
},
"alternativecoords": [
{
"condition": "[BOX:end_month_1] ! 31",
"x2": 714.42
},
{
"condition": "[BOX:end_month_2] ! 30",
"x2": 705.41
},
{
"condition": "[BOX:end_month_3] ! 29",
"x2": 696.41
}
]
},
{
"name": "totalpasses",
"type": "Decimal",
"idoffset": 1,
"x1": 310,
"x2": 350
},
{
"name": "unitprice",
"type": "Decimal",
"formula": "[BOXROW:totalprice] / [BOXROW:totalpasses]"
},
{
"name": "discount",
"type": "Decimal",
"idoffset": 1,
"x1": 380,
"x2": 410,
"extractionrules": [
{
"action": "Erase",
"target": "%"
},
{
"action": "QuitSpaces"
}
]
},
{
"name": "agencydiscount",
"type": "Empty"
},
{
"name": "totalprice",
"type": "Decimal",
"idoffset": 1,
"required": true,
"x1": 410,
"x2": 445,
"extractionrules": [
{
"action": "Erase",
"target": "€"
},
{
"action": "QuitSpaces"
}
]
}
]
},
{
"name": "comments",
"idoffset": 2,
"x1": 60,
"y1": 230,
"x2": 720,
"y2": 260,
"extractionrules": [
{
"action": "QuitSpaces"
},
{
"action": "Erase",
"target": "COMMENTS:"
}
]
}
],
"renames": [
{
"name": "startend_hour.INDEX[0]",
"rename": "starthour",
"exact": true,
"casesensitive": false
},
{
"name": "startend_hour.INDEX[1]",
"rename": "endhour",
"exact": true,
"casesensitive": false
},
{
"name": "swap.SWAP[",
"rename": "passes",
"exact": false,
"casesensitive": false
}
]
}
This software is provided under the terms of this Commercial Use License ("License"). By downloading, installing, or using this software, you agree to the terms and conditions of this License.
This software is a NuGet that can be freely downloaded. The purpose of the software is to provide functionality for logically ordering the reading of a PDF in plain text provided by third-party software. The user is granted a limited, non-exclusive, non-transferable license to use this software for evaluation purposes during a 30-day trial period. After this period, the user must acquire a commercial license to continue using this software.
The user may not decompile, modify, or resell this software. However, the user may redistribute the software when integrated into their own software under a commercial license. The user must acquire a valid license to use this software for commercial purposes.
The third-party software provided for PDF reading may not be able to read some types of PDFs, such as those from scanned images. The license holder shall not be liable for any loss or damage arising from the third-party software's inability to read a specific PDF.
All ownership and intellectual property rights of this software are owned by the license holder. This software is protected by copyright laws and other applicable laws.
This software is provided "as is," without warranties of any kind, whether express or implied. The license holder shall not be liable for any direct, indirect, incidental, special, exemplary, or consequential damages arising out of the use or inability to use this software.
This License shall be governed and construed in accordance with the laws of the State of Spain without regard to its conflict of law principles.
The terms and conditions of this License may be subject to change without prior notice. It is the user's responsibility to periodically review the terms of this License.
By downloading, installing, or using this software, you acknowledge that you have read and understood the terms and conditions of this License and agree to comply with them.