In the research group, when it comes to expense reimbursement, it is inevitable to create inbound and outbound orders based on invoices. It's manageable when there are few items, but it becomes really troublesome when there are many. So, while I was still involved in the reimbursement work, I created a small tool that allows for easy creation of inbound and outbound orders, as shown in the image below:
Although it reduced the mental burden, it still required manual input of invoice numbers, codes, invoice dates, and other information. Now that I am no longer involved in the reimbursement work, I suddenly thought how great it would be if I could directly upload a file to obtain all the invoice information. So I decided to do it. The initial project was done in js/ts, but this time I switched to Python, after all, life is short, and I prefer Python.
The main code for extracting invoice information is implemented as follows, primarily relying on the pdfplumber library and regular expressions:
import pdfplumber
import re
from typing import List, Dict, Optional
class InvoiceExtractor:
def _invoice_pdf2txt(self, pdf_path: str) -> Optional[str]:
"""
Extract text from a PDF file using pdfplumber.
:param pdf_path: Path to the PDF file.
:return: Extracted text as a string, returns None if extraction fails.
"""
try:
with pdfplumber.open(pdf_path) as pdf:
text = '\n'.join(page.extract_text() for page in pdf.pages if page.extract_text())
return text
except Exception as e:
#print(f"Error extracting text from {pdf_path}: {e}")
return None
def _extract_invoice_product_content(self, content: str) -> str:
"""
Extract product-related content from the invoice text.
:param content: Complete text of the invoice.
:return: Extracted product-related content as a string.
"""
lines = content.splitlines()
start_pattern = re.compile(r"^(Goods or Taxable Services|Project Name)")
end_pattern = re.compile(r"^Total Amount Including Tax")
start_index = next((i for i, line in enumerate(lines) if start_pattern.match(line)), None)
end_index = next((i for i, line in enumerate(lines) if end_pattern.match(line)), None)
if start_index is not None and end_index is not None:
extracted_lines = lines[start_index:end_index + 1]
return '\n'.join(extracted_lines).strip()
return "No matching content found"
def construct_invoice_product_data(self, raw_text: str) -> List[Dict[str, str]]:
"""
Process the extracted text to construct a list of invoice product data.
:param raw_text: Extracted raw text.
:return: List of product data, each product as a dictionary.
"""
blocks = re.split(r'(?=Goods or Taxable Services|Project Name)', raw_text.strip())
records = []
for block in blocks:
lines = [line.strip() for line in block.splitlines() if line.strip()]
if not lines:
continue
current_record = ""
for line in lines[1:]:
if line.startswith("Total") or line.startswith("Total Amount Including Tax"):
continue
if line.startswith("*"):
if current_record:
self._process_record(current_record, records)
current_record = line
else:
if " " in current_record:
first_space_index = current_record.index(" ")
current_record = current_record[:first_space_index] + line + current_record[first_space_index:]
if current_record:
self._process_record(current_record, records)
return records
def _process_record(self, record: str, records: List[Dict[str, str]]):
"""
Process a single record and add it to the record list.
:param record: String of a single record.
:param records: Record list.
"""
parts = record.rsplit(maxsplit=7)
if len(parts) == 8:
try:
records.append({
"product_name": parts[0].strip(),
"specification": parts[1].strip(),
"unit": parts[2].strip(),
"quantity": parts[3].strip(),
"unit_price": float(parts[4].strip()),
"amount": float(parts[5].strip()),
"tax_rate": parts[6].strip(),
"tax_amount": float(parts[7].strip())
})
except ValueError as e:
print(f"Failed to parse record: {record}, Error: {e}")
pass
In the end, a dictionary will be obtained, containing the product name, specification, unit, quantity, unit price, total price, tax rate, and tax amount of the invoice. Following this script, combined with fastapi and vue3, I created an application that allows for drag-and-drop to obtain invoice information and export inbound and outbound orders:
Of course, I am no longer responsible for the reimbursement work, but what I created benefits my junior colleagues, regardless of whether they use it or not; I have made it.