Skip to content

Oisin003/Financial-NLP-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Financial-NLP-System

L00172671 - Oisin Gibson

File Overview (Brief)

Root

Server

Server Models

Server Routes

Server Middleware

Server Services

Server Tests

NER (Named Entity Recognition) Accuracy Testing

The nerAccuracy.test.js file evaluates entity extraction quality against sample financial documents with known ground truth.

Running the tests:

# With Jest (requires NLP service running on port 8000)
cd server
npm test -- nerAccuracy.test.js

# Standalone mode (detailed output)
node server/tests/nerAccuracy.test.js --standalone

Metrics measured:

  • Precision: Of entities extracted, what percentage were correct?
  • Recall: Of known entities, what percentage were found?
  • F1 Score: Harmonic mean of precision and recall (overall accuracy)

Entity types evaluated:

Type Description Example
ORG Organizations "Apple Inc.", "FDIC"
PERSON People names "Tim Cook", "Jamie Dimon"
MONEY Monetary values "$394.3 billion", "£8.4 billion"
DATE Dates/periods "January 26, 2023", "Q3 2023"
GPE Countries/cities "California", "London"
PERCENT Percentages "7.8%", "24.5%"
CARDINAL Numbers "1.5 million", "30"

Interpreting results:

  • 🟢 Good (F1 ≥ 70%): Entity type is reliably extracted
  • 🟡 Fair (F1 ≥ 40%): Partial extraction, may need refinement
  • 🔴 Needs Work (F1 < 40%): Consider custom NER rules or training

Server Contracts

Client

Client Entry

Client Components

Client Hooks

Client Utils

Scripts

======================================================================================================================================= Reference Material

  • Tokenization Concepts

    • Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
  • Stopword Removal

    • Common English stopwords list based on NLTK (Natural Language Toolkit)
    • Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly Media.

PDF Processing

Tika, OCR, and Tesseract

JPEG2000 OCR Support (Scanned PDFs)

Some scanned PDFs use JPEG2000 (JP2) images. To OCR these, add the JAI Image I/O JARs:

  1. Download:
    • jai-imageio-core-*.jar
    • jai-imageio-jpeg2000-*.jar
  2. Place both files in server/lib
  3. Restart Tika (npm run tika)

Web Development Frameworks

UI/UX Design

Authentication & Security

File Upload Handling

Data Retention & Scheduling

File Upload Security (Validation & Sanitization)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors