Financial-NLP-System

L00172671 - Oisin Gibson

File Overview (Brief)

Root

package.json: Workspace scripts and shared deps.
README.md: Project notes and references.

Server

server/package.json: Server deps and scripts.
server/server.js: Express app, DB init, routes, cleanup job.
server/createAdmin.js: One‑off admin user creator.
server/tika-config.xml: Tika OCR configuration.

Server Models

server/models/User.js: User schema + auth helpers.
server/models/Document.js: Document schema + NLP fields.

Server Routes

server/routes/auth.js: Login and registration endpoints.
server/routes/users.js: Admin user management endpoints.
server/routes/documents.js: Upload, download, NLP endpoints.

Server Middleware

server/middleware/auth.js: JWT auth guard.

Server Services

server/services/nlpProcessor.js: Text extraction + NLP pipeline.

Server Tests

server/tests/auth.test.js: Auth route tests.
server/tests/documents.test.js: Document route tests.
server/tests/users.test.js: User route tests.
server/tests/nerAccuracy.test.js: NER accuracy evaluation tests.

NER (Named Entity Recognition) Accuracy Testing

The nerAccuracy.test.js file evaluates entity extraction quality against sample financial documents with known ground truth.

Running the tests:

# With Jest (requires NLP service running on port 8000)
cd server
npm test -- nerAccuracy.test.js

# Standalone mode (detailed output)
node server/tests/nerAccuracy.test.js --standalone

Metrics measured:

Precision: Of entities extracted, what percentage were correct?
Recall: Of known entities, what percentage were found?
F1 Score: Harmonic mean of precision and recall (overall accuracy)

Entity types evaluated:

Type	Description	Example
ORG	Organizations	"Apple Inc.", "FDIC"
PERSON	People names	"Tim Cook", "Jamie Dimon"
MONEY	Monetary values	"$394.3 billion", "£8.4 billion"
DATE	Dates/periods	"January 26, 2023", "Q3 2023"
GPE	Countries/cities	"California", "London"
PERCENT	Percentages	"7.8%", "24.5%"
CARDINAL	Numbers	"1.5 million", "30"

Interpreting results:

🟢 Good (F1 ≥ 70%): Entity type is reliably extracted
🟡 Fair (F1 ≥ 40%): Partial extraction, may need refinement
🔴 Needs Work (F1 < 40%): Consider custom NER rules or training

Scripts

scripts/startTika.js: Starts Tika server with config.

======================================================================================================================================= Reference Material

Tokenization Concepts
- Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
Stopword Removal
- Common English stopwords list based on NLTK (Natural Language Toolkit)
- Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly Media.

PDF Processing

pdf-parse Library
- GitHub: https://github.com/modesty/pdf-parse
- Uses Mozilla's PDF.js for parsing

Tika, OCR, and Tesseract

Apache Tika
- Download: https://tika.apache.org/download.html
- Server: https://cwiki.apache.org/confluence/display/TIKA/TikaServer
Tesseract OCR (Windows builds)
- Downloads: https://github.com/UB-Mannheim/tesseract/wiki
Tesseract OCR (Official)
- Project: https://github.com/tesseract-ocr/tesseract

JPEG2000 OCR Support (Scanned PDFs)

Some scanned PDFs use JPEG2000 (JP2) images. To OCR these, add the JAI Image I/O JARs:

Download:
- jai-imageio-core-*.jar
- jai-imageio-jpeg2000-*.jar
Place both files in server/lib
Restart Tika (npm run tika)

React (W3Schools): https://www.w3schools.com/react/
SQL (W3Schools): https://www.w3schools.com/sql/
Node.js (GeeksforGeeks): https://www.geeksforgeeks.org/nodejs/
Express.js (GeeksforGeeks): https://www.geeksforgeeks.org/express-js/
JWT (GeeksforGeeks): https://www.geeksforgeeks.org/json-web-token-jwt/
bcrypt (GeeksforGeeks): https://www.geeksforgeeks.org/bcrypt-hashing-in-nodejs/

Web Development Frameworks

React Documentation
- Official Docs: https://react.dev/
- React Hooks: https://react.dev/reference/react
Express.js - Web framework for Node.js
- Official Guide: https://expressjs.com/
Sequelize ORM
- Documentation: https://sequelize.org/docs/v6/

UI/UX Design

Bootstrap 5
- Documentation: https://getbootstrap.com/docs/5.0/
- Icons: https://icons.getbootstrap.com/
Component-Based Architecture
- Fowler, M. (2003). "Patterns of Enterprise Application Architecture"

Authentication & Security

JSON Web Tokens (JWT)
- jwt.io: https://jwt.io/introduction
bcrypt - Password hashing
- GitHub: https://github.com/kelektiv/node.bcrypt.js

File Upload Handling

Multer - Node.js middleware for multipart/form-data
- Documentation: https://github.com/expressjs/multer

Data Retention & Scheduling

node-cron - Scheduled jobs in Node.js
- Documentation: https://www.npmjs.com/package/node-cron

File Upload Security (Validation & Sanitization)

OWASP File Upload Cheat Sheet
- Guidance: https://cheatsheetseries.owasp.org/cheatsheets/File_Upload_Cheat_Sheet.html
file-type - Detect file signature (magic bytes)
- Documentation: https://www.npmjs.com/package/file-type

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.vscode		.vscode
client		client
nlp_service		nlp_service
runtimes		runtimes
scripts		scripts
server		server
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Financial-NLP-System

File Overview (Brief)

Root

Server

Server Models

Server Routes

Server Middleware

Server Services

Server Tests

NER (Named Entity Recognition) Accuracy Testing

Server Contracts

Client

Client Entry

Client Components

Client Hooks

Client Utils

Scripts

======================================================================================================================================= Reference Material

PDF Processing

Tika, OCR, and Tesseract

JPEG2000 OCR Support (Scanned PDFs)

Web Development Frameworks

UI/UX Design

Authentication & Security

File Upload Handling

Data Retention & Scheduling

File Upload Security (Validation & Sanitization)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Financial-NLP-System

File Overview (Brief)

Root

Server

Server Models

Server Routes

Server Middleware

Server Services

Server Tests

NER (Named Entity Recognition) Accuracy Testing

Server Contracts

Client

Client Entry

Client Components

Client Hooks

Client Utils

Scripts

======================================================================================================================================= Reference Material

PDF Processing

Tika, OCR, and Tesseract

JPEG2000 OCR Support (Scanned PDFs)

Web Development Frameworks

UI/UX Design

Authentication & Security

File Upload Handling

Data Retention & Scheduling

File Upload Security (Validation & Sanitization)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages