L00172671 - Oisin Gibson
- package.json: Workspace scripts and shared deps.
- README.md: Project notes and references.
- server/package.json: Server deps and scripts.
- server/server.js: Express app, DB init, routes, cleanup job.
- server/createAdmin.js: One‑off admin user creator.
- server/tika-config.xml: Tika OCR configuration.
- server/models/User.js: User schema + auth helpers.
- server/models/Document.js: Document schema + NLP fields.
- server/routes/auth.js: Login and registration endpoints.
- server/routes/users.js: Admin user management endpoints.
- server/routes/documents.js: Upload, download, NLP endpoints.
- server/middleware/auth.js: JWT auth guard.
- server/services/nlpProcessor.js: Text extraction + NLP pipeline.
- server/tests/auth.test.js: Auth route tests.
- server/tests/documents.test.js: Document route tests.
- server/tests/users.test.js: User route tests.
- server/tests/nerAccuracy.test.js: NER accuracy evaluation tests.
The nerAccuracy.test.js file evaluates entity extraction quality against sample financial documents with known ground truth.
Running the tests:
# With Jest (requires NLP service running on port 8000)
cd server
npm test -- nerAccuracy.test.js
# Standalone mode (detailed output)
node server/tests/nerAccuracy.test.js --standaloneMetrics measured:
- Precision: Of entities extracted, what percentage were correct?
- Recall: Of known entities, what percentage were found?
- F1 Score: Harmonic mean of precision and recall (overall accuracy)
Entity types evaluated:
| Type | Description | Example |
|---|---|---|
| ORG | Organizations | "Apple Inc.", "FDIC" |
| PERSON | People names | "Tim Cook", "Jamie Dimon" |
| MONEY | Monetary values | "$394.3 billion", "£8.4 billion" |
| DATE | Dates/periods | "January 26, 2023", "Q3 2023" |
| GPE | Countries/cities | "California", "London" |
| PERCENT | Percentages | "7.8%", "24.5%" |
| CARDINAL | Numbers | "1.5 million", "30" |
Interpreting results:
- 🟢 Good (F1 ≥ 70%): Entity type is reliably extracted
- 🟡 Fair (F1 ≥ 40%): Partial extraction, may need refinement
- 🔴 Needs Work (F1 < 40%): Consider custom NER rules or training
- server/contracts/nlpResults.json: NLP results JSON contract.
- client/package.json: Client deps and scripts.
- client/public/index.html: HTML entry point.
- client/public/manifest.json: PWA metadata.
- client/src/index.js: React bootstrap.
- client/src/App.js: Routes and app layout.
- client/src/config.js: API base URL.
- client/src/index.css: Global styles.
- client/src/App.css: App styles.
- client/src/reportWebVitals.js: Perf reporting.
- client/src/setupTests.js: Test setup.
- client/src/components/AdminPanel.js: Admin user management UI.
- client/src/components/AlertMessage.js: Reusable alert UI.
- client/src/components/Dashboard.js: Main dashboard UI.
- client/src/components/DocumentCard.js: Document list card.
- client/src/components/Documents.js: Document list page.
- client/src/components/DocumentStatistics.js: Document stats UI.
- client/src/components/EmptyDocuments.js: Empty state UI.
- client/src/components/FileDropZone.js: Drag‑and‑drop upload area.
- client/src/components/Footer.js: Footer UI.
- client/src/components/Header.js: Header/nav UI.
- client/src/components/Login.js: Login form.
- client/src/components/Logo.js: Logo SVG.
- client/src/components/NLPAnalysis.js: NLP modal container.
- client/src/components/NLPAnalysisView.js: NLP modal UI.
- client/src/components/NLPAnalysis.styles.js: NLP modal styles.
- client/src/components/Register.js: Registration form.
- client/src/components/SelectedFileCard.js: Selected file preview.
- client/src/components/UploadDocument.js: Upload page.
- client/src/components/UploadGuidelines.js: Upload tips UI.
- client/src/hooks/useAlert.js: Alert state hook.
- client/src/hooks/useDocuments.js: Documents data hook.
- client/src/hooks/useFileUpload.js: Upload state hook.
- client/src/utils/alertUtils.js: Alert helpers.
- client/src/utils/documentUtils.js: Document helpers.
- client/src/utils/fileUtils.js: File helpers.
- scripts/startTika.js: Starts Tika server with config.
======================================================================================================================================= Reference Material
-
Tokenization Concepts
- Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
-
Stopword Removal
- Common English stopwords list based on NLTK (Natural Language Toolkit)
- Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly Media.
- pdf-parse Library
- GitHub: https://github.com/modesty/pdf-parse
- Uses Mozilla's PDF.js for parsing
- Apache Tika
- Tesseract OCR (Windows builds)
- Tesseract OCR (Official)
Some scanned PDFs use JPEG2000 (JP2) images. To OCR these, add the JAI Image I/O JARs:
- Download:
jai-imageio-core-*.jarjai-imageio-jpeg2000-*.jar
- Place both files in server/lib
- Restart Tika (
npm run tika)
- React (W3Schools): https://www.w3schools.com/react/
- SQL (W3Schools): https://www.w3schools.com/sql/
- Node.js (GeeksforGeeks): https://www.geeksforgeeks.org/nodejs/
- Express.js (GeeksforGeeks): https://www.geeksforgeeks.org/express-js/
- JWT (GeeksforGeeks): https://www.geeksforgeeks.org/json-web-token-jwt/
- bcrypt (GeeksforGeeks): https://www.geeksforgeeks.org/bcrypt-hashing-in-nodejs/
-
React Documentation
- Official Docs: https://react.dev/
- React Hooks: https://react.dev/reference/react
-
Express.js - Web framework for Node.js
- Official Guide: https://expressjs.com/
-
Sequelize ORM
- Documentation: https://sequelize.org/docs/v6/
-
Bootstrap 5
- Documentation: https://getbootstrap.com/docs/5.0/
- Icons: https://icons.getbootstrap.com/
-
Component-Based Architecture
- Fowler, M. (2003). "Patterns of Enterprise Application Architecture"
-
JSON Web Tokens (JWT)
- jwt.io: https://jwt.io/introduction
-
bcrypt - Password hashing
- Multer - Node.js middleware for multipart/form-data
- Documentation: https://github.com/expressjs/multer
- node-cron - Scheduled jobs in Node.js
- Documentation: https://www.npmjs.com/package/node-cron
- OWASP File Upload Cheat Sheet
- file-type - Detect file signature (magic bytes)
- Documentation: https://www.npmjs.com/package/file-type