file-processing

A versatile Python library for generalizing and streamlining the processing of diverse file types. It provides a unified File class that uses the Strategy Pattern to select appropriate processors based on file extensions. The library supports over 20 unique file processors and is designed for easy extensibility.

Features

Unified File Interface: Interact with different file types using a single File class.
Strategy Pattern Implementation: Dynamically selects the appropriate processor based on file extensions.
Extensible Design: Easily add support for new file types by creating custom processors.
Metadata Extraction: Extracts comprehensive metadata and text content from files.
Lazy Loading for Optional Features: Supports optional OCR and transcription capabilities via file-processing-ocr and file-processing-transcription.

Installation

To install the file-processing library from GitHub (since it's not packaged yet), use pip with the repository URL:

pip install git+https://github.com/hc-sc-ocdo-bdpd/file-processing.git

Note: Optional dependencies for OCR and transcription are available through file-processing-ocr and file-processing-transcription.

Quick Start

Here's how to get started with file-processing:

from file_processing import File

# Initialize a File object
file = File('path/to/your/file.pdf')

# Access metadata
print(f"File Name: {file.file_name}")
print(f"File Size: {file.size} bytes")
print(f"Owner: {file.owner}")

# Access extracted text (if applicable)
print(f"Text Content: {file.metadata.get('text', 'No text extracted')}")

Supported File Types

The library supports a wide range of file types:

Text-Based Files: .txt, .csv, .json, .xml, .html, .py, .ipynb, .gitignore
Documents: .pdf, .docx, .rtf, .xlsx, .pptx, .msg
Images: .png, .jpg, .jpeg, .gif, .tif, .tiff, .heic, .heif
Audio/Video Files: .mp3, .wav, .mp4, .flac, .aiff, .ogg
Archives: .zip
Packaged Software: .whl, .exe
Model Files: .gguf (used with file-processing-models)

Optional Features

The file-processing library can be extended with OCR and transcription capabilities by installing additional packages:

OCR: file-processing-ocr
Transcription: file-processing-transcription

Architecture

The library utilizes the Strategy Pattern to select the appropriate processor based on the file extension. Here's how it works:

The File class acts as a context that delegates the processing to a specific FileProcessorStrategy.
Each file type has a corresponding processor class that implements the FileProcessorStrategy interface.
If a file type is not explicitly supported, a GenericFileProcessor is used as a fallback.

Extending the Library

To add support for a new file type:

Create a New Processor Class:

from file_processing.file_processor_strategy import FileProcessorStrategy

class CustomFileProcessor(FileProcessorStrategy):
    def __init__(self, file_path: str, open_file: bool = True) -> None:
        super().__init__(file_path, open_file)
        self.metadata = {}

    def process(self) -> None:
        # Implement processing logic
        pass

    def save(self, output_path: str = None) -> None:
        # Implement save logic
        pass

Register the New Processor in file.py:

Add your new processor to the PROCESSORS dictionary in file_processing/file.py:
```
File.PROCESSORS['.custom_extension'] = CustomFileProcessor
```
Update the __init__.py File:

Add an import statement for your new processor in file_processing/processors/__init__.py:
```
from .custom_processor import CustomFileProcessor
```

Following these steps ensures your new processor is correctly integrated with the file-processing library.

Contributing

We welcome contributions from the community. If you'd like to contribute:

Fork the Repository: Create your own fork on GitHub.
Create a Feature Branch: Work on your feature or bug fix in a separate branch.
Write Tests: Ensure your changes are covered by tests.
Submit a Pull Request: When you're ready, submit a PR for review.

License

This project is licensed under the MIT License.

Contact

For questions or support, please contact:

Email: ocdo-bdpd@hc-sc.gc.ca

We are committed to fostering collaboration and innovation within Health Canada and beyond. Explore our repository, contribute, or get in touch to learn more about our work.

Name		Name	Last commit message	Last commit date
Latest commit History 933 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
file_processing		file_processing
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

file-processing

Table of Contents

Features

Installation

Quick Start

Supported File Types

Optional Features

Architecture

Extending the Library

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 16

Uh oh!

Languages

License

hc-sc-ocdo-bdpd/file-processing

Folders and files

Latest commit

History

Repository files navigation

file-processing

Table of Contents

Features

Installation

Quick Start

Supported File Types

Optional Features

Architecture

Extending the Library

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 16

Uh oh!

Languages

Packages