A versatile Python library for generalizing and streamlining the processing of diverse file types. It provides a unified File class that uses the Strategy Pattern to select appropriate processors based on file extensions. The library supports over 20 unique file processors and is designed for easy extensibility.
- Features
- Installation
- Quick Start
- Supported File Types
- Optional Features
- Architecture
- Extending the Library
- Contributing
- License
- Contact
- Unified File Interface: Interact with different file types using a single
Fileclass. - Strategy Pattern Implementation: Dynamically selects the appropriate processor based on file extensions.
- Extensible Design: Easily add support for new file types by creating custom processors.
- Metadata Extraction: Extracts comprehensive metadata and text content from files.
- Lazy Loading for Optional Features: Supports optional OCR and transcription capabilities via
file-processing-ocrandfile-processing-transcription.
To install the file-processing library from GitHub (since it's not packaged yet), use pip with the repository URL:
pip install git+https://github.com/hc-sc-ocdo-bdpd/file-processing.gitNote: Optional dependencies for OCR and transcription are available through file-processing-ocr and file-processing-transcription.
Here's how to get started with file-processing:
from file_processing import File
# Initialize a File object
file = File('path/to/your/file.pdf')
# Access metadata
print(f"File Name: {file.file_name}")
print(f"File Size: {file.size} bytes")
print(f"Owner: {file.owner}")
# Access extracted text (if applicable)
print(f"Text Content: {file.metadata.get('text', 'No text extracted')}")The library supports a wide range of file types:
- Text-Based Files:
.txt,.csv,.json,.xml,.html,.py,.ipynb,.gitignore - Documents:
.pdf,.docx,.rtf,.xlsx,.pptx,.msg - Images:
.png,.jpg,.jpeg,.gif,.tif,.tiff,.heic,.heif - Audio/Video Files:
.mp3,.wav,.mp4,.flac,.aiff,.ogg - Archives:
.zip - Packaged Software:
.whl,.exe - Model Files:
.gguf(used withfile-processing-models)
The file-processing library can be extended with OCR and transcription capabilities by installing additional packages:
- OCR: file-processing-ocr
- Transcription: file-processing-transcription
The library utilizes the Strategy Pattern to select the appropriate processor based on the file extension. Here's how it works:
- The
Fileclass acts as a context that delegates the processing to a specificFileProcessorStrategy. - Each file type has a corresponding processor class that implements the
FileProcessorStrategyinterface. - If a file type is not explicitly supported, a
GenericFileProcessoris used as a fallback.
To add support for a new file type:
-
Create a New Processor Class:
from file_processing.file_processor_strategy import FileProcessorStrategy class CustomFileProcessor(FileProcessorStrategy): def __init__(self, file_path: str, open_file: bool = True) -> None: super().__init__(file_path, open_file) self.metadata = {} def process(self) -> None: # Implement processing logic pass def save(self, output_path: str = None) -> None: # Implement save logic pass
-
Register the New Processor in
file.py:Add your new processor to the
PROCESSORSdictionary infile_processing/file.py:File.PROCESSORS['.custom_extension'] = CustomFileProcessor
-
Update the
__init__.pyFile:Add an import statement for your new processor in
file_processing/processors/__init__.py:from .custom_processor import CustomFileProcessor
Following these steps ensures your new processor is correctly integrated with the file-processing library.
We welcome contributions from the community. If you'd like to contribute:
- Fork the Repository: Create your own fork on GitHub.
- Create a Feature Branch: Work on your feature or bug fix in a separate branch.
- Write Tests: Ensure your changes are covered by tests.
- Submit a Pull Request: When you're ready, submit a PR for review.
This project is licensed under the MIT License.
For questions or support, please contact:
- Email: ocdo-bdpd@hc-sc.gc.ca
We are committed to fostering collaboration and innovation within Health Canada and beyond. Explore our repository, contribute, or get in touch to learn more about our work.