This repository contains reference implementations developed by the Vector AI Engineering team, focused on advancing multimodal learning across structured data, image, audio, and text.
This project explores cutting-edge techniques in multimodal AI through the following key areas:
- Multimodal Representation Learning: Learning representations from multiple modalities for improved understanding and downstream tasks.
- Table Question Answering: Extending Retrieval-Augmented Generation (RAG) to structured data for intelligent question answering and table summarization.
- Vision-Language Models (VLMs): Enhancing document understanding by integrating visual layouts with textual representations.
- Audio-Language Models (ALMs): Fusing audio and text inputs to improve speech and language understanding tasks.
- implementations/: Implementations are organized by topics. Each topic has its own directory containing notebooks, and a README for guidance.
To begin working with this repository:
- Clone this repository to your local environment.
- Explore each topic in the
implementations/directory, guided by their respective README files. - Follow the instructions in the README file of each topic to setup the environment.
- Run the notebooks in the topic directory.
For more information or help with navigating this repository, please contact members of Vector AI Engineering Team:
- Vahid Reza Khazaie — vahidreza.khazaie@vectorinstitute.ai
- Mahshid Alinoori — mahshid.alinoori@vectorinstitute.ai
- Aravind Narayanan — aravind.narayanan@vectorinstitute.ai