Enviar aquest missatge de text: Towards Multi-Modal Interactive Systems That Connect Audio, Vision and Beyond