Pošalji tekstualnu poruku: Multimodal Learning from Videos: Self-Supervised Pre-Training, Post-Training Alignment, and Benchmarks