Describir: Modular dual-stream visual fusion network for visual question answering