Describir: SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text