Describir: Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention