Adopting the two-branch network to video-text tasks