Does the agent say what it sees: an analysis in 3D question answering