How Important is a Golden Dataset for LLM Evaluation?