This paper presents a systematic survey and critical review on evaluating large language models, covering challenges, limitations, and recommendations for more rigorous evaluation practices in the field.