VALL-E, or the Neural Codec Language Models, is a zero-shot text-to-speech synthesizer that creates human-like voices without any lengthy training process. VALL-E is equipped with a range of samples and synthesis options to create a diversity of voices. Furthermore, VALL-E maintains acoustic environment and speaker's emotions while synthesizing the text. It is based on the deep learning technology that helps in producing voices with more natural sound. Additionally, VALL-E features the LibriSpeech and VCTK samples to give a better understanding of the model's performance. VALL-E is a perfect tool for AI applications that require human-level text-to-speech synthesis.