World's Best 10 Large Language Models (LLMs) in 2023

Publisher : Jeffrey

World's Best 10 Large Language Models (LLMs) in 2023

After the release of ChatGPT by OpenAI, the race to build the best large language models (LLMs) has increased manifold. Large companies, small startups, and the open source community are working to develop the most advanced large language models.

So far, more than hundreds of LLMs have been released, but which ones are the most competent? To find out, follow our list of the best big language models (proprietary and open source) in 2023.

1. OpenAI

* GPT-4

The GPT-4 model by OpenAI is the best AI Large Language Model (LLM) available in 2023. Released in March 2023, the GPT-4 model demonstrated excellent abilities to understand complex reasoning, advanced coding ability, proficiency in multiple academic tests. Human-level performance skills and more

In fact, it is the first multimodal model that can accept texts and images as input. Although multimodal capability has not yet been added to ChatGPT, some users have gained access through Bing Chat, which is powered by the GPT-4 model.

GPT-4 is one of the very few LLMs that have enhanced illusion and reality by a mile. Compared to ChatGPT-3.5, the GPT-4 model scores nearly 80% in actual evaluations in several categories. OpenAI has done most of the work to further align the GPT-4 model with human values using reinforcement learning from human feedback (RLHF) and adversarial testing by domain experts.

Chat GTP-4

Finally, you can use ChatGPT plugins and browse the web with Bing using the GPT-4 model. Only few disadvantages are that the response is slow and the inference time is very high, which forces developers to use the old GPT-3.5 model. Overall, the OpenAI GPT-4 model is the best LLM you can use in 2023, and I strongly recommend subscribing to ChatGPT Plus if you want to use it for serious work. It costs $20, but if you don't want to pay, you can use ChatGPT 4 for free from third-party portals.

* GPT-3.5

After GPT 4, OpenAI takes second place again with GPT-3.5. It is a general purpose LLM similar to GPT-4 but does not specialize in specific domains. Talking about the pros first, it is a very fast model and produces full response in seconds.

Whether you are throwing creative tasks like writing an article with ChatGPT or coming up with a business plan to earn money using ChatGPT, the GPT-3.5 model does an excellent job. Additionally, the company recently released a larger 16K case length for the GPT-3.5-Turbo model. Don't forget, it's free to use and has no hourly or daily limits.

Chat GPT-3.5

GPT-3.5 claims that its biggest con is that it causes a lot of hallucinations and often spews out false information. So for serious research work, I would not suggest using it. However, for basic coding questions, translation, understanding science concepts and creative tasks, GPT-3.5 is a good enough model.

On the HumanEval benchmark, the GPT-3.5 model scored 48.1%, while the GPT-4 scored 67%, the highest for a general-purpose large language model. Remember, GPT-3.5 was trained on 175 billion parameters, while GPT-4 was trained on over 1 trillion parameters.

2. Google

* LaMDA

LaMDA is a family of Transformer-based models that is specialized for dialog. These models have up to 137B parameters and are trained on 1.56T words of public dialog data. LaMBDA can engage in free-flowing conversations on a wide array of topics. Unlike traditional chatbots, it is not limited to pre-defined paths and can adapt to the direction of the conversation.

* BARD

Bard is a chatbot that uses machine learning and natural language processing to simulate conversations with humans and provide responses to questions. It is based on the LaMDA technology and has the potential to provide up-to-date information, unlike ChatGPT, which is based on data collected only up to 2021.

* PaLM 2 (Bison-001)

PaLM 2 is our next generation large language model of 2023 that builds on Google’s legacy of breakthrough research in machine learning and responsible AI.

It excels at advanced reasoning tasks, including code and math, classification and question answering, translation and multilingual proficiency, and natural language generation better than our previous state-of-the-art LLMs, including PaLM. It can accomplish these tasks because of the way it was built – bringing together compute-optimal scaling, an improved dataset mixture, and model architecture improvements.

PaLM 2 is grounded in Google’s approach to building and deploying AI responsibly. It was evaluated rigorously for its potential harms and biases, capabilities and downstream uses in research and in-product applications. It’s being used in other state-of-the-art models, like Med-PaLM 2 and Sec-PaLM, and is powering generative AI features and tools at Google, like Bard and the PaLM API.

PaLM2

Google has focused on commonsense reasoning, formal logic, mathematics, and advanced coding in 20+ languages on the PaLM 2 model. It’s being said that the largest PaLM 2 model has been trained on 540 billion parameters and has a maximum context length of 4096 tokens.

Google has announced four models based on PaLM 2 in different sizes (Gecko, Otter, Bison, and Unicorn). Of which, Bison is available currently, and it scored 6.40 in the MT-Bench test whereas GPT-4 scored a whopping 8.99 points.

* mT5

Multilingual T5 (mT5) is a text-to-text transformer model consisting of 13B parameters. It is trained on the mC4 corpus, covering 101 languages like Amharic, Basque, Xhosa, Zulu, etc. mT5 is capable of achieving state-of-the-art performance on many cross-lingual NLP tasks.

* Claude v1

In case you are unaware, Claude is a powerful LLM developed by Anthropic, which has been backed by Google. It has been co-founded by former OpenAI employees and its approach is to build AI assistants which are helpful, honest, and harmless. In multiple benchmark tests, Anthropic’s Claude v1 and Claude Instant models have shown great promise. In fact, Claude v1 performs better than PaLM 2 in MMLU and MT-Bench tests.

It’s close to GPT-4 and scores 7.94 in the MT-Bench test whereas GPT-4 scores 8.99. In the MMLU benchmark as well, Claude v1 secures 75.6 points, and GPT-4 scores 86.4. Anthropic also became the first company to offer 100k tokens as the largest context window in its Claude-instant-100k model. You can basically load close to 75,000 words in a single window.

3. Deep mind

* Gopher

DeepMind's language model Gopher is more accurate than existing large language models on tasks such as answering questions about specialized subjects such as science and humanities, and their equivalents in other tasks such as logical reasoning and mathematics. Gopher has 280B parameters that it can tune, making it larger than OpenAI's GPT-3, which has 175 billion.

* chinchilla

A chinchilla uses the same computing budget as a gopher, however, with only 70 billion parameters and four times as much data. It outperforms models such as Gopher, GPT-3, Jurassic-1 and Megatron-Turing NLG in many downstream evaluation tasks. It uses significantly less computing for fine-tuning and inference, greatly simplifying downstream usage.

* the sparrow

Sparrow is a chatbot developed by DeepMind that is designed to answer users' questions correctly and reduce the risk of unsafe and inappropriate answers. The motivation behind Sparrow is to solve the problem of language patterns that produce incorrect, biased or harmful outputs. Sparrow is trained using human judgments to be more helpful, correct and harmless than baseline pre-trained language models.

4. NVIDIA

* Megatron-Turing NLG

Megatron-Turing Natural Language Generation model (MT-NLG), is the largest and the most powerful monolithic transformer English language model with 530 billion parameters. This 105-layer, transformer-based MT-NLG improves upon the prior state-of-the-art models in zero-, one-, and few-shot settings. It demonstrates unmatched accuracy in a broad set of natural language tasks such as, Completion prediction, Reading comprehension, Commonsense reasoning, Natural language inferences, Word sense disambiguation, etc.

NVIDIA

Training such a large model was made possible by novel parallelism techniques demonstrated in the paper on the NVIDIA DGX SuperPOD-based Selene supercomputer. You can read more about the model and its accuracy on NVIDIA Technical Blog.

5. LLaMA

Ever since the LAMA models were leaked online, Meta has gone all-in on open source. It has officially released LAMA models in various sizes ranging from 7 billion parameters to 65 billion parameters. According to Meta, its LAMA-13B model outperforms the GPT-3 model from OpenAI trained on 175 billion parameters. Many developers use LAMA to fine tune and create some of the best open source models out there. Remember, remember, the LAMA was released for research only and not for commercial use, unlike the Falcon model by TII.

Talking about the LAMA 65B model, it has shown excellent capability in most usage scenarios. She is one of the top 10 models on the Open LLM leaderboard in Hugging Face. Meta said no proprietary material was used to train the model. Instead, the company used publicly available data from CommonCrawl, C4, GitHub, ArXiv, Wikipedia, StackExchange and more.

LLaMA

Simply put, after the release of the LAMA model by Meta, the open source community saw rapid innovation and came up with new techniques to create smaller and more efficient models.

6. Baidu

* Ernie 3.0 Titan

Ernie 3.0 was released by Baidu and Peng Cheng Laboratory. It has 260B parameters and excels in natural language understanding and generation. It is trained on massively structured data and achieves state-of-the-art results on over 60 NLP tasks, including machine reading comprehension, text classification and semantic similarity. Additionally, the Titan 30 performs well in the few-shot and zero-shot benchmarks, showing its ability to normalize various downstream tasks with small amounts of labeled data.

* Ernie Bot

Chinese technology company Baidu has announced that it will complete internal testing of its “Ernie Bot” project in March. Ernie Bot is an AI-based language model similar to OpenAI's ChatGPT, capable of language understanding, language generation, and text-to-image generation. This technology is part of the global race to develop productive artificial intelligence.

7. Huawei

PanGu-Alpha

Huawei has developed a Chinese-language equivalent of OpenAI's GPT-3 called PanGu-Alpha. The model is based on 1.1 TB of Chinese language sources, including books, news, social media and web pages, and contains 200 billion parameters, 25 million more than GPT-3. Pangu-Alpha is highly efficient in completing various linguistic tasks such as text summarization, question answering, and dialogue generation.

8. Falcon

Falcon is the first open-source large language model on this list, and it surpasses all open-source models released so far, including LAMA, StableLM, MPT, and more. Falcon LLM is a foundational large language model (LLM) with 40 billion parameters trained on one trillion tokens. It is developed by Technology Innovation Institute (TII), UAE. has now released Falcon LLM – a 40B model.

Falcon-LLM

The model uses only 75 percent of GPT-3’s training compute, 40 percent of Chinchilla’s, and 80 percent of PaLM-62B’s.

Falcon was built using custom tooling and leverages a unique data pipeline that can extract high-quality content out of web data and use it for training a custom codebase, independent from the works of NVIDIA, Microsoft, or HuggingFace.

A particular focus was put on data quality at scale. LLMs are notoriously sensitive to the quality of their training data, so significant care was taken in building a data pipeline that would both scale to tens of thousands of CPU cores for fast processing, and that would extract high-quality content from the web using extensive filtering and deduplication.

The architecture of Falcon was optimized for performance and efficiency. Combining high-quality data with these optimizations, Falcon significantly outperforms GPT-3 for only 75% of the training compute budget—and requires a fifth of the compute at inference time.

Falcon matches the performance of state-of-the-art LLMs from DeepMind, Google, and Anthropic.

Falcon 40B is Open Sourced :

* Technology Innovation Institute has publicly released the model’s weights for research and commercial use.

* For researchers and developers this will make Falcon 40B, 7B more accessible, with it being based on released under the Apache License Version 2.0.

9. Vicuna 33B

Vicuna is another powerful open source LLM developed by LMSYS. It is derived from LAMA like many other open source models. It is fine-tuned using supervised instructions and training data collected from sharegpt.com, a portal where users share their amazing ChatGPT conversations. It is an auto-regressive large language model and is trained on 33 billion parameters.

In LMSYS' own MT-Bench test, it achieved a score of 7.12, while the best proprietary model, the GPT-4, scored 8.99 points. 59.2 points in MMLU exam and 86.4 points in GPT-4. Despite being a very small model, Vicuna's performance is remarkable. You can check out the demo and interact with the chatbot by clicking on the link below.

Read More Vicuna 33B : Chat LLM

10. AI21 Labs(Jurassic-1)

We are thrilled to announce the launch of AI21 Studio, our new developer platform where you can use our state-of-the-art Jurassic-1 language models to build your own applications and services. Jurassic-1 models come in two sizes, where the Jumbo version, at 178B parameters, is the largest and most sophisticated language model ever released for general use by developers. AI21 Studio is currently in open beta, allowing anyone to sign up and immediately start querying Jurassic-1 using our API and interactive web environment.

Our mission at AI21 Labs is to fundamentally reimagine the way humans read and write by introducing machines as thought partners, and the only way we can achieve this is if we take on this challenge together. We’ve been researching language models since our Mesozoic Era (aka 2017 ðŸ˜‰). Jurassic-1 builds on this research, and it is the first generation of models we’re making available for widespread use.

Jurassic-1 models are highly versatile, capable of both human-like text generation, as well as solving complex tasks such as question answering, text classification and many others. Our models utilize a unique 250,000-token vocabulary which is not only much larger than most existing vocabularies (5x or more), but also the first to include multi-word tokens such as expressions, phrases, and named entities. Because of this, Jurassic-1 needs fewer tokens to represent a given amount of text, thereby improving computational efficiency and reducing latency significantly..

-- Check in Now AI21