What Are Tokens?
Tokens are the fundamental units that large language models (LLMs) use to process text. Models don't read words or characters directly — they split text into a sequence of tokens. A token can be a full word (like hello), a subword (like un + believ + able), or even a single character or punctuation mark.
For CJK (Chinese, Japanese, Korean) characters, each character typically encodes to 1-2 tokens, meaning the same semantic content in Chinese usually consumes more tokens than in English.
Token Examples
| Text | Approx. Tokens | Explanation |
|---|---|---|
Hello world | 2 | Common English words = 1 token each |
你好世界 | 4 | Each CJK character ≈ 1-2 tokens |
unbelievable | 3 | Long words split into subwords |
ChatGPT is amazing! | 5 | Proper nouns may be split |
const x = 42; | 5 | Symbols and numbers each take tokens |
https://example.com/path | 7-9 | URLs are split into many tokens |
Token Limits & Pricing by Model
Below are the context window sizes and API pricing for major AI models as of 2026 (price per 1M tokens, USD):
| Model | Context Window | Input / 1M | Output / 1M |
|---|---|---|---|
| GPT-4o | 128K | $2.50 | $10.00 |
| GPT-4 Turbo | 128K | $10.00 | $30.00 |
| GPT-3.5 Turbo | 16K | $0.50 | $1.50 |
| Claude 3.5 Sonnet | 200K | $3.00 | $15.00 |
| Claude 3 Opus | 200K | $15.00 | $75.00 |
| Claude 3 Haiku | 200K | $0.25 | $1.25 |
| Gemini 1.5 Pro | 1M | $1.25 | $5.00 |
| Gemini 1.5 Flash | 1M | $0.075 | $0.30 |
| Llama 3.1 405B | 128K | Varies by provider | |
| Llama 3.1 70B | 128K | Varies by provider | |
How Tokenization Works
Modern LLMs mostly use BPE (Byte Pair Encoding) or its variants to convert text into tokens. The core idea behind BPE is:
- Start from bytes: Initially treat each byte as an individual token.
- Iterative merging: Count all adjacent token pair frequencies and merge the most frequent pair into a new token.
- Repeat until vocabulary target: Keep merging until the vocabulary reaches a preset limit (e.g., cl100k_base has ~100K tokens).
Different models use different tokenizers:
- OpenAI GPT-4/4o: Uses the
cl100k_baseencoder with ~100K vocabulary. - Anthropic Claude: Uses a proprietary tokenizer, similar efficiency to cl100k_base, slightly better for natural language.
- Google Gemini: Uses SentencePiece tokenizer, optimized for multilingual text, slightly better CJK efficiency.
- Meta Llama 3: Uses a BPE-based tokenizer with ~128K vocabulary.
This tool uses a heuristic algorithm for approximate token estimation. For exact counts, use each model's official tokenizer library (e.g., OpenAI's tiktoken).
Tips for Reducing Token Usage
- Write concise prompts: Remove redundant phrasing and repeated instructions. Direct, concise prompts use fewer tokens and often produce better results.
- Use system messages: Place fixed background instructions in the system message to avoid repeating them in every conversation turn.
- Limit output length: Use the
max_tokensparameter to cap response length, or explicitly ask for brief answers in your prompt. - Avoid large code blocks: Paste only relevant code snippets instead of entire files. Code typically uses 1 token per 2-3 characters, making it less efficient.
- Compress with summaries: For long conversations, periodically ask the model to summarize previous context and replace full history with the summary.
- Choose the right model: Use cost-effective models like GPT-3.5 or Claude Haiku for simple tasks; reserve GPT-4o or Claude Opus for complex reasoning.
Related Tools
- Advanced Text Statistics — Character count, word frequency, readability scores
- AI API Pricing Comparison — Compare API pricing across major models
- AI Models Comparison — Compare model capabilities, context windows, speed