Generative AI & LLMs

Tokenization: AI shabdon ko kaise 'Padhta' hai?

Tokenization in LLM

Aapne notice kiya hoga ki ChatGPT ki billing "Words" par nahi, balki "Tokens" par hoti hai. AI ke liye "Token" wo choti unit hai jo bhasha ko math mein badalti hai. AI ko "Apple" nahi pata, use Token #1245 pata hai. Is post mein hum Tokenization ki factory ke andar ka raaz samjhenge.


1. Tokenization kya hai? (The Breakdown)

Computer text nahi samajhta, wo sirf numbers samajhta hai. Tokenization text ko chhote tukdon mein todne ki process hai:

  • Word-level: "Walking" -> ["Walking"] (Problem: "Walked" ko naya word manega).
  • Character-level: "Walking" -> ["W", "a", "l", "k", "i", "n", "g"] (Problem: Meaning bhool jayega).
  • Subword-level (Standard): "Walking" -> ["Walk", "ing"]. Aaj ke models (GPT, Llama) Subword Tokenization use karte hain kyonki ye "Meaning" aur "Flexibility" ka best balance hai.

2. BPE (Byte Pair Encoding): The AI Syllables

BPE aaj ka industry standard algorithm hai.

  • Ye un characters ko aapas mein "Merge" (Jhod) deta hai jo baar-baar saath aate hain.
  • Example: "Antigravity" ko AI shayad Anti + grav + ity mein todega.
  • Isse AI naye words (jo usne kabhi nahi dekhe) unke "Tukdon" se samajh leta hai. Ise hum Out-of-Vocabulary (OOV) handling kehte hain.

3. The "Hindi Tax": Kyon Hindi mehngi hai?

Ye ek kadwa sach hai: Devanagari script (Hindi) English ke mukable 3-4x zyada tokens leti hai.

  • English word "Technology" = 1 Token.
  • Hindi word "เคคเค•เคจเฅ€เค•" (Takneek) = 4-5 Tokens.
  • Kyon? Kyonki AI tokenizer internet ke 90% English data par train hue hain. Unke liye Hindi characters "Complex symbols" hain jinhe wo chote-chote bytes mein tod dete hain. Isliye Hindi AI apps ki cost hamesha zyada hoti hai.

4. Token Limits: Why memory is limited?

Har model ki ek Context Window hoti hai (e.g. 128k tokens).

  • Ye aapka "Working Space" hai.
  • Agar aap 500 pages feed karenge, toh model purane tokens bhool jayega kyonki uski "Token Memory" bhar chuki hai.
  • Jitne efficient tokens honge, utna hi bada context AI yaad rakh payega.

5. Summary Table: Tokenization Math

Script 1000 Words โ‰ˆ Tokens Cost Factor
English 750 Tokens 1x (Base)
Spanish / French 1200 Tokens 1.6x
Hindi / Marathi 3000-4000 Tokens 4x - 5x
Code (Python) 800 Tokens 1.1x

FAQs

1. "Tiktoken" kya hai? Ye OpenAI ka open-source library hai jo aapko batati hai ki aapka text kitne tokens lega. Coding se pehle hamesha isse check karein cost bachane ke liye.

2. Kya Emoji tokens lete hain? Haan! Ek simple emoji 1 se 3 tokens le sakta hai kyonki wo Unicode bytes mein tootta hai.

3. Tokenizer model ke training ka part hai? Haan! Tokenizer training se pehle hi fixed ho jata hai. Agar aapne tokenizer badla, toh pura model phir se train karna padega (kyonki numbers ka matlab badal jayega).

4. 2026 mein koi improvement? Ab naye models (jaise Llama-3.2) ka vocabulary size badh gaya hai (128k tokens), jis se Hindi aur regional languages ab thodi sasti aur efficient ho gayi hain.


Tokenization AI ki "Alphabet" hai. Ise samajh kar aap AI ki speed, cost, aur performance ko optimize kar sakte hain! ๐Ÿ”ข


Tarun ke baare mein: Tarun subword encoding algorithms aur linguistic compression ke specialist hain. AI-Gyani par har token logical aur optimized hai.

โ† Pichla Tutorial

Advanced Prompting: AI ko super-intelligent banayein

Agla Tutorial โ†’

Embeddings: AI 'matlab' kaise samajhta hai?

About the Author

TM
Tarun Mankar
Software Engineer & AI Content Creator

Main ek Software Engineer hoon jo AI aur Machine Learning ke baare mein Hinglish mein likhta hai. Maine AI Gyani isliye banaya taaki koi bhi Indian student bina English ki tension ke AI seekh sake โ€” bilkul free, bilkul asaan.