CLEAR Series 3 · 5 min read

Tokenization and Word Embedding

How language models read messy clinical text

Series progress 3 / 8

C.L.E.A.R. Series | Post #3

CLINIcal Lens to Explain AI Relatably

Have you ever wondered how a machine reads a patient summary/ a radiology report —or makes sense of a referral note with half-finished sentences and Acronyms?

Language models do it with math and structure.

Before an AI model can “understand” a sentence, it has to do two things:

  • Break the text down into workable units
  • Represent those units in a way that carries meaning

This is where tokenization and word embedding come in. Let’s take a closer look—without the jargon or code.

1. Tokenization - Breaking Down the Language

In clinical terms, tokenization is like breaking down a long discharge summary into readable, processable units. But how these chunks are made matters.

Here are the 3 main tokenization approaches used in language models:

a. WordPiece

  • 🔹 Common in models like BERT
  • 🔹 Breaks rare words into frequent subwords
  • 🔹 E.g., "neurotoxicity" → “neuro”, “##toxic”, “##ity”

Useful when you're dealing with structured vocabularies, like drug names...

b. BPE (Byte Pair Encoding)

  • 🔹 Merges characters or subwords based on frequency
  • 🔹 Helps models learn frequent patterns, even in made-up words

Think of it like building a mental abbreviation list based on repeated exposure in discharge summaries.

c. SentencePiece - The Chaos Handler

This tokenizer doesn’t assume spaces. It takes in raw bytes and figures things out—even if the input is multilingual or full of typos.

If WordPiece is a textbook reader, SentencePiece is your senior who's fluent in deciphering rushed notes, acronyms or Google Translate copy-paste jobs.

2. Then Comes Word Embedding

Once the words are sliced, they need meaning—this is where embeddings come in. Every token gets mapped to a vector (a set of numbers).

But here's the cool bit:

  • 🔹 Words like “fever” and “infection” will end up close together
  • 🔹 So will “lung” and “bronchus”
  • 🔹 And yes, “MI” and “heart attack” might look nearly identical in this space

It’s like a clinical concept map, built not from definitions—but from how often things appear together.

So what’s really happening?

You feed the model a messy note. It breaks it down (tokenization). Then gives the pieces clinical meaning (embeddings). That’s how it “reads.”

If you’re trying to make sense of how language models actually handle text—without diving into Python—this series is for you.

#CLEAR #ClinicalAI #NojargonsNoCoding #AIinMedicine #Tokenization #WordEmbeddings #MedicalAI #NLPinHealthcare #LLMsExplained #ExplainableAI

Original source

Read the original LinkedIn post

The full reading experience now lives on this website. The original LinkedIn post remains available as the source reference.

View LinkedIn post