Process
- Create Tokenizer
- Make Dataset
- NSP (Next Sentence Prediction)
- Masking
Training
This somewhat contradicts what I learned earlier, so I’m noting it down.
For domain-specific tasks, training from scratch using only domain-specific data outperforms fine-tuning a pretrained model.
 ref: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
This is a paper that trained BERT from scratch using data from PubMed, one of the largest archives of biomedical journals.

These are BERT scores on biomedical tasks. For example, BC5-chem is a chemical NER task. For such domain-specific tasks, training BERT from scratch on biomedical data outperformed fine-tuning.
Data

Dataset
Transform data into a form the model can consume. For BERT, you need to create:
- input_ids: Vocab IDs generated through Token Embedding.
- token_type_ids: Segment IDs generated through Segment Embedding.
- Positional encoding information.
target_seq_length
This is BERT code on GitHub. The embedding size is controlled as follows:
- max_num_tokens: maximum number of tokens that can go into BERT.
target_seq_length = max_num_tokensif random.random() < self.short_seq_probability: target_seq_length = random.randint(2, max_num_tokens)short_seq_probability makes target_seq_length randomly variable.
The reason for this is model generalization. If all data is packed to max_num_tokens, the model might not handle other token counts well. So the max embedding size is randomly adjusted to produce a more flexible model.
Segment control
Starting from line 258, the code controls segments. The dataset tries to fill the max embedding size. That is, if ‘sentence_1[SEP]sentence_2’ is too short, it will create something like ‘sentence_1+sentence_2[SEP]sentence_3+sentence_4’. There are still only 2 segments — ‘sentence_1+sentence_2’ becomes one segment.
The code randomly cuts Segment A’s length — it picks a random integer and uses only that many tokens as the segment.
Truncation
‘SegmentA[SEP]SegmentB’ might exceed the max embedding size. That’s when truncation is needed.
Truncation repeats:
- Randomly pick either Segment A or B.
- Remove the last token from the selected segment.
- Check the token count; if truncation is still needed, go back to step 1.
Dataloader
Determines how data is delivered to the model. For BERT, this boils down to the masking strategy.