7 Comments
User's avatar
Kacimi Imad's avatar

Hello Sir,

Thank you for all the different tutorials. I have a question about the instruction dataset. In my case, I want to fine-tune, for example, LLAMA3 for extracting features from a given real estate ad. My question is:

Is it better to format the dataset using the prompt format of LLAMA3, or is it okay to use a different format, like the Alpaca format? For example:

"""

Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

### Instruction:

{instruction}

### Input:

{input}

### Response:

"""

Could you provide a note discussing how to prepare the dataset?

Thank you!

Expand full comment
Benjamin Marie's avatar

It's totally OK to use a different prompt format. What only matters is that the prompt format used during fine-tuning must be the same at inference time.

You can use Alpaca prompt format for fine-tuning Llama 3.

Expand full comment
Kacimi Imad's avatar

thank u sir

Expand full comment
Xinyu Wei's avatar

Can the files train.clean.pp.dedup.norm.spm8k.en and train.clean.pp.dedup.norm.spm8k.es be used for machine translation training as they are? Does the content of the files need to be numericalized for NLP training, meaning that each token is mapped to a unique numerical ID?

Expand full comment
Benjamin Marie's avatar

Yes they can be used like that without further pre processing.

Expand full comment
Xinyu Wei's avatar

I still have a question, I remember when training deep learning models, especially NLP, we usually need to convert the text to ID, why in this blog of yours, the generated English txt can be taken directly to training without converting to ID?

Expand full comment
Benjamin Marie's avatar

Because the framework does this conversation for you.

Expand full comment