Thank you for all the different tutorials. I have a question about the instruction dataset. In my case, I want to fine-tune, for example, LLAMA3 for extracting features from a given real estate ad. My question is:
Is it better to format the dataset using the prompt format of LLAMA3, or is it okay to use a different format, like the Alpaca format? For example:
"""
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
"""
Could you provide a note discussing how to prepare the dataset?
It's totally OK to use a different prompt format. What only matters is that the prompt format used during fine-tuning must be the same at inference time.
You can use Alpaca prompt format for fine-tuning Llama 3.
Can the files train.clean.pp.dedup.norm.spm8k.en and train.clean.pp.dedup.norm.spm8k.es be used for machine translation training as they are? Does the content of the files need to be numericalized for NLP training, meaning that each token is mapped to a unique numerical ID?
I still have a question, I remember when training deep learning models, especially NLP, we usually need to convert the text to ID, why in this blog of yours, the generated English txt can be taken directly to training without converting to ID?
Hello Sir,
Thank you for all the different tutorials. I have a question about the instruction dataset. In my case, I want to fine-tune, for example, LLAMA3 for extracting features from a given real estate ad. My question is:
Is it better to format the dataset using the prompt format of LLAMA3, or is it okay to use a different format, like the Alpaca format? For example:
"""
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
"""
Could you provide a note discussing how to prepare the dataset?
Thank you!
It's totally OK to use a different prompt format. What only matters is that the prompt format used during fine-tuning must be the same at inference time.
You can use Alpaca prompt format for fine-tuning Llama 3.
thank u sir
Can the files train.clean.pp.dedup.norm.spm8k.en and train.clean.pp.dedup.norm.spm8k.es be used for machine translation training as they are? Does the content of the files need to be numericalized for NLP training, meaning that each token is mapped to a unique numerical ID?
Yes they can be used like that without further pre processing.
I still have a question, I remember when training deep learning models, especially NLP, we usually need to convert the text to ID, why in this blog of yours, the generated English txt can be taken directly to training without converting to ID?
Because the framework does this conversation for you.