|
|
|
@@ -55,75 +55,75 @@ The following tables describe the configuration comparison with Megatron-LM. |
|
|
|
|
|
|
|
This document supports only the precision comparison of the mcore model. Therefore, `--use-mcore-model` must be configured for Megatron-LM, and `use_legacy: False` must be configured for MindSpore Transformers. |
|
|
|
|
|
|
|
| Megatron-LM | Description | MindSpore Transformers | Description | |
|
|
|
|--------------------------------------------|---------------------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
|
|
| `use-legacy-model` and `use-mcore-model` | Specifies whether to use the mcore model. | `use_legacy` | Specifies whether to use the mcore model. `use_legacy: False` is equivalent to `--use-mcore-model`. | |
|
|
|
| `num-layers` | Number of network layers, that is, number of transformer layers. | `num_layers` | Number of network layers, that is, number of transformer layers. | |
|
|
|
| `encoder-num-layers` | Number of encoder layers. | Not supported. | | |
|
|
|
| `decoder-num-layers` | Number of decoder layers. | Not supported. | | |
|
|
|
| `hidden-size` | Size of the hidden layer, which is the dimension in the hidden state. | `hidden_size` | Size of the hidden layer, which is the dimension in the hidden state. | |
|
|
|
| `ffn-hidden-size` | Size of the hidden layer in the feedforward network. | `intermediate_size` | Size of the hidden layer in the feedforward network. | |
|
|
|
| `num-attention-heads` | Number of attention heads. | `num_heads` | Number of attention heads. | |
|
|
|
| `kv-channels` | Number of key/value tensor channels. | `head_dim` | Number of key/value tensor channels. | |
|
|
|
| `group-query-attention` | Specifies whether to enable group query attention. | `use_gqa` | Specifies whether to enable group query attention. | |
|
|
|
| `num-query-groups` | Number of query groups. | `n_kv_heads` | Number of query groups. | |
|
|
|
| `max-position-embeddings` | Maximum position encoding length. | `max_position_embeddings` | Maximum position encoding length. | |
|
|
|
| `position-embedding-type` | Position encoding type, such as learned_absolute and rope. | `position_embedding_type` | Position encoding type, such as learned_absolute and rope. | |
|
|
|
| `use-rotary-position-embeddings` | Specifies whether to use rotary position embedding (RoPE). | Specified by `position_embedding_type`==`rope` | Specifies whether to use RoPE. | |
|
|
|
| `rotary-base` | Rotary base used for RoPE. | `rotary_base` | Rotary base used for RoPE. | |
|
|
|
| `rotary-percent` | RoPE usage ratio. | `rotary_percent` | RoPE usage ratio. | |
|
|
|
| `rotary-interleaved` | Specifies whether to use interleaved RoPE. | `rotary_interleaved` | Specifies whether to use interleaved RoPE. | |
|
|
|
| `rotary-seq-len-interpolation-factor` | Rotary sequence length interpolation factor. | `rotary_seq_len_interpolation_factor` | Rotary sequence length interpolation factor. | |
|
|
|
| `use-rope-scaling` | Specifies whether to enable RoPE scaling. | `use_rope_scaling` | Specifies whether to enable RoPE scaling. | |
|
|
|
| `rope-scaling-factor` | RoPE scaling factor. | `scaling_factor` | RoPE scaling factor. | |
|
|
|
| `no-position-embedding` | Specifies whether to disable location encoding. | `no-position-embedding` | Specifies whether to disable location encoding. | |
|
|
|
| `disable-bias-linear` | Disables bias in linear layers. | `add_bias_linear` | Enables bias in linear layers. | |
|
|
|
| `mrope-section` | Information of multiple RoPE sections. | Not supported. | | |
|
|
|
| `make-vocab-size-divisible-by` | Divides the size of the word table by a specified number. | Not supported. | By default, the dictionary size is not changed. | |
|
|
|
| `init-method-std` | Standard deviation of the normal distribution used during model parameter initialization. | `init_method_std` | Standard deviation of the normal distribution used during model parameter initialization. | |
|
|
|
| `attention-dropout` | Dropout probability applied in the multi-head self-attention mechanism. | `attention_dropout` | Dropout probability applied in the multi-head self-attention mechanism. | |
|
|
|
| `hidden-dropout` | Dropout probability in the hidden layer. | `hidden_dropout` | Dropout probability in the hidden layer. | |
|
|
|
| `normalization` | Normalization method, which can be LayerNorm or RMSNorm. | `normalization` | Normalization method, which can be LayerNorm or RMSNorm. | |
|
|
|
| `norm-epsilon` | Normalized stability factor (epsilon). | `rms_norm_eps` | RMSNorm stability factor. | |
|
|
|
| `apply-layernorm-1p` | Specifies whether to add 1 after LayerNorm. | Not supported. | | |
|
|
|
| `apply-residual-connection-post-layernorm` | Specifies whether the residual connection is applied after LayerNorm. | `apply_residual_connection_post_layernorm` | Specifies whether the residual connection is applied after LayerNorm. | |
|
|
|
| `openai-gelu` | Specifies whether to use the GELU activation function of the OpenAI version. | Not supported. | | |
|
|
|
| `squared-relu` | Specifies whether to use the square ReLU activation function. | Not supported. | | |
|
|
|
| Specified by `swiglu`, `openai-gelu`, and `squared-relu` | The default value is **torch.nn.functional.gelu**. | `hidden_act` | Activation function type. | |
|
|
|
| `gated_linear_unit` | Specifies whether to use gate linear unit in multi-layer perceptron (MLP). | `gated_linear_unit` | Specifies whether to use gate linear unit in MLP. | |
|
|
|
| `swiglu` | Specifies whether to use the SwiGLU activation function. | `hidden_act`==`silu` and `gated_linear_unit`| Specifies whether to use the SwiGLU activation function. | |
|
|
|
| `no-persist-layer-norm` | Disables persistence layer normalization. | Not supported. | | |
|
|
|
| `untie-embeddings-and-output-weights` | Specifies whether to decouple the weights of the input embedding layer and output layer. | `untie_embeddings_and_output_weights` | Specifies whether to decouple the weights of the input embedding layer and output layer. | |
|
|
|
| Specified by `fp16` and `bf16` | Tensor compute precision during training. | `compute_dtype` | Tensor compute precision during training. | |
|
|
|
| `grad-reduce-in-bf16` | Gradient reduction using BFloat16. | Not supported. | | |
|
|
|
| Not supported. | By default, the initialization tensor is generated in BFloat16 format. | `param_init_type` | Initial precision of the weight tensor. The default value is **Float32**, which ensures that the backward gradient is updated in Float32. | |
|
|
|
| Not supported. | By default, layer normalization is calculated in Float32. | `layernorm_compute_type` | Layer normalization tensor calculation precision. | |
|
|
|
| `attention-softmax-in-fp32` | Executes **attention softmax** in Float32. | `softmax_compute_type` | Softmax tensor calculation precision. | |
|
|
|
| Not supported. | | `rotary_dtype` | Position encoding tensor calculation precision. | |
|
|
|
| `loss-scale` | Overall loss scaling factor. | `loss_scale_value` | Overall loss scaling factor, which is configured in **runner_wrapper**. If `compute_dtype` is set to **BFloat16**, the value is usually set to **1.0**. | |
|
|
|
| `initial-loss-scale` | Initial loss scaling factor. | Not supported. | | |
|
|
|
| `min-loss-scale` | Minimum loss scaling factor. | Not supported. | | |
|
|
|
| `loss-scale-window` | Dynamic window size scaling. | `loss_scale_window` | Dynamic window size scaling. | |
|
|
|
| `hysteresis` | Loss scale hysteresis parameter. | Not supported. | | |
|
|
|
| `fp32-residual-connection` | Uses Float32 for residual connection. | Not supported. | | |
|
|
|
| `accumulate-allreduce-grads-in-fp32` | Accumulates and reduces gradients using Float32. | Not supported. | Accumulates and reduces gradients using Float32 by default. | |
|
|
|
| `fp16-lm-cross-entropy` | Uses Float16 to execute the cross entropy of the LLM. | Not supported. | Uses Float32 to execute the cross entropy of the LLM by default. | |
|
|
|
| `q-lora-rank` | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled. | `q_lora_rank` | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled. | |
|
|
|
| `kv-lora-rank` | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled. | `kv_lora_rank` | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled. | |
|
|
|
| `qk-head-dim` | Number of dimensions per Q/K head. | `qk_nope_head_dim` | Number of dimensions per Q/K head. | |
|
|
|
| `qk-pos-emb-head-dim` | Number of relative position embedding dimensions per Q/K head. | `qk_rope_head_dim` | Number of relative position embedding dimensions per Q/K head. | |
|
|
|
| `v-head-dim` | Number of dimensions per value projection (V head). | `v_head_dim` | Number of dimensions per value projection (V head). | |
|
|
|
| `rotary-scaling-factor` | RoPE scaling coefficient.| `scaling_factor` | RoPE scaling coefficient. | |
|
|
|
| `use-precision-aware-optimizer` | Enables the optimizer with precision awareness to automatically manage parameter updates of different data types. | Not supported. | | |
|
|
|
| `main-grads-dtype` | Data type of the main gradient. | Not supported. | By default, Float32 is used as the data type of the main gradient. | |
|
|
|
| `main-params-dtype` | Data type of the main parameter. | Not supported. | By default, Float32 is used as the data type of the main parameter. | |
|
|
|
| `exp-avg-dtype` | Data type of the exponential moving average (EMA). | Not supported. | | |
|
|
|
| `exp-avg-sq-dtype` | Data type of the EMA square item. | Not supported. | | |
|
|
|
| `first-last-layers-bf16` | Specifies whether to forcibly use BFloat16 at the first and last layers. | Not supported. | | |
|
|
|
| `num-layers-at-start-in-bf16` | Number of layers that start with BFloat16. | Not supported. | | |
|
|
|
| `num-layers-at-end-in-bf16` | Number of layers that end with BFloat16. | Not supported. | | |
|
|
|
| `multi-latent-attention` | Specifies whether to enable the multi-hidden variable attention mechanism. | `multi_latent_attention` | Specifies whether to enable the multi-hidden variable attention mechanism. | |
|
|
|
| `qk-layernorm` | Enables query/key layer normalization. | `qk-layernorm` | Enables query/key layer normalization. | |
|
|
|
| Megatron-LM | Description | MindSpore Transformers | Description | |
|
|
|
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
|
|
| `use-legacy-model` and `use-mcore-model` | Specifies whether to use the mcore model. | `use_legacy` | Specifies whether to use the mcore model. `use_legacy: False` is equivalent to `--use-mcore-model`. | |
|
|
|
| `num-layers` | Number of network layers, that is, number of transformer layers. | `num_layers` | Number of network layers, that is, number of transformer layers. | |
|
|
|
| `encoder-num-layers` | Number of encoder layers. | Not supported. | | |
|
|
|
| `decoder-num-layers` | Number of decoder layers. | Not supported. | | |
|
|
|
| `hidden-size` | Size of the hidden layer, which is the dimension in the hidden state. | `hidden_size` | Size of the hidden layer, which is the dimension in the hidden state. | |
|
|
|
| `ffn-hidden-size` | Size of the hidden layer in the feedforward network. | `intermediate_size` | Size of the hidden layer in the feedforward network. | |
|
|
|
| `num-attention-heads` | Number of attention heads. | `num_heads` | Number of attention heads. | |
|
|
|
| `kv-channels` | Number of key/value tensor channels. | `head_dim` | Number of key/value tensor channels. | |
|
|
|
| `group-query-attention` | Specifies whether to enable group query attention. | `use_gqa` | Specifies whether to enable group query attention. | |
|
|
|
| `num-query-groups` | Number of query groups. | `n_kv_heads` | Number of query groups. | |
|
|
|
| `max-position-embeddings` | Maximum position encoding length. | `max_position_embeddings` | Maximum position encoding length. | |
|
|
|
| `position-embedding-type` | Position encoding type, such as learned_absolute and rope. | `position_embedding_type` | Position encoding type, such as learned_absolute and rope. | |
|
|
|
| `use-rotary-position-embeddings` | Specifies whether to use rotary position embedding (RoPE). | Specified by `position_embedding_type`==`rope` | Specifies whether to use RoPE. | |
|
|
|
| `rotary-base` | Rotary base used for RoPE. | `rotary_base` | Rotary base used for RoPE. | |
|
|
|
| `rotary-percent` | RoPE usage ratio. | `rotary_percent` | RoPE usage ratio. | |
|
|
|
| `rotary-interleaved` | Specifies whether to use interleaved RoPE. | `rotary_interleaved` | Specifies whether to use interleaved RoPE. | |
|
|
|
| `rotary-seq-len-interpolation-factor` | Rotary sequence length interpolation factor. | `rotary_seq_len_interpolation_factor` | Rotary sequence length interpolation factor. | |
|
|
|
| `use-rope-scaling` | Specifies whether to enable RoPE scaling. | `use_rope_scaling` | Specifies whether to enable RoPE scaling. | |
|
|
|
| `rope-scaling-factor` | RoPE scaling factor. | `scaling_factor` | RoPE scaling factor. | |
|
|
|
| `no-position-embedding` | Specifies whether to disable location encoding. | `no-position-embedding` | Specifies whether to disable location encoding. | |
|
|
|
| `disable-bias-linear` | Disables bias in linear layers. | `add_bias_linear` | Enables bias in linear layers. | |
|
|
|
| `mrope-section` | Information of multiple RoPE sections. | Not supported. | | |
|
|
|
| `make-vocab-size-divisible-by` | Divides the size of the word table by a specified number. | Not supported. | By default, the dictionary size is not changed. | |
|
|
|
| `init-method-std` | Standard deviation of the normal distribution used during model parameter initialization. | `init_method_std` | Standard deviation of the normal distribution used during model parameter initialization. | |
|
|
|
| `attention-dropout` | Dropout probability applied in the multi-head self-attention mechanism. | `attention_dropout` | Dropout probability applied in the multi-head self-attention mechanism. | |
|
|
|
| `hidden-dropout` | Dropout probability in the hidden layer. | `hidden_dropout` | Dropout probability in the hidden layer. | |
|
|
|
| `normalization` | Normalization method, which can be LayerNorm or RMSNorm. | `normalization` | Normalization method, which can be LayerNorm or RMSNorm. | |
|
|
|
| `norm-epsilon` | Normalized stability factor (epsilon). | `rms_norm_eps` | RMSNorm stability factor. | |
|
|
|
| `apply-layernorm-1p` | Specifies whether to add 1 after LayerNorm. | Not supported. | | |
|
|
|
| `apply-residual-connection-post-layernorm` | Specifies whether the residual connection is applied after LayerNorm. | `apply_residual_connection_post_layernorm` | Specifies whether the residual connection is applied after LayerNorm. | |
|
|
|
| `openai-gelu` | Specifies whether to use the GELU activation function of the OpenAI version. | Not supported. | | |
|
|
|
| `squared-relu` | Specifies whether to use the square ReLU activation function. | Not supported. | | |
|
|
|
| Specified by `swiglu`, `openai-gelu`, and `squared-relu` | The default value is **torch.nn.functional.gelu**. | `hidden_act` | Activation function type. | |
|
|
|
| `gated_linear_unit` | Specifies whether to use gate linear unit in multi-layer perceptron (MLP). | `gated_linear_unit` | Specifies whether to use gate linear unit in MLP. | |
|
|
|
| `swiglu` | Specifies whether to use the SwiGLU activation function. | `hidden_act` == `silu` and `gated_linear_unit` | Specifies whether to use the SwiGLU activation function. | |
|
|
|
| `no-persist-layer-norm` | Disables persistence layer normalization. | Not supported. | | |
|
|
|
| `untie-embeddings-and-output-weights` | Specifies whether to decouple the weights of the input embedding layer and output layer. | `untie_embeddings_and_output_weights` | Specifies whether to decouple the weights of the input embedding layer and output layer. | |
|
|
|
| Specified by `fp16` and `bf16` | Tensor compute precision during training. | `compute_dtype` | Tensor compute precision during training. | |
|
|
|
| `grad-reduce-in-bf16` | Gradient reduction using BFloat16. | Not supported. | | |
|
|
|
| Not supported. | By default, the initialization tensor is generated in BFloat16 format. | `param_init_type` | Initial precision of the weight tensor. The default value is **Float32**, which ensures that the backward gradient is updated in Float32. | |
|
|
|
| Not supported. | By default, layer normalization is calculated in Float32. | `layernorm_compute_type` | Layer normalization tensor calculation precision. | |
|
|
|
| `attention-softmax-in-fp32` | Executes **attention softmax** in Float32. | `softmax_compute_type` | Softmax tensor calculation precision. | |
|
|
|
| Not supported. | | `rotary_dtype` | Position encoding tensor calculation precision. | |
|
|
|
| `loss-scale` | Overall loss scaling factor. | `loss_scale_value` | Overall loss scaling factor, which is configured in **runner_wrapper**. If `compute_dtype` is set to **BFloat16**, the value is usually set to **1.0**. | |
|
|
|
| `initial-loss-scale` | Initial loss scaling factor. | Not supported. | | |
|
|
|
| `min-loss-scale` | Minimum loss scaling factor. | Not supported. | | |
|
|
|
| `loss-scale-window` | Dynamic window size scaling. | `loss_scale_window` | Dynamic window size scaling. | |
|
|
|
| `hysteresis` | Loss scale hysteresis parameter. | Not supported. | | |
|
|
|
| `fp32-residual-connection` | Uses Float32 for residual connection. | `fp32_residual_connection` | Uses Float32 for residual connection. | |
|
|
|
| `accumulate-allreduce-grads-in-fp32` | Accumulates and reduces gradients using Float32. | Not supported. | Accumulates and reduces gradients using Float32 by default. | |
|
|
|
| `fp16-lm-cross-entropy` | Uses Float16 to execute the cross entropy of the LLM. | Not supported. | Uses Float32 to execute the cross entropy of the LLM by default. | |
|
|
|
| `q-lora-rank` | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled. | `q_lora_rank` | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled. | |
|
|
|
| `kv-lora-rank` | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled. | `kv_lora_rank` | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled. | |
|
|
|
| `qk-head-dim` | Number of dimensions per Q/K head. | `qk_nope_head_dim` | Number of dimensions per Q/K head. | |
|
|
|
| `qk-pos-emb-head-dim` | Number of relative position embedding dimensions per Q/K head. | `qk_rope_head_dim` | Number of relative position embedding dimensions per Q/K head. | |
|
|
|
| `v-head-dim` | Number of dimensions per value projection (V head). | `v_head_dim` | Number of dimensions per value projection (V head). | |
|
|
|
| `rotary-scaling-factor` | RoPE scaling coefficient. | `scaling_factor` | RoPE scaling coefficient. | |
|
|
|
| `use-precision-aware-optimizer` | Enables the optimizer with precision awareness to automatically manage parameter updates of different data types. | Not supported. | | |
|
|
|
| `main-grads-dtype` | Data type of the main gradient. | Not supported. | By default, Float32 is used as the data type of the main gradient. | |
|
|
|
| `main-params-dtype` | Data type of the main parameter. | Not supported. | By default, Float32 is used as the data type of the main parameter. | |
|
|
|
| `exp-avg-dtype` | Data type of the exponential moving average (EMA). | Not supported. | | |
|
|
|
| `exp-avg-sq-dtype` | Data type of the EMA square item. | Not supported. | | |
|
|
|
| `first-last-layers-bf16` | Specifies whether to forcibly use BFloat16 at the first and last layers. | Not supported. | | |
|
|
|
| `num-layers-at-start-in-bf16` | Number of layers that start with BFloat16. | Not supported. | | |
|
|
|
| `num-layers-at-end-in-bf16` | Number of layers that end with BFloat16. | Not supported. | | |
|
|
|
| `multi-latent-attention` | Specifies whether to enable the multi-hidden variable attention mechanism. | `multi_latent_attention` | Specifies whether to enable the multi-hidden variable attention mechanism. | |
|
|
|
| `qk-layernorm` | Enables query/key layer normalization. | `qk-layernorm` | Enables query/key layer normalization. | |
|
|
|
|
|
|
|
- Optimizer and learning rate scheduling configurations |
|
|
|
|
|
|
|
|