Continuous Autoregressive Language Models

Would be interesting to combine it with Reasoning In the Latent Space: feed the vector from the output layer of transformer back to input.

Obviously, you can't do it in pre-training. But you can add it later as an optional 'extra' vector, I think. E.g. `input_embedding + MLP(prev_output) * alpha`. Alpha is zero during pre-training.

I like this plan, but don't you already have this from the input vector in the prompt, at least if the inference is 'chunk wise' - generating a latent space vector, decoding it, outputting it, doing the next one.

What if you trained a separate thinking phase using the auto encoder, though? Might be more efficient, and then you've got it using neuralese internally.

Actually, reading the (summary) paper - they tried your idea and had trouble with it for a different reason:

   > Once the generative head predicts the next vector , a natural next step would be to feed it directly as input to the Transformer for predicting . However, we found that the model struggles to unpack the semantic information from such a compact representation. Instead, we ground the autoregressive process back in the more structured discrete space, where the predicted  is passed through the autoencoder to reconstruct the K tokens.

Very interesting. Also I find these training parameters quite elegant:

- Diversity: This term encourages the model to generate a diverse set of samples, preventing mode collapse. - Fidelity: This term rewards the model for making predictions that are close to the ground-truth

I'm wondering if a continuos next-vector generative approach also increase innate "reasoning" capabilities of the model, since it could potentially capture more of the semantics of the data vs just tokens.

And may be even more adapted to sorts of RL finetuning?

The technique of compressing tokens down reminds me a bit of byte latent transformers