CHAT-GPT文本生成语言模型背后的根本秘密以及如何自建

了解不同类型的语言模型，以及它们如何成为生成式人工智能工具的构件之一。

语言模型根据文本实例来理解单词出现的概率。较简单的模型可以查看短的单词序列的上下文，而较大的模型可以在句子层面或段落层面工作。一旦经过训练，语言模型就能在给定的样本背景下预测下一个字符或词。

语言模型已经伴随我们几十年了，只是最近才向公众发布了一个非常通用的版本。自从神经网络诞生以来，由于神经网络的泛化能力，语言建模能够达到新的高度。

根据模型处理的语境类型，有2种类型的语言模型--

1)基于字符的神经语言模型 2）基于词的神经语言模型

这样的模型通常是用RNN（循环神经网络）制作的，如长短期记忆（LSTM）。

基于字符的神经语言模型

这些是在字符层面上开发语言模型的神经网络。基于字符的神经语言模型是一种使用字符作为其输入和输出的基本单位的语言模型。基于字符的模型不像基于单词的模型那样依赖于固定的词汇表，而是可以处理任何输入文本，而不需要进行预处理或标记化。

它们通过训练一个神经网络来预测一个序列中的下一个字符的概率分布，给定该序列中以前的字符。这有其自身的优势和劣势。

优点：基于字符的模型的主要优点之一是它们能够处理训练数据中不存在的词汇（OOV）或稀有词汇。它们还可以捕捉形态变化和拼写错误，使它们对机器翻译、文本分类和语音识别等任务很有用。

缺点：然而，基于字符的模型的一个缺点是，它们在训练时的计算成本很高，特别是在大数据集上，而且可能比基于单词的模型需要更多的内存。它们还倾向于产生比基于单词的模型更重复、更不连贯的文本，特别是在生成较长的序列时。

举例来说：下面是一个基于字符的LSTM语言模型如何生成文本的例子：

输入文字： "敏捷的棕色狐狸跳过懒惰的狗"。

首先，文本被标记为单个字符，然后进行单热编码。LSTM网络经过训练，可以根据之前的字符预测序列中的下一个字符。在训练过程中，网络学习了输入文本中字符的模式和关系。

在推理过程中，该模型采用一个种子字符序列，并根据其学习的模式生成下一个字符。例如，如果种子序列是 "The qui"，该模型可能会根据它在训练期间学到的模式预测下一个字符是 "c"。然后，模型将这个预测的字符添加到种子序列中，形成 "The quic"，这个添加的序列成为下一个预测的输入，并重复这个过程以产生下一个字符，如此循环。该模型产生的进度输出可能看起来像这样

The qui  -> seed
the quic ->first prediction based on the context "the qui"
the quick ->first prediction based on the context "the quic"
and so on

Seed is the input for first prediction
Appended Prediction of seed becomes the input for second prediction similarly,
  appended Prediction of second prediction becomes the input for 3rd prediction

基于词的神经语言模型

这些是在字符层面上开发语言模型的神经网络。基于词的神经语言模型是一种神经网络，它通过根据前面的词来预测下一个词来生成文本。与预测下一个字符的基于字符的神经语言模型不同，基于单词的神经语言模型将整个单词作为输入。

它们的工作方式是训练一个神经网络来预测一个序列中的下一个词的概率分布，而这个序列中的前一个词是在该序列中的。这有其自身的优势和劣势。

优点： 1）更好的语义表示：与基于字符的模型相比，基于单词的模型能够以更有意义的方式捕捉一个句子或一连串单词的含义。这是因为单词是语言的组成部分，比单个字符具有更多的意义。

2)语境理解：基于词的模型可以理解一个词在一个句子或序列中的使用背景，使其在生成相关和有意义的文本时更加准确。

3）提高性能：在一些语言建模任务中，如文本预测、机器翻译和语音识别，基于单词的模型已被证明比基于字符的模型性能更好。

劣势： 1）词汇外的单词和词汇量：基于单词的模型在处理词汇中不存在的单词时可能会有困难。这可能导致不正确或无法理解的预测。词汇量的大小对基于单词的模型来说是一个挑战。

2)长期的依赖性：虽然基于单词的模型可以处理单词之间的短期依赖关系，但它们可能难以捕捉单词之间的长期依赖关系，从而导致不正确或不相关的预测。

举例来说：下面是一个同样的例子，以显示基于单词的LSTM语言模型可能产生的文本：

输入文字： "敏捷的棕色狐狸跳过懒惰的狗"。

首先，文本被标记为单个单词，然后进行单热编码。LSTM网络经过训练，可以根据之前的字符预测序列中的下一个字符。在训练过程中，网络学习了输入文本中字符的模式和关系。

在推理过程中，该模型以一个种子词序列为基础，根据其学习的模式生成下一个词。例如，如果种子序列是 "The quick"，该模型可能会根据它在训练期间学到的模式预测下一个词是 "brown"。然后，该模型将这个预测的字符添加到种子序列中，形成 "The quick brown"，这个添加的序列成为下一个预测的输入，并重复这个过程以产生下一个字符，如此循环。该模型产生的进度输出可能看起来像这样

The quick  -> seed
the quick brown ->first prediction based on the context "the quick"
the quick brown fox ->first prediction based on the context "the quick brown"
and so on

Seed is the input for first prediction
Appended Prediction of seed becomes the input for second prediction similarly,
  appended Prediction of second prediction becomes the input for 3rd prediction

为文本生成开发神经语言模型的教程

首先，我们需要一个样本文本来训练我们的模型。为此，我们将使用古腾堡计划的《共和国》电子书，作者是柏拉图。我已经分享了一个驱动器链接，其中包含文本的清洁版本。

→https://drive.google.com/file/d/1_oCjjkbt8uBmKEFNq-V-meIhBjhrwSs4/view?usp=sharing

我正在使用Colab。接下来，我们导入所需的库。

!pip install keras_preprocessing
import string
import re
from numpy import array
from pickle import dump
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.utils.vis_utils import plot_model
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from random import randint
from pickle import load
from keras.models import load_model
from keras_preprocessing.sequence import pad_sequences
from random import randint
from numpy import argmax
from keras.models import load_model

接下来是生成序列。对原始文本进行清理和转换，使我们分别拥有上下文和预测。比如说：

Context      Prediction
I            went 
I went       down
I went down  yesterday

整个数据集要转换为这种序列。在colab中输入驱动器的链接文本，并命名为Plato.txt

# load doc into memory
def load_doc(filename):
  # open the file as read only
    file = open(filename, 'r')
  # read all text
    text = file.read()
  # close the file
    file.close()
  return text

def clean_doc(doc):
  # replace '--' with a space ' '
  doc = doc.replace('--', ' ')
  # split into tokens by white space
    tokens = doc.split()
  # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
  # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
  # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
      # make lower case
  tokens = [word.lower() for word in tokens]
    return tokens

def save_doc(lines, filename):
  data = '
'.join(lines)
  file = open(filename, 'w')
  file.write(data)
  file.close()

# load document
in_filename = 'Plato.txt'
doc = load_doc(in_filename)
print(doc[:200])
# clean document
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))
# organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
  # select sequence of tokens
    seq = tokens[i-length:i]
  # convert into a line
    line = ' '.join(seq)
  # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))
# save sequences to file
out_filename = 'republic_sequences.txt'
save_doc(sequences, out_filename)

接下来是训练模型。我们总是可以使用复杂的模型。但由于训练时间的限制和资源的限制，我使用了一个简单的模型。

# load doc into memory
def load_doc(filename):
  # open the file as read only
    file = open(filename, 'r')
  # read all text
    text = file.read()
  # close the file
    file.close()
  return text

# define the model
def define_model(vocab_size, seq_length):
  model = Sequential()
  model.add(Embedding(vocab_size, 50, input_length=seq_length))
  model.add(LSTM(100, return_sequences=True))
  model.add(LSTM(100))
  model.add(Dense(100, activation='relu'))
  model.add(Dense(vocab_size, activation='softmax'))
  # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  # summarize defined model
    model.summary()
  plot_model(model, to_file='model.png', show_shapes=True)
  return model

# load
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('
')
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]
# define model
model = define_model(vocab_size, seq_length)
# fit model
model.fit(X, y, batch_size=128, epochs=1)
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

一旦模型训练完成，我们就可以在给定的背景下生成文本。接下来，我们将为我们的语言模型训练一个基本的LSTM神经网络。

# load doc into memory
def load_doc(filename):
  # open the file as read only
    file = open(filename, 'r')
  # read all text
    text = file.read()
  # close the file
    file.close()
  return text

# define the model
def define_model(vocab_size, seq_length):
  model = Sequential()
  model.add(Embedding(vocab_size, 50, input_length=seq_length))
  model.add(LSTM(100, return_sequences=True))
  model.add(LSTM(100))
  model.add(Dense(100, activation='relu'))
  model.add(Dense(vocab_size, activation='softmax'))
  # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  # summarize defined model
    model.summary()
  plot_model(model, to_file='model.png', show_shapes=True)
  return model

# load
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('
')
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]
# define model
model = define_model(vocab_size, seq_length)
# fit model
model.fit(X, y, batch_size=128, epochs=500)
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

一旦训练完成，我们将从数据集中随机抽取文本作为输入。然后，模型输出的文本具有适当的意义，并与输入的上下文一致。

CONTEXT:
any one of those within the prescribed age who forms a connection with any 
woman in the prime of life without the sanction of the rulers for we shall 
say that he is raising up a bastard to the state uncertified and unconsecrated
 very true he replied this applies however only
 
 OUTPUT:to those who are within the specified age after that he knows all
  neither is the art or any other conflict and can the same knowledge had been
   given to the soul that again is least ridiculous would be numbers seeing that
    then if must be the greatest variety of

由于语言模型的存在，上下文和输出是一致的。语境和输出将一起像--"任何一个在规定年龄内的人在没有统治者批准的情况下与任何一个正值壮年的女人建立联系，因为我们将说他是在为国家培养一个未经认证和未经圣化的私生子非常正确他回答说这只适用于那些在规定年龄内的人之后他知道一切既不是艺术也不是任何其他冲突，可以同样的知识被赋予灵魂那又是最荒谬的将是数字看到，那么如果必须是最大种类"

ChatGPT与许多复杂的模型一起工作。在每个查询中，ChatGPT需要以适当和一致的方式完成上下文。

展开阅读全文

页面更新：2024-04-01

标签：模型文本语言神经网络上下文序列单词字符种子神经秘密

1 2 3 4 5

CHAT-GPT文本生成语言模型背后的根本秘密以及如何自建

基于字符的神经语言模型

基于词的神经语言模型

为文本生成开发神经语言模型的教程

chatgpt火了，贾跃亭的车也来了

遭遇无差别“炸店”后商家连呼后怕律师：通过网络煽动实施暴力攻击涉嫌犯罪

马斯克4月来中国？分析一下可能性很大

昆仑万维炸板，两机构合计净卖出1.1亿元

全球最大黄色网站，经营不善，被迫卖身

为何中国是电商干掉实体，日本却是实体干掉电商，问题出在哪了？

关注 - 特斯拉中国将于4月3日推出新品，第四代超充桩或落地中国？

BAE将与Heart Aerospace合作研制电动飞机电池

轻装上阵，微盟迈向智慧零售的春天

Windows 12 让用户失望了因为硬件要求过高

GitHub开源项目推荐bubbletea

魅族 20/无界版屏幕均由京东方独供，首发低功耗超高清解决方案！

巴基斯坦专家：一场听证会“秀”出了美式价值观的消亡

新一轮“价格战”开始了！

Vingroup据悉讨论出售其购物中心业务股份，或成越南近年最大并购交易

远古文明的秘密：巨石阵背后的谜团

GPT-4：下一代自然语言处理模型的突破

1949年，军统少将周镐被秘密处决，16年后妻子凭一本日记为

灞桥区邵平店幼儿园语言文字教案规范化排查

“馒头恢复运营了吗？”开口就是工作术语，你有工作语言后

自然语言处理AI的未来：期待GPT4的强大

浦东医生话养生赶走便秘，快来了解小朋友“天天畅快”的

自主创新我国核电站“神经中枢”这样炼成

毫末智行将发布自动驾驶生成式大模型DriveGPT

呆萌机器人生产线-AI绘画模型及prompt推荐