[HandsOn] 16. RNN과 어텐션을 사용한 자연어 처리 - 내용 정리2

[도서완독]Hands On Machine Learning 2022. 8. 18. 20:44

IMDb 리뷰 데이터셋은 자연어 처리계의 'hello world'라고 한다.

영어로 쓰인 영화 리뷰 50000개(train 25000, test 25000) 로 구성되어 있음.

각 리뷰가 부정적인지=0, 긍정적인지=1 나타내는 간단한 이진 타깃이 포함되어 있음.

인기가 높은 이유는, 노트북에서 감당할 시간 안에 처리할 수 있을 만큼 간단하지만 재미있다고 함.ㅋㅋ

keras에서 바로 받을 수 있음.

(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

이 데이터셋은 이미 전처리되어 있음. 각 정수는 하나의 단어를 나타냄.

구두점을 모두 제거하고, 단어를 소문자로 변환한 다음 공백으로 나누어 빈도에 따라 인덱스를 붙임.

(낮은 정수가 자주 등장하는 단어에 해당)

0: 패딩 토큰, 1: start of sequence 토큰, 2: 알수 없는 단어를 의미

실전에서는 직접 텍스트를 전처리해야 하는데, 모델을 배포하고 싶다면 매번 다른 전처리 함수를 쓸수는 없음! 또 텐서플로 연산만을 사용해 전처리 과정을 처리하고 싶을 수도 있는데, 이런 경우 전처리를 모델에 포함시킬 수 있음.

import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples

전처리 함수를 만들쟈.

def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

리뷰 텍스트를 잘라내어 각 리뷰에서 처음 300 글자만 남김.(훈련 속도를 높이기 위해)

또한 일반적으로 처음 한 두 문장에서 리뷰가 긍정적인지 아닌지 판단 가능하기 때문.

preprocess(X_batch, y_batch)

그 다음 어휘 사전을 구축.

Counter로 단어의 등장 횟수를 센다.

from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

어휘 사전에서 가장 많이 등장하는 단어 만개만 남기고 삭제! 왜냐하면 굳이 사전에 있는 모든 단어를 모델이 알아야 할 필요는 없으니까.

vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]

각 단어를 어휘 사전의 인덱스로 바꾸는 전처리 단계. 전에 13장에서 본 것처럼 1000개의 oov 버킷을 사용하는 룩업 테이블을 만듬. (왜 1000개지..?)

words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))

이 단어에 대한 id를 알아보면?

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10053]])>

this, movie, was는 룩업 테이블에 있는데, 판타스틱은 없으니까 oov 버킷 중 하나에 매핑됨!!!

이제 최종 트레인셋을 만들면 됨. 리뷰를 배치로 묶고 앞서만든 전처리 함수로 전처리. 그 다음 테이블을 사용해서 단어를 인코딩! 프리패치.

def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, epochs=5)

모델을 훈련.

나머지 부분은 간단한데 임베딩 층이 새로 생겼다.

저기는 [배치 크기, 타임 스텝 수] 를 input으로 받아서 [배치 크기, 타임 스텝 수, 임베딩 크기]를 출력으로 반환.

배치 크기는 항상 앞에 오고 타임 스텝의 값을 128차원의 배열로 변환하니까 그렇다.

16.2.1 마스킹

모델이 패딩 토큰을 무시하도록 학습되어야 함. Embedding 층을 만들 때 mask_zero=True 매개변수를 추가하면 되삼.

그러면 이어지는 모든 층에서 id=0인 패딩 토큰을 무시한다.

16.2.2 사전훈련된 임베딩 재사용하기

오.. 나 완전 딥린인데 이거 좀 신기함. 텐서플로 허브에서 원하는 모듈을 찾아서 예제코드를 프로젝트로 복사하면 사전훈련된 가중치를 자동으로 다운로드하여 모델에 포함시킴.

import tensorflow_hub as hub

model = keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
                   dtype=tf.string, input_shape=[], output_shape=[50]),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])

nnlm-en-dim50 문장 임베딩 모듈 버전 1을 감성분석 모델에 사용한 예시!

hub.KerasLayer층이 모듈을 다운로드함. 이 모듈의 이름은 '문장 인코더'라고 함!

각 단어를 대규모 코퍼스에서 사전훈련된 임베딩 행렬을 사용해 임베딩 함. 전체 단어수가 70억 개인 코퍼스임!!!

모든 단어의 평균을 계산함. 이 결과가 문장 임베딩임. 기본적으로 이 층은 훈련되지 않지만, 작업에 맞게 미세 조정할 수 있음.

imdb 리뷰 데이터셋을 따로 전처리할 필요 없이 모델에 넣을 수 있음!!!

저작자표시 (새창열림)

'[도서완독]Hands On Machine Learning' 카테고리의 다른 글

[HandsOn] 17. 오토인코더와 GAN을 사용한 표현 학습과 생성적 학습- 내용 정리1 (0)	2022.08.23
[HandsOn] 16. RNN과 어텐션을 사용한 자연어 처리 - 내용 정리3 (0)	2022.08.19
[HandsOn] 16. RNN과 어텐션을 사용한 자연어 처리 - 내용 정리1 (0)	2022.08.16
[HandsOn]15. RNN과 CNN을 사용해 시퀀스 처리하기 - 내용 정리2 (0)	2022.08.13
[HandsOn]15. RNN과 CNN을 사용해 시퀀스 처리하기 - 내용 정리1 (0)	2022.08.08

ABOUT ME

항상 엔진을 켜둘게🚀 항상 엔진을 켜둘게🚀

16.2.1 마스킹

16.2.2 사전훈련된 임베딩 재사용하기

'[도서완독]Hands On Machine Learning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

16.2.1 마스킹

16.2.2 사전훈련된 임베딩 재사용하기

'[도서완독]Hands On Machine Learning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바