read

Loading Data From File

데이터가 많아지면 파일로 저장하게 된다. (보통 csv파일로 저장한다)

import numpy as np

# EXAM1, EXAM2, EXAM3, FINAL
# 73,80,75,152
# 93,88,93,185
# 89,91,90,180
# 96,98,100,196
# 73,66,70,142
# 53,46,55,101

xy = np.loadtxt('data-01-test-score.csv', delimiter=',', dtype=np.float32)
# 처음부터, 마지막 하나 빼고
x_data = xy[:, 0:-1]
# 마지막 하나만
y_data = xy[:, [-1]]

print(x_data.shape, x_data, len(x_data))
print(y_data.shape, y_data)

파일은 loadtxt를 사용하면 쉽게 가져올 수 있다.

import tensorflow as tf
import numpy as np

# EXAM1, EXAM2, EXAM3, FINAL
# 73,80,75,152
# 93,88,93,185
# 89,91,90,180
# 96,98,100,196
# 73,66,70,142
# 53,46,55,101

xy = np.loadtxt('data-01-test-score.csv', delimiter=',', dtype=np.float32)
# 처음부터, 마지막 하나 빼고
x_data = xy[:, 0:-1]
# 마지막 하나만
y_data = xy[:, [-1]]

print(x_data.shape, x_data, len(x_data))
print(y_data.shape, y_data)

X = tf.placeholder(tf.float32, shape=[None, 3])
Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random_normal([3, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

hypothesis = tf.matmul(X, W) + b

cost = tf.reduce_mean(tf.square(hypothesis - Y))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=1e-5)
train = optimizer.minimize(cost)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for step in range(2001):
    cost_val, hy_val, _ = sess.run(
        [cost, hypothesis, train],
        feed_dict={X: x_data, Y: y_data})
    if step % 10 == 0:
        print(step, "Cost : ", cost_val, "\nPrediction : \n", hy_val)

print("Your score will be ", sess.run(hypothesis, feed_dict={X: [[100, 70, 101]]}))
print("Other scores will be ", sess.run(hypothesis, feed_dict={X: [[60, 70, 110], [90, 100, 80]]}))

모두 구현해 본 코드는 위와 같다.

이렇게 numpy를 이용해 구현한 것이고, 이제 이 파일이 굉장히 커서 메모리에 올리기 힘들 때, tensorflow에서는 Queue Runner 라는 것을 제공한다.

여러 파일들이 있을 때, Random Shuffle하여 Filename Queue 를 만든다.

그리고, Reader로 연결해서, 데이터에 맞게 Decode한 후, 읽어온 값을 Queue에 쌓는다.

학습을 시킬 때는 모든 값이 아닌, batch파일이 필요하므로, Queue에서 꺼내서 쓰면 된다.

위 과정이 복잡해 보이지만, 생각보다 간단하다.

과정은 다음과 같다.

filename_queue = tf.train.string_input_producer(
    ['data-01-test-score.csv', 'data-02-test-score.csv', ... ],
    shuffle=False, name='filename_queue')

기본적으로 filename을 list화 한다고 보면 된다.

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

다음은 위와 같다. 파일을 읽어올 reader로 TextLineReader(binary도 가능)를 사용하고, key, value로 읽는다.

record_defaults = [[0.],[0.],[0.],[0.]]
xy = tf.decode_csv(value, record_defaults=record_defaults)

그리고 읽어온 value를 어떻게 parsing을 할 것인가를 decode_csv로 가져온다.

이 때, record_defaults를 이용해 어떤 형태로 읽어올지 설정한다. (float)

train_x_batch, train_y_batch = tf.train.batch([xy[0:-1], xy[-1:]], batch_size=10)

sess.tf.Session()

...

이 값을 batch라는 형태로 읽어온다. 그리고, batch_size를 설정해 한번에 몇개씩 가져올 지 설정한다.

coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

위 소스는 통상적으로 쓴다고 보면 된다.

for step in range(2001):
    x_batch, y_batch = sess.run([train_x_batch, train_y_batch])
    ...

Loop를 돌 때, batch들도 tensor 이므로, 값을 넘긴다. 추후에 feed_dict로도 가능하다.

coord.request_stop()
coord.join(threads)

이 부분 역시 통상적으로 쓰인다.

import tensorflow as tf

# Test Scores for General Psychology
# http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/frames/frame.html
# The data (X1, X2, X3, X4) are for each student.
# X1 = score on exam #1
# X2 = score on exam #2
# X3 = score on exam #3
# X4 = score on final exam
# EXAM1,EXAM2,EXAM3,FINAL
# 73,80,75,152
# 93,88,93,185
# 89,91,90,180
# 96,98,100,196
# 73,66,70,142
# 53,46,55,101
# 69,74,77,149
# 47,56,60,115
# 87,79,90,175
# 79,70,88,164
# 69,70,73,141
# 70,65,74,141
# 93,95,91,184
# 79,80,73,152
# 70,73,78,148
# 93,89,96,192
# 78,75,68,147
# 81,90,93,183
# 88,92,86,177
# 78,83,77,159
# 82,86,90,177
# 86,82,89,175
# 78,83,85,175
# 76,83,71,149
# 96,93,95,192

filename_queue = tf.train.string_input_producer(
    ['data-01-test-score.csv'], shuffle=False, name='filename_queue')

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

record_defaults = [[0.],[0.],[0.],[0.]]
xy = tf.decode_csv(value, record_defaults=record_defaults)

train_x_batch, train_y_batch = tf.train.batch([xy[0:-1], xy[-1:]], batch_size=10)

X = tf.placeholder(tf.float32, shape=[None, 3])
Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random_normal([3, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

hypothesis = tf.matmul(X, W) + b

cost = tf.reduce_mean(tf.square(hypothesis - Y))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=1e-5)
train = optimizer.minimize(cost)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

for step in range(2001):
    x_batch, y_batch = sess.run([train_x_batch, train_y_batch])
    cost_val, hy_val, _ = sess.run(
        [cost, hypothesis, train],
        feed_dict={X: x_batch, Y: y_batch})
    if step % 10 == 0:
        print(step, "Cost : ", cost_val, "\nPrediction:\n", hy_val)

coord.request_stop()
coord.join(threads)

모든 코드는 위와 같고 결과는 같다.

batch가 셔플하고 싶다하면 shuffle_batch를 쓰면 된다.

min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size
example_batch, label_batch = tf.train.shuffle_batch(
  [example, label], batch_size=batch_size, capacity=capacity,
  min_after_dequeue=min_after_dequeue)

Logistic (regression) classification

그럼 Logistic classification 을 구현해 보겠다. 먼저, Training Data를 진행할 것인데, 결과값은 binary다. (binary classification)

import tensorflow as tf

x_data = [[1,2],[2,3],[3,1],[4,3],[5,3],[6,2]]
y_data = [[0],[0],[0],[1],[1],[1]]

X = tf.placeholder(tf.float32, shape=[None, 2])
Y = tf.placeholder(tf.float32, shape=[None, 1])
# X 입력 2개, Y 출력 1개 -> [6*2] * [W] = [6*1]
W = tf.Variable(tf.random_normal([2,1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# sigmoid
hypothesis = tf.sigmoid(tf.matmul(X, W) + b)

cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) * tf.log(1 - hypothesis))

train = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

# 예측값, 0.5 기준. cast 하면 True일때 1, False일때 0
predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)
# predicted와 Y가 같은지 확인
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32))

# 특정 블럭 내 동작으로 제한
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for step in range(10001):
        cost_val, _ = sess.run([cost, train], feed_dict={X: x_data, Y: y_data})
        if step % 200 == 0:
            print(step, cost_val)

    h, c, a = sess.run([hypothesis, predicted, accuracy], feed_dict={X: x_data, Y: y_data})
    print("\nHypothesis : ", h, "\nCorrect (Y) : ", c, "\nAccuracy : ", a)

위와 같이 진행할 수 있다.

그럼 조금 더 다양한 데이터로 진행해보자. (data-03-diabetes.csv 파일을 이용한다.)

import tensorflow as tf
import numpy as np

xy = np.loadtxt('data-03-diabetes.csv', delimiter=',', dtype=np.float32)
x_data = xy[:,0:-1]
y_data = xy[:,[-1]]

X = tf.placeholder(tf.float32, shape=[None, 8])
Y = tf.placeholder(tf.float32, shape=[None, 1])
# X 입력 2개, Y 출력 1개 -> [6*2] * [W] = [6*1]
W = tf.Variable(tf.random_normal([8,1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# sigmoid
hypothesis = tf.sigmoid(tf.matmul(X, W) + b)

cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) * tf.log(1 - hypothesis))

train = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

# 예측값, 0.5 기준. cast 하면 True일때 1, False일때 0
predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)
# predicted와 Y가 같은지 확인
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32))

# 특정 블럭 내 동작으로 제한
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for step in range(10001):
        cost_val, _ = sess.run([cost, train], feed_dict={X: x_data, Y: y_data})
        if step % 200 == 0:
            print(step, cost_val)

    h, c, a = sess.run([hypothesis, predicted, accuracy], feed_dict={X: x_data, Y: y_data})
    print("\nHypothesis : ", h, "\nCorrect (Y) : ", c, "\nAccuracy : ", a)

하위에는 W값을 제외하고 대부분 같다.

간단히 해볼 부분은, CSV reading을 통해 구현해보도록 하자. 또 다른 부분은 Kaggle이라는 사이트의 데이터로 classification을 해보도록 하자.

이후에는 다음에 하도록 하자.

[ML] HunKim`s ML 3

Loading Data From File

Logistic (regression) classification

Written by

구찌

Supported by

구찌의 나도한번 해블로그

구찌의 개발 블로그 입니다.