Machine Learning – Tutorial 4

Regression – Training and Testing

This tutorial covered the first application of Regression to sample data.

Key takeaways being:-

  • X & y – define features and labels respectively
  • Scaling data helps accuracy and performance but can take longer:
    Generally, you want your features in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can also help with accuracy. Because this range is so popularly used, it is included in the preprocessing module of Scikit-Learn. To utilize this, you can apply preprocessing.scale to your X variable:
  • Quandl limits the number of anonymous API calls, the work around requiring registration and the passing of a key with requests:-
    quandl.ApiConfig.api_key = “<the API Key>”
  • cross_validation is deprecated – model_selection can be used instead
    from sklearn import model_selection
    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
  • It’s easy to swap out different algorithms – some can be threaded allowing parallel processing  (check algorithm documentation for n_jobs)
  • n_jobs=-1 sets the number of jobs to the maximum number for the processor
import pandas as pd
import quandl, math #imports Math and Quandl
import numpy as np # support for arrays
from sklearn import preprocessing, model_selection, svm #machine learning and
from sklearn.linear_model import LinearRegression # regression

quandl.ApiConfig.api_key = "9qfnyWSTDUpx6uhNX2dc"

df = quandl.get('WIKI/GOOGL') #import data from Qunadl
# print (df.head()) # print out the head rows of the data to check what we're getting

# create a dataframe
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close','Adj. Volume']]
df['HL_pct'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100
df['pct_Change'] = ((df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']) * 100

df = df[['Adj. Close','HL_pct','pct_Change','Adj. Volume']]

forecast_col = 'Adj. Close' # define what we're forcasting
df.fillna(-99999, inplace=True) #replaces missing data with an outlier value (-99999) rather that getting rid of any data

forecast_out = int(math.ceil(0.01*len(df))) # math.ceil rounds everything up to the nearest whole - so this formula takes 1% of the length of the datafram, rounds this up  and finally converts it to an interger

df['label'] = df[forecast_col].shift(-forecast_out) # so this adds a new column 'label' that contains the 'Adj. Close' value from ~1 days in future(?)
# print (df.head()) #just used to check data

X = np.array(df.drop(['label'],1)) # everything except the lable column; this returns a new dataframe that is then converted to a numpy array and stored as X
y = np.array(df['label']) # array of labels

X = preprocessing.scale(X) # scale X before classifier - this can help with performance but can also take longer: can be skipped
y = np.array(df['label'])

### creat training and testing sets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2) # 0.2 = 20% of the datafram

### Swapping different algorythms
# clf = LinearRegression() # simple linear regressions
# clf = LinearRegression(n_jobs=10) # linear regression using threading, 10 jobs at a time = faster
clf = LinearRegression(n_jobs=-1) # linear regression using threading with as many jobs as preprocessor will handle
# clf = svm.SVR() # base support vector regression
# clf = svm.SVR(kernel="poly") # support vector regression with specific kernel, y_train) # fit the data to the training data
accuracy = clf.score(X_test, y_test) # score it against test


Leave a Reply