Machine Learning

Machine Learning – Tutorial 5

Regression – Forecasting and Predicting

This next tutorial covers using the trained regression model to forecast out data. Full notes here:-

https://pythonprogramming.net/forecasting-predicting-machine-learning-tutorial/

Key takeaways:-

import pandas as pd
import quandl, math, datetime #imports Math, datetime and Quandl
import numpy as np # support for arrays
from sklearn import preprocessing, model_selection, svm #machine learning and
from sklearn.linear_model import LinearRegression # regression
import matplotlib.pyplot as plt
from matplotlib import style

style.use('ggplot')

quandl.ApiConfig.api_key = "9qfnyWSTDUpx6uhNX2dc"

df = quandl.get('WIKI/GOOGL') #import data from Qunadl
# print (df.head()) # print out the head rows of the data to check what we're getting

# create a dataframe
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close','Adj. Volume']]
df['HL_pct'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100
df['pct_Change'] = ((df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']) * 100

df = df[['Adj. Close','HL_pct','pct_Change','Adj. Volume']]


forecast_col = 'Adj. Close' # define what we're forcasting
df.fillna(-99999, inplace=True) #replaces missing data with an outlier value (-99999) rather that getting rid of any data

forecast_out = int(math.ceil(0.01*len(df))) # math.ceil rounds everything up to the nearest whole - so this formula takes 1% of the length of the datafram, rounds this up  and finally converts it to an interger

df['label'] = df[forecast_col].shift(-forecast_out) # so this adds a new column 'label' that contains the 'Adj. Close' value from ~1 days in future(?)

# print (df.head()) #just used to check data

X = np.array(df.drop(['label'],1)) # everything except the lable column; this returns a new dataframe that is then converted to a numpy array and stored as X
X = preprocessing.scale(X) # scale X before classifier - this can help with performance but can also take longer: can be skipped
X_lately = X[-forecast_out:] # used to predict against - note there are no y values for these to check against
X = X[:-forecast_out] # needs to happen after scaling



df.dropna(inplace=True)
y = np.array(df['label'])
y = np.array(df['label']) # array of labels

### creat training and testing sets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2) # 0.2 = 20% of the datafram

### Swapping different algorythms
# clf = LinearRegression() # simple linear regressions
# clf = LinearRegression(n_jobs=10) # linear regression using threading, 10 jobs at a time = faster
clf = LinearRegression(n_jobs=-1) # linear regression using threading with as many jobs as preprocessor will handle
# clf = svm.SVR() # base support vector regression
# clf = svm.SVR(kernel="poly") # support vector regression with specific kernel

clf.fit(X_train, y_train) # fit the data to the training data
accuracy = clf.score(X_test, y_test) # score it against test

# print(accuracy)

### pridiction - easy once the classifier is sets

forecast_set = clf.predict(X_lately)
print (forecast_set, accuracy, forecast_out)
df['Forecast'] = np.nan

last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day # moving to future dates not in dataset

## Set a new Data Frame including dates with the forecast values
for i in forecast_set:
    next_date = datetime.datetime.fromtimestamp(next_unix)
    next_unix += 86400
    df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)]+[i]

df['Adj. Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.savefig('ML_Tutorial5.svg', bbox_inches='tight') #bbox_inches='tight' minimises whitespace around the fig
plt.show()

Machine Learning – Tutorial 4

Regression – Training and Testing

This tutorial covered the first application of Regression to sample data.

https://pythonprogramming.net/training-testing-machine-learning-tutorial/

Key takeaways being:-

  • X & y – define features and labels respectively
  • Scaling data helps accuracy and performance but can take longer:
    Generally, you want your features in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can also help with accuracy. Because this range is so popularly used, it is included in the preprocessing module of Scikit-Learn. To utilize this, you can apply preprocessing.scale to your X variable:
  • Quandl limits the number of anonymous API calls, the work around requiring registration and the passing of a key with requests:-
    quandl.ApiConfig.api_key = “<the API Key>”
  • cross_validation is deprecated – model_selection can be used instead
    from sklearn import model_selection
    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
  • It’s easy to swap out different algorithms – some can be threaded allowing parallel processing  (check algorithm documentation for n_jobs)
  • n_jobs=-1 sets the number of jobs to the maximum number for the processor
import pandas as pd
import quandl, math #imports Math and Quandl
import numpy as np # support for arrays
from sklearn import preprocessing, model_selection, svm #machine learning and
from sklearn.linear_model import LinearRegression # regression

quandl.ApiConfig.api_key = "9qfnyWSTDUpx6uhNX2dc"

df = quandl.get('WIKI/GOOGL') #import data from Qunadl
# print (df.head()) # print out the head rows of the data to check what we're getting

# create a dataframe
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close','Adj. Volume']]
df['HL_pct'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100
df['pct_Change'] = ((df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']) * 100

df = df[['Adj. Close','HL_pct','pct_Change','Adj. Volume']]


forecast_col = 'Adj. Close' # define what we're forcasting
df.fillna(-99999, inplace=True) #replaces missing data with an outlier value (-99999) rather that getting rid of any data

forecast_out = int(math.ceil(0.01*len(df))) # math.ceil rounds everything up to the nearest whole - so this formula takes 1% of the length of the datafram, rounds this up  and finally converts it to an interger

df['label'] = df[forecast_col].shift(-forecast_out) # so this adds a new column 'label' that contains the 'Adj. Close' value from ~1 days in future(?)
df.dropna(inplace=True)
# print (df.head()) #just used to check data

X = np.array(df.drop(['label'],1)) # everything except the lable column; this returns a new dataframe that is then converted to a numpy array and stored as X
y = np.array(df['label']) # array of labels

X = preprocessing.scale(X) # scale X before classifier - this can help with performance but can also take longer: can be skipped
y = np.array(df['label'])

### creat training and testing sets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2) # 0.2 = 20% of the datafram

### Swapping different algorythms
# clf = LinearRegression() # simple linear regressions
# clf = LinearRegression(n_jobs=10) # linear regression using threading, 10 jobs at a time = faster
clf = LinearRegression(n_jobs=-1) # linear regression using threading with as many jobs as preprocessor will handle
# clf = svm.SVR() # base support vector regression
# clf = svm.SVR(kernel="poly") # support vector regression with specific kernel

clf.fit(X_train, y_train) # fit the data to the training data
accuracy = clf.score(X_test, y_test) # score it against test

print(accuracy)

Machine Learning – Tutorial 3

Regression – Features and Labels

So the first two tutorials basically introduced the topic and imported some stock data – straight forward. Biggest takeaway being the use of Quandl – I’ll be doing some research into them at a later date.

So this tutorial gets into the meat of regression using Numpy to convert data into Numpy Arrays for Sykit-learn to do its thing.

Quick note on features and labels:-

  • features are the descriptive attributes
  • labels are what we’re trying to predict or forecast

A common example with regression might be to try to predict the dollar value of an insurance policy premium for someone. The company may collect your age, past driving infractions, public criminal record, and your credit score for example. The company will use past customers, taking this data, and feeding in the amount of the “ideal premium” that they think should have been given to that customer, or they will use the one they actually used if they thought it was a profitable amount.

Thus, for training the machine learning classifier, the features are customer attributes, the label is the premium associated with those attributes.

import pandas as pd
import quandl, math #imports Math and Quandl

df = quandl.get('WIKI/GOOGL') #import data from Qunadl
# print (df.head()) # print out the head rows of the data to check what we're getting

# create a dataframe
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close','Adj. Volume']]
df['HL_pct'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100
df['pct_Change'] = ((df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']) * 100

df = df[['Adj. Close','HL_pct','pct_Change','Adj. Volume']]


forecast_col = 'Adj. Close' # define what we're forcasting
df.fillna(-99999, inplace=True) #replaces missing data with an outlier value (-99999) rather that getting rid of any data

forecast_out = int(math.ceil(0.01*len(df))) # math.ceil rounds everything up to the nearest whole - so this formula takes 1% of the length of the datafram, rounds this up  and finally converts it to an interger

df['label'] = df[forecast_col].shift(-forecast_out) # so this adds a new column 'label' that contains the 'Adj. Close' value from ~1 days in future(?)
df.dropna(inplace=True)
print (df.head()) #just used to check data

Tutorial script here:-

https://pythonprogramming.net/features-labels-machine-learning-tutorial/?completed=/regression-introduction-machine-learning-tutorial/

Machine Learning – Tutorial 2

Regression Intro

import pandas as pd
import quandl

df = quandl.get('WIKI/GOOGL') #import data from Qunadl
# print (df.head()) # print out the head rows of the data to check what we're getting

# create a dataframe
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close','Adj. Volume']]
df['HL_pct'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100
df['pct_Change'] = ((df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']) * 100

df = df[['Adj. Close','HL_pct','pct_Change','Adj. Volume']]
print (df.head()) #just used to che data

Python interlude

So the key project I’ve been working on is looking at stock market data and trying to develop a set of python tools that allows me to make better predictions.

This started with Beautiful Soup scripts designed to harvest company fundamentals from the London Stock Exchange website – integrating this to a SQL database for later analysis. This should given me all the raw data I need to determine whether a company is viable (passing a set of qualifier tests) along with (eventually) a way of predicting a base share value.

The second element to the project is then using sentiment analysis to look at how those same companies are being discussed on social media. This has been based on Sendex’s tutorials using Twitter, but my hope is to adapt these to other platforms. This then compliments the base data with some views on where current sentiment is – and hopefully there is a correlation between the two data-sets.

However, the current virus has meant free fall in most stock indexes which is probably going to skew my model. So I’m going to let it pass while working on a few supplementary modules that I would have gotten around to including at a later date.

Namely Machine Learning 🙂

So as before I’m going to follow Sendex’s tutorials on this.
There are a few Udemy courses I’ve done in this area – so I might not keep extensive notes – just cover the key elements I need to keep track of.

Anyway….music not related…