Covid-19

There’s only been a few times when I felt I was living through history – Berlin Wall coming down, 911, 711… And the difference with all of those big events was I was always an observer, not caught up in them directly. 

This one is very different as it’s touching everyone with no part of the world not being impacted in some way shape or form.

So we’ve been in lock-down for about a week, with my boy (Isaac) being off school from Thursday, and so far it’s all been ok. It’s taken a bit of getting used to with replicating the school routine and it’s odd having to consciously put thought into shopping and food. It’s also struck me as slightly strange how people react and get into panic buying, which as a behaviour pattern always strikes me at Christmas (when the shops are only closed for 2 days).

Anyway I’ve kind of promised to myself to keep some notes through this period on how we’re coping – to document our part of it all.

Machine Learning – Tutorial 4

Regression – Training and Testing

This tutorial covered the first application of Regression to sample data.

https://pythonprogramming.net/training-testing-machine-learning-tutorial/

Key takeaways being:-

  • X & y – define features and labels respectively
  • Scaling data helps accuracy and performance but can take longer:
    Generally, you want your features in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can also help with accuracy. Because this range is so popularly used, it is included in the preprocessing module of Scikit-Learn. To utilize this, you can apply preprocessing.scale to your X variable:
  • Quandl limits the number of anonymous API calls, the work around requiring registration and the passing of a key with requests:-
    quandl.ApiConfig.api_key = “<the API Key>”
  • cross_validation is deprecated – model_selection can be used instead
    from sklearn import model_selection
    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
  • It’s easy to swap out different algorithms – some can be threaded allowing parallel processing  (check algorithm documentation for n_jobs)
  • n_jobs=-1 sets the number of jobs to the maximum number for the processor
import pandas as pd
import quandl, math #imports Math and Quandl
import numpy as np # support for arrays
from sklearn import preprocessing, model_selection, svm #machine learning and
from sklearn.linear_model import LinearRegression # regression

quandl.ApiConfig.api_key = "9qfnyWSTDUpx6uhNX2dc"

df = quandl.get('WIKI/GOOGL') #import data from Qunadl
# print (df.head()) # print out the head rows of the data to check what we're getting

# create a dataframe
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close','Adj. Volume']]
df['HL_pct'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100
df['pct_Change'] = ((df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']) * 100

df = df[['Adj. Close','HL_pct','pct_Change','Adj. Volume']]


forecast_col = 'Adj. Close' # define what we're forcasting
df.fillna(-99999, inplace=True) #replaces missing data with an outlier value (-99999) rather that getting rid of any data

forecast_out = int(math.ceil(0.01*len(df))) # math.ceil rounds everything up to the nearest whole - so this formula takes 1% of the length of the datafram, rounds this up  and finally converts it to an interger

df['label'] = df[forecast_col].shift(-forecast_out) # so this adds a new column 'label' that contains the 'Adj. Close' value from ~1 days in future(?)
df.dropna(inplace=True)
# print (df.head()) #just used to check data

X = np.array(df.drop(['label'],1)) # everything except the lable column; this returns a new dataframe that is then converted to a numpy array and stored as X
y = np.array(df['label']) # array of labels

X = preprocessing.scale(X) # scale X before classifier - this can help with performance but can also take longer: can be skipped
y = np.array(df['label'])

### creat training and testing sets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2) # 0.2 = 20% of the datafram

### Swapping different algorythms
# clf = LinearRegression() # simple linear regressions
# clf = LinearRegression(n_jobs=10) # linear regression using threading, 10 jobs at a time = faster
clf = LinearRegression(n_jobs=-1) # linear regression using threading with as many jobs as preprocessor will handle
# clf = svm.SVR() # base support vector regression
# clf = svm.SVR(kernel="poly") # support vector regression with specific kernel

clf.fit(X_train, y_train) # fit the data to the training data
accuracy = clf.score(X_test, y_test) # score it against test

print(accuracy)

Machine Learning – Tutorial 3

Regression – Features and Labels

So the first two tutorials basically introduced the topic and imported some stock data – straight forward. Biggest takeaway being the use of Quandl – I’ll be doing some research into them at a later date.

So this tutorial gets into the meat of regression using Numpy to convert data into Numpy Arrays for Sykit-learn to do its thing.

Quick note on features and labels:-

  • features are the descriptive attributes
  • labels are what we’re trying to predict or forecast

A common example with regression might be to try to predict the dollar value of an insurance policy premium for someone. The company may collect your age, past driving infractions, public criminal record, and your credit score for example. The company will use past customers, taking this data, and feeding in the amount of the “ideal premium” that they think should have been given to that customer, or they will use the one they actually used if they thought it was a profitable amount.

Thus, for training the machine learning classifier, the features are customer attributes, the label is the premium associated with those attributes.

import pandas as pd
import quandl, math #imports Math and Quandl

df = quandl.get('WIKI/GOOGL') #import data from Qunadl
# print (df.head()) # print out the head rows of the data to check what we're getting

# create a dataframe
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close','Adj. Volume']]
df['HL_pct'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100
df['pct_Change'] = ((df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']) * 100

df = df[['Adj. Close','HL_pct','pct_Change','Adj. Volume']]


forecast_col = 'Adj. Close' # define what we're forcasting
df.fillna(-99999, inplace=True) #replaces missing data with an outlier value (-99999) rather that getting rid of any data

forecast_out = int(math.ceil(0.01*len(df))) # math.ceil rounds everything up to the nearest whole - so this formula takes 1% of the length of the datafram, rounds this up  and finally converts it to an interger

df['label'] = df[forecast_col].shift(-forecast_out) # so this adds a new column 'label' that contains the 'Adj. Close' value from ~1 days in future(?)
df.dropna(inplace=True)
print (df.head()) #just used to check data

Tutorial script here:-

https://pythonprogramming.net/features-labels-machine-learning-tutorial/?completed=/regression-introduction-machine-learning-tutorial/

Machine Learning – Tutorial 2

Regression Intro

import pandas as pd
import quandl

df = quandl.get('WIKI/GOOGL') #import data from Qunadl
# print (df.head()) # print out the head rows of the data to check what we're getting

# create a dataframe
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close','Adj. Volume']]
df['HL_pct'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100
df['pct_Change'] = ((df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']) * 100

df = df[['Adj. Close','HL_pct','pct_Change','Adj. Volume']]
print (df.head()) #just used to che data

Python interlude

So the key project I’ve been working on is looking at stock market data and trying to develop a set of python tools that allows me to make better predictions.

This started with Beautiful Soup scripts designed to harvest company fundamentals from the London Stock Exchange website – integrating this to a SQL database for later analysis. This should given me all the raw data I need to determine whether a company is viable (passing a set of qualifier tests) along with (eventually) a way of predicting a base share value.

The second element to the project is then using sentiment analysis to look at how those same companies are being discussed on social media. This has been based on Sendex’s tutorials using Twitter, but my hope is to adapt these to other platforms. This then compliments the base data with some views on where current sentiment is – and hopefully there is a correlation between the two data-sets.

However, the current virus has meant free fall in most stock indexes which is probably going to skew my model. So I’m going to let it pass while working on a few supplementary modules that I would have gotten around to including at a later date.

Namely Machine Learning 🙂

So as before I’m going to follow Sendex’s tutorials on this.
There are a few Udemy courses I’ve done in this area – so I might not keep extensive notes – just cover the key elements I need to keep track of.

Anyway….music not related…

Natural Language Tool Kit – Tutorial 20

Graphing Live Twitter Sentiment 

The final stage of this set of tutorials is graphing the sentiment output…based on a more in depth tutorial here:-

https://pythonprogramming.net/live-graphs-matplotlib-tutorial/

Code here:-

import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib import style
import time

style.use("ggplot")

fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

def animate(i):
    pullData = open("twitter-out.txt","r").read()
    lines = pullData.split('\n')

    xar = []
    yar = []

    x = 0
    y = 0

    for l in lines[-200:]:
        x += 1
        if "pos" in l:
            y += 1
        elif "neg" in l:
            y -= 1

        xar.append(x)
        yar.append(y)
        
    ax1.clear()
    ax1.plot(xar,yar)
ani = animation.FuncAnimation(fig, animate, interval=1000)
plt.show()

As I’m running this on a headless server it ran into issues straight away…

galiquis@localhost: $ python3 nltk_tutorial21.py
Unable to init server: Could not connect: Connection refused
Unable to init server: Could not connect: Connection refused

(nltk_tutorial21.py:26565): Gdk-CRITICAL **: 08:21:48.134: gdk_cursor_new_for_display: assertion ‘GDK_IS_DISPLAY (display)’ failed

(nltk_tutorial21.py:26565): Gdk-CRITICAL **: 08:21:48.137: gdk_cursor_new_for_display: assertion ‘GDK_IS_DISPLAY (display)’ failed

So after a little research I found a way of switching the ‘canvas’ to the Agg Buffer – allowing the output to be saved rather than shown.

import matplotlib as mpl
mpl.use('Agg')

It’s important that this is defined ahead of any other canvas calls/functions – otherwise it throws errors.

The other tweaks I made just switched off the animation for now.

import matplotlib as mpl
mpl.use('Agg')

import matplotlib.pyplot as plt
import matplotlib.animation as animation
#from matplotlib import style

import time

#style.use("ggplot")


fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

def animate(i):
    pullData = open("twitter-out.txt","r").read()
    lines = pullData.split('\n')

    xar = []
    yar = []

    x = 0
    y = 0

    for l in lines[-200:]:
        x += 1
        if "pos" in l:
            y += 1
        elif "neg" in l:
            y -= 1

        xar.append(x)
        yar.append(y)

    ax1.clear()
    ax1.plot(xar,yar)

#ani = animation.FuncAnimation(fig, animate, interval=100)
animate(1)
fig.savefig('temp.png')

Natural Language Tool Kit – Tutorial 20

Twitter sentiment analysis

First up the Twitter API module needed installing:- 

galiquis@raspberrypi: $ pip3 install tweepy

Next a Twitter App is required from this link:-

https://developer.twitter.com/en/apps

This required setting up a developer account – with more justification in the application form than I was expecting – especially around what I’d be using the app for….anyway once generated it gave a live stream of twitter based on this code:-

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener

#consumer key, consumer secret, access token, access secret.
ckey="6ru23AnzOKAieH4eYXF0XuTPS"
csecret="74Oz560aRCfo5QzzXu2I0gfOm58qkNPfZx0oSl3tnWEnEND4ex"
atoken="241873929-QkQ1eN0Du1Cg6el6rJa3sMGRHBaiSp7Cxekq61Of"
asecret="hf0ECMVfcqlPWkgOGKeNNTU1m41QQuiTOLzktsiNqqIxD"

class listener(StreamListener):

    def on_data(self, data):
        print(data)
        return(True)

    def on_error(self, status):
        print(status)

auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)

twitterStream = Stream(auth, listener())
twitterStream.filter(track=["car"])

https://pythonprogramming.net/twitter-api-streaming-tweets-python-tutorial/

The below covers a few tweaks with the output of the sentiment engine being saved off into a text file. 

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
import sentiment_mod as s

#consumer key, consumer secret, access token, access secret.
ckey="*"
csecret="*"
atoken="*"
asecret="*"

class listener(StreamListener):

    def on_data(self, data):
        all_data = json.loads(data)
        tweet = all_data["text"]
        sentiment_value, confidence = s.sentiment(tweet)
        print(tweet, sentiment_value, confidence)

        if confidence*100 >= 80:
            output = open("twitter-out.txt", "a")
            output.write(sentiment_value)
            output.write('\n')
            output.close()

        return(True)

    def on_error(self, status):
        print(status)

auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)

twitterStream = Stream(auth, listener())
twitterStream.filter(track=["car"]) # term searched for in tweets

Next we’ll look at graphing this data.

Natural Language Tool Kit – Tutorial 19

Sentiment Analysis Module 

This section brings all the detail together to create a module that can be used to monitor twitter sentiment.

The code for this is seperated into two blocks.

The first pickles most of the heavy training to save time in future iteration:-.

import nltk
import random
#from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize
#from unidecode import unidecode

class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

short_pos = open("positive.txt","r", encoding='utf-8', errors='replace').read() ## had to add a line to tell the open function to use utf-8
short_neg = open("negative.txt","r", encoding='utf-8', errors='replace').read()

# move this up here
all_words = []
documents = []

#  j is adject, r is adverb, and v is verb
#allowed_word_types = ["J","R","V"]
allowed_word_types = ["J"]

for p in short_pos.split('\n'):
    documents.append( (p, "pos") )
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())

for p in short_neg.split('\n'):
    documents.append( (p, "neg") )
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())

save_documents = open("pickled_algos/documents.pickle","wb")
pickle.dump(documents, save_documents)
save_documents.close()


all_words = nltk.FreqDist(all_words)


word_features = list(all_words.keys())[:5000]


save_word_features = open("pickled_algos/word_features5k.pickle","wb")
pickle.dump(word_features, save_word_features)
save_word_features.close()


def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]

random.shuffle(featuresets)
print(len(featuresets))

testing_set = featuresets[7400:] # memory limitations on the Pi meant this needed reducing from 10,000
training_set = featuresets[:7400]

save_featuresets = open("pickled_algos/featuresets.pickle","wb") ## added code to pickle featuresets
pickle.dump(featuresets, save_featuresets)
save_featuresets.close()

classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

###############
save_classifier = open("pickled_algos/originalnaivebayes5k.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

save_classifier = open("pickled_algos/MNB_classifier5k.pickle","wb")
pickle.dump(MNB_classifier, save_classifier)
save_classifier.close()

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

save_classifier = open("pickled_algos/BernoulliNB_classifier5k.pickle","wb")
pickle.dump(BernoulliNB_classifier, save_classifier)
save_classifier.close()

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

save_classifier = open("pickled_algos/LogisticRegression_classifier5k.pickle","wb")
pickle.dump(LogisticRegression_classifier, save_classifier)
save_classifier.close()


LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

save_classifier = open("pickled_algos/LinearSVC_classifier5k.pickle","wb")
pickle.dump(LinearSVC_classifier, save_classifier)
save_classifier.close()


##NuSVC_classifier = SklearnClassifier(NuSVC())
##NuSVC_classifier.train(training_set)
##print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)


SGDC_classifier = SklearnClassifier(SGDClassifier())
SGDC_classifier.train(training_set)
print("SGDClassifier accuracy percent:",nltk.classify.accuracy(SGDC_classifier, testing_set)*100)

save_classifier = open("pickled_algos/SGDC_classifier5k.pickle","wb")
pickle.dump(SGDC_classifier, save_classifier)
save_classifier.close()

I had to drop the size of the learning set given the memory restrictions on the Pi not sure there was another way around this and it probably means that to start doing some of this stuff seriously….I’m going to need a bigger/better server.

Also I added a section to pickle the featuresets:-

save_featuresets = open("pickled_algos/featuresets.pickle","wb") ## added code to pickle featuresets
pickle.dump(featuresets, save_featuresets)
save_featuresets.close()

One option might be clustering the Pi or putting together a low power server – which then leads into building a full home server farm.

The next section is then the sentiment module:-

#File: sentiment_mod.py

import nltk
import random
#from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize



class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf


documents_f = open("pickled_algos/documents.pickle", "rb")
documents = pickle.load(documents_f)
documents_f.close()




word_features5k_f = open("pickled_algos/word_features5k.pickle", "rb")
word_features = pickle.load(word_features5k_f)
word_features5k_f.close()


def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features



featuresets_f = open("pickled_algos/featuresets.pickle", "rb")
featuresets = pickle.load(featuresets_f)
featuresets_f.close()

random.shuffle(featuresets)
print(len(featuresets))

testing_set = featuresets[10000:]
training_set = featuresets[:10000]



open_file = open("pickled_algos/originalnaivebayes5k.pickle", "rb")
classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/MNB_classifier5k.pickle", "rb")
MNB_classifier = pickle.load(open_file)
open_file.close()



open_file = open("pickled_algos/BernoulliNB_classifier5k.pickle", "rb")
BernoulliNB_classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/LogisticRegression_classifier5k.pickle", "rb")
LogisticRegression_classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/LinearSVC_classifier5k.pickle", "rb")
LinearSVC_classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/SGDC_classifier5k.pickle", "rb")
SGDC_classifier = pickle.load(open_file)
open_file.close()

voted_classifier = VoteClassifier(
                                  classifier,
                                  LinearSVC_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)


def sentiment(text):
    feats = find_features(text)
    return voted_classifier.classify(feats),voted_classifier.confidence(feats)

Natural Language Tool Kit – Tutorial 18

Better training data

This tutorial covers training the algorithm on a new more tailored data-set. The training data set used still covers movie reviews but contains those that are a lot shorter – which should give better results.