Regression – Features and Labels
So the first two tutorials basically introduced the topic and imported some stock data – straight forward. Biggest takeaway being the use of Quandl – I’ll be doing some research into them at a later date.
So this tutorial gets into the meat of regression using Numpy to convert data into Numpy Arrays for Sykit-learn to do its thing.
Quick note on features and labels:-
- features are the descriptive attributes
- labels are what we’re trying to predict or forecast
A common example with regression might be to try to predict the dollar value of an insurance policy premium for someone. The company may collect your age, past driving infractions, public criminal record, and your credit score for example. The company will use past customers, taking this data, and feeding in the amount of the “ideal premium” that they think should have been given to that customer, or they will use the one they actually used if they thought it was a profitable amount.
Thus, for training the machine learning classifier, the features are customer attributes, the label is the premium associated with those attributes.
import pandas as pd import quandl, math #imports Math and Quandl df = quandl.get('WIKI/GOOGL') #import data from Qunadl # print (df.head()) # print out the head rows of the data to check what we're getting # create a dataframe df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close','Adj. Volume']] df['HL_pct'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100 df['pct_Change'] = ((df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']) * 100 df = df[['Adj. Close','HL_pct','pct_Change','Adj. Volume']] forecast_col = 'Adj. Close' # define what we're forcasting df.fillna(-99999, inplace=True) #replaces missing data with an outlier value (-99999) rather that getting rid of any data forecast_out = int(math.ceil(0.01*len(df))) # math.ceil rounds everything up to the nearest whole - so this formula takes 1% of the length of the datafram, rounds this up and finally converts it to an interger df['label'] = df[forecast_col].shift(-forecast_out) # so this adds a new column 'label' that contains the 'Adj. Close' value from ~1 days in future(?) df.dropna(inplace=True) print (df.head()) #just used to check data
Tutorial script here:-