Step 1

Generating real-time time series data

Overview

The first step of this exercise is to generate the required historical time-series data for our Tank, T-1 process. This files created will hold two (2) sets of historical data. One set for training purposes, and another set for validation purposes. The reasoning for each will be discussed below.

Assumptions

It is assumed that you have downloaded the python based trial repo from our Git-Hub repository, and that you have created the proper virtual environment (venv) by building the requirements for your venv using the requirements.txt. The code samples shown in this section are all from VS Code.

Executing Step 1

// Execute Step 1 from the terminal
python ./step1.py

// During execution, the following will read out on the command line
Generating the sample Training Data.....
Working, Please wait.....
Training Data Generation Results: {'status': True, 'message': 'Succesfully generated the test data traindata.csv'}
Generating the sample Validation Data.....
Working, Please wait.....
Validation DAta Generation Results: {'status': True, 'message': 'Succesfully generated the test data valdata.csv'}

It is important to wait during the 'Working, Please wait...... notifications. Once the training has been generated, the following image should appear. Go ahead and close that image to complete step 1.

Training vs Validation Data

During the this step, two (2) files were created, and should be available in the main directory of your project. Please do not move these files from their location, as they will be used in Steps 2 and 3. This step of data selection is the most valuable step in the process and is a skill set that will be developed and honed over your career, so we automated this portion! However, the key concpets are between getting to know the difference between what is training data, and what is validation data. What makes good data to train on, and valid on are handled on a different lesson.

traindata.csv
valdata.csv

Training Data

The traindata.csv is data set that will be used for building the digital environment in Step 2, and training the agent in Step 3. Training data, in practice, is a manually selected subset of time intervals and is not the overall fully available historical timeline. This is important to understand because not all data is valuable data in respect to machine learning. In this example, this manual process of pulling sample sets from the historical database was fast tracked and delivered as one .csv file.

Validation Data

The valdata.csv is a data set that will be used in validating the digital environments in Step 2, and the trained responses of the agent in Step 3. Validation data, in practice, is a manually selected subset of time intervals and is not the overall fully available historical timeline, and it is also not the same interval of time that exists in the training data. Key rule of thumb, is that training data should be in validation data, and/or validation data should be in training data. A good validation of a trained model must use data that it has never been trained on.

PreviousOverview of The Process NextStep 2

Last updated 2 years ago