All Courses
All Courses
Courses by Software
Courses by Semester
Courses by Domain
Tool-focused Courses
Machine learning
POPULAR COURSES
Success Stories
1) File Parsing Definition: Parse essentially means to ''resolve (a sentence) into its component parts and describe their syntactic roles''. In computing, parsing is 'an act of parsing a string or a text'. [Google Dictionary]File parsing in computer language means to give a meaning to the characters of a text file as per…
Adnan Zaib Bhat
updated on 09 Jan 2019
Definition:
Parse essentially means to ''resolve (a sentence) into its component parts and describe their syntactic roles''. In computing, parsing is 'an act of parsing a string or a text'. [Google Dictionary]File parsing in computer language means to give a meaning to the characters of a text file as per the formal grammar. ''Within computational linguistics the term is used to refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic and other information.'' (Wikipedia). A parser is a program that parses the text files.
Converge File:
A converge file is usually some thermodynamic properties file containing data points related to various properties. In this project, I will be parsing an Engine output file. The file contains 17 thermodynamic properties like crank angle, pressure, temperature, volume, etc. There are thousands of data points for each property.
The Converge file that I will use in this project is named 'engine_data.out' and can be found here:
https://drive.google.com/file/d/1L8GY56d-M8mB1KfceM-xVhGNvnCIqxjm/view
Before one use information given in a file, it is very important to understand the given file, find the patterns and meaningful ways of data extraction. This is a part of data pre-processing. Rigorously speaking, Data preprocessing is a technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviours or trends, and is likely to contain many errors. Data preprocessing prepares raw data for further processing. (Techopedia)
In data preprocessing, techniques like data cleansing, data integration, data transformation etc. are used. The first two techniques deal with missing and inconsistent data. Data transformation involves transforming the raw data into meaningful and usable formats.
Now, looking at the engine_data.out file in Fig. 1.1 below, it can be seen that the first line in the file contains Converge release name and date. The second line contains the column numbers. The third and fourth lines contain properties and their units. The fifth line is blank and the data points for each property occur from the 6th line up till the last line of the file. It is also clear that lines which do not contain data points start with the '#' symbol. (I opened the data file in WordPad).
In python, one of the best ways to parse a file is to use a for-loop and read a file line by line as shiwn in the following code.
#Reading and extracting data from engine_data.out file
#Extraction
engine_lines =[] #preallocation
for line in open('engine_data.out','r'): #r stands for read only
engine_lines.append(line) #appends the lines in a list
#Printing Information
print('No of lines = ',len(engine_lines),'\n') #number of lines in the list
print('\n First two lines: \n')
print(engine_lines[0:2],) #first 2 lines
print('\n First data-points line: \n')
print(engine_lines[5]) #first data points line
print('\nClass Type\n')
print('Class of the engine_lines variable = ',type(engine_lines))
print('Class of elements = ', type(engine_lines[0]), type(engine_lines[5]))
From Output 1.1, we see that there are 8670 lines in the file. With the above code, all lines are individual entries stored in the list named engine_lines. It can also be seen that each element in the engine_lines list is a string. However, I need to store each data point for a particular property separately as a number and not as a string.
Most of the data files are text files and contain some characteristics which are used to separate data points from each other. If a file is a Comma Separated Values file, then the data points are separated by commas. Looking at the engine_data.out file, it is clear that the data points are separated by spaces. On the first glance, when checking the first data lines, one finds that there are three spaces between each data point. So, while coding, one can use this feature of the file and extract all the data points.
For splitting the data points, using the inbuilt function .split() can be very useful. Now say, if the converge file was a CSV file, the split function could be used by writing engine_lines.split(','). As the data points are separated by (seemingly three) spaces, I could use engine_line.split(' '). [Note the three spaces between the single quotes]. However, the three-space criterion for all lines was only my assumption. While looking keenly at the file, there are certain lines, where there are more or less than three spaces in between the data points. (of course, I realised it only after getting errors). The best way is to input nothing in the split function. This way, the function automatically finds meaningful separators (at least in my case). Also, the 'non-data' lines contain the '#' symbol at the beginning. Using, the two properties of the converge file, the code to integrate data points from it is shown below.
Raw Data Extraction Code:
#Creating different lists for properties with respective data points
#Defining variable types
Crank = []
Pressure = []
Max_Pres = []
Min_Pres = []
Mean_Temp = []
Max_Temp = []
Min_Temp = []
Volume = []
Mass = []
Density = []
Integrated_HR = []
HR_Rate = []
C_p = []
C_v = []
Gamma = []
Kin_Visc = []
Dyn_Visc = []
'''the above variables can be names anything like A, B, C etc. But, for clarity in code, it is better to write the property name for each variable'''
for line in open('engine_data.out','r'):
if "#" not in line:
Crank.append(float(line.split()[0])) #python counts from 0 and not 1
Pressure.append(float(line.split()[1]))
Max_Pres.append(float(line.split()[2]))
Min_Pres.append(float(line.split()[3]))
Mean_Temp.append(float(line.split()[4]))
Max_Temp.append(float(line.split()[5]))
Min_Temp.append(float(line.split()[6]))
Volume.append(float(line.split()[7]))
Mass.append(float(line.split()[8]))
Density.append(float(line.split()[9]))
Integrated_HR.append(float(line.split()[10]))
HR_Rate.append(float(line.split()[11]))
C_p.append(float(line.split()[12]))
C_v.append(float(line.split()[13]))
Gamma.append(float(line.split()[14]))
Kin_Visc.append(float(line.split()[15]))
Dyn_Visc.append(float(line.split()[16]))
plt.plot(Volume,Pressure)
plt.show()
Pressure-Volume Plot from the above code is given in figure 1.2 below:
The code given in Section 1.1.3 works, but it is in no way interactive or appealing. Often, one would like to enter a given Converge engine file, select two desired properties to be plotted with proper labels and titles and be able to save the figures. That would require some complexity in coding. I will explain that step by step in the next sections.
The code given in section in 1.1.3 is not versatile at all. In order to make an interactive program, there are many considerations and shortcomings that may lead to the crashing of the program. For example, if a user enters an invalid file, enters a column number that doesn't exist, the program won't work and is likely to crash. I will discuss each situation along with the solutions. A perfectly working program would be one which will:
Above all, it is important that the program
The program can crash when following cases occur:
While Inputting file:
In short, the program must be crash-proof.
The criteria to determine whether a given file is a valid Converge engine file or not depends on some unique characteristic that must exist in all such files. I have selected the existence of the word 'CONVERGE' (after the # symbol) in the first line of the file as the criteria for a valid file. With the below pseudo code, it is possible to make a 'software' solution for reading, extracting and plotting data from similar files. However, there are some assumptions that need to be considered. Assuming all Converge Engine files contain:
Based on the above assumptions, the following pseudo code illustrates the program idea.
import libraries
start main Loop:
ask for the file name from the Users
if the name does not exist:
prompt the user to enter an existing file
if the name exists:
try parsing the file
if not able to read file:
promp the user to enter a valid file
if able to read an parse:
check validity by finding 'CONVERGE' in the first line
if 'CONVERGE' not in first line:
prompt the user to enter valid file
if 'CONVERGE' in first line:
extract labels and units
extract data columns
extract Converge release
Begin Particular-File-Loop:
print the converge release
print the column names with numbers
prompt the user to enter a valid column for x axis
print the selected column
prompt the user to enter a valid column for y axis
print the selected column
plot the graph with labels and titles
try:
to create a folder in the current directory
if folder exists:
save the plot in the folder
else:
create the folder and then save the plot
ask the user whether to re-run with the same file or a different file
if rerun:
then stay in Particular-File-Loop
else:
go to Main Loop
ask the user whether to enter a new converge file or to exit the Programs
if new file:
stay in the Main Loop
else:
exit the program
The Python code for an interactive file parsing and data visualisation is given below
### Program to Parse a Converge Engine Data Thermodynamic File###
#This code checks the existence and validity of a Converge Engine Data file,
#Then parses the valid file, extracts release version, labels, units and data points
#Stores data points in each coloumn (a property) in separate arrays/lists
#The arrays/lists can then be used to plot graphs between two properties at a time
#The plots are saved in proper folders with proper names
#User will always be prompted to enter valid inputs
#1) Importing libraries/modules:
import matplotlib.pyplot as plt #for plotting
import numpy as np #for creating arrays
from time import sleep #for interactive time delays
from pathlib import Path #creats path name class
import os #Files and directory module
exit = 'y' #defining variable and setting default condition
while exit == 'y': #Main-Program-Loop
#2) Entering and Checking the existance File:
# 2.1) Inputting the file name:
file_input = input(' \n\n Enter the name of the Converge Engine Data File: ')
cwd = os.getcwd() #gets the current working directory
path = cwd + '\\' #double backslash is interpreted as single
file_path = path + file_input #full path name
file_path_find = Path(file_path) #class = pathlib.WindowsPath
# 2.2) Checking the existance of file in the directory:
exist = file_path_find.is_file() #.is_file method returns boolean values
# 2.2.1) Code for non existing file:
if exist == False:
print(("\n No such file '%s' exists in the directory: '%s' \n\
Make sure you enter the file name (case sensitive) along with the exstenstion correctly.\n\n" %(file_input,path)))
sleep(2) #for an interactive pause
'''A backslash is used to indicate the compiler a line continuity'''
# 2.2.2) Code for existing file:
if exist == True: #here, 'if exist == True:' can be replaced by an 'else:' statement
all_columns = [] #defining the type/preallocation
#3) Extacting and Checking Validity of file:
try: #try whether the file is readable or not
# 3.1) Reading file and Extracting content:
for line in open(file_input,'r'): #read ('r') file line by line
separate = line.split() #splits the line automatically
all_columns.append(separate) #store each
# 3.2) Checking the validity of converge file:
# 3.2.1) Invalidity Test
if 'CONVERGE' not in all_columns[0][1]:
print('\n\n You have entered an invalid or corrupt file. Please enter a valid CONVERGE Enigne Data Output File. \n')
sleep(2)
print('Itried not')
'''
The key concept here is to identify a certian unique word or words that will only be
containedin a converge release file at a particular location. The above criteria is,
of course valid only if we assume that all Converge engine data files are similar in
format and, that all such files have 17 columns,first line contains the word CONVERGE
and then the Release etc and lastly that the data points start from the 6th line.
There are four cases that may arise:
1) Entered file is valid and passes the validity test. This is desired.
2) Entered file is invalid and doesn't pass the validity test. This is also desired.
3) Entered file is invalid and passed the validity test.
4) Entered file is not readable, like an image file.
In 3) and 4) cases, the program again prompts to enter a valid file
'''
# 3.2.2) Validity Test:
''' if the file isn't invalid, it is, consequently valid'''
else:
#print('\n You have entered a valid Converge Engine Data File. \n')
#4) Extraction:
# 4.1) Labels/Property names and Units
label_columns = all_columns[2:4]
#Property name and units are in the 3rd and 4th lines respectively
del(label_columns[0][0],label_columns[1][0]) #deleting the pound symbols
'''Note: The names and units are contained in lines while the data points
for each property in columns'''
# 4.2) Data points (lines)
data_columns = all_columns[6:] #selects the only lines with data points
#5) Grouping:
# 5.1) Defining/Preallocating (lists)
''' list variables can be named anything Like A,B,C etc for simplicity
but, it is obvious naming them properly makes code understandable more easily'''
Crank = []
Pressure = []
Max_Pres = []
Min_Pres = []
Mean_Temp = []
Max_Temp = []
Min_Temp = []
Volume = []
Mass = []
Density = []
Integrated_HR = []
HR_Rate = []
C_p = []
C_v = []
Gamma = []
Kin_Visc = []
Dyn_Visc = []
# 5.2) Grouping data into respective columns:
for i in range(len(data_columns)):
Crank.append(float(data_columns[i][0]))
Pressure.append(float(data_columns[i][1]))
Max_Pres.append(float(data_columns[i][2]))
Min_Pres.append(float(data_columns[i][3]))
Mean_Temp.append(float(data_columns[i][4]))
Max_Temp.append(float(data_columns[i][5]))
Min_Temp.append(float(data_columns[i][6]))
Volume.append(float(data_columns[i][7]))
Mass.append(float(data_columns[i][8]))
Density.append(float(data_columns[i][9]))
Integrated_HR.append(float(data_columns[i][10]))
HR_Rate.append(float(data_columns[i][11]))
C_p.append(float(data_columns[i][12]))
C_v.append(float(data_columns[i][13]))
Gamma.append(float(data_columns[i][14]))
Kin_Visc.append(float(data_columns[i][15]))
Dyn_Visc.append(float(data_columns[i][16]))
'''The key idea here is that for each iteration (for each line, denoted by
'i'), the loop appends to each (property) list a data point. For example
data_columns[i][7] will always append the 8th entry from each line. This way,
data points belonging to Volume only will ve stored in the Volume array.
'''
DATA = [Crank,Pressure,Max_Pres,Min_Pres,Mean_Temp,Max_Temp,Min_Temp,Volume,Mass,\
Density,Integrated_HR,HR_Rate,C_p,C_v,Gamma,Kin_Visc,Dyn_Visc] #GRoups each file column into a list
# 5.3) Converting Arrays into absolute units
''' The three columns belonging to pressure in the file are in MPa, while other are
in absolute units. Also, if needed, the Crank angles can be convertred to raidans
with np.radians() command. The other way is to multiply pressure arrays in Section
5.2 e.g., Max_Pres.append(float(data_columns[i][2])*10e6)
'''
#for i in range(1,4):
# DATA[i] = 1e6*np.array(DATA[i])
'''by converting a list to numpy array, elementwise operations can be done
Mega = 10^6 which in python is written as 1e^6
However, when plotting, the units for pressures will be in MPa already.
Only while calculating, shall we need to multiply 10^6 to pressure arrays
'''
# 6) Prompting Columns to be plotted from the user:
rerun = 'r' #defining variable and setting default condition
while rerun == 'r': #Particular-File-Loop (let's call it that)
# 6.1) Creating converge file name, version and column values for display:
version = ' ' #defining variable
for i in range(1,5):
version = version + ' ' + all_columns[0][i]
print('\n\n\n',(' '*10 + '*'*10)*5)
print('\n\t\t\t Current File: %s' %(file_input+version))
#\n creates new line and \t creates a tab space
for i in range(len(label_columns[0])):
print('\t',label_columns[0][i],'=',i+1)
'''
Note: the variable 'i' used in the loops can be reused to save memory.
However, if the value of i is to be used after the loop ends, say as a
counter etc, then different loops should use different variables
'''
# 6.2) Prompting for the first column, X-axis:
'''
If the user enters a float, char or string, the program can crash. Also,
the input() accepts strings by default. Using try-execpt this can be fixed
'''
x ='anything' #anything but an integer between 1 and 17
while x not in list(range(1,18)): #because there are only 17 columns
x = input('\n\n Please Enter the column number (X-axis): ')
try:
x = int(x)
if x not in list(range(1,18)):
print(' Invalid Number. Accepted Values (1-17)')
except:
print(' Invalid Number. Accepted Values (1-17)')
'''
The above code will try if the input value can be converted into an interger.
Then, test if the integer lies 1 and 17. If yes, this condition will satisfy
both 'try condition' as well as while loop. Otherwise,it will keep displaying
the error message and keep prompting for a valid number from the user
'''
x = x - 1 #because python counts from 0 :)
print('\t', label_columns[0][x]) #prints the selected column
# 6.2) Prompting for the second column, Y-axis:
y = 'anything'
while y not in list(range(1,18)):
y = input('\n Please Enter the column number (Y-axis): ')
try:
y = int(y)
if y not in list(range(1,18)):
print(' Invalid Number. Accepted Values (1-17)')
except:
print(' Invalid Number. Accepted Values (1-17)')
y = y - 1
print('\t',label_columns[0][y])
# 7) Plotting:
# 7.1) Creating Title, axes labels
title = 'Plot of ' + label_columns[0][y] + ' Vs ' + label_columns[0][x]
x_lab = label_columns[0][x] + label_columns[1][x]
y_lab = label_columns[0][y] + label_columns[1][y]
# 7.2) Creating The Folders and filename for the plot.
folder = path + 'File Parsing\\Plot Figures\\'
'''Needed to create the folder for the first time. If the folder already
exists, it will move to except'''
try:
os.makedirs(folder)
#makedirs makes folders and subfolders. mkdir, only a single folder
plot_filename = folder + title + '.jpeg'
except:
plot_filename = folder + title + '.jpeg'
#'png' or any image format can be used
# 7.2) Plotting the figure:
plt.figure()
plt.plot(DATA[x],DATA[y])
plt.xlabel(x_lab)
plt.ylabel(y_lab)
plt.title(title)
plt.savefig(plot_filename)
plt.show()
#Note: savefig() must be placed before show(), else, a blank image is saved
#8) Rerunning the program with the current converge file:
'''
If the user wants to plot again with the same file, the program should not ask
again for the converge file. Thus, the user explicitly has to declare whether
the current file is to be used again or another file is to be used
'''
rerun = input('\n Press R to rerun or H to exit to home (R/H): ')
rerun = rerun.lower()
#lower() converts string to lower case. User can enter R,r,H or h
while rerun != 'r' and rerun != 'h':
print('Invalid Input')
rerun = input('\n Press R to rerun or H to exit to home (R/H): ')
rerun = rerun.lower()
'''
prompting the user to enter either R,r or H,h only
'''
if rerun == 'h':
#entering h will satisfy the 'Particular-File-Loop' and break it
print('\n Exiting to Home... \n\n')
sleep(1)
'''entering h will satisfy the 'Particular-File-Loop' and break it, and
thus, return to the Main-Program-Loop'''
except:
print('\n You have entered an invalid or corrupt file. Please enter a valid CONVERGE Enigne Data Output File. \n\n')
sleep(2)
#9) Rerunning the program with a new file:
exit = input('\n\n Press Y to enter Converge file or N to exit (Y/N): ')
exit = exit.lower()
while exit != 'n' and exit != 'y':
print(' Invalid Input')
exit = input('\n\n Press Y to enter Converge file or N to exit (Y/N):')
exit = exit.lower()
'''#entering n will satisfy the 'Main-Program-Loop' and break it and thus, terminate, or exit the program'''
if exit == 'n':
print('\n Exiting Program...')
sleep(1)
'''The (Y/N) prompt will be dispalyed everytime when non-existing file case arises, invalid converge file is input
or when the user returns to home after plotting'''
The program has freedom of the number of data lines in the file. Also, the program is independent of the file location, as long as the converge file also exists in the same folder.
The only limitation of this program is that a user will not be able to plot properties which have a column number higher then 17 (if there are any). This is because I will be using only 17 variables to store the columns and, any column higher than 17 will get parsed but not stored in any separate variable. Also, I have set valid column numbers between 1-17 only.
I have stored the python file named 'engine_data.py' in a particular folder named Data Analysis. Along with it, I have copied the valid Converge file 'engine_data.out', an image file named image_png, a pdf file named 'engine.pdf', a copy of the converge file named 'non_converge.out' with 'CONVERGE' erased from the first line and a copy of Converge engine file named 'more_colmns.out' with an additional incomplete 18th column. Fig. 2.1 - 2.3 shows the various files in the directory.
I will run the program through the following steps:
The program should have created a folder named File Parsing. In it, a sub-folder named Plot Figures and in it the figures that I generated from the program.
I have made a video of the working of the program (below)
NOTE: The final PV plot was created from the more_columns.out file, which even though is an invalid file, nonetheless contains valid data for 17 columns.
For the Engine Performance, check out the second part of this project:
File Parsing and Data Analysis in Python Part II (Area Under Curve and Engine Performance).
***END***
Leave a comment
Thanks for choosing to leave a comment. Please keep in mind that all the comments are moderated as per our comment policy, and your email will not be published for privacy reasons. Please leave a personal & meaningful conversation.
Other comments...
File Parsing and Data Analysis in Python Part I (Interactive Parsing and Data Visualisation)
1) File Parsing Definition: Parse essentially means to ''resolve (a sentence) into its component parts and describe their syntactic roles''. In computing, parsing is 'an act of parsing a string or a text'. [Google Dictionary]File parsing in computer language means to give a meaning to the characters of a text file as per…
15 Jan 2019 02:28 PM IST
File Parsing and Data Analysis in Python Part I (Interactive Parsing and Data Visualisation)
1) File Parsing Definition: Parse essentially means to ''resolve (a sentence) into its component parts and describe their syntactic roles''. In computing, parsing is 'an act of parsing a string or a text'. [Google Dictionary]File parsing in computer language means to give a meaning to the characters of a text file as per…
09 Jan 2019 02:59 AM IST
File Parsing and Data Analysis in Python Part II (Area Under Curve and Engine Performance)
1) Integration/Area Under Curve 1.1 PV Diagram In thermodynamics, a PV diagram is a plot which shows the relationship between the pressure and volume for a particular process. We know that dw=p.dv is the small work done by the process at a particular instance. Hence, total work done by a process from…
08 Jan 2019 06:07 AM IST
Constrained Optimisation Using Lagrange Multipliers
Problem: Minimize: 5−(x−2)2−2(y−1)2; subject to the following constraint: x+4y=3 1) Lagrange Multipliers Lagrange multipliers technique is a fundamental technique to solve problems involving constrained problems. This method is utilised to find the local minima and maxima subjected to (at least one) equality…
22 Dec 2018 06:32 PM IST
Related Courses
0 Hours of Content
Skill-Lync offers industry relevant advanced engineering courses for engineering students by partnering with industry experts.
© 2025 Skill-Lync Inc. All Rights Reserved.