What is an array in Python?

Estimated reading time: 3 minutes

Python arrays are used to manage how data is analysed and have the data in a structured form for the data analyst or data scientist to use.

What is an Array?

An array has the following properties:

  1. The data in it is of the same data type.
  2. The data is stored as contiguous memory locations, meaning meaning the lowest value is at index 0 and the last value is at the highest index value.
  3. Arrays assign index values to each data point contained within it.
  4. You can append values to the array.
  5. You can delete values from the array, that is it is mutable.

What are the differences between arrays and lists?

For the most part arrays and lists are the same, but with one difference:

A list can store any data type you want e.g. strings, integers etc.

On the other hand, arrays can only store data that is of the same data type.

What are the different ways that I can create an array?

(A) Use Numpy

import numpy as np

a = np.array([1,2,3])

print(a)
print(type(a))

Output:
[1 2 3]
<class 'numpy.ndarray'>

(B) Use array

import array as test_array

a = test_array.array('i',[1,2,3])

print(a)

Output:
array('i', [1, 2, 3])
<class 'array.array'>

On the Python.org website, below are the list of values that can be populated into the above program, depending on what your need is:

When should I use arrays?

It really depends on the nature of your python program, but below are some examples that may help you make a decision:

(A) Many variables of the same type : There maybe a scenario where you have to create an array to store data that is of the same data type. For example you have a list of codes to look up against, which are all integers.

(B) Faster and more efficient : If speed is what you are looking for using arrays, will help improve performance of your computer program, using lists is much slower.

(C) Compactability and efficiency : If the nature of your program needs to store large amounts of data that needs to be accessed quickly , then this would be a good reason to use them.

(D) Ability to retrieve data quickly through indexing: As arrays have index values associated with them, they data can be easiy retrieved.

(E) Need to compute some mathematical values: Arrays are excellent for any numerical operations you need to complete, as the level of coding is minimal.

import array as test_array

a = test_array.array('i',[1,2,3])

mydivider = 2

mynewlist = [x / mydivider for x in a]
print(mynewlist)

result:
[0.5, 1.0, 1.5]

So in summary:

Speed , efficiency and ease of use are the main reasons to use an array.

We use arrays here in how to show percentage differences between files in python , why not go over and see it in action!

How to show percentage differences between files in Python

Estimated reading time: 5 minutes

In our previous post on how to compare CSV files for differences we showed how you could see the differences, but what if you wanted to see if a record was 100% matched or not?

Here we are going to use SequenceMatcher which is a python class that allows you to compare files and return if they are matched as a percentage.

Lets look at the code

Import statements and read in the CSV files. Here we do the usual of importing the libraries we need and read in the two CSV files we use:

Note we also set up the data frame settings here as well, so we can see the data frame properly. See further down.

import pandas as pd
import numpy as np
import math
from difflib import SequenceMatcher
pd.set_option('max_columns', None)
pd.set_option('display.width', None)


#Read in the CSV files
df1 = pd.read_csv('CSV1.csv')
df2 = pd.read_csv('CSV2.csv')

The two CSV files look like this:

CSV1

CSV2

Next, we are going to create an array. Creating an array allows the comparison to be completed as it creates indexes, and the indexes in each array can be compared and then the percentage difference can be calculated.

#Create an array for both dataframes
array1 = np.array(df1)
array2 = np.array(df2)

Our next step is to transfer the arrays to a data frame, change all integer values to a string, then join both data frames into one.

In this instance the changing of values into a string allows those values to be iterated over, otherwise, you will get an error ” TypeError: ‘int’ object is not iterable

#Transfer the arrays to a dataframe
df_CSV_1 = pd.DataFrame(array1, columns=['No1','Film1','Year1','Length1'])
df_CSV_2 = pd.DataFrame(array2, columns=['No2','Film2','Year2','Length2'])

#Change all the values to a string, as numbers cannot be iterated over.
df_CSV_1['Year1'] = df_CSV_1['Year1'].astype('str')
df_CSV_2['Year2'] = df_CSV_2['Year2'].astype('str')
df_CSV_1['Length1'] = df_CSV_1['Length1'].astype('str')
df_CSV_2['Length2'] = df_CSV_2['Length2'].astype('str')

#join the dataframes
df = pd.concat([df_CSV_1,df_CSV_2], axis=1)

We are now moving to the main part of the program, which gives us the answers we need. Here we create a function that does the calculations for us:

#Create a function to calculate the differences and show as a ratio.
def create_ratio(df, columna, columnb):
    return SequenceMatcher(None,df[columna],df[columnb]).ratio()

Next we calculate the differences and format the output

#Here we use apply which will pull in the data that needs to be passed to the fuction above.
df['Film_comp'] = df.apply(create_ratio,args=('Film1','Film2'),axis=1)
df['Year_comp'] = df.apply(create_ratio,args=('Year1','Year2'),axis=1)
df['Length_comp'] = df.apply(create_ratio,args=('Length1','Length2'),axis=1)

#This creates the values we are looking for
df['Film_comp'] = round(df['Film_comp'].astype('float'),2)*100
df['Year_comp'] = round(df['Year_comp'].astype('float'),2)*100
df['Length_comp'] = round(df['Length_comp'].astype('float'),2)*100

#this removes the decimal point that is added as a result of using the datatype 'Float'
df['Film_comp'] = df['Film_comp'].astype('int')
df['Year_comp'] = df['Year_comp'].astype('int')
df['Length_comp'] = df['Length_comp'].astype('int')
#Print the output
print(df)

And the final output looks like this:

An explanation of the output

As can be seen, the last three columns are the percentages of the match obtained, 100 been an exact match.

For index value 1 there is Joker in the first file, but Jokers is in the second file.

The ratio is calculated as follows:

Joker is length 5, Jokers is length 6 = 11 characters in length

So the logic looks at the sequence, iterating through the sequence it can see that the first 10 characters are in the same order, but the 11th character is not, as a result, it calculates the following:

(10/11) * 100 = 90.90

Finally, the round function sets the value we are looking for to 91.

On the same line, we shall compare the year:

2019 and 2008 are a total of eight characters.

In the sequence, the first two of each match, and as they are also found, gives us four, and the ratio is as follows:

4/8 *100 = 50

For Index 20 we also compared the film name, in total there are 17 characters, but the program ignores what they call junk, so the space is not included, for that reason the ratio only calculates over sixteen characters.

In order to understand this better I have compiled the below:

Index 1Index 1
201920088JokerJokers11
2Correct Spot1JCorrect Spot1
0Correct Spot1oCorrect Spot1
1Incorrect spot0KCorrect Spot1
9Incorrect spot0eCorrect Spot1
rCorrect Spot1
2Found1JFound1
0Found1oFound1
1Found0KFound1
9Found0eFound1
rFound1
Total in Comparison8Total in Comparison11
Ratio0.50Ratio0.91
Index 20
The DirtThe Dirty16
TCorrect Spot1
HCorrect Spot1
ECorrect Spot1
Correct Spot1
TFound1
HFound1
EFound1
DCorrect Spot1
ICorrect Spot1
RCorrect Spot1
TCorrect Spot1
DFound1
IFound1
RFound1
TFound1
Total in Comparison16
Ratio0.94

How to Add Formulas to Excel using Python

Estimated reading time: 3 minutes

You may be working on automating some exports to Excel using Python to compare files or just pure and simple adding formulas to an Excel file before you open it up.

Here we explain adding formulas to your Excel output using Numpy or adding the calculations to specific cells in the output.

Adding formulas to specific cells

First of all, let’s look at the normal spreadsheet with some calculations, these have the formulas typed in. The ultimate objective is to have the Python code do this for us, one less step.

As can be seen, the cells have the formulas in them, but this would be a very time-consuming process if you had to do it multiple times, in multiple spreadsheets.

To get around this we can write the Python logic as follows:

  1. Create three lists and three dataframes as follows.
datasetA_list = np.array([1,2,3,4,5,6,7,8,9,10])

datasetB_list = np.array([9,8,65,43,3,21,3,2,1,7])

dataset_list = ('sum','average','median','standard deviation','count','correlation')


datasetA = pd.DataFrame(datasetA_list,columns=['ValueA'])
datasetB = pd.DataFrame(datasetB_list,columns=['ValueB'])
dataset_list_calcs = pd.DataFrame(dataset_list, columns=['Calcs'])

2. Next create a path to where you are going to store the data as follows:

path = 'output.xlsx'

3. In this next step create the workbook and location where the data will be stored. This will load the headings created in step 1 to a particular location on the spreadsheet.

workbook = pd.ExcelWriter(path, engine='openpyxl')
workbook.book = load_workbook(path)
workbook.sheets = dict((ws.title,ws) for ws in workbook.book.worksheets)

datasetA.to_excel(workbook,sheet_name="Sheet1", startrow=1,index=False, header=True,)
datasetB.to_excel(workbook,sheet_name="Sheet1", startrow=1, startcol=2,index=False, header=True)
dataset_list_calcs.to_excel(workbook,sheet_name="Sheet1", startrow=1, startcol=4,index=False, header=True)

4. Load the formulas into cells besides their relevant headings. This should line post these formulas beside the relevant heading created in step 1.

###Creating calculations for datasetA

sheet = workbook.sheets['Sheet1']
sheet['E2'] = 'CalcsA'
sheet['F3'] = '=SUM(A3:A12)'
sheet['F4'] = '=AVERAGE(A3:A12)'
sheet['F5'] = '=MEDIAN(A3:A12)'
sheet['F6'] = '=STDEV(A3:A12)'
sheet['F7'] = '=COUNT(A3:A12)'
sheet['F8'] = '=CORREL(A3:A12,C3:C12)'

###Creating calculations for datasetB

sheet = workbook.sheets['Sheet1']
sheet['H2'] = 'CalcsB'
sheet['H3'] = '=SUM(C3:C12)'
sheet['H4'] = '=AVERAGE(C3:C12)'
sheet['H5'] = '=MEDIAN(C3:C12)'
sheet['H6'] = '=STDEV(C3:C12)'
sheet['H7'] = '=COUNT(C3:C12)'
sheet['H8'] = '=CORREL(A3:A12,C3:C12)'

Use Numpy to create the calculations

a. Create the calculations that you will populate into the spreadsheet, using Numpy

a = np.sum(datasetA_list)
b = np.average(datasetA_list)
c = np.median(datasetA_list)
d = np.std(datasetA_list,ddof=1) ## Setting DDOF = 0 will give a differnt figure, this corrects to match the output.
f = np.count_nonzero(datasetA_list)
g = np.corrcoef(datasetA_list,datasetB_list)

b. Create the headings and assign them to particular cells

sheet['E14'] = 'Numpy Calculations'
sheet['E15'] = 'Sum'
sheet['E16'] = 'Average'
sheet['E17'] = 'Median'
sheet['E18'] = 'Standard Deviation'
sheet['E19'] = 'Count'
sheet['E20'] = 'Correlation'

c. Assign the variables in step a to a set of cells

sheet['F15'] = a
sheet['F16'] = b
sheet['F17'] = c
sheet['F18'] = d
sheet['F19'] = f
sheet['F20'] = str(g)

d. Save the workbook and close it – This step is important, and always include.

workbook.save()
workbook.close()

And the final output looks like…

how to add sine and cosine in python code

Explaining what this post is about
I was recently online and providing help to a fellow python coder, and a query came up about how you would rewrite some code that had included the sine function  and the cosine function

We were asked to see if we could translate this code:

L_1=1;
L_2=1.5;
q_0=0.5;
q_1=pi()/4;
q_2=pi()/6;
%Position of the end effetor
x= q_0+L_1*cos(q_1)-L_2*cos(pi()-q_1-q_2)
y=L_1*sin(q_1)+L_2*sin(pi()-q_1-q_2)

into its python equivalent.

Some background about the output

In order to get the desired result, there is a need to import a package to provide some of the mathematical analysis required, and was achieved through using Numpy statistical analysis

This package allows the following functions to be used in the logic:

  • Pi
  • Cosine
  • Sine

And the result is

As a result, the below shows the output of the above question converted to its Python equivalent:

import numpy as np

a= np.pi
print("PI value is ", a)

L_1=1
L_2=1.5
q_0=0.5
q_1=a/4
q_2=a/6

print("L_1 value is",L_1)
print("L_2 value is",L_2)
print("q_0 value is",q_0)
print("q_1 value is",q_1)
print("q_2 value is",q_2)


x= (q_0+(L_1*(np.cos(q_1)))-(L_2*(np.cos(a-q_1-q_2))))
y= (L_1*(np.sin(q_1))+(L_2*(np.sin(a-q_1-q_2))))

print("x value is " , x)
print("y value is " , y)

with its output showing:

PI value is  3.141592653589793
L_1 value is 1
L_2 value is 1.5
q_0 value is 0.5
q_1 value is 0.7853981633974483
q_2 value is 0.5235987755982988
0.7071067811865476
-0.25881904510252085
x value is  1.5953353488403288
y value is  2.15599552062015