Page 231 - Ai Book - 10
P. 231
Opening a CSV file Using Python
As you have read earlier, Pandas is a popular Python package for data science because it offers powerful,
expressive and flexible data structures that make data manipulation and analysis easy, among many other things.
To read a CSV file in Jupyter notebook, firstly, you will import Pandas with the help of the following command:
import pandas as pd
After that, you will create a new variable in which your csv file will be saved. Here, we create a variable dataframe.
To save a csv file in dataframe, type the following command after importing pandas:
dataframe = pd.read_csv(“F:\Softwares\Setups\DataScience\Anaconda\pkgs\pandas-
1.0.5-py38h47e9c7a_0\Lib\site-packages\pandas\CHARTS.csv”)
Now, csv file is saved in the variable dataframe. Now, you can explore the data of the csv file using .head()
function. To do this, you can type the following command:
print (dataframe.head())
The visual look of the code is as follows:
The output of the code looks like as follows:
Issues Related to Data Collection
You know that data visualisation is an important factor while training an AI model. If a model is trained with
biased data, then it will predict the biased result. Thus, you should know about the various kinds of issues that
you can face while collecting data from various sources.
u Inaccurate Data : An erroneous data in any form is useless. There are two conditions in which data can be
erroneous. These are:
• Wrong Values: Sometimes, values in the dataset are not correct. For example, if you draw a table which
contains three columns i.e. Name, Class and Roll No and you fill the details of the student name in the
Class column , then the table is said to be incorrect or inappropriate as it contains wrong values.
• Invalid or Null Values: In general, invalid or null values are those values whose format does not match
the attribute data type, and cannot be converted to that data type. Data sets are not perfect. Sometimes
they end up with corrupt or missing values. Many times you will find NaN values in the dataset. These
are null values which do not hold any meaning and are not processable. In such a case, you will delete
these values.
u Outliers: As its name implies, an outlier is a data point that lies outside the overall pattern in a distribution.
The concept of Outliers was given by Moore and McCabe in 1999. Sometimes, the data points that lie outside
the pattern might be problematic for many statistical analyses because they can cause tests to either miss
significant findings or distort real results. You can find an outlier value by examining the numbers in the
given sample of population.
It is too difficult to identify the outliers while working with huge amounts of data on machines. Thus data
visualisation is an important approach for the interpretation and identification of patterns in the collected
data.
105
105