Page 231 - Ai Book - 10
P. 231

Opening a CSV file Using Python

        As you  have read earlier, Pandas  is  a popular  Python  package for data science because it  offers  powerful,
        expressive and flexible data structures that make data manipulation and analysis easy, among many other things.
        To read a CSV file in Jupyter notebook, firstly, you will import Pandas with the help of the following command:
              import pandas as pd
        After that, you will create a new variable in which your csv file will be saved. Here, we create a variable dataframe.
        To save a csv file in dataframe, type the following command after importing pandas:
              dataframe = pd.read_csv(“F:\Softwares\Setups\DataScience\Anaconda\pkgs\pandas-
              1.0.5-py38h47e9c7a_0\Lib\site-packages\pandas\CHARTS.csv”)
        Now, csv file is saved in the variable dataframe. Now, you can explore the data of the csv file using .head()
        function. To do this, you can type the following command:

              print (dataframe.head())
        The visual look of the code is as follows:




        The output of the code looks like as follows:









        Issues Related to Data Collection
        You know that data visualisation is an important factor while training an AI model. If a model is trained with
        biased data, then it will predict the biased result. Thus, you should know about the various kinds of issues that
        you can face while collecting data from various sources.

         u   Inaccurate Data : An erroneous data in any form is useless. There are two conditions in which data can be
             erroneous. These are:

             •  Wrong Values: Sometimes, values in the dataset are not correct. For example, if you draw a table which
                contains three columns i.e. Name, Class and Roll No and you fill the details of the student name in the
                Class column , then the table is said to be incorrect or inappropriate as it contains wrong values.

             •  Invalid or Null Values: In general, invalid or null values are those values whose format does not match
                the attribute data type, and cannot be converted to that data type. Data sets are not perfect. Sometimes
                they end up with corrupt or missing values. Many times you will find NaN values in the dataset. These
                are null values which do not hold any meaning and are not processable. In such a case, you will delete
                these values.

         u   Outliers: As its name implies, an outlier is a data point that lies outside the overall pattern in a distribution.
             The concept of Outliers was given by Moore and McCabe in 1999. Sometimes, the data points that lie outside
             the pattern might be problematic for many statistical analyses because they can cause tests to either miss
             significant findings or distort real results. You can find an outlier value by  examining the numbers in the
             given sample of population.
             It is too difficult to identify the outliers while working with huge amounts of data on machines. Thus data
             visualisation is an important approach for the interpretation and identification of patterns in the collected
             data.


                                                                                                             105
                                                                                                             105
   226   227   228   229   230   231   232   233   234   235   236