Big Data Python 1: Working with Tables

No matter what kind of data you want to access, the most important thing must be saving the data. Therefore, I’d like to know how to handle tables and save them to the most multi-use format, which is CSV!

4 min readAug 9, 2020

1. Purpose

a) handle different tables(e.g. CSV, excel…etc.)

b) use it to be a representative data set for machine learning(every row stands for an object, every column represents a characteristic of the object)

c) integrate conveniently with common graphing library to plot beautiful graphs

The tables could be handled by Pandas. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

2. Goal

I use Kaggle’s Ted data set to practice pandas basics. Kaggle is the world’s largest data science community which often is full of data sets and AI competitions. I use my go-to dataset, which is about the TED talks to practice Pandas: https://www.kaggle.com/rounakbanik/ted-talks

⚠️Sign in is required before downloading.

⚠️Only need to get the ted_main.csv in the page to analyze

3. Steps

3.1 read the table

Pandas only has 2 data types: DataFrame and Series
1. multiple rows * multiple cols -> DataFrame
2. one col * one or multiple rows -> Series

[Code]

3.2 size of DataFrame

4.2.1. Because we have 2 dimensions now, we cannot use len anymore. Instead, we should use .shape to get the 2 dimensions
4.2.2. .shape is a tuple, so the first element [0] is your row numbers, the second

[code]

[Output]

3.3 access the columns of table

4.3.1 single column

[code]

3.3.2 multiple columns

We have to assemble the tags we want into a list and then send it to the columns [], so the two [] represent different meanings. The outside [] is DataFrame’s columns; the inside [] is the list of tags.

[code]

3.4 access the rows of table

4.4.1 When accessing the rows, we use .loc (less used. We only use it when we create our own tags of rows.) and iloc (more used. We use it as Pandas already helped us create the tags of rows from 0.)

3.4.2 We will get data similar to a list when using .iloc. After that, we can use it like we use a list.

3.4.3 .iloc -> [“1st”, “2nd”,”3rd”,…,”last”]

[code]

3.5 access rows and columns

[code]

3.6 filter the rows

4.6.1 Data filtering is to keep the rows which did meet our expectation and drop the parts which didn’t.

4.6.2 The concept of filtering is we make a boolean list as big as our data quantity, and we keep the data matched True and drop the data matched False.

[code]

3.7 use to_csv to save the table

3.7.1 necessary argument: the file place

3.7.2 optional argument: encoding

3.7.3 optional argument: index(default is True)

[code]

3.8 delete the columns

[code]

3.9 convert types

Pandas.apply() is the most important function which helps you convert all the cells in a column. We can see the time in the dataset of TED strange as below. Why is that?

This is called UNIX time or POSIX time. It is the number of seconds that have elapsed since the Unix epoch; the Unix epoch is 00:00:00 UTC on 1 January 1970. We can easily convert it to our familiar time format by datatime of the inbuilt modules.

[code]