This page (currently under construction) hosts a collection of tools that I have found useful as an student/RA, mostly in the realm of data scraping. The hope is that some RA will stumble across this and save herself 200 hours of coyping data from pdfs.

Beautiful Soup

Beautifulsoup is a python library for scraping HTML/XML. Pretty neat.

PDF Miner

Not all data is in HTML. Pros: you can get data from PDFs! Cons: Must be local. Here is the super boring to read documentation… a more friendly tutorial is coming soon.

Amazon EC2 Instances

Amazon offers super cheap (+ 1 year free!) cloud computing services. Why do you want to use this? You a) don’t have access to your own server b) don’t have a laptop you can devote to scraping 24/7.

If you want to scrape weather data for Ulaanbaatar every 60 seconds day in and day out, you should set up an Amazon EC2 instance. This will act as your own personal server.

Once you are set up (steps 1-3, here), connect to your amazon EC2 instance using the terminal app. There are a few steps to this. Here’s what I had to do:

First, “use the chmod command to make sure that your private key file isn’t publicly viewable.” Whatever that means.

chmod 600  path/to/key/keyname.pem

Then connect!

ssh -v -i path/to/key/keyname.pem ec2-user@your_ec2_name

your_ec2_name is probably something like ec2-11-111-111-11.us-east-2.compute.amazonaws.com, except, you know, some of the 1’s aren’t 1’s. Hopefully you’ve now connected to your instance. It’s just like your terminal, but running in Ohio or something.

To copy files from your machine to your EC2 instance, run this chunk (mind the :~/ !):

scp -i path/to/key/keyname.pem -r path/to/folder/with/files ec2-user@your_ec2_name.compute.amazonaws.com:~/

Now all of your scraping stuff will be in your EC2 instance. You will probably need to install your dependencies still. Virtual environments (see below) are super helpful for this.

Next, use the screen command to run your scraper:

screen 

#this opens up a virtual terminal, or 'screen' 

python my_scraper.py 

# maybe some output that indicates that your scraper has started 

Ctrl-a d

# so hold Control-a, then press d. This will detach the screen, but keep your program running

After you’ve ctrl-a d-ed out of your screen, you will be back at the EC2 command line. Simply type “exit” to disconnect. Using screen in this way will keep your code running even after you disconnect from youe EC2 instance, even after you shut down your computer, even after you throw your computer off a bridge, etc, etc. Nervous about your scraper? You can check it by re-connecting to your EC2 instance and typing screen -r. This “re-attaches” your screen, so you can check on its progress.

How do you get your super valuable data from your instance? Like this:


scp -i path/to/key/keyname.pem -r ec2-user@your-instance-name.compute.amazonaws.com:~/your/file/ .

Desafortunadamente the documentation for EC2 instances is, well, poor. Stackoverflow is your friend.

Virtual Environments

Virtual environments are good practice when using python. This is because your projects may require a specific versions of a python library (like beautifulsoup!), and when you download libraries using, say, pip, they are just plopped into the site-packages directory. Virtual environments create an isolated set of directories that help you prevent these conflicts. So in general, if you have a project you should create a virtual environment specific to that project. More on virtual environments here and here

pip install virtualenv
cd my_project_folder
virtualenv venv_my_project