Data scraping in a nutshell

Data scraping in a nutshell

In the year 2022, it's pretty easy to live out my dreams of becoming a superhero. I'm not embarrassed to say it is one of my top missions in life.

I say it's easy because of digital technology. More specifically, because of my ability to code, I can do things that other people cannot, including scrape data from the web.

To summarize, web scraping is using automated code to take structured data from a website, and store it for some later use. I'm a data journalist, so it makes makes sense for me to choose web scraping as my superhero 'thing'.

I've scraped data in the past, but not for any really serious project, so I'm pretty much a beginner.

I decided to make a sort of syllabus, and get an understanding of what I'll need to learn.


How to scrape data (the easy way)

The first step is learning how to scrape data. For most of my projects, I'm used to getting my data handed to me on a nice little platter, whether from a reporter or an institution.

I'm not sure where to start with this.

The simplest step would be to look for tools, right? Why reinvent the wheel.

After a 5 minute google search (stackoverflow, reddit) here are some web scraping services that notably stood out. (Jul, 2022 edition)

There are undoubtedly a lot more options that should fit your needs, and I might do a review of some services in a later blog post. For now, should be enough to get me going.


How do you scrape data? (the hard way)

Now, I can do things the easy way, but I do know how to code, so I think I can make my web scrapers a little more sophisticated if I put in the effort.

In my own words, scraping data is basically writing a few lines of code that go to a website, find some data in the web page, and save that data somewhere. Towards Data Science's article is probably the best tutorial for this, but in summary:

  • use python
  • go to a website using python
  • download the website using python
  • write some code to make the stuff you download 1) useful and 2) readable

Again, I'll probably write a later blog post going through this in more detail, but until then, read what's already out there.


Where would my scraping be useful?

In the words of my friend Yeli, there is no need for creative saviors. The fact is there are people whop have been doing this longer than I have, so I should probably just follow their lead.

There are many situations in which data scraping would be really helpful, an idea called Scrapism.

I'm still learning on what all those situations are. I don't necessarily need to decide which one I'll join just yet; for now, I'll just try to list out the basic categories.

  • preserving stuff (Web Archive project)
  • tracking real-time data (Covid Tracking, etc)
  • investigating some website or entity
  • machine learning training

There are probably more, but whatever.