{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "u1_tXzxs1dqx" }, "source": [ "[![AnalyticsDojo](https://github.com/rpi-techfundamentals/spring2019-materials/blob/master/fig/final-logo.png?raw=1)](http://rpi.analyticsdojo.com)\n", "

Introduction to Python - Web Mining

\n", "

introml.analyticsdojo.com

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Web Mining" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4vknzA5K1dq2" }, "source": [ "## This tutorial is directly from the the BeautifulSoup documentation.\n", "[https://www.crummy.com/software/BeautifulSoup/bs4/doc/]\n", "\n", "### Before you begin\n", "If running locally you need to make sure that you have beautifulsoup4 installed. \n", "`conda install beautifulsoup4` \n", "\n", "It should already be installed on colab. \n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "2E_vSEPJ2nPf" }, "source": [ "## All html documents have structure. Here, we can see a basic html page. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "X9wgJCq91drI" }, "outputs": [], "source": [ "html_doc = \"\"\"\n", "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "\n", "

...

\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Tt20VJZp1drO" }, "source": [ "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "

...

\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "j8GsdPrC1drP" }, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "import requests\n", "soup = BeautifulSoup(html_doc, 'html.parser')\n", "\n", "print(soup.prettify())\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "WTbE3qYh1drZ" }, "source": [ "### A Retreived Beautiful Soup Object \n", "- Can be parsed via dot notation to travers down the hierarchy by *class name*, *tag name*, *tag type*, etc.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "Z_fKqIc81drb", "scrolled": true }, "outputs": [], "source": [ "soup" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "YivtiYOy1dri" }, "outputs": [], "source": [ "#Select the title class.\n", "soup.title\n", " \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "OHSL7iLZ1drn" }, "outputs": [], "source": [ "#Name of the tag.\n", "soup.title.name\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "pc4BUJ_21drt" }, "outputs": [], "source": [ "#String contence inside the tag\n", "soup.title.string\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "JJIaljuP1drw" }, "outputs": [], "source": [ "#Parent in hierarchy.\n", "soup.title.parent.name\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "9aHRkzDA1dr0" }, "outputs": [], "source": [ "#List the first p tag.\n", "soup.p\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "Q8cZlFCc1dr4" }, "outputs": [], "source": [ "#List the class of the first p tag.\n", "soup.p['class']\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "CJG0TbO-1dr9" }, "outputs": [], "source": [ "#List the class of the first a tag.\n", "soup.a\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "6vbhMqmr1dsB" }, "outputs": [], "source": [ "#List all a tags.\n", "soup.find_all('a')\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "PaERPV3s1dsL" }, "outputs": [], "source": [ "\n", "soup.find(id=\"link3\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "ZpAKma8p1dsO" }, "outputs": [], "source": [ "#The Robots.txt listing who is allowed.\n", "response = requests.get(\"https://en.wikipedia.org/robots.txt\")\n", "txt = response.text\n", "print(txt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "VfJ_AUlF1dsR" }, "outputs": [], "source": [ "response = requests.get(\"https://www.rpi.edu\")\n", "txt = response.text\n", "soup = BeautifulSoup(txt, 'html.parser')\n", "\n", "print(soup.prettify())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "4nZAjkNZ1dsX" }, "outputs": [], "source": [ "soup.find_all('a')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "AJmuDqLd1dsq" }, "outputs": [], "source": [ "# Experiment with selecting your own website. Selecting out a url. \n", "\n", "response = requests.get(\"enter url here\")\n", "txt = response.text\n", "soup = BeautifulSoup(txt, 'html.parser')\n", "\n", "print(soup.prettify())" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "V2YHWMZv1dsr" }, "source": [ "#For more info, see \n", "[https://github.com/stanfordjournalism/search-script-scrape](https://github.com/stanfordjournalism/search-script-scrape) " ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "02_intro_python_webmining.ipynb", "provenance": [ { "file_id": "https://github.com/rpi-techfundamentals/spring2019-materials/blob/master/04-viz-api-scraper/02_intro_python_webmining.ipynb", "timestamp": 1548903782673 } ], "version": "0.3.2" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }