database python connection basic Sqlalchemy & Pandas

connecting a database through Python

Sqlite

Sqlalchemy

Pandas

basic

Create a database engine

  • SQLite database
    • simple and fast
  • SQLAlchemy
    • Works with many Relational Database Management Systems
      • databasetype:///schema_name.extention
In [3]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///test.sqlite')
In [5]:
table_names = engine.table_names()
table_names
Out[5]:
[]

example: Chinook Database

  • http://chinookdatabase.codeplex.com/
    • the Chinook database contains information about a semi-fictional digital media store in which media data is real and customer, employee and sales data has been manually created.
In [8]:
# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('sqlite:///Chinook_Sqlite.sqlite')

# Save the table names to a list: table_names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names)
[u'Album', u'Artist', u'Customer', u'Employee', u'Genre', u'Invoice', u'InvoiceLine', u'MediaType', u'Playlist', u'PlaylistTrack', u'Track']

The Hello World of SQL Queries!

  • SELECT, in order to retrieve all columns of the table Album in the Chinook database. Recall that the query SELECT * selects all columns.
In [12]:
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook_Sqlite.sqlite')

# Open engine connection
con = engine.connect()

# Perform query: rs
rs = con.execute('select * from Album')

# Save results of the query to list: ll
ll = rs.fetchall()


# Close connection
con.close()

# Print head of list
print len(ll),type(ll)
ll[:3]
347 <type 'list'>
Out[12]:
[(1, u'For Those About To Rock We Salute You', 1),
 (2, u'Balls to the Wall', 2),
 (3, u'Restless and Wild', 2)]
In [13]:
# Save results of the query to DataFrame
df=pd.DataFrame(ll)
df.head(3)
Out[13]:
0 1 2
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2

Customizing the Hello World of SQL Queries

  • Select specified columns from a table;
  • Select a specified number of rows;
  • Import column names from the database table.
In [28]:
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT LastName, Title FROM Employee")
    df = pd.DataFrame(rs.fetchmany(size=3))
    
    # Using the rs object, set the DataFrame's column names to the corresponding names of the table columns.
    df.columns = rs.keys()

# Print the length of the DataFrame df
print(len(df))

# Print the head of the DataFrame df
df
3
Out[28]:
LastName Title
0 Adams General Manager
1 Edwards Sales Manager
2 Peacock Sales Support Agent

Filtering your database records using SQL's WHERE

In [34]:
# Create engine: engine
engine = create_engine('sqlite:///Chinook_Sqlite.sqlite')

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute('select * from Employee where EmployeeID >= 6')
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print the head of the DataFrame df
print df.shape
df.head(1)
(3, 15)
Out[34]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 6 Mitchell Michael IT Manager 1 1973-07-01 00:00:00 2003-10-17 00:00:00 5827 Bowness Road NW Calgary AB Canada T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899 michael@chinookcorp.com

Ordering your SQL records with ORDER BY

In [37]:
# Create engine: engine
engine = create_engine('sqlite:///Chinook_Sqlite.sqlite')

# Open engine in context manager
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate")
    df = pd.DataFrame(rs.fetchall())

    # Set the DataFrame's column names
    df.columns = rs.keys()

# Print head of DataFrame
print df.shape
df.head(3)
(8, 15)
Out[37]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 4 Park Margaret Sales Support Agent 2.0 1947-09-19 00:00:00 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 00:00:00 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 1 Adams Andrew General Manager NaN 1962-02-18 00:00:00 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com

Querying relational databases directly with pandas

  • df = pd.read_sql_query("queries", engine)
In [40]:
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook_Sqlite.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Album", engine)
print df.shape
(347, 3)
  • compare
In [41]:
# Open engine in context manager
# Perform query and save results to DataFrame: df1
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Album")
    df1 = pd.DataFrame(rs.fetchall())
    df1.columns = rs.keys()

# Confirm that both methods yield the same result: does df = df1 ?
print(df.equals(df1))
True

Pandas for more complex querying

  • You'll build a DataFrame that contains the rows of the Employee table for which the EmployeeId is greater than or equal to 6 and you'll order these entries by BirthDate.
In [44]:
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook_Sqlite.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query(
    "SELECT * FROM Employee WHERE EmployeeId >= 6 ORDER BY BirthDate",
    engine
)

# Print head of DataFrame
print df.shape
df
(3, 15)
Out[44]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 8 Callahan Laura IT Staff 6 1968-01-09 00:00:00 2004-03-04 00:00:00 923 7 ST NW Lethbridge AB Canada T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772 laura@chinookcorp.com
1 7 King Robert IT Staff 6 1970-05-29 00:00:00 2004-01-02 00:00:00 590 Columbia Boulevard West Lethbridge AB Canada T1K 5N8 +1 (403) 456-9986 +1 (403) 456-8485 robert@chinookcorp.com
2 6 Mitchell Michael IT Manager 1 1973-07-01 00:00:00 2003-10-17 00:00:00 5827 Bowness Road NW Calgary AB Canada T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899 michael@chinookcorp.com

INNER JOIN

In [47]:
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print head of DataFrame df
print df.shape
df.head(3)
(347, 2)
Out[47]:
Title Name
0 For Those About To Rock We Salute You AC/DC
1 Balls to the Wall Accept
2 Restless and Wild Accept

pandas filter

In [50]:
# Execute query and store records in DataFrame: df
df = pd.read_sql_query(
    "SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000",
    engine
)

# Print head of DataFrame
print df.shape
df.head(3)
(4073, 11)
Out[50]:
PlaylistId TrackId TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
0 1 3390 3390 One and the Same 271 2 23 None 217732 3559040 0.99
1 1 3392 3392 Until We Fall 271 2 23 None 230758 3766605 0.99
2 1 3393 3393 Original Fire 271 2 23 None 218916 3577821 0.99
In [ ]:
 

simple url_based APIs tutorial

simple url_based APIs

 

APIs

APIs

Application Programming Interface

  • Protocols and routines
    • Building and interacting with software applications
  • fun: OMDB API
    • the Open Movie Database

JSONs

JavaScript Object Notation

  • Real-time server-to-browser communication
  • Douglas Crockford
  • Human readable

What is an API?

  • Set of protocols and routines
  • Bunch of code
    • Allows two so!ware programs to communicate with each other

Connecting to an API in Python

fun: OMDB API

In [7]:
import requests
url = 'http://www.omdbapi.com/?t=Split'

r = requests.get(url)
json_data = r.json()

for key, value in json_data.items():
    print(key + ':', value)
    
with open("a_movie.json", 'w+') as save:
    save.write(r.text)
(u'Plot:', u'After three girls are kidnapped by a man with 24 distinct personalities they must find some of the different personalities that can help them while running away and staying alive from the others.')
(u'Rated:', u'PG-13')
(u'Response:', u'True')
(u'Language:', u'English')
(u'Title:', u'Split')
(u'Country:', u'USA')
(u'Writer:', u'M. Night Shyamalan')
(u'Metascore:', u'75')
(u'imdbRating:', u'7.6')
(u'Director:', u'M. Night Shyamalan')
(u'Released:', u'20 Jan 2017')
(u'Actors:', u'Anya Taylor-Joy, James McAvoy, Haley Lu Richardson, Kim Director')
(u'Year:', u'2016')
(u'Genre:', u'Horror, Thriller')
(u'Awards:', u'1 nomination.')
(u'Runtime:', u'117 min')
(u'Type:', u'movie')
(u'Poster:', u'https://images-na.ssl-images-amazon.com/images/M/MV5BOWFiNjViN2UtZjIwYS00ZmNhLWIzMTYtYTRiMTczOGMzZGE0L2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyMjY5ODI4NDk@._V1_SX300.jpg')
(u'imdbVotes:', u'864')
(u'imdbID:', u'tt4972582')

Loading and exploring a JSON

  • with open(file_path) as file:
In [9]:
import json
# Load JSON: json_data
with open("a_movie.json") as json_file:
    json_data = json.load(json_file)

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])
(u'Plot: ', u'After three girls are kidnapped by a man with 24 distinct personalities they must find some of the different personalities that can help them while running away and staying alive from the others.')
(u'Rated: ', u'PG-13')
(u'Response: ', u'True')
(u'Language: ', u'English')
(u'Title: ', u'Split')
(u'Country: ', u'USA')
(u'Writer: ', u'M. Night Shyamalan')
(u'Metascore: ', u'75')
(u'imdbRating: ', u'7.6')
(u'Director: ', u'M. Night Shyamalan')
(u'Released: ', u'20 Jan 2017')
(u'Actors: ', u'Anya Taylor-Joy, James McAvoy, Haley Lu Richardson, Kim Director')
(u'Year: ', u'2016')
(u'Genre: ', u'Horror, Thriller')
(u'Awards: ', u'1 nomination.')
(u'Runtime: ', u'117 min')
(u'Type: ', u'movie')
(u'Poster: ', u'https://images-na.ssl-images-amazon.com/images/M/MV5BOWFiNjViN2UtZjIwYS00ZmNhLWIzMTYtYTRiMTczOGMzZGE0L2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyMjY5ODI4NDk@._V1_SX300.jpg')
(u'imdbVotes: ', u'864')
(u'imdbID: ', u'tt4972582')

API requests

  • pull some movie data down from the Open Movie Database (OMDB) using their API.
  • he movie you'll query the API about is The Social Network
    • The query string should have one argument t=social+network
  • Apply the json() method to the response object r and store the resulting dictionary in the variable json_data.
In [20]:
# Import requests package
import requests

# Assign URL to variable: url
url = 'http://www.omdbapi.com/?t=social+network'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Print the text of the response
print(r.text)

print type(r.text)

print type(r.json())

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

print 
# Print each key-value pair in json_data
for key in json_data.keys():
    print(key + ': ', json_data[key])
{"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin (screenplay), Ben Mezrich (book)","Actors":"Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons","Plot":"Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"USA","Awards":"Won 3 Oscars. Another 161 wins & 162 nominations.","Poster":"https://images-na.ssl-images-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg","Metascore":"95","imdbRating":"7.7","imdbVotes":"496,009","imdbID":"tt1285016","Type":"movie","Response":"True"}
<type 'unicode'>
<type 'dict'>

(u'Plot: ', u'Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.')
(u'Rated: ', u'PG-13')
(u'Response: ', u'True')
(u'Language: ', u'English, French')
(u'Title: ', u'The Social Network')
(u'Country: ', u'USA')
(u'Writer: ', u'Aaron Sorkin (screenplay), Ben Mezrich (book)')
(u'Metascore: ', u'95')
(u'imdbRating: ', u'7.7')
(u'Director: ', u'David Fincher')
(u'Released: ', u'01 Oct 2010')
(u'Actors: ', u'Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons')
(u'Year: ', u'2010')
(u'Genre: ', u'Biography, Drama')
(u'Awards: ', u'Won 3 Oscars. Another 161 wins & 162 nominations.')
(u'Runtime: ', u'120 min')
(u'Type: ', u'movie')
(u'Poster: ', u'https://images-na.ssl-images-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg')
(u'imdbVotes: ', u'496,009')
(u'imdbID: ', u'tt1285016')
In [2]:
# Import package
import requests

# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=machine+learning'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['233488']['extract']
print(pizza_extract)
<p><b>Machine learning</b> is the subfield of computer science that gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data driven predictions or decisions, through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms is infeasible; example applications include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.</p>
<p>Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses in prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. Machine learning is sometimes conflated with data mining, where the latter subfield focuses more on exploratory data analysis and is known as unsupervised learning. Machine learning can also be unsupervised and be used to learn and establish baseline behavioral profiles for various entities and then used to find meaningful anomalies.</p>
<p>Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction; in commercial use, this is known as predictive analytics. These analytical models allow researchers, data scientists, engineers, and analysts to "produce reliable, repeatable decisions and results" and uncover "hidden insights" through learning from historical relationships and trends in the data.</p>
<p></p>
In [ ]:
 

get data from the web using Python -1 beautifulsoup, requests, urllib

basics

using urllib

requests,

beautifulsoup

 

import-1

Importing flat files from the web

  • use Python2

  • University of California, Irvine's Machine Learning repository.

http://archive.ics.uci.edu/ml/index.html

  • 'winequality-red.csv', the flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.
In [1]:
# Import package
import urllib

# Import pandas
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urllib.urlretrieve(url, 'winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print df.shape
(1599, 12)
In [2]:
df.head(3)
Out[2]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5

Opening and reading flat files from the web

  • load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas.
In [3]:
# Import packages
import matplotlib.pyplot as plt
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')

# Print the head of the DataFrame
# print(df.head())
print df.shape
# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1], alpha=.4, figsize=(6,3))
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()
(1599, 12)

Importing non-flat files from the web

  • use pd.read_excel() to import an Excel spreadsheet.
In [4]:
# Import package
import pandas as pd

# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname=None)

# Print the sheetnames to the shell
print(xl.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())

print type(xl)
type(xl['1700'])
[u'1700', u'1900']
                 country       1700
0            Afghanistan  34.565000
1  Akrotiri and Dhekelia  34.616667
2                Albania  41.312000
3                Algeria  36.720000
4         American Samoa -14.307000
<type 'dict'>
Out[4]:
pandas.core.frame.DataFrame
In [ ]:
 
In [5]:
from urllib2 import urlopen, Request

request = Request('http://jishichao.com')

response = urlopen(request)

html = response.read()

response.close()
In [6]:
print type(html)
len(html)
<type 'str'>
Out[6]:
4843

Printing HTTP request results in Python using urllib

  • You have just just packaged and sent a GET request to "http://docs.datacamp.com/teach/" and then caught the response. You saw that such a response is a http.client.HTTPResponse object. The question remains: what can you do with this response?
  • Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method.
In [7]:
# Import packages
from urllib2 import urlopen, Request

# Specify the url
url = "http://docs.datacamp.com/teach/"

# This packages the request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Extract the response: html
html = response.read()

print type(html)
print 
# Print the html
print(html[:300])


# Be polite and close the response!
response.close()
<type 'str'>

<!DOCTYPE html>
<link rel="shortcut icon" href="images/favicon.ico" />
<html>

  <head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <title>Home</title>
  <meta name="description" content="A

Requests

  • better and most used

Performing HTTP requests in Python using requests

  • do the same using the higher-level requests library.
In [8]:
import requests
r = requests.get('http://jishichao.com')
text = r.text
In [9]:
print type(text)
print type(text.encode('utf-8'))
<type 'unicode'>
<type 'str'>

Beautiful Soup

Parsing HTML with BeautifulSoup

  • Import the function BeautifulSoup from the package bs4
  • Package the request to the URL, send the request and catch the response with a single function requests.get(), assigning the response to the variable r.
  • Use the text attribute of the object r to return the HTML of the webpage as a string; store the result in a variable html_doc.
  • Create a BeautifulSoup object soup from the resulting HTML using the function BeautifulSoup()
  • Use the method prettify() on soup and assign the result to pretty_soup
In [14]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc, "lxml")

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print type(pretty_soup)
print 
print(pretty_soup[:300])
<type 'unicode'>

<!DOCTYPE html>
<html lang="en">
 <head>
  <link href="../static/my.css" rel="stylesheet" type="text/css"/>
  <title>
   welcome 23333
  </title>
  <!--      <script>
        (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
        (i[r].q=i[r].q||[]).push(arguments)},i[r

Turning a webpage into data using BeautifulSoup: getting the text

  • Extract the title from the HTML soup soup using the attribute title and assign the result to guido_title.
  • Extract the text from the HTML soup soup using the method get_text() and assign to guido_text.
In [11]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()

# Print Guido's text to the shell
print(guido_text[100:300])
<title>welcome 23333 </title>

An interactive Data Visualization Web App I wrote

My Notebook website built by Python Flask deployed on AWS
A image downloader for a specific website 'worldcosplay', the program helps you open their
  • Use the method find_all() to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag < a >; store the result in the variable a_tags
  • The variable a_tags is a results set: your job now is to enumerate over it, using a for loop and to print the actual URLs of the hyperlinks; to do this, for every element link in a_tags, you want to print() link.get('href').
In [12]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))
<title>welcome 23333 </title>
http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=
      result&fr=&sf=1&fmq=1467292435965_R&pv=&ic=0&nc=1&z=&se=1&showtab
      =0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=草泥马动态图
http://www.shichaoji.com
http://www.jishichao.com:7777
http://www.jishichao.com:10086
https://jishichao.com
/windows0
/windows2
/windows1
/mac1
/linux64
/linux32
/plotting
./
In [13]:
type(a_tags), type(a_tags[0])
Out[13]:
(bs4.element.ResultSet, bs4.element.Tag)
In [ ]:
 
In [ ]:
 

A Image Crawler I wrote Added GUI (graphical user interface) & packaged into a single .exe file

in 2014, I’ ve written a Python image crawler for a specific website Cure WorldCosplay,  a website that attracts cosplayers  all over the world post their own pictures . Which has about 10k active members and up to 10 million pictures posted.  

The pros is the program is packaged into a single executable file, no programming environment needed. But some virus detection software could report unsafe file.

 Here the program is!     

Click names below for download

With interface

Without interface


Theoretically, if you have enough disk space, you can download all the pics of that website (about 9800 Gigabyte), the only limit is your bandwidth. I have deployed 36 crawlers on a Linux server at the same time, they download pics 24/7 at the maximum internet bandwidth.

 

How to use it: