Using Requests Session Objects for web scraping

I had to write a Python script some months ago, that would retrieve Solar energy data from a web site. It was basically a handful of HTTP calls, parse the response that was mainly in JSON, and store the results as JSON and CSV for processing later.

Since it was such a small task, I used the Requests module instead of a complete web scraper. The HTTP requests had to be made with a time out, and also pass certain headers. I started customizing each call, until I learned about the Requests Session Objects.

You create a session, as in ORM/JPA, where you can define a context, with certain properties and control an orthogonal behavior.

import requests

def get(session, url):
    """Utility method to HTTP GET a URL"""
    response = session.get(url, timeout=None)
    return response

def post(session, url, data):
    """Utility method to HTTP POST a URL with data parameters"""
    response = session.post(url, data=data, timeout=None)
    if response.status_code != requests.codes.ok:
        response.raise_for_status()
    return response

with requests.Session() as s:
    # x-test header will be sent with every request
    s.headers.update({'x-test': 'true'})

    data = {'user': user, 'password': password, 'rememberne': 'false'}
    r = post(s, 'https://portal.login.url', data)

    r = get(s, 'https://portal.home.url')

Besides the session object, that gives you ability to add headers to all requests, you won’t have to worry about redirects. The library by default takes care of that with a default limit of up to 30.

Happy hacking!

Categories: Blog

Tags: Python