Front Page News | A Dash Application

December 24, 2019
Web Scraping Python

We can't have, like, willy-nilly proliferation of fake news. That's crazy. You can't have more types of fake news than real news. That's allowing public deception to go unchecked. That's crazy.
- Elon Musk

This is how it looks like.
Front Page News App

Introduction

While there a lot of news apps out there, most of them are not Ad free (Adblockers are not allowed nowadays). The news is sandwiched between ads that none of us want to see and the entire experience is bad. With the intention of improving my general awarenewss (which still hasn't improved yet), I embarked on this project to create a news app but with a twist.

I used web scraping to scrape font-page links of the "The Hindu" newspaper (known to be credible) and compiled in the form of a Dash, an open-source plotly project for building interactive applications at scale. I deployed this application on Heroku, a cloud platform as a service

A screenshot of application is below.

Step 1: Structure of the app

Firstly, the news links are scraped from the front page of the newspaper. This activity is done whenever the app is loaded. The links are refreshed for each day. The links are loaded as options in a drop-down menu.

The next step is to follow each link and extract content from the page. This involves, extracting heading, location, timestamp, intro-title and body copy. This activity is performed in real-time when a user clicks the link.

Step 2 : Coding the layout

Initiating Dash layout

A basic dash layout looks like as shown in the code below

import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output

app = dash.Dash()

server = app.server

app.title = 'Front Page News | Sumit Kant'
app.layout = html.Div([
    html.H1('Front Page News | Sumit Kant'),
])

if __name__ == '__main__':
    app.run_server(debug=True)

Getting front-page links

I used requests, urllib and BeautifulSoup packages to extract links from the front-page of the newspaper, as shown below

import requests
import urllib.request
from bs4 import BeautifulSoup
url = 'https://www.thehindu.com/todays-paper/'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
front_page = [(x.get_text(), x.get('href')) for x in soup.find_all('ul', {'class': 'archive-list'})[0].find_all('a')]

Extracting content in real-time

Now each link that is served in the drop-down, if clicked will be treated as a new soup and relevant content will be extracted, as shown in the function call below.

def display_article(value):
    # soup
    temp_soup = BeautifulSoup(requests.get(value).text, "html.parser")

    # Intro text under image
    intro_text = temp_soup.find('h2',{'class':'intro'})
    if (intro_text == None): intro_text = ' '
    else: intro_text = intro_text.get_text()

    # Article body copy
    article = temp_soup.find('div', {'class':'paywall'})
    if (article == None) : article = 'Article not available'
    else: article = [html.P(x.get_text().strip()) for x in article.find_all('p')]

    # Article image
    try:
        img = temp_soup.find_all('source')[0].get('srcset')
    except:
        img = 'https://i.ytimg.com/vi/wcTpTYQv7lg/maxresdefault.jpg'
    
    # Article title
    title = temp_soup.find('h1',{'class':'title'}).get_text()

    city = temp_soup.find('div', {'class':'ut-container'}).findChildren()[0].get_text().strip()[:-1]

    time_stamp = temp_soup.find('div', {'class':'ut-container'}).findChildren()[3].get_text().strip()


    return city, time_stamp, title, intro_text, img, article

Putting it all together

I used @app.callback to link input and outputs to the html layout.

The Input component of dash.dependencies recieves input from the drop-down whose value is then passed on to the function display_article() as an argument.

The output of the function call returns multiple items which are recieved by the Ouput component of dash.dependencies. This component then takes the output value and displays in the children of selected component selected by id.

For example, the output "city" of the function call display_article(), will be displayed as the child element of the div tag with id = "display-city"

The finalized code is shown below.

import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import requests
import urllib.request
from bs4 import BeautifulSoup


external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css', 'https://codepen.io/sumitkant/pen/oNgzyjw.css']

app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

#python
url = 'https://www.thehindu.com/todays-paper/'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
front_page = [(x.get_text(), x.get('href')) for x in soup.find_all('ul', {'class': 'archive-list'})[0].find_all('a')]


server = app.server

app.title = 'Front Page News | Sumit Kant'
app.layout = html.Div([
    html.H1('Front Page News'),
    
    html.Div(dcc.Dropdown(
        id='dropdown',
        options=[{'label': i[0], 'value':i[1] } for i in front_page],
        value=front_page[0][1]
    )),

    html.Small(id = 'display-city'),
    html.Small(id = 'display-timestamp'),
    html.H2(id = 'display-title'),
    html.H5(id = 'display-intro'),
    html.Img(id = 'display-image'),
    html.Div(id = 'display-article')
])

@app.callback([
    Output('display-city', 'children'),
    Output('display-timestamp', 'children'),
    Output('display-title', 'children'),
    Output('display-intro', 'children'),
    Output('display-image', 'src'),
    Output('display-article', 'children'),],
    [Input('dropdown', 'value')])

def display_article(value):
    # soup
    temp_soup = BeautifulSoup(requests.get(value).text, "html.parser")

    # Intro text under image
    intro_text = temp_soup.find('h2',{'class':'intro'})
    if (intro_text == None): intro_text = ' '
    else: intro_text = intro_text.get_text()

    # Article body copy
    article = temp_soup.find('div', {'class':'paywall'})
    if (article == None) : article = 'Article not available'
    else: article = [html.P(x.get_text().strip()) for x in article.find_all('p')]

    # Article image
    try:
        img = temp_soup.find_all('source')[0].get('srcset')
    except:
        img = 'https://i.ytimg.com/vi/wcTpTYQv7lg/maxresdefault.jpg'
    
    # Article title
    title = temp_soup.find('h1',{'class':'title'}).get_text()

    city = temp_soup.find('div', {'class':'ut-container'}).findChildren()[0].get_text().strip()[:-1]

    time_stamp = temp_soup.find('div', {'class':'ut-container'}).findChildren()[3].get_text().strip()


    return city, time_stamp, title, intro_text, img, article


if __name__ == '__main__':
    app.run_server(debug=True)

Step 3 : Deploying on Heroku

For deploying a dash application on Heroku, the documentation given by plotly is clear enough. If you are deploying your Dash app on Heroku, just replace the code in the documentation with your app.py code and follow along as described.

Results

Check the link below for final results. This app is currenly optimzed for mobile.
Front Page News App

Previous Parallelize Grid-Search for XGBoost

Next Automating Amazon Wishlist Tracking