Front Page News | A Dash Application
We can't have, like, willy-nilly proliferation of fake news. That's crazy. You can't have more types of fake news than real news. That's allowing public deception to go unchecked. That's crazy.
- Elon Musk
This is how it looks like.
Front Page News App
Introduction
While there a lot of news apps out there, most of them are not Ad free (Adblockers are not allowed nowadays). The news is sandwiched between ads that none of us want to see and the entire experience is bad. With the intention of improving my general awarenewss (which still hasn't improved yet), I embarked on this project to create a news app but with a twist.
I used web scraping to scrape font-page links of the "The Hindu" newspaper (known to be credible) and compiled in the form of a Dash, an open-source plotly project for building interactive applications at scale. I deployed this application on Heroku, a cloud platform as a service
A screenshot of application is below.
Step 1: Structure of the app
Firstly, the news links are scraped from the front page of the newspaper. This activity is done whenever the app is loaded. The links are refreshed for each day. The links are loaded as options in a drop-down menu.
The next step is to follow each link and extract content from the page. This involves, extracting heading, location, timestamp, intro-title and body copy. This activity is performed in real-time when a user clicks the link.
Step 2 : Coding the layout
Initiating Dash layout
A basic dash layout looks like as shown in the code below
import dash import dash_core_components as dcc import dash_html_components as html from dash.dependencies import Input, Output app = dash.Dash() server = app.server app.title = 'Front Page News | Sumit Kant' app.layout = html.Div([ html.H1('Front Page News | Sumit Kant'), ]) if __name__ == '__main__': app.run_server(debug=True)
Getting front-page links
I used requests
, urllib
and BeautifulSoup
packages to extract links from the front-page of the newspaper, as shown below
import requests import urllib.request from bs4 import BeautifulSoup url = 'https://www.thehindu.com/todays-paper/' response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") front_page = [(x.get_text(), x.get('href')) for x in soup.find_all('ul', {'class': 'archive-list'})[0].find_all('a')]
Extracting content in real-time
Now each link that is served in the drop-down, if clicked will be treated as a new soup
and relevant content will be extracted, as shown in the function call below.
def display_article(value): # soup temp_soup = BeautifulSoup(requests.get(value).text, "html.parser") # Intro text under image intro_text = temp_soup.find('h2',{'class':'intro'}) if (intro_text == None): intro_text = ' ' else: intro_text = intro_text.get_text() # Article body copy article = temp_soup.find('div', {'class':'paywall'}) if (article == None) : article = 'Article not available' else: article = [html.P(x.get_text().strip()) for x in article.find_all('p')] # Article image try: img = temp_soup.find_all('source')[0].get('srcset') except: img = 'https://i.ytimg.com/vi/wcTpTYQv7lg/maxresdefault.jpg' # Article title title = temp_soup.find('h1',{'class':'title'}).get_text() city = temp_soup.find('div', {'class':'ut-container'}).findChildren()[0].get_text().strip()[:-1] time_stamp = temp_soup.find('div', {'class':'ut-container'}).findChildren()[3].get_text().strip() return city, time_stamp, title, intro_text, img, article
Putting it all together
I used @app.callback
to link input and outputs to the html layout.
The Input
component of dash.dependencies
recieves input from the drop-down whose value is then passed on to the function display_article()
as an argument.
The output of the function call returns multiple items which are recieved by the Ouput
component of dash.dependencies
. This component then takes the output value and displays in the children
of selected component selected by id
.
For example, the output "city" of the function call display_article()
, will be displayed as the child element of the div
tag with id = "display-city"
The finalized code is shown below.
import dash import dash_core_components as dcc import dash_html_components as html from dash.dependencies import Input, Output import requests import urllib.request from bs4 import BeautifulSoup external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css', 'https://codepen.io/sumitkant/pen/oNgzyjw.css'] app = dash.Dash(__name__, external_stylesheets=external_stylesheets) #python url = 'https://www.thehindu.com/todays-paper/' response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") front_page = [(x.get_text(), x.get('href')) for x in soup.find_all('ul', {'class': 'archive-list'})[0].find_all('a')] server = app.server app.title = 'Front Page News | Sumit Kant' app.layout = html.Div([ html.H1('Front Page News'), html.Div(dcc.Dropdown( id='dropdown', options=[{'label': i[0], 'value':i[1] } for i in front_page], value=front_page[0][1] )), html.Small(id = 'display-city'), html.Small(id = 'display-timestamp'), html.H2(id = 'display-title'), html.H5(id = 'display-intro'), html.Img(id = 'display-image'), html.Div(id = 'display-article') ]) @app.callback([ Output('display-city', 'children'), Output('display-timestamp', 'children'), Output('display-title', 'children'), Output('display-intro', 'children'), Output('display-image', 'src'), Output('display-article', 'children'),], [Input('dropdown', 'value')]) def display_article(value): # soup temp_soup = BeautifulSoup(requests.get(value).text, "html.parser") # Intro text under image intro_text = temp_soup.find('h2',{'class':'intro'}) if (intro_text == None): intro_text = ' ' else: intro_text = intro_text.get_text() # Article body copy article = temp_soup.find('div', {'class':'paywall'}) if (article == None) : article = 'Article not available' else: article = [html.P(x.get_text().strip()) for x in article.find_all('p')] # Article image try: img = temp_soup.find_all('source')[0].get('srcset') except: img = 'https://i.ytimg.com/vi/wcTpTYQv7lg/maxresdefault.jpg' # Article title title = temp_soup.find('h1',{'class':'title'}).get_text() city = temp_soup.find('div', {'class':'ut-container'}).findChildren()[0].get_text().strip()[:-1] time_stamp = temp_soup.find('div', {'class':'ut-container'}).findChildren()[3].get_text().strip() return city, time_stamp, title, intro_text, img, article if __name__ == '__main__': app.run_server(debug=True)
Step 3 : Deploying on Heroku
For deploying a dash application on Heroku, the documentation given by plotly is clear enough. If you are deploying your Dash app on Heroku, just replace the code in the documentation with your app.py
code and follow along as described.
Results
Check the link below for final results. This app is currenly optimzed for mobile.
Front Page News App