News Analytics as a Service (NAaaS)

Extract focus time and focus location from NEWS documents and plotting them on a map of Pakistan.

Problem overview

News documents consist of a focus time (time/date the NEWS is referring to) and a focus location (location to which the NEWS is referring to) as well as a creation time (time/date the NEWS was created/uploaded at). Similarly, NEWS might have a focus location as well.

However, it has been observed that the focus time and the focus location usually do not relate to the creation time of the document, nor the Area mentioned in the headline of the document. Therefore, it is usually left up to the reader to interpret the temporal specificity and the geographical or spatial specificity of the document by themselves.

Description

NAaaS aims to be a dashboard style web application with an integrated GIS (Geographical Information System). The application will provide multiple options to filter NEWS for example on the basis of Time Frames, geographical locations as well as specific keywords that appear in NEWS documents. This filtered NEWS will be plotted on a map of Pakistan using the integrated GIS.

The system will automate the process of scraping news from online sources and then extracting information (focus time and focus location) from the NEWS documents and then plot that NEWS.

The information extraction module will be implemented on a distributed computing architecture using big data tools such as Apache kafka, Apache Spark for increased efficiency.

Motivation

Imagine someone wants to know about a specific event occurred on a specific time, He has to go on some NEWS website and manually check whether the event occurred is important for him/her or not and if he/she wants to get the information about one specific event/topic but over a time frame, things get more clumsy

For example, X wants to know about the number of robberies in city Y, area B reported in the NEWS from January 2021 to January 2022. If such information is not collected by any third-party organization. He himself has to go through the NEWS and check the robbery keyword and make sure the location is city Y, area B. X has to work quite hard to get this data for 1 year, But what about 2 years? Or not just robbery, any other keyword/topic he might be interested in

NAaaS provides a one-stop solution to many of the problems related to the NEWS articles. It provides you quick access to all the NEWS regarding the selected event/topic of NEWS in the provided time frame and plots it on the Map, which provides a better understanding of NEWS relevance to the location and time. To deal with the big-data NAaaS internal structure is to be distributed over the cluster of computers, allowing distributed computing, help increase the response time, query execution of the user and extraction of data from different sources with stream processing techniques using big data tools

Architecture

The system architecture for NAaaS is divided into 4 main modules:

  1. Scrapping Modules to get news and tweets.
  2. Information extraction Modules to extract focus time and focus location.
  3. Big data tools to enable distributive computing for increased efficiency.
  4. Geographic Information Module to plot queried news on map of Pakistan.

A high level architecture of the system can be seen in Figure 1.

Figure 1: High level architecture of the system

Scrapping Module

Leave a Reply

Your email address will not be published.