Predictive Analytics

Table of Contents

Introduction 3
Task 1 3
1.1 Predictive Analytics 3
1.2 NoSQL Databases 4
1.3 Hadoop Ecosystem 7
1.4 Blockchain 8
1.5 CRISP Methodology 10
1.6 Supervised Learning 12
Task 2 13
Conclusions 26
References 27
Appendix 29

Introduction

This assignment is related to the Automobile research company, reflecting applications in Big Data Analysis and possible solutions based on our knowledge as data scientists. It is divided in two main tasks. Task 1 implies defining, reviewing and evaluating the following topics: Predictive Analytics, NoSQL Databases, Hadoop Ecosystem, Blockchain, CRISP Methodology, Supervised Learning. The discussion should involve the implications of the Big Data Analytics according to the company requirements. Task 2 involves interpreting the data set provided by the company using Tableau and it is divided in five questions.
Task 1
1.1 Predictive Analytics

The procedure of determining important patterns in data is called predictive analytics (Abbott, 2014). It consists of various related disciplines in order to discover the patterns in data, such as statistics, artificial intelligence, data mining, pattern recognition and data mining. Predictive analytics makes use of a database, where algorithms automate the method of finding the patterns instead of using assumptions from an analyst.
The most common use of predictive analytics is to maximize a company’s profit, by developing tools to predict the behavior of their customers. Other uses of predictive analytics are:
• Detecting fraud: cyber security is a growing unease, high-performance behavioral analytics scan the network in real time in order of discovering abnormalities or vulnerabilities that could lead to fraud.
• Optimizing marketing campaigns: predictive analytics can be used in this case to define a customer purchase or response, by using their browsing data.
• Managing resources: companies like airlines or hotels forecast and manage their inventory by using predictive analysis in order to increase their profit.

Regarding Automobile research company, predictive analytics can be used to check a customer’s credit score, to assess their purchase limit, or if they have a low credit score to deny renting or leasing a car.
1.2 NoSQL Databases

NoSQL (Not only Structured Query Language) represents one of the Big Data storage concepts that has and will continue to have a particularly fast-growing rate over the time as result of the increase of users and data volume. (EMC Education Services, 2014)
NoSQL, figure 1, can be accessed using an API (Application Programming Interface) based query interface which is often supplied within an application. NoSQL mostly manages non-relational data, although there are several NoSQL databases that provide queries similar to SQL. (Erl, 2015)

Figure 1 – NoSQL database (Erl, 2015)

Figure 2 – NoSQL providing API or SQL-like queries (Erl, 2015)
According to Erl, 2015, the major factors that define most NoSQL storage devices comparing with relational database management system (RDBMS) are:
• Schema-less data model,
• High scalability: ability to increase and replace the number of machines into the distributed system,
• High availability,
• Low operational costs: no licensing costs,
• Eventual consistency,
• Sharding and replication where dataset is partitioned horizontally and later copied to multiple nodes,
• Particular design that allowed storing semi-structured and unstructured data,
• Fault tolerance.
Depending on storage method, in NoSQL is mainly divided in key-value, document, column family and graph storage.

Figure 6 – NoSQL storage examples (Erl, 2015)
Similar to the figure 7, the Automobile research company can benefit from NoSQL storage in several perspectives: from storing to manipulating unstructured data regarding the types of vehicles, distributions, consume, sales and other technicalities.

Figure 7 – Graph storage (Erl, 2015)

1.3 Hadoop Ecosystem

Hadoop is one of the main projects of the Apache Software Foundation and has been on the IT market for almost ten years. According to Prajapati (2013), it is used to store large data in a distributed environment to process it simultaneously. It also good for Automobile research company because provides distributed storage and computation between clusters of computers. The Hadoop ecosystem contains different components and each component does a different job.
The most important components are:
• Hadoop Distributed File System (HDFS) - is the Hadoop storage system and a distributed file system that provides high-capacity access to application information for both input/output of MapReduce jobs (Prajapati, 2013). It resolves three major issues for the company such as cost, reliability (copies the data multiple times and distributed to individual nodes) and speed (HDFS can easily deliver more than 2 GB of data per second per computer to MapReduce). Data is also shared between these data nodes.
• YARN - performs cluster resource management and job scheduling and is also the data processing private dealer processing starts to come into play. YARN resource manager allows one or more engines to run on the Hadoop cluster processing (Hadoop.apache.org., 2021). After implementing YARN for Automobile research company, the Common toolkit includes utilities required for Hadoop modules such as data compression, input/ output operations and error detection, interfaces and tools for authorizing proxy users, authentication, data privacy, administration of cryptographic access keys.
Hadoop is one of the most powerful and widely used systems in handling large volumes of information quickly and cheaply.
1.4 Blockchain

Blockchain technology is a system where information is recorded in such manner that it is almost impossible to modify or hack data, a “digital ledger” where database is managed on a decentralized network, this being a much resilient system than a classic centralized structure as Lantz and Cawrey (2020) revealed in their work.

Figure 8 - Blockchain illustration (source: https://builtin.com/blockchain)

Firstly, was implemented as a public archive where all the transactions of well-known crypto-currency Bitcoin were collected, data being ordered sequentially and deposited into storage units called blocks. As Asharaf and Adarsh (2017) mentioned, each of these blocks have a public unique fingerprint (hash) that is generated from the value of the preceding block and the actual data contained by the block and any changes or attempts are visible affecting the whole chain, this characteristic is offering a high security level and because of the trust free, decentralised and distributed database, this technology is starting to be used lately by many companies as business solution.

Figure 9 - Blockchain links through hash (source: https://medium.com/swlh/blockchain-characteristics-and-its-suitability-as-a-technical-solution-bd65fc2c1ad1)

In automotive industry blockchains are used nowadays as a solution for recording and sharing with insurance companies or 3rd parties vehicle history information (repairs, MOT’s, accident). Big names in industry as BMW (2020) are using this for supply chain verification, car parts or used materials as they mention on their website. Furthermore, can be used by automotive selling companies, as in our example, where manufacturer’s data is provided and can be accessed by selling companies and customers, this solution offering a maximum transparency on products or other information.
1.5 CRISP Methodology

Cross Industry Standard Process for Data Mining, known as CRISP-DM methodology, is described by De Ville (2001) in his work as a multinational standard method of devising, documenting and always improving data mining processes.

Figure 10. CRISP-DM Methodology (Oliva Ramos and Stirrup, 2017)
This methodology is a top-down approach, non-linear, described in 6 phases, as show in figure 10.
Business understanding and data understanding consist of gathering data about the business perspective. What we want to achieve, to set the objectives and create a plan. In this stage are also assessed the tools and techniques for data mining as they influence the project. The quality of the data will be verified as well in this step.
Data preparation is the following step, which is where the data that will be used for the project is selected. Increasing the quality of data is also done in this step, shaping it for the modeling phase.
In the modeling phase we select the techniques that are applied to the data. The models will be tweaked and refined to fit the business needs.
The fifth step is evaluation where a certain degree of assessment will be done to check if the model fits the business goals and decide the next step.
The last phase is deployment where a plan is developed and documented in order to deploy the model. Here, predictive analysis helps the most to improve the functional part of the business. A plan to monitor and maintain the model is put in place, including the necessary steps and how to perform them. The final report is made including the deliverables, summarizing and organizing the results of the data mining engagement.
As an example, for our business, Automobile research company, we can use Tableau, with the data gathered to check in which country a certain car manufacturer sold the most units
1.6 Supervised Learning

Supervised machine-learning is defined by the particular approach where an intelligent application is built to develop understanding about the input, followed by the prediction or classification of the output. Labeled datasets and outcomes are delivered in the beginning phase for the training and this way the system is expected to develop interpretation of the data, dividing it in different categories. The next step is to test the supervised learning machine by intentionally providing it with similar, but unlabeled data required to be categorized. A demonstration of the process is illustrated in figure 11, where the machine is initially fed categorized data for training purposes. Figure 12 illustrates the algorithm used by the supervised machine to perform unlabeled data classification. (Dasgupta, 2018; Erl, 2015)

Figure 11 -Classification of the labeled dataset (Erl, 2015)

Figure 12 – Classification of the unlabeled dataset (Erl, 2015)
The supervised learning is categorised based on the algorithm used in classification process. Liniar regression provides clarification in forecasting and predicting outcomes based on the historical information. The Automobile research company could benefit from regarding several marketing strategies including: sales anticipation, predicting optimum car price or the next car purchases based on campaigns or other sources. On the other hand, logistic regression is based on the probabilistic classigication algorithm. In Automobile reasearch company, this algorithm could be applied to predict the posibility of an online purchase. (Prajapati, 2013)
Task 2

Task 2 involves analysis of the datasets and the fuel economy for 2015 model cars, figure 13, provided by the Automotive Research company. Tableau application is suggested to be used in order to perform data visualization and problem-solving queries essential in the research.

Figure 13 – FEguide: the desired data source

Question 1
Based on dataset provided to be analyzed, the car manufacture corresponds to Mfr Name and the car models correspond to Carline.

Figure 14 – Car manufactures

Figure 15 – Car models

Figure 16– The measure used to provide the car models quantity

Figure 17– The quantity of models for each car manufacture
The counting measurement and the dimension are prepared for data visualization and they are sorted in descending order based on the quantity of models corresponding to each car manufacture. Figure 19 illustrates how color is used to individualize each car manufacture in order to provide additional support in visualizing the data. Therefore, the car manufacture with the highest quantity of the models is produced by General Motors with a corresponding value of 127 models. (Murray, 2013)

Figure 18 - Sorting

Figure 19 – Data visualization result Question 1
Question 2

The average fuel economy for city and motorway driving is calculated using the formula shown in figure 20. The car manufacture dimension is required to provide guidance and it is self-explanatory. As Murray (2013) explains, the visualization of the highest average combined fuel economy is prioritized through the accentuated color and size, which are gradually minimized accordingly to the corresponding value of each car manufacturer. As a result, the highest average combined fuel economy value is 29.71 associated to Mazda.

Figure 20 – Average city and highway fuel economy

Figure 21 – Average fuel economy for city and motorway

Figure 22– Data visualization result Question 2

Question 3

The calculation and the value of the average fuel economy are represented in figure 23, as the same measurement was necessary for Question 3.

Figure 23 – Average fuel economy for each car manufacture
The choice of transmission type is questionable, based on comparison between Transmission and Transmission Description. A suitable dimension would be the first one because, from data analysis perspective, when the input is too informative or too detailed, a rapid interpretation of it could be compromised. The chart shown in figure 26 communicates the details available for determining the fuel economy values within the context. The data is converted into visual analysis using an appropriate graphic, therefor, color blue and orange are used to distinctively illustrate the type of transmission and the size of the bars is simultaneously used to visualize their value. Analyzing the chart, the high and low average fuel economy for auto transmission is 29.44- Mazda and 16.70- Aston Martin. The high and low average fuel economy for manual transmission is 33.17 – Ford Motor and 16.00- Aston Martin. (Kosara & Mackinlay, 2013)

Figure 24 – Analogy between dimensions

Figure 25 – Value of the average fuel economy for the transmission type corresponding to each car manufacture.

Figure 26– Data visualization result Question 3

Question 4

Considering that the car manufacturer is represented by Mfr Name and the wheel drive models is represented by Drive desc, the next step to extract only the models that have 2-wheel and 4-wheel drive which have engine power over 3.5. The distinct color feature attributed to each car manufacture provides guidance in clearly visualizing the output.

Figure 27 - All Wheel drive models classified based on car manufacture

Figure 28 – Filter wheel drive
Assuming that engine displacement is the correspondent to engine power, the next step is to provide the average engine power for every model.

Figure 29 – Average engine displacement

Figure 30- Engine power adjustment over 3.5

Figure 31 – Data visualization Question 4

Question 5

Tableau vs Power Bi

Tableau, Microsoft Excel and Power Bi are a few noticeable software known as business analytic platforms used for Business Intelligence and data visualization. Silva (2020) pointed out that both Tableau and Power BI are approachable for any user because they have an intuitive interface. In terms of setting up visualizations, creating agreements, reduced cost and time, connecting to external data sources both are efficient; however, Tableau allows its users to control any quantity of datapoints and can connect to various and much more data sources.

Figure 32 - Tableau vs Power Bi illustration (source: https://www.betterbuys.com/bi/tableau-vs-power-bi/)

As customer’s solutions Power BI has limited support for free account users, while paid accounts receive an improved support. Tableau has subscription tiers tailored for their customer needs and tend to have a higher-standard support for all their clients according to customer’s feedback from TrustPilot.com. Regarding costs Power BI can be an affordable option compared to Tableau, as it offers 60 days trial and lowest payment plan is 9.99$ while Tableau offers only 14 days, lowest payment plan is 25$ more expensive than PowerBi.
Overall, both systems have strengths and weaknesses, Power Bi being much suitable for small companies that are looking for a reasonably priced solution, on the other hand Tableau offers high quality support and a limitless total of datapoints, the final decision is due to be made by the customer’s affordability and needs.

Conclusions

Data analysis tools, either in a data lake that stores data in native format or in a data warehouse, are still under development.
The combination of Big Data and analysis is an important part of keeping the company one step ahead of the competition. But the company must also create the conditions to allow data researchers and analysts to test theories based on the data they have.

References

Abbott, D., 2014. Applied predictive analytics. 1st ed. Wiley.
Asharaf, S. and Adarsh, S. (2017) Decentralised Computing Using Blockchain Technologies and Smart Contracts: Emerging Research and Opportunities. New York, IGI Global.

BMW (2020) How Blockchain Automotive can help drivers. Available at: https://www.bmw.com/en/innovation/blockchain-automotive.html. Date Accessed:04/03/2021

Dasgupta, N., 2018. Practical Big Data Analytics: Hands-on techniques to implement enterprise analytics and machine learning using Hadoop, Spark, NoSQL and R. Birmingham: Packt Publishing.
De Ville, B., 2001. Microsoft data mining. Woburn, Mass.: Butterworth-Heinemann.
EMC Education Services, 2014. Data science and big data analytics: discovering, analyzing, visualizing and presenting data. Indianapolis: Wiley-Blackwell.
Erl, T., 2015. Big Data Fundamentals: Concepts, Drivers & Techniques. Indiana: Pearson Education.
Kosara, R. & Mackinlay, J., 2013. Storytelling: the next step for visualisation. Computer, Volume 46, pp. 44-50.
Lantz, L. and Cawrey, D. (2020) Mastering Blockchain: Unlocking the Power of Cryptocurrencies, Smart Contracts and Decentralised Applications, 1st Edition. USA, O’Relilly Media Inc.
Murray, D. G., 2013. Tableau your data! Fast and Easy Visual Analysis with Tableau Software. Indianapolis: John Wiley & Sons.
Oliva Ramos, R. and Stirrup, J., 2017. Advanced analytics with R and Tableau. Birmingham: Packt Publishing.
Prajapati, V., 2013. Big data analytics with R and Hadoop. Birmingham: Packt Publishing.
SaM Solutions, 2021. Pros and Cons of Tableau Software for Data Visualization. Available at: https://www.sam-solutions.com/blog/tableau-software-review-pros-and-cons-of-a-bi-solution-for-data-visualization/. (Accessed: 3/03/2021).
Silva, R. F. (2020) Power BI, Excel and Tableau - Business Intelligence Clinic. USA, independently published - Daniane Silva.

Appendix