Data Mining for Business Analytics
Use Rapidminer to explore and experiment with the data preprocessing techniques discussed in the lectures. The data file needed for this assignment is available for download on the Blackboard. The cars data set was originally compiled by Barry Becker and Ronny Kohavi of Silicon Graphics. The data set consists of information about 261 automobiles manufactured in the 1970s and 1980s, including gas mileage, number of cylinders, cubic inches, horsepower, weight, time to 60, year, and brand. The data set is available as a .csv file. Please complete the following and document the major steps by taking screenshots:
- Download the data file cars.csv and then save the file in a local directory.
- Import the data set into Rapidminer by performing the following steps:
a. Create a new process
b. Create a new repository
c. Create data and process folders
d. Add cars.csv to the data folder
e. Explore the general characteristics of the data, by visualizing the minimums, maximums, means, and standard deviations of the numerical attributes, as well as the distributions of weights, the brand, etc. - Design a process that performs the following data preprocessing steps on the data
a. Retrieve the cars.csv data set.
b. Discretize the Cylinder attribute by using the Discretize by Binning operator. Set the number of bins as 4.
c. Use min-max normalization to transform the values of the cubicinches attribute onto the range [0.0-1.0] by using the range normalization method.
d. Discretize the hp attribute by using the Discretize by Frequence operator. Set the number of bins as 6.
e. Use z-score normalization to standardize the values of the weightlbs attribute on.
f. Discretize the time-to-60 attribute by using the Discretize by Binning operator. Set the number of bins as 4.
g. Save the results in a file named “Results by Your Name” by using the Write CSV operator. Please take a screenshot of your analysis process consisting of all the operators.
h. Run the process and save the process as “Data Preprocessing by Your Name” - Create a second process to perform basic correlation analysis by performing the following steps:
a. Retrieve the original cars.csv data set.
b. Transform the nominal brand attribute into numeric variables by using the nominal to numeric operator.
c. Perform fundamental correlation analysis among all the attributes. For the purpose, you need to construct a complete Correlation Matrix by using the correlation matrix operator. Discuss your results by indicating any strong correlations (positive or negative) among pairs of attributes.
d. Take a screenshot of your analysis process consisting of all the operators.
e. Run the process and save the process as “Correlation Analysis by Your Name”