Introduction of the project
Completed work thus far
Up to now, we have finish part of the basic statistical study of the data from 2019 Stack Overflow Annual Developer Survey.
(1) The popularity of computational language
Based on our statistic results (Figure 1), Javascript, CSS, HTML, SQL, Python, Java are 6 most popular computational language. More than 40% responders were using these language on their extensive development work. Compared with the popularity of language the responders plan to learn, Javascript is the most widely used language today. Because of its wide user base, it is expected in the next year, Javascript is still the most desired language to be used. However, the numbers of Javascript users may decrease as shown in Figure 2. Instead, we may expect more users turn to Python. We guess it is because the popular implication of Python in the field of Machine Learning. Being a master of Python has become an basic requirement for future computational engineers.
Figure1. The distribution of used computational language among the responders of 2019 Stack Overflow Annual Developer Survey
(2) Education Situation
Then we study the education situation of the responders. As shown in Figure 3, about half of the responders has a highest education level of bachelor’s degree. About a quarter responders has a Master Degree. Very rare responders obtain a doctoral degree, which only takes 2.82%. By comparison, it is not rare that responders do not achieve a bachelor’s degree or higher one.
The majority of responders are major in computer science and engineering or software engineering (Figure 4). But there are still some responders major in various field, like engineering, administration, science, business and so on, which means computer science has been widely used and connected with other field.
(3) Job Situation
Firstly, we want to know whether the field of computational engineering has problem of working overtime. As shown in the statistic results in Figure. 5, most people work 40 hours a week, which is almost the same with the other fields. And it is shows positively that most responders are satisfied with their current career (Figure 6).
Then we plot the distribution of salaries. The salary range is very wide with a long tail. But most responders’ salaries fall in the range of less than 75000 $/year.
Figure2. . The distribution of computational language expected to use in the next year among the responders of 2019 Stack Overflow Annual Developer Survey
Figure3. Distribution of education levels of the responders of 2019 Stack Overflow Annual Developer Survey
Figure4. Distribution of major among the responders of 2019 Stack Overflow Annual Developer Survey
Figure 5. The distribution of working hours a week
Figure 6. The distribution of satisfactory in work
Figure 7. The distribution of salaries
what’s left to complete
(1) To further discover statistic information from the data.
(2) To include the feature “time” into our statistic study.
(3) To learn a regression model to study what features have an impact on income.
Timeline
- Data importing and mining with python. (01/10/2019-15/10/2019 )
We will use Pandas, Numpy, Scipy and Matplotlib library to import, sort, group and find the basic statistical results from the data. - Model study with Machine learning (16/10/2019-31/10/2019)
we will use linear regression (Baysian learning) to find the relationship between salaries and factors like education, program language, working years et. al. So that we can study a model to provide a prediction on salaries based on the students’ background. - Result presentation. (1/11/2019-15/11/2019)
We will present our data mining results and salary prediction model with a user-friendly interface. - Paper and PPT preparation. (16/11/2019-30/11/2019)
What challenges have you faced
(1)The main challenges comes from the source of data itself. Firstly, since this database is results of a questionnaire, the features contain different types of data. For example, the type of responders’ ages is “integer”, while the type of responders’ working place is “string”. The multi-type features not only require us to use different statistic ways to understand the results, but also lay a burden to choose a unit form for the later further model study, the regression model of understanding the impact of different features to income.
Secondly, the type of questions in the questionnaire is different. There are mainly three type of questions: Single choice, multi choice and short answer question. Compared with single choice, the multi choice and short answer questions need extra treatments before category and sorting.
(2) Another problem is that the dataset contains numerous null values. If we dump all rows containing any null value, the data size decrease directly to about 4%. Using the database of 2019 as an example, the strict dumping operation cut the original data size of 88883 to 3475 entries. This great change is a direct result of the source of the data. Because it is expected that responders would skip some questions they are not interested. For this problem, we need to carefully take care of the null values.
What new risks do you see
(1) How to include the feature “time” into our statistic study? It is interesting to show the trend with time variance. It is fortunate that we have 9 years’ data of Stack Overflow Annual Developer Survey. However, they are stored in different files and each file has about 60M information. This is very heavy workloads. The more difficult part is to find tools that can present the time variation.
(2)The most challenging part of our project is learning a regression model to study what features have an impact on income. For one risk, as mentioned above, we need to figure out a unit form of data for the further regression model study. For another risk, many features in the data have no relationship to the target “income”. We need to carefully choose the effective ones from the redundant features.
What are your DB schemas
We used relational DB schemas
What technologies are you using
(1) Data import:
We used DataFrame library
(2) Null data:
We used DataFrame library to delete the null data
(3) Statistic calculaton:
We used numpy library
(4) Figure plot
We used hist and import seaborn in matplotlib.pyplot library.
(5) Documents interpretation
We used sklearn.feature_extraction library
teams: Who’s doing what
Appendix
The example code:
import warnings
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import Series
Load data
df2019 = pd.read_csv('D:\course\database\developer_survey_2019\survey_results_public.csv')
percents=np.array(sum(document_counts.sum(axis=0))/document_counts.shape[0]*100)
percents=np.round(percents)
percents=percents.reshape(percents.shape[1])
df=pd.DataFrame({'name':count_vect.get_feature_names(), 'value':percents})
df=df.sort_values(by=['value'],ascending=[False])
df['label']=df['value'].map(lambda x:str(x)+'%')
df['name']=df["name"].str.cat(df["label"],sep=":")
plt.figure(figsize = (12,6))
sns.barplot(df['value'], df['name'], alpha = 0.9)
plt.xlabel('Percent', fontsize =12)
plt.ylabel('', fontsize = 12)
plt.show()
dw=df2019['WorkWeekHrs']
dw.describe
print(max(dw))
counts, bins = np.histogram(dw)
dw.hist(bins=50, figsize=(10,5),range=(0,100))
density=True
plt.xlabel('Hours', fontsize =15)
plt.ylabel('Counts', fontsize = 15)
plt.show()