Data Management System

Data Management System

Twenty Pages of the reference document have already been written. Some editing is required in addition using provided outline to finalize document in seventy pages. A writer with good quantitive skills will be required. The target audience is already included in the original document.Initial work will be submitted as soon as writer

agreed on. Writer CV will be reviewed by client. And additional information will be provided as assignment progresses.

Concept Paper
By Bongs Lainjo
(Draft zero, not for circulation)
Data Management System (DAMSys)
Data Management is a concept whose time has come. It has evolved gradually and surely from a limited number of fields to the current situation where its ubiquity can no longer be shrouded by any forces. Its cross-sectional appeal has made it popular in fields ranging from Scientific, industrial, business, governments to organizations etc, just to mention a few. This diversity has however generated its own controversies. For example there are as many interpretations and understandings of the word ‘Data Management’ as there are interested stakeholders. The silver lining among these diversities is the commonality of generating and transforming data for decision and informed decision-making processes.

When one looks at the myriad fields involved in managing data, one is amazed by the simplicity on the one hand and complexity on the other. For instance, the store clerk who tracks and manages commodity supplies is interested in how stock flows and at the same time trying to minimize stock-outs. On the other spectrum, we have the rocket scientist who is overwhelmed by complex systems dependent on reliable and stable data sets required in order to enable the successful accomplishment of certain goals and objectives.

Data Management as we know it today has also been popularized by the strong demands on companies and organizations to produce meaningful results. This is more pronounced in the area of Monitoring and Evaluation (M and E). In general and where available and used judiciously, M and E activities have served as a driving force in underscoring the importance and usefulness of data management.

From an M and E perspective (which is a significant focus of this document), data management will continue to be viewed as a complementary force towards the achievement of reliable results (output (intermediate), outcome and goal (strategic objective)). In program management, output, outcome and goal are strategic elements of a logical framework (log frame). They are generally used by governments and implementing program agencies. The others, intermediate results and strategic objective are primarily used by the US government. The United States Agency for International Development (USAID) is a strong proponent of this framework. From a program management view, the outputs (or lower level indicators) are generally monitored while the higher level indicators (outcome and goals) are generally evaluated. In the former case, the objective is to assess the degree of progress (or lack of it) towards achieving the planned results. In the latter case, the objective is to establish the level of satisfaction among the beneficiaries of the intervention. There are two distinct categories of this generally recognized by program implementing partners – process and impact.

The different definitions of data management (DM)not withstanding, DM is defined here as system of components ranging from the conceptual stage to the decision making level. Details of this definition will be presented elsewhere as a model (see annex 1) and below.

The framework presented below will generally apply to managing primary data sets although in certain cases, it also applies to secondary data sets. Details on primary and secondary data will be presented elsewhere in the book.
The approach used in developing this model was motivated by the author’s experience and a desire to develop a book as inclusive as possible. Because of this degree of inclusiveness, fields like demography, epidemiology (cross-sectional, control and longitudinal) studies, pollsters etc will also find the book useful. While a course in basic statistics is assumed, its absence will in no major way serve as a deterrent in understanding most of the material. There will however be need to have some basic understanding of fundamentals of ‘univariate’ analysis in order to better appreciate the section on analysis.

Conceptual Framework
The data management life cycle (framework) is made up of the following components:
?    Conceptualization;
?    Instrument Design;
?    Pilot testing;
?    Instrument review;
?    Implementation;
?    Data Analysis;
?    Report Generation;
?    Decision Making;
?    Process review

The elements that define each of these components are:



The approach used in developing this manual will be to the extent possible focused on:

?    Presenting a contextual meaning of the subject matter based on the author’s experience and available literature review;
?    Developing an example to illustrate the applicability of the subject matter and
?    Presenting an appropriate and relevant case-study when available.

All these will be presented in a fairly non-technical language to the extent possible. A significant attempt will therefore be made to make the manual as reader-friendly as possible. Needless to say that every effort will be made to stay focused on the subject and avoid unnecessary deviations from the relevant substance.

Target audience
As has already been mentioned elsewhere, the author has tried to interest as many readers as possible. Groups that will find this document useful include:

Develop mentalists;
Academic Institutions;
Civil Societies;
Non-Governmental Organizations;
Program Implementing Partners;


It anticipated that the final product of this initiative will be a book covering all the identified in this concept note. This should help in minimizing the gap that currently exists in the area of data management. It is hoped that such a book if presented in a precise non- technical language will go a long way in generating interest among readers from different spheres of life.

(Estimated Pages >= 500, TNR, 12)

The outline presented below will serve in helping us accomplish our objectives. The following chapters will complement those that relate directly to the framework. In the model, each component represents a chapter with corresponding elements serving as sub-titles of the chapter.

1.    Introduction (Overview);
2.    Background;
3.    Objectives:
i.    General;
ii.    Specific
4.    Literature Review;
5.    Target Audience;
6.    Variables and Indicators:
i.    Definition;
ii.    Description and Types;
iii.    Indicator Selection;
iv.    Indicator Prioritization (See Annex 2 – PRISM table and description);

7. Application of Data Management Framework (Cross-Cutting themes)
8. Basic Statistics:
i.          Frequencies Tables (Proportions);
ii.         Mean;
iii.    Median;
iv.    Percentile;
v.    Confidence Level;
vi.    Confidence Interval;
vii.    Significance Test;
9.  Sampling:
i.    Sampling Frame;
ii.    Sampling Unit;
iii.    Sample Size;
iv.    Sampling Methods:
a.    Purposive;
b.    Simple Random Sampling;
c.    Systematic Sampling;
d.    Probability Sampling;
e.    Cluster Sampling;
10. Measurement Scales:
i. Nominal;
ii. Ordinal’
iii. Metric (continuous);
iv. Categorical;
11. Data and Information:
i. Primary;
ii. Secondary;
iii. Qualitative;
iv. Quantitative;
12. Statistical Software Packages:
i. Epiinfo;
ii. SPSS;
iii. SAS;
iv. Stata;
v. Systat;
13. Data Management Model (Annex 1.Each Component represents a Chapter)
a. Introduction;
b. Theoretical Background;
c. Classical Example;
d. Case Study;
14. Conclusion
Annex 1: Data Management Framework

Annex 2: Program Indicator Screening Matrix (PRISM)

PRISM – Description

Historically, development program implementation has been plagued with a complex set of challenges. These vary from the program design stage to implementation and sustainability. A key element of program design and implementation has been relevant indicator selection process and ability to optimize its robustness and mitigate the prevalence of bias. These challenges continue to influence effective program management initiatives. Attempts to address some of these issues vary from program to program. For example, what some program designers may identify as low-level indicators sometimes end in practice representing higher-level indicators. Such a scenario without doubt misrepresents the effects potential results.

The PRISM tool is a table aimed at extensively analyzing each indicator. This effort is executed by a team of expert who are selected and grouped based on the relevant expertise. An initial attempt is made to clearly describe the matrix, its limitations and how it assists in addressing some of challenges faced by program implementing partners in establishing meaningful indicators. The final outcome of this exercise is a consensus or degree of concordance (discordance) among the team members.


The definition of team in this context comprises of group and sub-groups. The latter represent obviously a sub-set of the former. These groups AS A RULE are made up of odd-numbers. For example, if there is only one expert available, a team will not be possible. If there are two experts available, then one group can be established and where there is disagreement as whether to recommend and indicator or not, a coin is tossed (or flipped). Further more, if there are experts available, a sub-group will represent a group. That all the three members will work as a group and the recommendations will be considered a group decision based on degree of concordance. The process in this case is simple and straight forward. That is the majority decision (in this two out of three) prevails. That’s how the rule of odd numbers is used. And the preceding description addresses outlier scenarios.

To the extent possible, this model works better if we can establish as many subgroups and groups as possible without losing sight of the distribution (odd number of sub-group and group members). For example, if we have ten experts, we can easily create two sub-groups of five members. In this case the group will be ten while the sub-groups will be two. In general, total number of sub-group members should be limited to five. Experience has confirmed that when we have more than five members in a sub-group, some members become overwhelmed and tend not to participate fully. Finally, before each sub-group work starts, the members are required to select a moderator and a secretary. The former then presents the sub-group finding during the final group meeting.

The Matrix:
The table that is used in establishing the number of acceptable indicators is made up of as many ROWS as there are indicators and TEN COLUMNS. The first row represents descriptions of each column. For example, in row one, column one, we fill in the relevant thematic area, result-level and indicator. In the next eight columns (still on row one), we fill in the respective criterion that will be used in screening the indicators. In the row below and subsequently, we have a table of binary elements, i.e. zeros and ones (0,1). The former represents a corresponding indicator which does not satisfy the criteria and the latter a corresponding indicator which fulfills the criteria. The same process applies to the all the criterion and corresponding indicators. Column seven summarizes the scores in terms of number of yeses (or 1). The seventh column is the final score attained by each indicator. This is represented as a percentage of yeses in the row. The last column is the final outcome. This column tells us if based on the scores (1s) we should go ahead and recommend the indicator or not. The ‘gold standard’ for this exercise is 100%. That is an indicator that scores yeses in all the criterion qualifies for implementation automatically.

Because this is a composite analysis, we need to remember that a final outcome is only valid when all these criteria are considered simultaneously. That is, the outcome identified in the last column. What happens if no indicator satisfies all these conditions? The answer is simple. Before all the sub-groups begin their assignment, there is a consensus established by the team with regard to an acceptable level. For example, the team could agree before the exercise starts that any indicator that scores 70% (total yeses divided by sum of yeses and nays) or decision level, will be considered acceptable. Sometimes, this bar can vary. For example, if the team recognizes that a certain threshold tends to admit too many redundant indicators, the bar can be raised higher in order to further refine our choices.
The following paragraphs attempt to define the meaning of each criterion as it applies in the matrix.

Specificity: This refers to the likelihood of the indicator measuring the relevant result. In other words, is there a possibility that the result the indicator represents does not represent exactly what we are looking for?

Reliability: This criterion is synonymous to replication. That is does the indicator consistently produce the same result when measured over a certain period of time? For example, if two or more people calculated this indicator independently, will the come up with the same result? If the answer is yea, then the indicator has satisfied that condition and hence a ‘one’ is entered in that cell. And zero entered otherwise.
Sensitivity: It’s a test that tries to assess the stability of an indicator. For example, does the indicator continue to deliver the same result with a small variation of either the numerator or denominator? How does the result change when assumptions are modified? Does the indicator actually contribute to the next higher level? For example, an indicator at the output level accounting for one at the outcome level will yield a misleading result. If the same indicator accounts for two or more result levels simultaneously, it is not stable. As indicated earlier, any indicator that satisfies a criterion is given a one in the corresponding cell and a zero otherwise.
Simplicity: A convoluted indicator represents challenges at many levels. Hence here, we are looking for an indicator that is easy to collect, analyze and disseminate. Any indicator that satisfies these conditions automatically qualifies for inclusion. The zero/one process is then followed as indicated above.
Utility: This refers to degree to which information generated by this indicator will be used. The objective of this criterion is to assist in streamlining an indicator in an attempt to help the decision making in making an informed-decision. This can either be during the planning process or during the re-alignment process. The latter representing occasions when organizations are evaluating the current status of its mandate.
Affordability: This is simply a cost-effective perspective of the indicator in question. Can the program/project afford to collect and report on the indicator? In general, it takes at least two comparable indicators to establish a more efficient and cost-effective one. The one that qualifies is included at that criterion level. And the same process as outlined above is followed.

Inclusion: The penultimate column (8) simply represents the composite score. The total number of yeses is divided by the total number of criterion (in this case, seven) and multiplied by 100 to produce the relevant score for each indicator. During this process, each indicator is then classified as either accepted (if it scores 70% or more) or rejected otherwise.

Annex 2b: PRISM case-study – Intra-group Screening

Annex 2c: PRISM case study – Intra-group screening (Cont’d)

Annex 2d: PRISM case-study – Inter-group Screening

Annex 2e: PRISM case study – Inter-group screening (cont’d)

Annex 3: Sample illustration of Data Cleaning component

Introduction. This section describes how to consolidate and process quantitative data prior to analysis. An example is given that clarifies the application of each step outlined.
Why is Data Consolidation
As a critical first step, following data collection and prior to data analysis, raw quantitative data from questionnaires (or other data collection instruments) must be processed and consolidated in order to be usable. This will require some form of data cleaning, organising, and coding to so that the data is ready to be entered into a database or spreadsheet, analysed and compared.
Quantitative data is usually collected using a data collection instrument such as a questionnaire. The number of questionnaires or cases, is usually fairly large, especially where probability sampling strategies are used. Due to the nature of quantitative inquiry, most of the questions are closed ended and solicit short ‘responses’ from respondents that are easy to process and code. It is almost always necessary to use computer software to analyse the data due to this re¬latively large number of cases (in comparison to qualitative data) as well as variables (e.g. questions on the questionnaire). Microsoft excel and access provide basic spreadsheet and database functions, whereas more specialised statistical software such as SPSS and Epilnfo can be used where available and where expertise exists.
Ideally consolidation and processing is conducted by the team of interviewers who completed the data collection (WFP or implementing partner staff or consultants), however, in many cases additional staff are specifically tasked with the work of entering data into pre-formatted spread¬sheets or databases. Data processing and consolidation needs to be well supervised and con¬ducted as it can significantly affect the quality of subsequent analysis.
Steps to follow for consolidating and processing Quantitative Data
The following 6 steps outline the main tasks related to consolidating and processing quantitative data, prior to analysis.
Step 1: Nominate a Person and set a Procedure to ensure the Quality of Data Entry
When entering quantitative data into the database or spreadsheet, set up a quality check pro¬cedure such as having someone who is not entering data check every 10th case to make sure it was entered correctly.
Step 2: Entering Numeric Variables on Spreadsheets
Numeric variables should be entered into the spreadsheet or database with each variable on the questionnaire making up a column and each case or questionnaire making up a row. The type of ‘case’ will depend on the unit of study (e.g. individual, households, school, or other).
Step 3: Entering Continuous Variable Data on Spreadsheets
Enter raw numeric values for continuous variables (e.g. age, weight, height, anthropometric Z¬scores, income). A new categorical variable can be created from the continuous variable later to assist in analysis. For 2 or more variables that will be combined to make a third variable, be sure and enter each separately. (For example, the number of children born and the number of children died should be entered as separate variables and the proportion of children who have

died could be created as a third variable). The intent is to ensure that the detail is not lost during data entry so that categories and variable calculations can be adjusted later if need be.
Step 4: Coding and Labelling Variables
Code categorical nominal variables numerically (e.g. give each option in the variable a number). Where the variable is ordinal (e.g. defining a thing’s position in a series), be sure to order the codes in a logical sequence (e.g. 1 equals lowest and 5 equals the highest). In SPSS and some other software applications it is possible to give each numeric variable a value label (e.g. the nominal label that corresponds with the numeric code). For excel and other software that do not have this function, create a key for each nominal variable that lists the numeric codes and the corresponding nominal label.
Step 5: Dealing with a Missing Value
8e sure to enter a for cases in which the answer given is 0, do not leave the cell blank. A blank cell indicates a missing value (e.g. the respondent did not answer the question, the interviewer skipped the question by mistake, the question was not applicable to the respondent, or the an¬swer was illegible). It is best practice to code missing values as 99,999, or 9999. Make sure the number of 9’s make the value an impossible value for the variable (e.g. for a variable that is ‘number of cattle’, use 9999 since 99 cattle may be a plausible number in some areas). It is im¬portant to code missing values so that they can be excluded during analysis on a case by case basis (e.g. by setting the missing value outside the range of plausible values you can selectively exclude it from analysis in any of the computer software packages described above).
Step 6: Data Cleaning Methods
Even with quality controls it will be necessary to ‘clean the data’, especially for large data sets with many variables and cases. This allows for obvious errors in data entry to be corrected as well as for excluding responses that simply do not make sense. (Note that the majority of these should be caught in data collection, but even the best quality control procedures miss some mistakes.) To clean the data run simple tests on each variable in the dataset. For example a variable denoting the sex or gender of the respondent (1 = male, 2 = female) should only take values 1 or 2. If a value such as 3 exists, then you know a data entry mistake has occurred. Also look for impossible values (outside the range of plausibility) such as a child weighing 100 kg, a mother being 10 years old, a mother being a male, etc.
An Example of Quantitative Data Consolidation and Processing through the Application of the 6 Steps outlined above
In this example, each household is the unit of study for the survey and is considered a case.
Step 1: Nominate a Person and set a Procedure to ensure the Quality of Data Entry
Every 4th case will be checked by a non-data entry person (ex field editor) to ensure qual¬ity in data entry.
Step 2: Entering Numeric Variables on Spreadsheets and
Step 3: Entering Continuous Variable Data on Spreadsheets Q1: The estimated expenditure on food in the last 6 months Responses on Questionnaires:
Case1 $30
Case 2 $23
Case 3 $112
Case 4 $40

Q2: The estimated total expenditure in the last 3 months Responses on Questionnaires:
Case 1 $50
Case 2 $35
Case 3 $140
Case 4 $35
Enter into database or spreadsheet and create a third variable that is food expenditure as a per¬centage of total expenditure.

Food Expenditure in Last 3     Total Expenditure in Last 3     Food Exp as a % of total
Months     months     Exp
Case 1     $30.00     $50.00     60.00%
Case 2     $23.00     $35.00     65.71%
Case 3     $112.00     $140.00     80.00%
Case 4     $40.00     $35.00     114.29%
Step 4: Coding and Labelling Variables
Code the nominal variables using numeric values. For ordinal variables make sure the order or sequence of numeric values makes sense.
Q3: Name of Village (with corresponding numeric code added) Case 1 Hagadera = 1
Case 2 Hagadera = 1
Case 3 Kulan = 2
Case 4 Bardera = 3
Q4: Highest level of education completion of the head of household (with corresponding ordinal numeric codes that reflect least education to most)
Case 1 some primary, did not complete = 2
Case 2 no formal schooling = 1
Case 3 completed primary, some secondary = 4 Case 4 completed primary 3
Enter into database:

Food Expenditure     Total Expenditure     Food Exp as a %     Village     Highest ed. level
in Last 3 Months     in Last 3 months     of total Exp         completion by
Case 1     $30.00     $50.00     60.00%     1     2
Case 2     $23.00     $35.00     65.71%     1     1
Case 3     $112.00     $140.00     80.00%     2     4
Case 4     $40.00     $35.00     114.29%     3     3
Step 5: Dealing with a Missing Value Coding missing values
Q5: Number of children under 5 in household Case 1 = 2
Case 2 = 0
Case 3 = no answer given (missing value) Case 4 = 3
Enter into database giving missing value a value of 99 (we use 99 because with multiple wives 9 children under 5 within a household is a possibility, even though it is a remote 1 for this area).

Food Expendit-     Total Expendit-     Food Exp as a     Village     Highest ed.     Number of chil-
ure in Last 3     ure in Last 3     % of total Exp         level com ple-     dren US in HH
Months     months             tion by HofHH
Case 1     $30.00     $50.00     60.00%     1     2     2
Case 2     $23.00     $35.00     65.71%     1     1     0
Case 3     $112.00     $140.00     8Q.00%     2     4     99
Case 4     $40.00     $35.00     114.29%     3     3     3
Step 6: Data Cleaning Methods
Run data validity checks to ‘clean the data’. Try to find impossible values for each variable. If they are found and reverting to the questionnaire does not clarify the mistake, then set the value to missing (step 5).
In this case the third variable in case 4 (refer to the table under step 5) suggests either an entry error or a mistake on the questionnaire. Food cannot be 114% of total expenditure since food is a portion of expenditure and the maximum value it could take is 100% (food expenditure repres¬ents all expenditure).
After reverting to the questionnaire, it is confirmed that data was entered correctly and that the error lies in the respondent’s understanding of the question or in the interviewer’s recording of the response. It is decided that the best course of actual is to set variables 1,2, and 3 for Case 4 to ‘missing’ so that the analysis is not misleading.

Food Expendit-     Total Expendit-     Food Exp as a     Village     Highest ed.     Number of chil-
ure in Last 3     ure in Last 3     % of total Exp         level com ple-     dren US in HH
Months     months             tion by HofHH
Case 1     $30.00     $50.00     60.00%     1     2     2
Case 2     $23.00     $35.00     65.71%     1     1     0
Case 3     $112.00     $140.00     80.00%     2     4     99
Case 4     9999     9999     999     3     3     3


find the cost of your paper