Our writers are ready to help! Get 15% OFF your first paper

Hire our writerHire writer

What is Missing Data for Research

Some data can be lost in the research process, or certain participants or variables have not been considered. Such data is considered missing. Missing data can occur when:

  • the data entry is not complete;
  • you experience malfunctions of equipment that cause the loss of files.

Even if your research process is quite accurate, there is always some missing data in it. If it is quantitative research, you will see blank cells in the spreadsheet. So, let’s regard some facts about missing data and how to deal with this problem.

Missing Data Types

If some data is missing, it is an error because you cannot see the true values you have expected to obtain from your research. It is crucial to understand the reasons why this problem has appeared. It will help you identify the type of such data.

There are three types of missing data:

  • MCAR (missing completely at random) is randomly spread all over the variable, and it does not have connections to other variables;
  • MAR (missing at random) is not randomly distributed, but it may be considered by other variables;
  • MNAR (missing not at random) differs systematically from other values, and it relates to them.

Let’s regard the following example.

You need to gather data about holiday spending habits at the end of the year. You want to find out how much adults spend on gifts for their friends and relatives. The amounts should be in dollars.

MCAR (Missing Completely at Random)

It is possible that any missing value is not related to any other dataset components. Such missing data can come from anywhere, so it is randomly distributed. MCAR also does not relate to any unaccounted variables.

In our example, you may notice that some people have started completing the survey about their holiday habits but left the question of spending costs on gifts unanswered. However, you may also see that some data points come from a wide distribution within low and high values. So, the missing data does not refer to any specific cost-spending habits on holidays.

If the data does not relate to any specific values or untested variables, it is treated as MCAR. Nevertheless, it is always difficult to follow the assumption about ‘true randomness.’ MCAR also involves missing data because of lost samples or malfunctioning equipment.

MAR (Missing at Random)

The term is a bit confusing because MAR is not always the data missed at random. Such data differs more systematically from what you have gathered. However, other variables can take this data into account. Such a data point can be similar to that considered by another variable. Though, it is never identical to it.

While we continue to regard our example, we will see that if you repeat collecting the same data from a different group, you may notice that people aged 18-25 tend to skip this question more often than adults belonging to other age groups.

Nevertheless, when you move on, you will also see that the values in the group aged 18-25 are widely spread. It proves that this missing data relates to other variables (like age) and is not entirely independent from them. The explanation of this trend can involve the factor that younger people do not want to reveal their spending amounts because they care for their privacy.

MNAR (Missing Not at Random)

Here, the reasons for missing some values are closely related to other variables. To proceed with another example, we can see that the low values are almost absent in the new dataset. It means that respondents with low incomes omit this question because they do not want to admit low holiday spending.

It is important to take this type of missing data into account because the respondents belonging to a particular subgroup within the population can skip the question for a specific reason. If this reason is common for many participants, the sample may result in a completely wrong interpretation because a lot of data will be missing.

Attrition Bias

This phenomenon is typical for longitudinal studies. It means that some respondents are more likely to skip some questions than others. Or they can completely drop out. For example, some respondents may quit lengthy medical studies because they feel worse. The data from them is MNAR because the final dataset will include only the data from the patients who got better.

Why Is Missing Data a Problem?

When some data is missing, you may get biased results. Or it cannot be generalized outside the research project because the sample it has come from is unrepresentative.

Two types of missing data can be ignored. They are MAR and MCAR. This data does not systematically differ much from other values. So, it does not influence the ultimate result.

However, it cannot be ignored if the missing data systematically differs from other observed data, like MNAR. Such data is called non-ignorable.

Prevention of Missing Data

The most common reasons for missing data are lack of response, research protocols designed incorrectly, and attrition. You need to design your study conveniently to make it easier for respondents to answer the questions.

To minimize missing data, you can follow such tips:

  • ✔️ Reduce the amount of collected data.
  • ✔️ Use the techniques for data validation.
  • ✔️ Restrict the number of follow-ups.
  • ✔️ Develop user-friendly forms for collecting data.
  • ✔️ Provide perks and incentives.
  • ✔️ Create several backups to store the collected data safely.

Ways to Cope with Missing Data

When you cleanse the dataset after the data collection is completed, you have several ways to deal with missing data. You can accept the incomplete data as it is, remove it, or try to restore the missing parts.

It is up to you to decide what way to follow. Your decision should be based on an accurate assessment of why the data is missing. So, it will be easier to choose the right way if you answer the following questions:

  • Have random or non-random reasons resulted in missing data?
  • Can zero or null values be the reason for data missing?
  • Has the measure or question been unclear or poorly formulated?

Therefore, it is better to accept the incomplete data as it is if you deal with MAR or MCAR. In the case of MNAR, you need more complex approaches to make the right decision.

Acceptance

If you see that some cells are entirely blank, leave them as they are. It is the most conservative approach, but you can opt for it in the case of MCAR or MAR. It is crucial when you deal with a small sample. If you delete just a tiny part of data, it can influence statistical power even if it is incomplete.

You can also use the sign N/A, which stands for ‘not applicable’ to re-code the incomplete values. That will make such data more consistent within your dataset.

Removal

You can delete the missing data with the help of listwise and pairwise deletion methods.

Using Listwise Deletion

This technique means the removal of the data from all participants or cases for any variable. As a result, you will obtain a completely filled-out dataset for every participant involved in it.

If a lot of data is missing from a certain variable or measurement, the results obtained from respondents who have answered every question will be dramatically different from those who have not provided all the information. It will mean that your sample cannot represent the population correctly.

Example

If you initially have 121 participants in your sample, the deletion of incomplete or missing data can lead to a population reduction of 84 respondents. It may have happened that a specific question was left unanswered, particularly by women. It means that only men have completed the entire survey, and the result will be biased.

Using Pairwise Deletion

Here, you can include any available data from all cases and remove only what is missing from all parts of the analysis. It will help you preserve more data, but you will encounter an uneven sample size related to certain variables. It is even helpful when you have a smaller sample but rather distracting for bigger ones. You can also use this technique effectively when you have large numbers of missing values related to certain variables.

Remember to include only participants or cases with complete data for every variable if you deal with correlation or other multiple variables.

Example

Suppose you have received the following answers to your survey:

  • 9 people have not responded to the question about their gender, and the variable ‘gender’ has been reduced from 121 to 112 participants.
  • 6 respondents have not answered the question about their age, so the sample has been reduced to 106 participants.

If you decide to retain most values apart from those that were not presented, you will retain the other points for the participants who have not answered a particular question. However, if you look across the variables, the sample size will be smaller for some of them.

Imputation

This technique means replacing a piece of missing data with another value that is received due to the accurate estimate. If you want to receive a complete dataset, use imputation, which implies several different methods. You can replace the missing value with the median or mean value for the variable under consideration, and it is the easiest way to recreate the data.

Hot-Deck Imputation

Here, you can replace a missing part of data with a similar value received from other cases or respondents within your database. These are called ‘donors,’ and the case can be based on the information taken from another variable.

Example

You conduct a survey about the effectiveness of a shopping app. Respondents have to rate it from 1 to 5 for different questions. In the end, you have noticed that a few participants did not answer Question 3, leaving the cells empty. Then, you look through the responses for other variables and choose the respondents who answered in a similar way to those who skipped Question 3. They can become donors to this question, so you take their answers and complete the missing value.

Cold-Deck Imputation

As an alternative, you can take the existing indicators from other databases. The cases should be similar, but the sample is unrelated to that whose data is missing.

Example

You can use the missing values from the sample developed by your colleagues. It is possible that they had a similar survey but with a different sample. You can look for respondents whose answers are similar to those of your participants.

Be Careful While Using Imputation

While using imputation, you need to consider its benefits and possible risks. Even if you manage to retain all other data, this technique can produce some bias or result in inaccurate information. You cannot be sure that the value you have used for replacement reflects what the answer was supposed to be. Therefore, use imputation cautiously.

Final Thoughts

Now, you know about the types of missing data, how to prevent it, and how to deal with it. The technique we have represented in this article will help you minimize the negative impact of missing data on the results of your research. Use them accurately to ensure that the analysis reflects the actual situation and the relevant results.

The methods above will be helpful if you want to keep as much data in your dataset as possible. It is essential for small populations and samples because every piece of data in them should show an accurate picture to avoid bias.

More interesting articles