Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical

Π’ part one it was described that this publication was made on the basis of a dataset of the results of a cadastral valuation of real estate in the Khanty-Mansi Autonomous Okrug.

The practical part is presented in the form of steps. All cleaning was carried out in Excel, since the most common tool and the described operations can be repeated by most specialists who know Excel. And quite well suited for hand-to-hand work.

The zero stage will be the work on launching, saving the file, since it is 100 mb in size, then with the number of these operations, tens and hundreds, they take a significant amount of time.
Opening, on average, - 30 seconds.
Saving - 22 sec.

The first stage begins with the determination of the statistical indicators of the dataset.

Table 1. Dataset stats
Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical

Technology 2.1.

We create an auxiliary field, I have it under the number - AY. For each entry, we form the formula "= DLSTR (F365502) + DLSTR (G365502) + ... + DLSTR (AW365502)"

The total time spent on step 2.1 (for the Schumann formula) t21 = 1 hour.
The number of errors found at stage 2.1 (for the Schumann formula) n21 = 0 pcs.

The second stage.
Checking dataset components.
2.2. All values ​​in records are formed by standard characters. Therefore, we will track the statistics by symbols.

Table 2. Character stats in the dataset with a preliminary analysis of the results.Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical

Technology 2.2.1.

Create an auxiliary field - "alpha1". For each record, we form the formula "=CONCATENATE(Sheet1!B9;...Sheet1!AQ9)"
We create a fixed cell "Omega-1". In this cell, we will alternately enter character codes for Windows-1251 from 32 to 255.
Create an auxiliary field - "alpha2". With the formula "= FIND (CHAR (Omega, 1), "alpha1", N)".
Create an auxiliary field - "alpha3". With the formula "=IF(ISNUMBER("alpha2";N);1;0)"
Create a fixed cell "Omega-2", with the formula "=SUM("alpha3"N1:"alpha3"N365498)"

Table 3. Results of the preliminary analysis of the resultsCleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical

Table 4. Fixed errors at this stageCleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical

The total time spent on step 2.2.1 (for the Schumann formula) t221 = 8 hour.
The number of corrected errors at stage 2.2.1 (for the Schumann formula) n221 = 0 pcs.

Step 3.
The third step is to fix the state of the dataset. By assigning each record a unique number (ID) and each field. This is necessary to compare the converted dataset with the original one. It is also necessary to fully use the possibilities of grouping and filtering. Here we again turn to table 2.2.2 and select a symbol that is not used in the dataset. We get what is shown in Figure 10.

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Fig.10. Assignment of identifiers.

The total time spent on step 3 (for the Schumann formula) t3 = 0,75 hour.
The number of errors found at stage 3 (for the Schumann formula) n3 = 0 pcs.

Since the Schumann formula requires that the stage be completed by correcting errors. We return to stage 2.

Step 2.2.2.
In this step, we will also fix double and triple spaces.
Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Fig.11. The number of double spaces.

Correction of errors identified in Table 2.2.4.

Table 5. Error Correction PhaseCleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical

An example of why such an aspect as the use of the letters β€œe” or β€œΡ‘β€ is significant is shown in Figure 12.

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Fig.12. Mismatch on the letter "e".

Total time spent in step 2.2.2 t222 = 4 hours.
The number of errors found at stage 2.2.2 (for the Schumann formula) n222 = 583 pcs.

The fourth stage.
Checking for redundancy of fields fits well into this stage. Out of 44 fields 6 fields:
7 - Purpose of the structure
16 - Number of underground floors
17 - Parent object
21 - Village Council
38 - Structure parameters (description)
40 - Cultural heritage

They don't have any records. That is, redundant.
The field "22 - City" has one single entry, Figure 13.

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Fig.13. The only entry is Z_348653 in the "City" field.

The field "34 - Name of the building" contains records that clearly do not correspond to the purpose of the field, Figure 14.

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Fig.14. An example of a non-matching entry.

We exclude these fields from the dataset. Also we fix change of 214 records.

The total time spent on step 4 (for the Schumann formula) t4 = 2,5 hour.
The number of errors found at stage 4 (for the Schumann formula) n4 = 222 pcs.

Table 6. Analysis of dataset indicators after the 4th stage

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical

In general, analyzing the changes in indicators (Table 6), we can say that:
1) The leverage ratio of the average number of symbols to the standard deviation leverage is close to 3, that is, there are signs of a normal distribution (six sigma rule).
2) A significant deviation of the minimum and maximum levers from the average lever suggests that the study of tails is a promising direction in the search for errors.

We investigate the results of finding errors using the Schumann methodology.

idle stages

2.1. The total time spent on step 2.1 (for the Schumann formula) t21 = 1 hour.
The number of errors found at stage 2.1 (for the Schumann formula) n21 = 0 pcs.

3. The total time spent on step 3 (for the Schumann formula) t3 = 0,75 hour.
The number of errors found at stage 3 (for the Schumann formula) n3 = 0 pcs.

Effective stages
2.2. The total time spent on step 2.2.1 (for the Schumann formula) t221 = 8 hour.
The number of corrected errors at stage 2.2.1 (for the Schumann formula) n221 = 0 pcs.
Total time spent in step 2.2.2 t222 = 4 hours.
The number of errors found at stage 2.2.2 (for the Schumann formula) n222 = 583 pcs.

Total time spent in step 2.2 t22 = 8 + 4 = 12 hours.
The number of errors found at stage 2.2.2 (for the Schumann formula) n222 = 583 pcs.

4. The total time spent on step 4 (for the Schumann formula) t4 = 2,5 hour.
The number of errors found at stage 4 (for the Schumann formula) n4 = 222 pcs.

Since there are zero stages that should be included in the first stage of the Schumann model, and on the other hand, stages 2.2 and 4 are inherently independent, given that the Schuman model assumes that an increase in the duration of the check, the probability of detecting an error decreases, that is, the flow failures, then by investigating this flow, we will determine which of the stages to put first, according to the rule, where the failure density is more frequent, we put one of the stages first.

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Ris.15.

It follows from the formula in Figure 15 that it is preferable to put the fourth stage before stage 2.2 in the calculations.

Using the Schumann formula, we determine the estimated initial number of errors:

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Ris.16.

From the results in Figure 16, it can be seen that the predicted number of errors N2 = 3167, which is more than the minimum criterion of 1459.

As a result of the correction, we corrected 805 errors, and the predicted number is 3167 - 805 = 2362, which is still more than the minimum threshold accepted by us.

We define parameter C, lambda and reliability function:

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 2. Practical
Ris.17.

In essence, lambda is the actual rate at which errors are detected at each stage. If you look above, the estimate of this indicator earlier was 42,4 errors per hour, which is quite comparable to the Schuman indicator. Referring to the first part of this material, it was determined that the rate of finding errors by the developer should not be lower than 1 error per 250,4 records, when checking 1 record per minute. Hence the critical lambda value for the Schumann model:
60 / 250,4 = 0,239617.

That is, the need to carry out the procedures for finding errors must be carried out until the lambda, from the available 38,964, drops to 0,239617.

Or until the indicator N (potential number of errors) minus n (corrected number of errors) decreases below the threshold we adopted (in the first part) - 1459 pcs.

Part 1. Theoretical.

Source: habr.com

Add a comment