Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical

1. Initial data

Data cleaning is one of the challenges facing data analysis tasks. This material reflected the developments and solutions that arose as a result of solving a practical problem of analyzing the database in the formation of cadastral value. Sources here “REPORT No. 01/OKS-2019 on the results of the state cadastral valuation of all types of real estate (except for land plots) in the territory of the Khanty-Mansiysk Autonomous Okrug - Ugra”.

The file “Comparative model total.ods” in “Appendix B. Results of determining KS 5. Information on the method of determining cadastral value 5.1 Comparative approach” was considered.

Table 1. Statistical indicators of the dataset in the file “Comparative model total.ods”
Total number of fields, pcs. — 44
Total number of records, pcs. — 365 490
Total number of characters, pcs. — 101 714 693
Average number of characters in a record, pcs. — 278,297
Standard deviation of characters in a record, pcs. — 15,510
Minimum number of characters in an entry, pcs. — 198
Maximum number of characters in an entry, pcs. — 363

2. Introductory part. Basic standards

While analyzing the specified database, a task was formed to specify the requirements for the degree of purification, since, as is clear to everyone, the specified database creates legal and economic consequences for users. During the work, it turned out that there were no specific requirements for the degree of cleaning of big data. Analyzing the legal norms in this matter, I came to the conclusion that they are all formed from possibilities. That is, a certain task has appeared, information sources are compiled for the task, then a dataset is formed and, based on the created dataset, tools for solving the problem. The resulting solutions are reference points in choosing from alternatives. I presented this in Figure 1.

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical

Since, in matters of determining any standards, it is preferable to rely on proven technologies, I chose the requirements set out in "MHRA GxP Data Integrity Definitions and Guidance for Industry", because I considered this document the most comprehensive for this issue. In particular, in this document the section says “It should be noted that data integrity requirements apply equally to manual (paper) and electronic data.” (translation: “...data integrity requirements apply equally to manual (paper) and electronic data”). This formulation is quite specifically associated with the concept of “written evidence”, in the provisions of Article 71 of the Code of Civil Procedure, Art. 70 CAS, Art. 75 APC, “in writing” Art. 84 Code of Civil Procedure.

Figure 2 presents a diagram of the formation of approaches to types of information in jurisprudence.

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical
Rice. 2. Source here.

Figure 3 shows the mechanism of Figure 1, for the tasks of the above “Guidance”. It is easy, by making a comparison, to see that the approaches used when meeting the requirements for information integrity in modern standards for information systems are significantly limited in comparison with the legal concept of information.

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical
Ris.3

In the specified document (Guidance), the connection to the technical part, capabilities for processing and storing data, is well confirmed by a quote from Chapter 18.2. Relational database: "This file structure is inherently more secure, as the data is held in a large file format which preserves the relationship between data and metadata."

In fact, in this approach - from existing technical capabilities, there is nothing abnormal and, in itself, this is a natural process, since the expansion of concepts comes from the most studied activity - database design. But, on the other hand, legal norms appear that do not provide for discounts on the technical capabilities of existing systems, for example: GDPR - General Data Protection Regulation.

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical
Rice. 4. Funnel of technical capabilities (Source).

In these aspects, it becomes clear that the original dataset (Fig. 1) will have to, first of all, be saved, and secondly, be the basis for extracting additional information from it. Well, as an example: cameras recording traffic rules are ubiquitous, information processing systems weed out violators, but other information can also be offered to other consumers, for example, as marketing monitoring of the structure of the flow of customers to a shopping center. And this is a source of additional added value when using BigDat. It is quite possible that the datasets being collected now, somewhere in the future, will have value according to a mechanism similar to the value of rare editions of 1700 at the present time. After all, in fact, temporary datasets are unique and are unlikely to be repeated in the future.

3. Introductory part. Evaluation criteria

During the processing process, the following classification of errors was developed.

1. Error class (based on GOST R 8.736-2011): a) systematic errors; b) random errors; c) a blunder.

2. By multiplicity: a) mono distortion; b) multi-distortion.

3. According to the criticality of the consequences: a) critical; b) not critical.

4. By source of occurrence:

A) Technical – errors that occur during the operation of the equipment. A fairly relevant error for IoT systems, systems with a significant degree of influence on the quality of communication, equipment (hardware).

B) Operator errors - errors in a wide range from operator typos during input to errors in the technical specifications for database design.

C) User errors - here are user errors in the entire range from “forgot to switch the layout” to mistaking meters for feet.

5. Separated into a separate class:

a) the “task of the separator,” that is, the space and “:” (in our case) when it was duplicated;
b) words written together;
c) no space after service characters
d) symmetrically multiple symbols: (), "", "...".

Taken together, with the systematization of database errors presented in Figure 5, a fairly effective coordinate system is formed for searching for errors and developing a data cleaning algorithm for this example.

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical
Rice. 5. Typical errors corresponding to the structural units of the database (Source: Oreshkov V.I., Paklin N.B. "Key concepts of data consolidation").

Accuracy, Domain Integrity, Data Type, Consistency, Redundancy, Completeness, Duplication, Conformance to Business Rules, Structural Definiteness, Data Anomaly, Clarity, Timely, Adherence to Data Integrity Rules. (Page 334. Data warehousing fundamentals for IT professionals / Paulraj Ponniah.—2nd ed.)

Presented English wording and Russian machine translation in brackets.

Accuracy. The value stored in the system for a data element is the right value for that occurrence of the data element. If you have a customer name and an address stored in a record, then the address is the correct address for the customer with that name. If you find the quantity ordered as 1000 units in the record for order number 12345678, then that quantity is the accurate quantity for that order.
[Accuracy. The value stored in the system for a data element is the correct value for that occurrence of the data element. If you have a customer name and address stored in a record, then the address is the correct address for the customer with that name. If you find the quantity ordered as 1000 units in the record for order number 12345678, then that quantity is the exact quantity for that order.]

Domain Integrity. The data value of an attribute falls in the range of allowable, defined values. The common example is the allowable values ​​being “male” and “female” for the gender data element.
[Domain Integrity. The attribute data value falls within the range of valid, defined values. A general example is the valid values ​​"male" and "female" for a gender data element.]

Data Type. Value for a data attribute is actually stored as the data type defined for that attribute. When the data type of the store name field is defined as “text,” all instances of that field contain the store name shown in textual format and not numeric codes.
[Data type. The value of a data attribute is actually stored as the data type defined for that attribute. If the store name field data type is defined as "text", all instances of this field contain the store name displayed in text format rather than numeric codes.]

Consistency. The form and content of a data field is the same across multiple source systems. If the product code for product ABC in one system is 1234, then the code for this product is 1234 in every source system.
[Consistency. The form and content of the data field are the same in different source systems. If the product code for product ABC on one system is 1234, then the code for that product is 1234 on each source system.]

Redundancy. The same data must not be stored in more than one place in a system. If, for reasons of efficiency, a data element is intentionally stored in more than one place in a system, then the redundancy must be clearly identified and verified.
[Redundancy. The same data should not be stored in more than one place in the system. If, for reasons of efficiency, a data element is intentionally stored in multiple locations in a system, then redundancy must be clearly defined and verified.]

Completeness. There are no missing values ​​for a given attribute in the system. For example, in a customer file, there must be a valid value for the “state” field for every customer. In the file for order details, every detail record for an order must be completely filled.
[Completeness. There are no missing values ​​in the system for this attribute. For example, the client file must have a valid value for the "status" field for each client. In the order detail file, each order detail record must be completely completed.]

Duplication. Duplication of records in a system is completely resolved. If the product file is known to have duplicate records, then all the duplicate records for each product are identified and a cross-reference created.
[Duplicate. Duplication of records in the system has been completely eliminated. If a product file is known to contain duplicate entries, then all duplicate entries for each product are identified and a cross-reference is created.]

Conformance to Business Rules. The values ​​of each data item adhere to prescribed business rules. In an auction system, the hammer or sale price cannot be less than the reserve price. In a bank loan system, the loan balance must always be positive or zero.
[Compliance with business rules. The values ​​of each data element comply with established business rules. In an auction system, the hammer or sale price cannot be less than the reserve price. In a banking credit system, the loan balance must always be positive or zero.]

Structural Definiteness. Wherever a data item can naturally be structured into individual components, the item must contain this well-defined structure. For example, an individual's name naturally divides into first name, middle initial, and last name. Values ​​for names of individuals must be stored as first name, middle initial, and last name. This characteristic of data quality simplifies enforcement of standards and reduces missing values.
[Structural Certainty. Where a data element can be naturally structured into individual components, the element must contain this well-defined structure. For example, a person's name is naturally divided into first name, middle initial, and last name. Values ​​for individual names should be stored as first name, middle initial, and last name. This data quality characteristic simplifies the application of standards and reduces missing values.]

Data Anomaly. A field must be used only for the purpose for which it is defined. If the field Address-3 is defined for any possible third line of address for long addresses, then this field must be used only for recording the third line of address. It must not be used for entering a phone or fax number for the customer.
[Data Anomaly. A field must only be used for the purpose for which it is defined. If the Address-3 field is defined for any possible third address line for long addresses, then this field shall only be used to record the third address line. It should not be used to enter a telephone or fax number for a customer.]

Clarity. A data element may possess all the other characteristics of quality data but if the users do not understand its meaning clearly, then the data element is of no value to the users. Proper naming conventions help to make the data elements well understood by the users.
[Clarity. A data element may have all the other characteristics of good data, but if users do not clearly understand its meaning, then the data element is of no value to users. Correct naming conventions help make data elements well understood by users.]

Timely. The users determine the timeliness of the data. lf the users expect customer dimension data not to be older than one day, the changes to customer data in the source systems must be applied to the data warehouse daily.
[In a timely manner. Users determine the timeliness of data. If users expect customer dimension data to be no more than one day old, changes to customer data in the source systems should be applied to the data warehouse on a daily basis.]

Usefulness. Every data element in the data warehouse must satisfy some requirements of the collection of users. A data element may be accurate and of high quality, but if it is of no value to the users, then it is completely unnecessary for that data element to be in the data warehouse.
[Utility. Each data item in the data store must satisfy some requirements of the user collection. A data element may be accurate and of high quality, but if it does not provide value to users, then it is not necessary for that data element to be in the data warehouse.]

Adherence to Data Integrity Rules. The data stored in the relational databases of the source systems must adhere to entity integrity and referential integrity rules. Any table that permits null as the primary key does not have entity integrity. Referential integrity forces the establishment of the parent–child relationships correctly. In a customer-to-order relationship, referential integrity ensures the existence of a customer for every order in the database.
[Compliance with data integrity rules. Data stored in relational databases of source systems must comply with the rules of entity integrity and referential integrity. Any table that allows null as a primary key does not have entity integrity. Referential integrity forces the relationship between parents and children to be established correctly. In a customer-order relationship, referential integrity ensures that a customer exists for every order in the database.]

4. Quality of data cleaning

The quality of data cleaning is a rather problematic issue in bigdata. Answering the question of what degree of data cleaning is necessary to complete the task is fundamental for every data analyst. In most current problems, each analyst determines this himself and it is unlikely that anyone from the outside is able to evaluate this aspect in his solution. But for the task at hand in this case, this issue was extremely important, since the reliability of legal data should tend to one.

Considering software testing technologies to determine operational reliability. Today there are more than these models 200. Many of the models use a claim servicing model:

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical
Fig. 6

Thinking as follows: “If the error found is an event similar to the failure event in this model, then how to find an analogue of the parameter t?” And I compiled the following model: Let’s imagine that the time it takes a tester to check one record is 1 minute (for the database in question), then to find all the errors he will need 365 minutes, which is approximately 494 years and 3 months of working time. As we understand, this is a very large amount of work and the costs of checking the database will be prohibitive for the compiler of this database. In this reflection, the economic concept of costs appears and after analysis I came to the conclusion that this is a fairly effective tool. Based on the law of economics: “The volume of production (in units) at which a firm’s maximum profit is achieved is located at the point where the marginal cost of producing a new unit of output is compared with the price that this firm can receive for a new unit.” Based on the postulate that finding each subsequent error requires more and more checking of records, this is a cost factor. That is, the postulate adopted in testing models takes on a physical meaning in the following pattern: if to find the i-th error it was necessary to check n records, then to find the next (i+3) error it will be necessary to check m records and at the same time n

  1. When the number of records checked before a new error is found stabilizes;
  2. When the number of records checked before finding the next error will increase.

To determine the critical value, I turned to the concept of economic feasibility, which in this case, using the concept of social costs, can be formulated as follows: “The costs of correcting the error should be borne by the economic agent that can do it at the lowest cost.” We have one agent - a tester who spends 1 minute checking one record. In monetary terms, if you earn 6000 rubles/day, this will be 12,2 rubles. (approximately today). It remains to determine the second side of the equilibrium in economic law. I reasoned like this. An existing error will require the person concerned to expend effort to correct it, that is, the property owner. Let’s say this requires 1 day of action (submit an application, receive a corrected document). Then, from a social point of view, his costs will be equal to the average salary per day. Average accrued salary in Khanty-Mansi Autonomous Okrug “Results of the socio-economic development of the Khanty-Mansiysk Autonomous Okrug - Ugra for January-September 2019” 73285 rub. or 3053,542 rubles/day. Accordingly, we obtain a critical value equal to:
3053,542: 12,2 = 250,4 units of records.

This means, from a social point of view, if a tester checked 251 records and found one error, it is equivalent to the user fixing this error himself. Accordingly, if the tester spent time equal to checking 252 records to find the next error, then in this case it is better to shift the cost of correction to the user.

A simplified approach is presented here, since from a social point of view it is necessary to take into account all the additional value generated by each specialist, that is, costs including taxes and social payments, but the model is clear. A consequence of this relationship is the following requirement for specialists: a specialist from the IT industry must have a salary greater than the national average. If his salary is less than the average salary of potential database users, then he himself must check the entire database hand-to-hand.

When using the described criterion, the first requirement for the quality of the database is formed:
I(tr). The share of critical errors should not exceed 1/250,4 = 0,39938%. A little less than refining gold in industry. And in physical terms there are no more than 1459 records with errors.

Economic retreat.

In fact, by making such a number of errors in records, society agrees to economic losses in the amount of:

1459*3053,542 = 4 rubles.

This amount is determined by the fact that society does not have the tools to reduce these costs. It follows that if someone has a technology that allows them to reduce the number of records with errors to, for example, 259, then this will allow society to save:
1200*3053,542 = 3 rubles.

But at the same time, he can ask for his talent and work, well, let’s say - 1 million rubles.
That is, social costs are reduced by:

3 – 664 = 250 rubles.

In essence, this effect is the added value from the use of BigDat technologies.

But here it should be taken into account that this is a social effect, and the owner of the database is municipal authorities, their income from the use of property recorded in this database, at a rate of 0,3%, is: 2,778 billion rubles/year. And these costs (4 rubles) do not bother him much, since they are transferred to the property owners. And, in this aspect, the developer of more refining technologies in Bigdata will have to show the ability to convince the owner of this database, and such things require considerable talent.

In this example, the error assessment algorithm was chosen based on the Schumann model [2] of software verification during reliability testing. Due to its prevalence on the Internet and the ability to obtain the necessary statistical indicators. The methodology is taken from Monakhov Yu.M. “Functional stability of information systems”, see under the spoiler in Fig. 7-9.

Rice. 7 – 9 Methodology of the Schumann modelCleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical

The second part of this material presents an example of data cleaning, in which the results of using the Schumann model are obtained.
Let me present the results obtained:
Estimated number of errors N = 3167 n.
Parameter C, lambda and reliability function:

Cleaning up data like Rock, Paper, Scissors. Is it a game with or without a finish? Part 1. Theoretical
Ris.17

Essentially, lambda is an actual indicator of the intensity with which errors are detected at each stage. If you look at the second part, the estimate for this indicator was 42,4 errors per hour, which is quite comparable to the Schumann indicator. Above, it was determined that the rate at which developers find errors should be no lower than 1 error per 250,4 records, when checking 1 record per minute. Hence the critical value of lambda for the Schumann model:

60 / 250,4 = 0,239617.

That is, the need to carry out the procedures for finding errors must be carried out until the lambda, from the available 38,964, drops to 0,239617.

Or until the indicator N (potential number of errors) minus n (corrected number of errors) decreases below our accepted threshold - 1459 pcs.

Literature

  1. Monakhov, Yu. M. Functional stability of information systems. In 3 hours. Part 1. Software reliability: textbook. allowance / Yu. M. Monakhov; Vladim. state univ. – Vladimir: Izvo Vladim. state University, 2011. – 60 p. – ISBN 978-5-9984-0189-3.
  2. Martin L. Shooman, “Probabilistic models for software reliability prediction.”
  3. Data warehousing fundamentals for IT professionals / Paulraj Ponniah.—2nd ed.

Part two. Theoretical

Source: habr.com

Add a comment