What can go wrong with Data Science? Data collection

What can go wrong with Data Science? Data collection
Today there are 100500 Data Science courses, and it has long been known that the most money in Data Science can be earned with Data Science courses (why dig when you can sell shovels?). The main disadvantage of these courses is that they have nothing to do with real work: no one will give you clean, processed data in the right format. And when you leave the course and start solving a real problem, many nuances come up.

Therefore, we are starting a series of notes “What can go wrong with Data Science”, based on real events that happened to me, my comrades and colleagues. We will analyze typical Data Science tasks using real examples: how it actually happens. Let's start today with the task of collecting data.

And the first thing people stumble over when they start working with real data is the actual collection of this most relevant data for us. Key message of this article:

We systematically underestimate the time, resources and effort to collect, clean and prepare data.

And most importantly, we will discuss what to do to prevent this.

According to various estimates, cleaning, transformation, data processing, feature engineering, etc. take 80-90% of the time, and analysis 10-20%, while almost all educational material focuses exclusively on analysis.

Let's analyze as a typical example a simple analytical problem in three versions and see what "aggravating circumstances" are.

And for an example, again, we will consider similar variations of the task of collecting data and comparing communities for:

  1. Two subreddits of Reddit
  2. Two sections of Habr
  3. Two groups of classmates

Conditional approach in theory

Open the site and read the examples, if it's clear, put a few hours into reading, a few hours into the code by examples and debugging. Add a few hours to the collection. Throw a few hours in reserve (multiply by two and add N hours).

Key point: The time estimate is based on assumptions and guesses about how long it will take.

It is necessary to start the time analysis by estimating the following parameters for the conditional problem described above:

  • What size of data and how much it needs to be physically collected (*see below*).
  • What is the time for collecting one record and how long do you have to wait before you can collect the second.
  • Lay down writing code that saves the state and starts restarting when (and not if) everything falls down.
  • Figure out if we need authorization and set the time to get access via the API.
  • Put the number of errors as a function of data complexity - evaluate for a specific task: structure, how many transformations, what and how to extract.
  • Lay network errors and problems with non-standard project behavior.
  • Evaluate if the necessary functions are in the documentation and if not, how and how much is needed for a workaround.

Most importantly, to estimate the time - you actually need to spend time and effort for "reconnaissance in combat" - only then will your planning be adequate. Therefore, no matter how much you are pushed to say “how long does it take to collect data” - knock yourself out time for a preliminary analysis and argue by how much the time will vary depending on the actual parameters of the task.

And now we will demonstrate specific examples where such parameters will change.

Key Point: The assessment is based on an analysis of the key factors affecting the scope and complexity of the work.

Guess-based estimation is a good approach when the functional elements are quite small and there are not many factors that can significantly affect the problem structure. But in the case of a number of Data Science tasks, such factors become extremely numerous and such an approach becomes inadequate.

Comparison of Reddit Communities

Let's start with the simplest case (as it turns out later). In general, to be completely honest, we have an almost ideal case before us, let's check our difficulty checklist:

  • There is a neat, understandable and documented API.
  • It is extremely simple and most importantly, a token is automatically obtained.
  • There is python wrapper - with lots of examples.
  • A community that analyzes and collects data on reddit (up to youtube videos explaining how to use python wrapper) For example.
  • The methods we need most likely exist in the API. Moreover, the code looks compact and clean, below is an example of a function that collects comments on a post.

def get_comments(submission_id):
    reddit = Reddit(check_for_updates=False, user_agent=AGENT)
    submission = reddit.submission(id=submission_id)
    more_comments = submission.comments.replace_more()
    if more_comments:
        skipped_comments = sum(x.count for x in more_comments)
        logger.debug('Skipped %d MoreComments (%d comments)',
                     len(more_comments), skipped_comments)
    return submission.comments.list()

Taken from this collections of handy wrapping utilities.

Despite the fact that we have the best case before us, it is still worth considering a number of important factors from real life:

  • API limits - we are forced to take data in batches (sleep between requests, etc.).
  • Collection time - for a full analysis and comparison, you will have to lay a significant amount of time just for the spider to walk through the subreddit.
  • The bot has to run on the server - you can't just run it on your laptop, put it in your backpack and go on business. So I ran everything on a VPS. With the promo code habrahabr10, you can save another 10% of the cost.
  • The physical inaccessibility of some data (they are visible to admins or are too difficult to collect) - this must be taken into account, not all data, in principle, can be collected in adequate time.
  • Networking errors: Networking is a pain.
  • This is living real data - it is never pure.

Of course, these nuances must be included in the development. Specific hours / days depend on development experience or experience working on similar tasks, however, we see that here the task is exclusively engineering and does not require additional gestures to solve - everything can be very well assessed, painted and done.

Comparison of Habr sections

Let's move on to a more interesting and non-trivial case of comparing flows and / or sections of Habr.

Let's check our difficulty checklist - here, in order to understand each item, you will already have to poke a little into the task itself and experiment.

  • At first you think there is an API, but there isn't. Yes, Habr has an API, but only it is not available to users (or maybe it doesn’t work at all).
  • Then you just start parsing html - "import requests", what can go wrong?
  • How to parse? The simplest and most commonly used approach is to iterate over IDs, we note that it is not the most efficient and you will have to handle different cases - for example, the density of real IDs among all existing ones.

    What can go wrong with Data Science? Data collection
    Taken from this Article.

  • Raw data wrapped in HTML on top of the web is a pain. For example, you want to collect and store the rating of an article: you rip out the score from html and decide to save it as a number for further processing: 

    1) int(score) throws an error: since on Habré a minus, as, for example, in the line "–5" is an em dash, and not a minus sign (unexpectedly, right?), So at some point I had to raise the parser to life here with such a terrible fix.

    try:
          score_txt = post.find(class_="score").text.replace(u"–","-").replace(u"+","+")
          score = int(score_txt)
          if check_date(date):
            post_score += score
    

    Dates, pluses and minuses may not exist at all (as we see above in the check_date function, this was the case).

    2) Unescaped special characters - they will come, you need to be ready.

    3) The structure varies depending on the type of post.

    4) Old posts can have **weird structure**.

  • In fact, error handling and what may or may not happen will have to be processed and you cannot predict for sure what will go wrong and how else the structure can be and what will fall off where - you just have to try and take into account the errors that the parser throws.
  • Then you understand that you need to parse in several threads, otherwise parsing into one will then take 30+ hours (this is purely the execution time of an already working single-threaded parser that sleeps and does not fall under any bans). IN this article, this led at some point to a similar scheme:

What can go wrong with Data Science? Data collection

Total checklist by difficulty:

  • Working with the network and parsing html with iteration and enumeration by ID.
  • Documents of heterogeneous structure.
  • Lots of places where code can easily fall.
  • It is necessary to write || code.
  • Required documentation, code examples, and/or community is missing.

The conditional time estimate for this task will be 3-5 times higher than for collecting data from Reddit.

Comparison of Odnoklassniki groups

Let's move on to the most technically interesting case of those described. For me, it was interesting precisely because at first glance, it looks quite trivial, but it doesn’t turn out to be like that at all - as soon as you poke it with a wand.

Let's start with our difficulty checklist and note that many of them will turn out to be much more difficult than they look at first:

  • There is an API, but it almost completely lacks the necessary functions.
  • For certain functions, you need to ask for access by mail, that is, the issuance of access is not instantaneous.
  • It is terribly documented (let's start with the fact that Russian and English terms get in the way everywhere, and it is absolutely inconsistent - sometimes you just need to guess what they want from you somewhere) and, moreover, is not suitable by design for obtaining data, for example, the function we want.
  • Requires a session in the documentation, but does not actually use it - and there is no way to understand all the intricacies of the API modes, except to poke around and hope that something will work.
  • Lack of examples and community, the only foothold in the collection of information is a small wrapper in python (without a lot of usage examples).
  • Selenium seems to be the most working option, since many of the necessary data is under lock and key.
    1) That is, there is authorization through a fictitious user (and registration with handles).

    2) However, with Selenium, there are no guarantees for correct and repeatable work (at least in the case of ok.ru for sure).

    3) The Ok.ru website contains JavaScript errors and sometimes behaves strangely and inconsistently.

    4) You need to deal with pagination, loading elements, etc…

    5) API errors that the wrapper gives will have to be handled with a spike, for example, like this (a piece of experimental code):

    def get_comments(args, context, discussions):
        pause = 1
        if args.extract_comments:
            all_comments = set()
    #makes sense to keep track of already processed discussions
            for discussion in tqdm(discussions): 
                try:
                    comments = get_comments_from_discussion_via_api(context, discussion)
                except odnoklassniki.api.OdnoklassnikiError as e:
                    if "NOT_FOUND" in str(e):
                        comments = set()
                    else:
                        print(e)
                        bp()
                        pass
                all_comments |= comments
                time.sleep(pause)
            return all_comments
    

    My favorite mistake was:

    OdnoklassnikiError("Error(code: 'None', description: 'HTTP error', method: 'discussions.getComments', params: …)”)

    6) In the end, the Selenium + API option looks like the most rational option.

  • It is necessary to save the state and restart the system, handle a lot of errors, including inconsistent site behavior - and these errors are quite difficult to imagine (unless you write professional parsers, of course).

The conditional time estimate for this task will be 3-5 times higher than for collecting data from Habr. Despite the fact that in the case of Habr we use a frontal approach with HTML parsing, in the case of OK, we can work with the API in critical places.

Conclusions

No matter how you are required to estimate the time “on the spot” (we have planning today!) Of a large data processing pipeline module, the execution time is almost never possible to estimate even qualitatively without analyzing the task parameters.

On a slightly more philosophical note, agile evaluation strategies are well suited to engineering problems, but problems that are more experimental and, in a sense, "creative" and exploratory, i.e., less predictable, have difficulties, as in examples of similar topics. that we have covered here.

Of course, collecting data is just a prime example - it usually seems incredibly simple and technically uncomplicated, and it's in the details that the devil often lurks. And it is on this task that it turns out to show the whole range of possible options for what can go wrong and how much work can be delayed.

If you skim through the characteristics of the task without additional experiments, then Reddit and OK look similar: there is an API, python wrapper, but in fact, the difference is huge. Judging by these parameters, Habr's pars looks more complicated than OK - but in practice it is quite the opposite, and this is exactly what can be found out by conducting simple experiments to analyze the parameters of the task.

In my experience, the most effective approach is an approximate estimate of the time that you will need for the preliminary analysis itself and simple first experiments, reading the documentation - they will allow you to give an accurate estimate for the whole work. In terms of the popular agile methodology, I ask for a ticket for “evaluating task parameters”, on the basis of which I can evaluate what is possible to complete within the “sprint” and give a more accurate estimate for each task.

Therefore, the most effective argument seems to be one that would show the "non-technical" specialist how much time and resources will vary depending on parameters that have yet to be estimated.

What can go wrong with Data Science? Data collection

Source: habr.com

Add a comment