feed.xml

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>That CS guy</title>
    <description>Escribo sobre ciencias computacionales, principalmente enfocado a la programación de aplicaciones, pero con énfasis en C#, mi lenguaje favorito. A veces me da por escribir sobre tecnología también. 
</description>
    <link>http://thatcsharpguy.com/</link>
    <atom:link href="http://thatcsharpguy.com/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Thu, 21 Jun 2018 07:55:15 +0100</pubDate>
    <lastBuildDate>Thu, 21 Jun 2018 07:55:15 +0100</lastBuildDate>
    <generator>Jekyll v3.8.2</generator>
    
      <item>
        <title>Text as Data</title>
        <description>&lt;h1 id=&quot;text-as-data&quot;&gt;Text as Data&lt;/h1&gt;
&lt;p&gt;Text as data… a kind of a generic name to talk about text analysis &amp;amp; text classification.&lt;/p&gt;

&lt;p&gt;The idea behind this course was to teach us the basics of three things taking into consideration that in the real world, information is in significant part contained in text documents:&lt;/p&gt;

&lt;p&gt;The first thing is to learn how to &lt;strong&gt;organise and categorise&lt;/strong&gt; text. The second was how to &lt;strong&gt;search and retrieve&lt;/strong&gt; the documents or fragments of them, and the third one was how to &lt;strong&gt;analyse&lt;/strong&gt; the text to extract the sentiments that the authors were expressing.&lt;/p&gt;

&lt;h2 id=&quot;lecture-1-introduction-to-text&quot;&gt;Lecture 1. Introduction to text&lt;/h2&gt;
&lt;p&gt;In the first lecture, we reviewed how can text be represented as &lt;strong&gt;sparse vectors&lt;/strong&gt;, how can we calculate different &lt;strong&gt;similarity measures&lt;/strong&gt; between two vectors using similarity measures of sets. After that, we checked the &lt;strong&gt;bag of words&lt;/strong&gt; representation, as well how can we go beyond working with single, tokenised words and consider pairs or triples of words using &lt;strong&gt;n-grams&lt;/strong&gt; and another similarity measure, the &lt;strong&gt;cosine similarity&lt;/strong&gt;. We finalised this lecture by reviewing the problems with using term frequency as the only criteria to describe our documents.&lt;/p&gt;

&lt;h2 id=&quot;lecture-2-text-distributions&quot;&gt;Lecture 2. Text distributions&lt;/h2&gt;
&lt;p&gt;As we learned in the previous lecture, using the &lt;em&gt;raw term frequency&lt;/em&gt; is not the best idea, and thus, in this lecture, we saw different options to overcome this issue, such as &lt;strong&gt;operating in the log space&lt;/strong&gt; for the term frequency, as well as how to consider the collection using something known as the &lt;strong&gt;inverse document frequency or IDF&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;lecture-3-distributions-and-clustering&quot;&gt;Lecture 3. Distributions and clustering&lt;/h2&gt;
&lt;p&gt;Following with term distribution, we saw how it could be used to cluster documents in an unsupervised manner using algorithms such as k-means or hierarchical clustering.&lt;/p&gt;

&lt;h2 id=&quot;lecture-4-language-modelling&quot;&gt;Lecture 4. Language Modelling&lt;/h2&gt;
&lt;p&gt;For the fourth lecture, we learned another approach to representing documents and that is through probabilities: the probability of a sequence of words and the probability of a word given a sequence of words, via Language Models, and how considering n-grams allows us to get better models.&lt;/p&gt;

&lt;h2 id=&quot;lecture-5-word-vectors&quot;&gt;Lecture 5. Word Vectors&lt;/h2&gt;
&lt;p&gt;In lecture number five we learned about word embeddings, which is a somewhat more modern approach of representing words as dense vectors, created from the context of each word.&lt;/p&gt;

&lt;h2 id=&quot;lecture-6-text-classification&quot;&gt;Lecture 6. Text classification&lt;/h2&gt;
&lt;p&gt;Lecture six was about classification; we briefly reviewed classifiers such as Naïve Bayes, logistic regression, SVM and decision trees.&lt;/p&gt;

&lt;h2 id=&quot;lecture-7-intro-to-nlp&quot;&gt;Lecture 7. Intro to NLP&lt;/h2&gt;
&lt;p&gt;Natural Language Processing was the topic of the seventh lecture, in this case, things like including part of the speech tagging and dependency parsing.&lt;/p&gt;

&lt;h2 id=&quot;lecture-8-more-on-text-classification&quot;&gt;Lecture 8. More on text classification&lt;/h2&gt;
&lt;p&gt;Lecture eight was another look at classification, reviewing some good practices to avoid over or underfitting, as well as some ethical concerns that may arise from using machine learning for real-world applications.&lt;/p&gt;

&lt;h2 id=&quot;lecture-9-more-on-clustering&quot;&gt;Lecture 9. More on clustering&lt;/h2&gt;
&lt;p&gt;Just like the previous lecture, this one was about revisiting an old lecture from another perspective, in this case: clustering using &lt;strong&gt;Latent Semantic Indexing&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;applications&quot;&gt;Applications&lt;/h2&gt;
&lt;p&gt;The last lectures were about applications of what we saw during the course:&lt;/p&gt;

&lt;h3 id=&quot;lecture-10-information-extraction-named-entity-recognition-and-relation-extraction&quot;&gt;Lecture 10. Information Extraction, Named Entity recognition and Relation Extraction&lt;/h3&gt;

&lt;h3 id=&quot;lecture-11-question-answering-and-qa-architectures&quot;&gt;Lecture 11. Question Answering, and QA architectures&lt;/h3&gt;

&lt;h3 id=&quot;lecture-12-dialogue-systems-chatbots-slot-filling&quot;&gt;Lecture 12. Dialogue Systems, chatbots, slot filling&lt;/h3&gt;

&lt;h2 id=&quot;labs&quot;&gt;Labs.&lt;/h2&gt;
&lt;p&gt;The labs for this course were by far the most interesting of any other course I had this semester (don’t feel bad Big Data, yours were cool as well). We worked with tools like NLTK and spaCy, and Google’s version of the Jupyter Notebooks called Colab.&lt;/p&gt;
</description>
        <pubDate>Tue, 19 Jun 2018 19:00:00 +0100</pubDate>
        <link>http://thatcsharpguy.com/tv/text-as-data/</link>
        <guid isPermaLink="true">http://thatcsharpguy.com/tv/text-as-data/</guid>
        
        <category>DataScience</category>
        
        <category>Meta</category>
        
        <category>Tv</category>
        
        
        <category>tv</category>
        
        <category>data-science</category>
        
      </item>
    
      <item>
        <title>Apache Spark</title>
        <description>&lt;h1 id=&quot;spark&quot;&gt;Spark.&lt;/h1&gt;
&lt;p&gt;Hace ya unos meses les platiqué de MapReduce, que es un framework para procesamiento de grandes sets de datos, en ese video también les hablé de las limitantes de este modelo como el trabajar con algoritmos iterativos o el procesar datos en “tiempo-real”.&lt;/p&gt;

&lt;p&gt;Hoy les voy a hablar de otra forma de trabajar con grandes de cantidades de datos que es un poco distinta, esta vez utilizando Apache Spark. Spark es un framework de computación desarrollado inicialmente en la Universidad de California Berkeley, pero ahora es parte de los proyectos de la Apache Software Foundation.&lt;/p&gt;

&lt;p&gt;Spark funciona al rededor de un concepto muy importante es el de los &lt;strong&gt;Resilient Distributed Datasets&lt;/strong&gt; o &lt;strong&gt;RDDs&lt;/strong&gt;, algo que podría ser traducido a datasets distribuidos y recuperables (?). Estos RDDs son colecciones de datos de solo lectura, es decir, una vez creados ya no son modificables, y cada vez que uno es “modificado” en realidad se está creando uno nuevo.&lt;/p&gt;

&lt;p&gt;Estos RDDs no son creados de manera inmediata, sino que su creación se retrasa hasta que son &lt;em&gt;materializados&lt;/em&gt;: es decir almacenados en memoria o en disco, o en algunos casos para realizar alguna operación con ellos. En un concepto muy muy parecido a la evaluación perezosa.&lt;/p&gt;

&lt;p&gt;Su característica más importante es que en lugar de llevar un registro de la información que un RDD contiene, se lleva un registro de las operaciones que lo crearon, de este modo se puede ordenar la creación de un RDD nuevamente en caso de fallos.&lt;/p&gt;

&lt;p&gt;A medida que se van creando RDDs, también se va creando un grafo dirigido y aciciclico, o en terminología de Spark: &lt;strong&gt;DAG&lt;/strong&gt;, este grafo es una manera de representar las operaciones sobre los RDDs que les acabo de mencionar. Cuando un RDD es materializado el sistema se encarga de decidir cuál es la manera más óptima de crearlo a partir de las operaciones con las que se especificaron sobre él y sus derivados.&lt;/p&gt;

&lt;h2 id=&quot;distribuido&quot;&gt;Distribuido&lt;/h2&gt;
&lt;p&gt;Cabe señalar que al igual que MapReduce, Spark también es ejecutado en un cluster de servidores y que existe toda una pesadilla de orquestación para que este funcione correctamente, para tu buena suerte esta tarea ya está resuelta y tu solamente te puedes enfocar en resolver tu problema (la mayoría de las veces).&lt;/p&gt;

&lt;p&gt;Es justamente el DAG el que va a definir cómo es que los datos se van a mover entre nodos, reduciendo en lo mayor posible el movimiento de información entre servidores.&lt;/p&gt;

&lt;h2 id=&quot;ejemplo&quot;&gt;Ejemplo&lt;/h2&gt;
&lt;p&gt;Otra de sus ventajas es que al contrario de MapReduce con Spark no necesitas diseñar por separado las partes del sistema, puedes tratar to código como si estuvieras desarrollando un sistema monolítico y sin alguna consideración extra, cierto es que para lograr un buen desempeño es necesario que lidies con asuntos de distribución de cargas y repartición de trabajo, pero para empezar a programar no necesitas saber esto, además de que el código es mucho más sencillo de leer y de escribir.&lt;/p&gt;

&lt;p&gt;Este es el ejemplo de el conteo de palabras en Spark usando los tres lenguajes que Spark soporta por default:&lt;/p&gt;

&lt;h3 id=&quot;scala&quot;&gt;Scala&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;val textFile = sc.textFile(&quot;hdfs://...&quot;)
val counts = textFile.flatMap(line =&amp;gt; line.split(&quot; &quot;))
                 .map(word =&amp;gt; (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile(&quot;hdfs://...&quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&quot;python&quot;&gt;Python&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;text_file = sc.textFile(&quot;hdfs://...&quot;)
counts = text_file.flatMap(lambda line: line.split(&quot; &quot;)) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(&quot;hdfs://...&quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&quot;java&quot;&gt;Java&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;JavaRDD&amp;lt;String&amp;gt; textFile = sc.textFile(&quot;hdfs://...&quot;);
JavaPairRDD&amp;lt;String, Integer&amp;gt; counts = textFile
    .flatMap(s -&amp;gt; Arrays.asList(s.split(&quot; &quot;)).iterator())
    .mapToPair(word -&amp;gt; new Tuple2&amp;lt;&amp;gt;(word, 1))
    .reduceByKey((a, b) -&amp;gt; a + b);
counts.saveAsTextFile(&quot;hdfs://...&quot;);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Como puedes ver no es necesario dividir tu código en mapper y reducer como tendrías que haberlo hecho con  MapReduce&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;  public static class TokenizerMapper
       extends Mapper&amp;lt;Object, Text, Text, IntWritable&amp;gt;{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer&amp;lt;Text,IntWritable,Text,IntWritable&amp;gt; {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable&amp;lt;IntWritable&amp;gt; values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&quot;spark-vs-mapreduce&quot;&gt;Spark vs. MapReduce&lt;/h2&gt;
&lt;p&gt;Pero bueno, por todo lo que he dicho pareciera que debemos todos dejar de usar MapReduce (y en la mayoría de los casos esto parece ser un buen consejo) pero en general debes saber que Spark usa muchísima memoria RAM para cumplir su cometido eficientemente mientras que MapReduce utiliza memoria de disco que es aún un poco más barata, además de que si no necesitas velocidad en el procesamiento de la información considerar Spark podría ser una exageración.&lt;/p&gt;

&lt;h2 id=&quot;key-value-pairs&quot;&gt;key-value pairs&lt;/h2&gt;
&lt;p&gt;Se podría decir que Spark es compatible con MapReduce gracias a los tipos de dato llave-valor, así como el hecho de que podemos usar los mismos conectores de lectura de archivos. Por lo que algunas de las cosas que se hacen con MapReduce son directamente portables a Spark. Pero en fin, sigamos hablando de los beneficios de Spark…&lt;/p&gt;

&lt;h2 id=&quot;extensiones&quot;&gt;Extensiones&lt;/h2&gt;
&lt;p&gt;Spark ofrece algunas extensiones que lo habilitan para trabajos relacionados con SQL y DataFrames, Machine Learning, Streaming de datos y grafos.&lt;/p&gt;

&lt;h2 id=&quot;streaming&quot;&gt;Streaming&lt;/h2&gt;
&lt;p&gt;En el video de MapReduce me dejaron esta pregunta y respondí que Spark era una alternativa, y es que como mencioné esta herramienta permite trabajar con flujos de datos en &lt;em&gt;tiempo real&lt;/em&gt;… pero de esto podemos hablar en otro video, a lo mejor y hasta con un screencast.&lt;/p&gt;
</description>
        <pubDate>Tue, 12 Jun 2018 19:00:00 +0100</pubDate>
        <link>http://thatcsharpguy.com/tv/spark/</link>
        <guid isPermaLink="true">http://thatcsharpguy.com/tv/spark/</guid>
        
        <category>DataScience</category>
        
        <category>Meta</category>
        
        <category>Tv</category>
        
        
        <category>tv</category>
        
        <category>big-data</category>
        
      </item>
    
      <item>
        <title>Big Data</title>
        <description>&lt;h1 id=&quot;big-data&quot;&gt;Big Data&lt;/h1&gt;
&lt;p&gt;We took a look at some of the foundations of big data systems (some of them are even outdated now), from a more academic point of view.&lt;/p&gt;

&lt;p&gt;This is going to be a short video since what we mostly did was trying to understand the motivation and design decisions behind all these systems. I’ll put the links to all the papers we reviewed so you can take a look at them.&lt;/p&gt;

&lt;p&gt;Starting with…&lt;/p&gt;

&lt;h2 id=&quot;google-file-system&quot;&gt;Google File System&lt;/h2&gt;
&lt;p&gt;This was the first distributed file system Google created to store all the information they manage; it has since then been replaced by Colossus. However, the foundations remain.&lt;/p&gt;

&lt;p&gt;From there we jumped to …&lt;/p&gt;

&lt;h2 id=&quot;hdfs-or-hadoop-file-system&quot;&gt;(HDFS) or Hadoop File System&lt;/h2&gt;
&lt;p&gt;Which is again, a distributed file system, in this case, inspired by GFS. As I mentioned earlier, we start with the foundations, that is Version 1 version of HDFS only to see the differences with the second version (and now there is a third version out, yey!).&lt;/p&gt;

&lt;p&gt;Once we learned a bit about HDFS, we learned about his companion, the programming model called…&lt;/p&gt;

&lt;h2 id=&quot;mapreduce&quot;&gt;MapReduce&lt;/h2&gt;
&lt;p&gt;Which is a useful technique to process vast amounts of information in a distributed way, taking advantage of having lots of relatively cheap computers. I made a whole video dedicated to MapReduce; you can check the link in the description.&lt;/p&gt;

&lt;p&gt;However, MapReduce is somewhat outdated too, and it has some limitations. We reviewed other more modern approaches to work with Big Data problems using…&lt;/p&gt;

&lt;h2 id=&quot;spark&quot;&gt;Spark&lt;/h2&gt;
&lt;p&gt;Which is a framework for distributed computing that allows us to specify transformations over a dataset without actually doing them right away but in a lazy manner. Spark has its foundation on the concept of Resilient Distributed Datasets: read-only collections of data distributed over nodes in a cluster. I’ll probably make a video about Spark in the future.&lt;/p&gt;

&lt;p&gt;Both MapReduce and Spark run on top of a distributed filesystem, benefitting themselves from the characteristics of such systems.&lt;/p&gt;

&lt;p&gt;After learning about these two processing approaches, we went on to learn about more different ways of storing information using distributed NoSQL Data Stores like…&lt;/p&gt;

&lt;h2 id=&quot;bigtable&quot;&gt;Bigtable&lt;/h2&gt;
&lt;p&gt;Another one of those Google creations, in the first line of the paper it says what Bigtable actually is: Bigtable is a distributed storage system for managing structured data that is designed to scale to a huge size. And that’s about it, I mean, is way more complex that I’m making it sound here, but I won’t go into details.&lt;/p&gt;

&lt;p&gt;Again we briefly saw an open source version of Bigtable called &lt;strong&gt;HBase&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;cassandra&quot;&gt;Cassandra&lt;/h2&gt;
&lt;p&gt;Finally, we reviewed Cassandra, another highly distributed data store, and its approach to decentralise the knowledge that the other approaches had centralised in a master node, another interesting thing is its ease to work across data centres.&lt;/p&gt;

&lt;p&gt;As for the practical side of things we did a couple of exercises: one using MapReduce and the other one using Spark on a school provided cluster. Both exercises involved calculating PageRank scores of some Wikipedia articles.&lt;/p&gt;

&lt;p&gt;As you can probably guess all of these systems involve a coordination hell as all of them are distributed and hold redundant copies of data some of them not only on a single cluster but across the entire world.&lt;/p&gt;

&lt;p&gt;And that was it, as I said for all of those systems we reviewed their main components such as Master nodes or DataNodes or whatever they were called on each of the implementations and the basic techniques that powered their reliability like writing to logs or creating checkpoints, along with some of the tools that helped these tools achieve great performance like LSMTrees, SSTables and Bloom filters.&lt;/p&gt;

</description>
        <pubDate>Tue, 12 Jun 2018 19:00:00 +0100</pubDate>
        <link>http://thatcsharpguy.com/tv/big-data/</link>
        <guid isPermaLink="true">http://thatcsharpguy.com/tv/big-data/</guid>
        
        <category>DataScience</category>
        
        <category>Meta</category>
        
        <category>Tv</category>
        
        
        <category>tv</category>
        
        <category>big-data</category>
        
      </item>
    
      <item>
        <title>Information Retrieval</title>
        <description>&lt;h1 id=&quot;information-retrieval&quot;&gt;Information retrieval&lt;/h1&gt;
&lt;p&gt;Well, the title is self-explanatory but a good definition is the following: information retrieval is a field concerned with the structure, analysis, organisation, storage, searching and retrieval of information. In this case, we focused on textual information.&lt;/p&gt;

&lt;p&gt;We started by looking at the…&lt;/p&gt;

&lt;h2 id=&quot;architecture-of-ir-systems&quot;&gt;Architecture of IR systems&lt;/h2&gt;
&lt;p&gt;And its three components: Query terms, a collection of documents and the retrieval system. Then we continued with the critical parts of its operation:&lt;/p&gt;

&lt;h3 id=&quot;document-processing&quot;&gt;Document processing&lt;/h3&gt;
&lt;p&gt;We need to prepare the documents to be retrieved, and this is done through a series of steps: the first one is &lt;strong&gt;tokenisation&lt;/strong&gt; that is: converting sentences into words, &lt;strong&gt;stopword removal&lt;/strong&gt;: getting rid of highly frequent words and finally, &lt;strong&gt;conflation/stemming&lt;/strong&gt; taking similar words and transform them into a single unique symbol.&lt;/p&gt;

&lt;p&gt;All of this to create a structure called the &lt;strong&gt;Inverted index&lt;/strong&gt;.&lt;/p&gt;

&lt;h3 id=&quot;inverted-index&quot;&gt;Inverted index&lt;/h3&gt;
&lt;p&gt;The inverted index is a structure that maps words to the documents where they appear. I won’t go into much detail about inverted indexes in this video but if you want to know more, just let me know in the comments.&lt;/p&gt;

&lt;h3 id=&quot;retrieving&quot;&gt;Retrieving&lt;/h3&gt;
&lt;p&gt;Finally, we took a look at one of the basic algorithms for document retrieval given a query: the &lt;em&gt;best-match&lt;/em&gt; algorithm that only considers whether a document contains or not a word of the query we are evaluating. However, most of the time this approach is not a good idea, and we then reviewed &lt;strong&gt;term weighting&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;term-weighting&quot;&gt;Term Weighting&lt;/h2&gt;
&lt;p&gt;The idea behind term weighting is to give each word a value that represents its importance for a document. To understand a bit more this idea we first studied the &lt;strong&gt;Zipf’s law&lt;/strong&gt; and how it applies to vast collections of texts.&lt;/p&gt;

&lt;p&gt;After that, we learned about the heuristic of using something known as &lt;strong&gt;Inverse Document Frequency&lt;/strong&gt; to take into consideration the assumption that if a term appears many times in different documents in the collection, it is less representative of a single document.&lt;/p&gt;

&lt;h2 id=&quot;vector-space-model&quot;&gt;Vector Space Model&lt;/h2&gt;
&lt;p&gt;We also saw another way to represent documents: as high dimensional vectors, where each one of these dimensions represents the words in our whole vocabulary. Operating in a vector space allows us to compute similarities between other documents and queries.&lt;/p&gt;

&lt;p&gt;However, what’s the idea behind building a retrieval system if there is no way to evaluate it?&lt;/p&gt;

&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;/h2&gt;
&lt;p&gt;With this in mind, we took a look at how are retrieval systems evaluated. Starting from the creation of &lt;strong&gt;test collections&lt;/strong&gt;, that are extensive collections of documents accompanied by queries and relevance assessments, provided by humans.&lt;/p&gt;

&lt;p&gt;Metrics such as Precision @ rank R, Precision at standard recall levels, and Average precision and their averaged values, like Mean Average Precision and Mean Reciprocal Rank or the Discounted Cumulative Gain.&lt;/p&gt;

&lt;h2 id=&quot;relevance-feedback&quot;&gt;Relevance Feedback&lt;/h2&gt;
&lt;p&gt;Then we saw how to improve, or at least try to improve, the results our system outputs by using feedback from the user, something called &lt;strong&gt;relevance feedback&lt;/strong&gt;. Specifically, three different kinds of feedback: &lt;strong&gt;Explicit feedback&lt;/strong&gt;, &lt;strong&gt;implicit feedback&lt;/strong&gt; and &lt;em&gt;pseudo-relevant feedback&lt;/em&gt; and their relationship to &lt;strong&gt;query expansion&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Following the trend of involving the user in the retrieval process we looked into &lt;strong&gt;Interactive IR&lt;/strong&gt;, and some other ways to involve the user in the process of information retrieval, mainly through the user interface of the results output. After this, we went back to other, in this case, &lt;em&gt;advanced retrieval models&lt;/em&gt;…&lt;/p&gt;

&lt;h2 id=&quot;advanced-retrieval-models&quot;&gt;Advanced Retrieval Models&lt;/h2&gt;
&lt;p&gt;All the retrieval models we had seen so far were… simple.&lt;/p&gt;

&lt;p&gt;In this section, we started taking into account probability and considering the relevance estimation as a classification problem. Using retrieval models like BM25 or Language models, or even some more advanced ones like PL2, part of the Divergence From Randomness framework. This, also considering proximity between terms in an attempt to improve search results.&lt;/p&gt;

&lt;h2 id=&quot;learning-to-rank&quot;&gt;Learning to Rank&lt;/h2&gt;
&lt;p&gt;Moreover, we can use machine learning to refine the final ranking shown to the user. We reviewed the &lt;em&gt;cascade-like&lt;/em&gt; pipeline to filter documents to prepare our machine learning models what kind of features we can use and the three kinds of learning tasks: Pointwise, Pairwise and Listwise.&lt;/p&gt;

&lt;p&gt;For a while, we focused on &lt;strong&gt;web search&lt;/strong&gt; and its challenges. Like personalising the results for each user, or considering the context the user is in to resolve ambiguities. Keeping on with web search, we studied a way to get documents through &lt;strong&gt;web crawling&lt;/strong&gt; which is not an easy task. You need to consider issues like respecting the servers you’re querying or withstand traps in the web or malformed htmls, not to mention the fact that we need to keep track of the documents we’ve downloaded,  and that we can’t do it at high scale with a single computer.&lt;/p&gt;

&lt;p&gt;Then we went back to evaluation…&lt;/p&gt;

&lt;h2 id=&quot;online-evaluation&quot;&gt;Online evaluation&lt;/h2&gt;
&lt;p&gt;The evaluation that we studied earlier in the course is known as offline evaluation, but a more exciting and challenging task is to evaluate our system while real users are using it. We reviewed two techniques: A/B testing and Interleaving.&lt;/p&gt;

&lt;h2 id=&quot;ir-infrastructures-and-efficiency&quot;&gt;IR infrastructures and efficiency&lt;/h2&gt;
&lt;p&gt;After all we saw, we had to review how to make them efficient by reviewing compression techniques for the index such as &lt;strong&gt;pruning&lt;/strong&gt; or &lt;strong&gt;unary or gamma coding&lt;/strong&gt;, as well as caching techniques for queries, terms or documents. We saw some of the infrastructures and how they affect the performance and evaluation of the queries that are issued to the system.&lt;/p&gt;

&lt;h2 id=&quot;real-time-ir&quot;&gt;Real-time IR&lt;/h2&gt;
&lt;p&gt;We saw the necessity and challenges of performing some of the tasks we previously saw but using streams of data, such as tweets or facebook status updates.&lt;/p&gt;

&lt;h2 id=&quot;applications-beyond-search&quot;&gt;Applications Beyond Search&lt;/h2&gt;
&lt;p&gt;To finalise the course, we were introduced to other applications such as Text summarisation, real-time event detection, Recommendations and Question Answering.&lt;/p&gt;

&lt;p&gt;So that was about it, on the practical side of things we worked with Terrier, with is a retrieval system developed here at the University of Glasgow.&lt;/p&gt;
</description>
        <pubDate>Tue, 05 Jun 2018 19:00:00 +0100</pubDate>
        <link>http://thatcsharpguy.com/tv/information-retrieval/</link>
        <guid isPermaLink="true">http://thatcsharpguy.com/tv/information-retrieval/</guid>
        
        <category>Meta</category>
        
        <category>Tv</category>
        
        
        <category>tv</category>
        
        <category>c-sharp</category>
        
      </item>
    
      <item>
        <title>Data Fundamentals</title>
        <description>&lt;p&gt;This is the first video of a series I’ll be uploading in the coming weeks about the courses that I took as part of my masters’ degree here at the University of Glasgow.&lt;/p&gt;

&lt;p&gt;The first course I want to talk to you about is Data Fundamentals. Basically, this course was a refresher of some of the things that you learned if you did an engineering degree, not to mention if you did a computer science degree. In my opinion, this was also the perfect introduction to Data Science, with basic but very interesting stuff that prepared us for the challenges.&lt;/p&gt;

&lt;p&gt;I think it is worth mentioning that this was by far my favourite course, not only because of the topics we reviewed but also because of the lecturer, John Williamson, the way he explained things to us was simply amazing. But anyways, here in the university, the courses are divided in lectures. Usually, there is one topic per lecture, and it is precisely in that way that I’ll be describing them to you.&lt;/p&gt;

&lt;p&gt;Starting with lecture 1:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lecture 1&lt;/strong&gt;: We saw how to work with NumPy, a practical library to deal with numerical data in the form of arrays (vectors, matrices and tensors) using Python, how to transform them by performing flipping, slicing, transposing, and many more operations on them. And we also took a brief look at the concept of vectorized computing, which allows performing a single operation to multiple data elements to improve the performance of our code by taking the maximum advantage of our computers.&lt;/p&gt;

&lt;p&gt;For the &lt;strong&gt;Lecture 2&lt;/strong&gt;: we learned how arrays are efficiently stored in a computer, we also saw how floats are stored in IEEE 754 notation, and this enabled us to know what are the most common errors that could happen when working with them. Finally, we were introduced to the concept of higher rank tensors, and how can we work efficiently with them using vectorized computing.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Lecture 3&lt;/strong&gt;: we were shown some basics of scientific visualisation using &lt;code&gt;matplotlib&lt;/code&gt; a Python library to plot data, and learned the “language” of graphics, what is a stat, a scale, a guide, a geom, layer and a facet. As well as which plots make more sense when dealing with different types of data, this also involved which colours we can use to correctly communicate our findings when working with data.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Fourth lecture&lt;/strong&gt; was about an introduction to linear algebra, its concepts and how can we perform them using Python and NumPy. This is somewhat an extension of what we saw in the first lecture, but now with linear algebra in mind. In here we reviewed the basic operations that can be performed on vectors, the concept of norm and more complex vector operations. By the end of this lecture we reviewed matrices and the operations that are defined for them as well as some useful properties that we can use to make our computations simpler.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Fifth lecture&lt;/strong&gt; started with a refresher of how graphs can be represented using matrices whether they are directed, undirected and weighted or unweighted. After that, we went back directly to linear algebra, where concepts like eigenvalues and eigenvectors appeared, and how they are related to Principal Component Analysis. Talking about decomposition of matrices, we also reviewed the concept of Singular Value Decomposition, a way to decompose a matrix into simpler forms to enable efficient computation on them.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Lecture 6&lt;/strong&gt;, we started with the core concept in which machine learning relies on optimization. Optimization is the task of finding the optimal settings for a process in order to improve it. We saw the main parts of an optimization problem:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Parameters&lt;/li&gt;
  &lt;li&gt;Objective function&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this, we also saw how can constraints be implemented into our parameter selection to make our optimization search more realistic. We focused mainly on iterative optimization, guided by heuristics that try to improve the performance of the algorithms behind them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lecture 7&lt;/strong&gt; was also about optimization concepts, including the basics of how a neural network works, and how derivatives can help us in finding the optimal parameters using the gradient descent algorithm. We saw what differentiable programming is and how we can use it to perform optimization.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Lecture 8&lt;/strong&gt; the topic was probability, starting from the basics, including the concept of probability mass functions and probability density functions. We also reviewed what the joint, marginal and conditional probability, the basis for the Bayes’ Rule, a very important concept as well. And how can we deal with probabilities without running into numerical issues.&lt;/p&gt;

&lt;p&gt;In the &lt;strong&gt;lab number eight&lt;/strong&gt;, corresponding to this lecture, we learned what a stochastic process was, and practised using a specific version of this kind of processes: the Markov process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lecture 9&lt;/strong&gt; was also about probability; we saw the concept of expectation, then some statistics and finally what Bayesian inference is and the Monte Carlo approaches are useful when dealing with Bayesian inference.&lt;/p&gt;

&lt;p&gt;To conclude with the course in the &lt;strong&gt;lecture number 10&lt;/strong&gt;, in we saw a bit of Digital Signals  processing and time series. What sampling is and its relationship with the Nyquist limit, and how aliasing can appear if we don’t respect this limit. The last section of this lecture was about convolution and the Fourier transform.&lt;/p&gt;

&lt;p&gt;Of course, we reviewed a lot more topics and we went more in-depth for almost all of them, trying to summarize hours of lectures and labs in a short video is not that easy, so if you want to know more about something that I mentioned where, please leave it on the comments.&lt;/p&gt;

&lt;p&gt;So, that’s it for me, at least for now, I hope you liked this video, and if you did, please give it a Like. There are more videos coming about the remaining courses that I’m creating as I review the lecture notes in preparation for the exam, I think the next one is going to be about machine learning.&lt;/p&gt;
</description>
        <pubDate>Thu, 05 Apr 2018 19:00:00 +0100</pubDate>
        <link>http://thatcsharpguy.com/tv/data-fundamentals/</link>
        <guid isPermaLink="true">http://thatcsharpguy.com/tv/data-fundamentals/</guid>
        
        <category>Meta</category>
        
        <category>Tv</category>
        
        
        <category>tv</category>
        
      </item>
    
      <item>
        <title>¿Programar en inglés?</title>
        <description>&lt;p&gt;Los algoritmos voraces o &lt;em&gt;greedy&lt;/em&gt; son algoritmos que implementan una heuristica(técnica) que tiene como objetivo optimizar la búsqueda de una solución óptima para un problema.&lt;/p&gt;

&lt;p&gt;La idea detrás de los algoritmos voraces (o greedy) consiste en siempre tomar la mejor decisión de todas las que puede tomar en ese momento, con la esperanza de que al juntar todas estas pequeñas mejores decisiones, se obtendrá la mejor solución al problema en general.&lt;/p&gt;

&lt;p&gt;Seguramente verán esto explicado como que el algoritmo toma decisiones localmente óptimas con la esperanza de que esto lo llevará a la solución solución globalmente óptima.&lt;/p&gt;

&lt;p&gt;Toma por ejemplo este problema, supón que existen estas ciudades, y tu tarea es gastar lo menos posible en llegar de Pueblo Paleta a X, a Y, como puedes ver, hay muchas rutas. Un algoritmo voraz haría algo como esto:&lt;/p&gt;

&lt;p&gt;A cada paso, comprar el boleto más barato hasta llegar al pueblo Y.&lt;/p&gt;

&lt;h3 id=&quot;no-siempre-es-lo-mejor&quot;&gt;No siempre es lo mejor&lt;/h3&gt;

&lt;p&gt;Como lo dije, el algoritmo tiene la esperanza de que al hacer esto va a obtener el mejor resultado al final, pero esto no necesariamente es así.&lt;/p&gt;

&lt;p&gt;Imagina un escenario como el anterior, pero con los precios un poco diferentes. Siguiendo la heurística voraz, un algoritmo haría algo como esto:&lt;/p&gt;

&lt;p&gt;Tomando nuevamente el boleto más barato, aunque al final, estas decisiones localmente óptimas lo llevaron a gastar más dinero.&lt;/p&gt;

&lt;h3 id=&quot;por-qué-existen&quot;&gt;¿Por qué existen?&lt;/h3&gt;
&lt;p&gt;Pero bueno, si al final puede que no obtenga el mejor resultado, ¿por qué existe esta técnica?&lt;/p&gt;

&lt;p&gt;Pues hay problemas para los calcular la solución globalmente óptima es algo inimaginable, o increíblemente difícil como es el caso del problema del agente viajero (del cuál les hablé en un video anterior). Y tu como desarrollador tienes que tomar la decisión entre sacrificar un poco la optimalidad de tu algoritmo por el tiempo de procesamiento.&lt;/p&gt;

&lt;p&gt;También existen porque en general son sencillos de entender y también sencillos de programar.&lt;/p&gt;

&lt;p&gt;Recuerda que no existe un martillo dorado que sea la solución a todos los problemas que te encuentres, y que este esta es una de esas herramientas que puedes usar.&lt;/p&gt;

&lt;h3 id=&quot;los-componentes&quot;&gt;Los componentes&lt;/h3&gt;
&lt;p&gt;Ahora, no todos los problemas son susceptibles de afrontarse con este tipo de algoritmos, para que un problema pueda ser resuelto usándoles un algoritmo con esta característica debe tener:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;El problema debe poder ser descompuesto en subproblemas sequenciales, y cada que uno de estos problemas se resuelve, (es decir, a cada decisión que se toma contribuye al resultado final, reduciendo el tamaño del problema.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;ejemplos&quot;&gt;Ejemplos&lt;/h3&gt;
&lt;p&gt;Entre los ejemplos que encontrarán sobre problemas que se pueden afrontar con estos algoritmos están:&lt;/p&gt;

&lt;p&gt;Problema del agente viajero, los de Minimumm Spanning Tree de Prim y Kruskal, la codificación de Huffman y el problema de la mochila Knapsack.&lt;/p&gt;
</description>
        <pubDate>Tue, 20 Mar 2018 18:00:00 +0000</pubDate>
        <link>http://thatcsharpguy.com/tv/programar-ingles/</link>
        <guid isPermaLink="true">http://thatcsharpguy.com/tv/programar-ingles/</guid>
        
        <category>Meta</category>
        
        <category>Tv</category>
        
        
        <category>tv</category>
        
      </item>
    
      <item>
        <title>Los algoritmos voraces</title>
        <description>&lt;p&gt;Los algoritmos voraces o &lt;em&gt;greedy&lt;/em&gt; son algoritmos que implementan una heuristica(técnica) que tiene como objetivo optimizar la búsqueda de una solución óptima para un problema.&lt;/p&gt;

&lt;p&gt;La idea detrás de los algoritmos voraces (o greedy) consiste en siempre tomar la mejor decisión de todas las que puede tomar en ese momento, con la esperanza de que al juntar todas estas pequeñas mejores decisiones, se obtendrá la mejor solución al problema en general.&lt;/p&gt;

&lt;p&gt;Seguramente verán esto explicado como que el algoritmo toma decisiones localmente óptimas con la esperanza de que esto lo llevará a la solución solución globalmente óptima.&lt;/p&gt;

&lt;p&gt;Toma por ejemplo este problema, supón que existen estas ciudades, y tu tarea es gastar lo menos posible en llegar de Pueblo Paleta a X, a Y, como puedes ver, hay muchas rutas. Un algoritmo voraz haría algo como esto:&lt;/p&gt;

&lt;p&gt;A cada paso, comprar el boleto más barato hasta llegar al pueblo Y.&lt;/p&gt;

&lt;h3 id=&quot;no-siempre-es-lo-mejor&quot;&gt;No siempre es lo mejor&lt;/h3&gt;

&lt;p&gt;Como lo dije, el algoritmo tiene la esperanza de que al hacer esto va a obtener el mejor resultado al final, pero esto no necesariamente es así.&lt;/p&gt;

&lt;p&gt;Imagina un escenario como el anterior, pero con los precios un poco diferentes. Siguiendo la heurística voraz, un algoritmo haría algo como esto:&lt;/p&gt;

&lt;p&gt;Tomando nuevamente el boleto más barato, aunque al final, estas decisiones localmente óptimas lo llevaron a gastar más dinero.&lt;/p&gt;

&lt;h3 id=&quot;por-qué-existen&quot;&gt;¿Por qué existen?&lt;/h3&gt;
&lt;p&gt;Pero bueno, si al final puede que no obtenga el mejor resultado, ¿por qué existe esta técnica?&lt;/p&gt;

&lt;p&gt;Pues hay problemas para los calcular la solución globalmente óptima es algo inimaginable, o increíblemente difícil como es el caso del problema del agente viajero (del cuál les hablé en un video anterior). Y tu como desarrollador tienes que tomar la decisión entre sacrificar un poco la optimalidad de tu algoritmo por el tiempo de procesamiento.&lt;/p&gt;

&lt;p&gt;También existen porque en general son sencillos de entender y también sencillos de programar.&lt;/p&gt;

&lt;p&gt;Recuerda que no existe un martillo dorado que sea la solución a todos los problemas que te encuentres, y que este esta es una de esas herramientas que puedes usar.&lt;/p&gt;

&lt;h3 id=&quot;los-componentes&quot;&gt;Los componentes&lt;/h3&gt;
&lt;p&gt;Ahora, no todos los problemas son susceptibles de afrontarse con este tipo de algoritmos, para que un problema pueda ser resuelto usándoles un algoritmo con esta característica debe tener:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;El problema debe poder ser descompuesto en subproblemas sequenciales, y cada que uno de estos problemas se resuelve, (es decir, a cada decisión que se toma contribuye al resultado final, reduciendo el tamaño del problema.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;ejemplos&quot;&gt;Ejemplos&lt;/h3&gt;
&lt;p&gt;Entre los ejemplos que encontrarán sobre problemas que se pueden afrontar con estos algoritmos están:&lt;/p&gt;

&lt;p&gt;Problema del agente viajero, los de Minimumm Spanning Tree de Prim y Kruskal, la codificación de Huffman y el problema de la mochila Knapsack.&lt;/p&gt;
</description>
        <pubDate>Sun, 11 Mar 2018 18:00:00 +0000</pubDate>
        <link>http://thatcsharpguy.com/tv/algoritmos-voraces/</link>
        <guid isPermaLink="true">http://thatcsharpguy.com/tv/algoritmos-voraces/</guid>
        
        <category>Meta</category>
        
        <category>Tv</category>
        
        
        <category>tv</category>
        
      </item>
    
      <item>
        <title>¿Qué es MapReduce?</title>
        <description>&lt;p&gt;Es un modelo de programación fuertemente orientado a la ejecución paralela y distribuida entre múltiples computadoras, que se utiliza para trabajar con grandes colecciones de datos, digamos, de unos cuantos terabytes (o petabytes). La idea de MapReduce es ofrecer una forma simple, rápida, escalable y resistente a fallos para manipular enormes cantidades de datos. En la terminología de MapReduce a estas manipulaciones se les conoce como trabajos, y así me referiré a ellos de ahora en adelante.&lt;/p&gt;

&lt;p&gt;MapReduce surgió de Google y fue presentado en 2004 por dos investigadores de Google, quienes lo publicaron en un paper y fue a partir de ahí que se comenzó a desarrollar en el mundo del open source a partir de ahí se generó una propuesta open source que funciona dentro de Hadoop (que por cierto es otra maravilla del Open Source originada en Google, de la cual si quieren les hablo en otro video). La realidad es que la explicación que voy a dar aquí es muy, muy básica, y los invito a leer el paper y a buscar más información y desde luego, a dejar sus preguntas en los comentarios, no es que yo pueda responderlas, pero seguro yo se las puedo preguntar a alguien más.&lt;/p&gt;

&lt;h2 id=&quot;composición&quot;&gt;Composición&lt;/h2&gt;
&lt;p&gt;Su nombre proviene de dos viejos conocidos de la programación funcional: las funciones Map y Reduce. Como su nombre lo indica,  se compone de dos etapas (en realidad hay tres, pero la tercera es transparente para el desarrollador): la etapa de Map o Mapeo y la etapa de Reduce o Reducción, un poco más adelante les hablaré a detalle de estas etapas.&lt;/p&gt;

&lt;p&gt;Como lo mencioné antes, MapReduce fue creado para ejecutarse en múltiples computadoras a la vez, distribuyendo así la carga del trabajo en ellas, cabe destacar que estas computadoras no tienen que ser supercomputadoras carísimas, sino que pueden ser computadoras simples y relativamente baratas, es por eso que cuando se habla de MapReduce, se habla de clusters de computadoras.&lt;/p&gt;

&lt;p&gt;Todas las computadoras que participan en un trabajo de MapReduce   contienen un agente que coordina sus recursos, conocido como el NodeManager, este NodeManager a su vez se comunica con el encargado de orquestar todo el trabajo, que es nada más y nada menos que otra computadora dentro del cluster, que contiene un programa llamado ResourceManager, encargado de orquestar toda la operación de un trabajo de MapReduce (este asigna los trabajos, se mantiene al tanto de el estado de cada uno de los NodeManagers y resigna un trabajo no completado si es que alguno de los nodos deja de responder).&lt;/p&gt;

&lt;h2 id=&quot;entradas-y-salidas&quot;&gt;Entradas y salidas.&lt;/h2&gt;
&lt;p&gt;Como igual mencioné brevemente al inicio del video, un trabajo de MapReduce tiene como objetivo el tomar una cantidad de datos (de un archivo, de una base de datos o cualquier otra fuente de datos), manipularla de alguna manera y entregar un resultado al final. Pese a lo que indica su nombre “reduce”, el archivo generado no necesariamente tiene que ser una versión reducida de la entrada, puede llegar a ser más grande.&lt;/p&gt;

&lt;p&gt;Su funcionamiento es más o menos el siguiente:&lt;/p&gt;

&lt;h3 id=&quot;mapping&quot;&gt;Mapping&lt;/h3&gt;
&lt;p&gt;De tu entrada de datos tomas un conjunto de información que tiene la forma de pares llave-valor, dentro de la fase del mapper realizas alguna especie de procesamiento sobre ellos, como podría ser separar una cadena por espacios en blanco, encontrar todos los links que contiene un documento web. La idea es que del mapper generes un nuevo conjunto de datos, de nuevo en la forma de llave-valor, aquí cabe señalar que no es necesario que las llaves ni los valores sean los mismos que en los archivos de entrada, pueden ser totalmente diferentes.&lt;/p&gt;

&lt;h3 id=&quot;shuffling&quot;&gt;Shuffling&lt;/h3&gt;
&lt;p&gt;A la salida de los mappers se ejecuta otra etapa, conocida como shuffling, que se encarga de ordenar los datos generados de acuerdo a la llave que tu les asignaste en el mapper. Una vez que se ejecutó la fase de shuffling, los datos están listos para ser consumidos por los reducers.&lt;/p&gt;

&lt;h3 id=&quot;reducing&quot;&gt;Reducing&lt;/h3&gt;
&lt;p&gt;La tarea de los reducers comienza al ir a recuperar la información de los mappers, cada reducer tendrá la tarea de procesar todos los datos asignados a una sola llave a la vez. En el Reducer tenemos la oportunidad de realizar otro procesamiento sobre los valores, pero ahora con la certeza de que toda la información que tenemos está identificada con una sola llave.&lt;/p&gt;

&lt;p&gt;El reducer, al igual que el mapper, tiene la tarea de generar un nuevo conjunto de información en la forma llave-valor, la cual es depositada nuevamente en un archivo.&lt;/p&gt;

&lt;h2 id=&quot;claves&quot;&gt;Claves&lt;/h2&gt;

&lt;p&gt;Sí, sé que ese asunto de las llaves-valor puede sonar complicado al inicio, pero después de haber implementado una versión muy básica del primer PageRank de Google en MapReduce (dejo el link al post en la descripción) puedo identificar algunas claves:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Pensar en forma paralela, y tratar de olvidar temporalmente todos los otros paradigmas que tengas en mente.&lt;/li&gt;
  &lt;li&gt;Que al momento de trabajar en los mappers no vas a tener toda la información que necesitas en un solo lugar, ve a los mappers como una etapa de preparación de la información.&lt;/li&gt;
  &lt;li&gt;Que si quieres que algo se ejecute en un solo lugar, eso sería del lado de los reducers, y para que esto suceda todos los elementos que quieres en un lugar deben tener la misma llave.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;el-hola-mundo&quot;&gt;El Hola Mundo&lt;/h2&gt;
&lt;p&gt;El hola mundo del MapReduce es un ejemplo de conteo de palabras que te voy a tratar de explicar ahora mismo. Imagina que tienes el siguiente texto:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Cuando cuentes cuentos
Cuenta cuantos cuentos cuentas
Por que si no cuentas cuantos cuentos cuentas 
Nunca sabrás cuantos cuentos cuentas
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Una forma de contar las palabras usando MapReduce es la siguiente:&lt;/p&gt;

&lt;p&gt;Los mappers reciben fragmentos del texto, separados por lineas, la información es recibida de la siguiente manera, cada linea es un valor, y la llave de ese valor es el número de línea dentro del archivo. Dentro del mapper, tomamos cada línea y la separamos por espacios, tomamos cada una de esas palabras como llaves y les asociamos como valor un uno y los escribimos a la salida.&lt;/p&gt;

&lt;p&gt;La tarea del shuffler será ordenar esta información por la llave , algo así:&lt;/p&gt;

&lt;p&gt;Después, esta información es obtenida por los reducers, recuerda que los reducers tomarán todos los elementos asociados con una sola llave y los procesarán dentro de ellos.&lt;/p&gt;

&lt;p&gt;En el reducer, se ejecuta un proceso que toma todos los números asociados a una palabra y los suma, tan solo para al final escribir a la salida otro par llave valor, en donde la llave es la palabra y el valor es el número de veces que aparece dicha palabra en el archivo.&lt;/p&gt;

&lt;h2 id=&quot;usos-y-limitaciones&quot;&gt;Usos y limitaciones&lt;/h2&gt;
&lt;p&gt;Por regla general se usa MapReduce cuando queremos afrontar problemas relacionados con grandes, enormes cantidades de datos. Y es por eso que se considera que debe correr en un sistema de archivos distribuidos, como es el caso de HDFS (Hadoop Distributed File System).&lt;/p&gt;

&lt;p&gt;Si bien MapReduce es una propuesta para paralizar y distribuir el cómputo sobre grandes cantidades de datos, tenemos que tener en cuenta que no todos los problemas son susceptibles de afrontarse usándolo, y tampoco es una solución buena para trabajos que requieren de respuesta en tiempo real, MapReduce es más bien para procesamiento offline de información.&lt;/p&gt;
</description>
        <pubDate>Mon, 19 Feb 2018 18:00:00 +0000</pubDate>
        <link>http://thatcsharpguy.com/tv/mapreduce/</link>
        <guid isPermaLink="true">http://thatcsharpguy.com/tv/mapreduce/</guid>
        
        <category>Meta</category>
        
        <category>Tv</category>
        
        
        <category>tv</category>
        
      </item>
    
      <item>
        <title>Python</title>
        <description>&lt;h1 id=&quot;python&quot;&gt;Python.&lt;/h1&gt;

&lt;p&gt;Como muchos de ustedes ya sabrán, Python es un lenguaje de programación, que como muy pocos seguramente saben, tomó su nombre no de una serpiente, si no de un programa de comedia británico, pero en fin. Python fue publicado en 1991 por Guido van Rossum, inicialmente fue pensado como un simple lenguaje de scripting pero en la actualidad se ha infiltrado en el desarrollo web, la ciencia de datos, machine learning y ramas afines.&lt;/p&gt;

&lt;h2 id=&quot;filosofía&quot;&gt;Filosofía.&lt;/h2&gt;

&lt;p&gt;La filosofía detrás de Python podría estar resumida en un documento que fue creado en 1999, ocho años después de su creación. Pueden consultar el documento en este enlace: pero les voy a decir algunos de estos principios que sí, suenan muy filosóficos:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Beautiful is better than ugly&lt;/li&gt;
  &lt;li&gt;Explicit is better than implicit&lt;/li&gt;
  &lt;li&gt;Simple is better than complex&lt;/li&gt;
  &lt;li&gt;Readability counts&lt;/li&gt;
  &lt;li&gt;There should be one—and preferably only one—obvious way to do it.&lt;/li&gt;
  &lt;li&gt;If the implementation is hard to explain, it’s a bad idea.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lo cierto es que mientras que estos principios suenan bonitos, el escribir software todavía recae en los humanos, así que estos principios no se aplican muchas veces. Y, por ejemplo, puedes encontrar que en Python es normal que encuentres más de una manera de hacer las cosas.&lt;/p&gt;

&lt;h2 id=&quot;características&quot;&gt;Características&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Es dinámicamente tipado&lt;/strong&gt;: Porque podemos hacer algo como esto:&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code&gt;a = 1
b = 'C'
c = [0.1, 0.5]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Es decir, no es necesario especificar el tipo de dato de una variable antes de declararla. Y no existe un compilador, ni el intérprete, que esté comprobando esto antes de que el programa se esté ejecutando.&lt;/p&gt;

&lt;p&gt;También permite algo como esto:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;a = 1
a = 'C'
a = [0.1, 0.5]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Es decir, cambiar por completo el tipo de dato de una variable sin que nadie diga nada. Y créanme, esto puede ser motivo de muchas confusiones, pero una vez que te acostumbras, puede llegar a ser una herramienta muy útil.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;sin embargo, también es considerado un lenguaje &lt;strong&gt;fuertemente tipado&lt;/strong&gt; (cabe recalcar que puede existir esta combinación: dinámico y fuertemente tipado a la vez). Es considerado fuertemente tipado porque el lenguaje define un conjunto de reglas (de comportamientos) bajo las cuales los tipos de dato se pueden mezclar entre ellos, y romper esas reglas generará una excepción. Toma por ejemplo el siguiente código:&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code&gt;a3 = &quot;a&quot; + 3 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;por increíble que parezca, esto nos generaría un error puesto que los tipos de dato int y string no definen una forma de mezclarse, si quieres concatenar las cadenas tendrías que primero convertir el entero a cadena.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;No existen los corchetes (o llaves)&lt;/strong&gt;: sino que los bloques de código se definen usando indentaciones, es decir un bloque &lt;code&gt;if&lt;/code&gt; se define de la siguiente manera:&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code&gt;if b == 'C':
	print(&quot;b es C&quot;)
elif b == 'A':
	print(&quot;b es A&quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;O un código un poco más complejo se vería así:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def del_none(d):
    for key, value in list(d.items()):
        if value is None:
            del d[key]
        elif isinstance(value, str):
            d[key] = d[key].strip()
        elif isinstance(value, dict):
            del_none(value)
    return d
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Ah, seguramente lo notaste, pero Python tampoco requiere que uses un &lt;code&gt;;&lt;/code&gt; para terminar cada instrucción, la idea es que exista una instrucción por cada línea.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;como tal vez pudiste ver, es también un &lt;strong&gt;lenguaje de alto nivel&lt;/strong&gt;: La idea es abstraer (esconder) la mayor cantidad de detalles de implementación. Es un lenguaje de alto nivel y en ocasiones es muy sencillo leer programas escritos en este lenguaje, y a mi parecer, en muchos casos como si estuvieras leyendo un libro escrito en inglés.&lt;/li&gt;
  &lt;li&gt;Python es también &lt;strong&gt;multiparadigma&lt;/strong&gt;, puedes organizar tu código en clases, o utilizarlo como un lenguaje funcional, o puedes simplemente crear un programa que se ejecute proceduralmente… o una combinación de todo esto.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Altamente extensible&lt;/strong&gt;: tiene soporte para descargar módulos o bibliotecas de repositorios de paquetes que permiten que añadirle funcionalidad a tus programas, así que es normal que cuando descargues un proyecto tengas que descargar los paquetes asociados con instrucciones como las siguientes:&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code&gt;pip install package-name
easy_install package-name
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&quot;desventajas&quot;&gt;Desventajas&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;Considerado &lt;strong&gt;lento&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;A pesar de ser muy usado, hay áreas en las que no tiene mucho impacto, como el desarrollo para móviles.&lt;/li&gt;
  &lt;li&gt;Consume mucha memoria.&lt;/li&gt;
  &lt;li&gt;Puede hacer que otros lenguajes sean difíciles de trabajar, uno se acostumbra muy rápido a las bondades de Python, a mi de pronto ya se me olvida poner puntos y coma en C#&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Mon, 05 Feb 2018 18:00:00 +0000</pubDate>
        <link>http://thatcsharpguy.com/tv/python/</link>
        <guid isPermaLink="true">http://thatcsharpguy.com/tv/python/</guid>
        
        <category>Meta</category>
        
        <category>Tv</category>
        
        
        <category>tv</category>
        
      </item>
    
      <item>
        <title>¿Dormir o programar?</title>
        <description>&lt;p&gt;Este es un video más como de opinión, y quisiera comenzar diciendo esto: no dormir no es una medalla de honor, y el hecho de no dormir lo suficiente por estar trabajando o haciendo tarea no debe presumirse como un logro. Esto podría sonar raro, pero me parece que es algo muy común en el el ambiente de la programación de aplicaciones, aunque también lo he escuchado en otras profesiones, como entre los médicos.&lt;/p&gt;

&lt;p&gt;Claro, hay veces en que puede ser necesario: te encontraste con algo que es realmente complicado, tal vez tuviste que ayudar a alguien más o cuidar de tus hijos… También hay ocasiones en las que quedarse despierto puede ser divertido, como es el caso de algunos hackatones, conferencias o, las ya viejas, LAN parties. Sea cual sea la razón, a veces es necesario o divertido quedarse despierto, pero en mi opinión, no debes dejar que se convierta en un modo de vida.&lt;/p&gt;

&lt;p&gt;Hubo una etapa en mi vida en la que pensaba que dormir era una pérdida de tiempo: siempre tenía que “estar haciendo algo productivo”; pero eso ha cambiado, pues resulta que me di cuenta de que dormir el tiempo correcto es hacer algo productivo, dormir el tiempo necesario es no es un gasto, sino una inversión, y una muy buena. Seguramente ya te has desvelado antes, y sabes lo terrible que se siente al día siguiente:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Estás de mal humor,&lt;/li&gt;
  &lt;li&gt;te duele la cabeza,&lt;/li&gt;
  &lt;li&gt;no te dan muchas ganas de hablar con nadie,&lt;/li&gt;
  &lt;li&gt;te sientes desanimado y sin creatividad,&lt;/li&gt;
  &lt;li&gt;y te resulta imposible concentrarte…&lt;/li&gt;
  &lt;li&gt;¡además de que te andas quedando dormido en todos lados!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pero bueno, piénsalo… si algo se te complicó y empezaste a trabajar por la tarde y la noche se aproxima, cuando menos de das cuenta comienzas a tener sueño y a bostezar, ¿qué haces? ¿te vas a dormir? O, si ya trabajas, ¿qué pasa si por alguna razón el desarrollo de un módulo se retrasó y el cliente ya lo espera para la mañana siguiente? Cualquier escenario que te lleve al extremo ¿o vas por una lata de RedBull o una taza de café para seguir trabajando? ¿Qué haces normalmente, cuéntanos en los comentarios? ¿Te ha pasado algo similar o sueles desvelarte, cuéntanos?&lt;/p&gt;

&lt;p&gt;En mi opinión, muchas veces el tener que quedarse despierto para terminar algo es algo que puede atacarse de manera productiva, y si te toca desvelarte muy seguido por esta razón, deberías tratar de encontrar las causas que te llevan a no dormir para tratar de eliminarlas.&lt;/p&gt;

&lt;p&gt;También es cierto, y volviendo a lo que comenté al inicio del video, muchas compañías hacen de el no dormir una especie de medalla invisible, y digo invisible porque la verdad es que dudo mucho que alguien se atreva a poner un cuadro del tipo “empleado del mes” pero que diga “el que menos duerme”. Este tipo de medallas es algo que a veces nos presumimos entre nosotros, no se imaginan la cantidad de veces que oí, y yo mismo dije: “Me quedé despierto hasta las tres haciendo esto” o “Wey, llevo dos días sin dormir por terminar X o Y”, al principio puede sonar divertido y ser un motivo de orgullo, pero con el tiempo esto no va a hacer nada más que afectar tu desempeño y salud.&lt;/p&gt;

&lt;p&gt;Aunque a veces también pasa que te sientes taaaan inspirado que prefieres terminar lo que estás haciendo antes de ir a dormir, mi consejo es: No lo hagas, no vale la pena sacrificar una parte de (o todo) el día siguiente por trabajar dos o tres horas más durante la noche. Mi recomendación es que te tomes 5 minutos para escribir tu idea en la parte del código en la que estás trabajando, o de plano agarres una hoja de papel y la escribas ahí, y una vez hecho esto, te vayas a dormir sin remordimientos. Si la idea que tenías en mente antes de dormir era tan buena, cuando te despiertes por la mañana va a seguir ahí.&lt;/p&gt;

&lt;p&gt;Si dormir el tiempo necesario para descansar es un lujo que tu te puedes dar, créeme, dátelo. Duerme lo necesario, tu teclado (y el problema que estás enfrentando) seguirá ahí a la siguiente mañana.&lt;/p&gt;
</description>
        <pubDate>Sat, 27 Jan 2018 18:00:00 +0000</pubDate>
        <link>http://thatcsharpguy.com/tv/dormir-programar/</link>
        <guid isPermaLink="true">http://thatcsharpguy.com/tv/dormir-programar/</guid>
        
        <category>Meta</category>
        
        <category>Tv</category>
        
        
        <category>tv</category>
        
      </item>
    
  </channel>
</rss>