<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[BrainsToBytes]]></title><description><![CDATA[Articles on software engineering, programming and good practices for the software industry.]]></description><link>https://www.brainstobytes.com/</link><image><url>https://www.brainstobytes.com/favicon.png</url><title>BrainsToBytes</title><link>https://www.brainstobytes.com/</link></image><generator>Ghost 5.79</generator><lastBuildDate>Fri, 23 Feb 2024 01:44:02 GMT</lastBuildDate><atom:link href="https://www.brainstobytes.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[It's Fine, Nobody Can Remember Everything]]></title><description><![CDATA[It's ok if you can't remember every single detail about the tech you use, nobody can.]]></description><link>https://www.brainstobytes.com/no-one-can-remember-everything/</link><guid isPermaLink="false">61605c9837f63f003b75a075</guid><category><![CDATA[Career & Soft Skills]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Mon, 11 Oct 2021 13:34:12 GMT</pubDate><content:encoded><![CDATA[<p>A couple of days ago I had a conversation with a friend who is learning to program. </p><p>We were talking about the difficulty of remembering what each concept means and what every keyword does. The conversation eventually led to this question:<br><br><em><strong>Ok, but when will I stop needing the docs?</strong></em><br><br>I (probably most people) had the same feeling when I was learning how to program. Somehow I envisioned that eventually, after learning for a very long time, I would know every single detail of the language I am using, and every single function of the libraries I use.</p><p>It took me a few months working as a software developer to learn that this is not how you write software. I want to share some of the things I&apos;ve learned over the years about the relationship software developers have with documentation, their memory, and other online resources:</p><ul><li>Most of the important keywords and concepts will eventually stick, given that you write enough code. This happens every time you start learning a new programming language or technique: In the beginning, you will need to look up lots of stuff, but the more familiar you become with the tools you are using, the more details your brain will retain.</li><li>Understand concepts, look-up details: It&apos;s much more important to understand the basic concepts and ideas and where to use them than to know details about a language&apos;s syntax. When you develop &#xA0;&quot;programmer intuition&quot; you will know what to use and when to use it. Everything else can be found on the documentation. &#xA0;</li><li>Everyone uses docs, everyone looks things up. I used to have this weird idea that veteran programmers never look anything up. It doesn&apos;t happen that way, every seasoned programmer understands the value of good docs and that the important part of programming is not the stuff you can memorize, but the connections you can make between concepts and how to turn specs into good software. Go to the office of any software development team and you will find the same thing on every table: One monitor has a text editor or IDE, and the other has docs. </li><li>Use your <em>exocortex. </em>All those notebooks, doc pages, books, and basically anything where you can write down details are there to help you. They declutter your brain from specifics and let you concentrate on the important stuff. Remember, your brain has limited capacity, use it where it matters.</li></ul><p>That&apos;s it, do not stress too much if you can&apos;t remember everything you learn. Just keep writing code, keep building things, keep asking questions, and looking things up. Eventually, all these things will become second nature. &#xA0;And the ones that don&apos;t, it&apos;s fine, you can always find them in docs.</p><p>Thanks for reading! </p>]]></content:encoded></item><item><title><![CDATA[On Abstraction and Coupling]]></title><description><![CDATA[Article that talks about the role that abstractness and coupling play in the architecture of our software.]]></description><link>https://www.brainstobytes.com/on-abstraction-and-coupling/</link><guid isPermaLink="false">6092bf0895c335003b7a81cf</guid><category><![CDATA[Software Engineering & Design]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Wed, 05 May 2021 17:08:19 GMT</pubDate><content:encoded><![CDATA[<p>This article is about the second group of concepts I wanted to talk about after re-reading Clean Architecture.</p><p>I want to try something different this time: Instead of elaborating each idea in long, continuous prose, I&apos;ll just list them as separate chunks.</p><p>So, here it goes:</p><ul><li>We already know that tight coupling is a bad thing to have. It binds together software in a way that makes it hard to change and adapt to future changes. If two pieces of software are tightly coupled, they can&apos;t change independently and maintenance becomes harder as a result.</li><li>In an ideal scenario, most of your code should depend on <em>abstractions </em>instead of depending on concrete classes/software constructs. In statically typed OOP languages, this means that most references should be to either interfaces or abstract classes and that DI is used to control the dependency flow. For dynamically typed languages you can accomplish pretty much the same with duck typing.</li><li>Dependency cycles happen when a class depends on a class that depends on ... that depends on the first class of the chain. These are some of the most annoying things to deal with in a project but can be solved (in most cases) by using the Dependency Inversion Principle.</li><li>A common form of <em>depending on concrete things </em>is when you tie your project to specific technologies/libraries. Like directly calling DB wrappers across all your code (or worse, smearing SQL everywhere). Instead, you should hide those <em>details </em>behind a clearly defined interface tailored to the problem you are trying to solve, it will make things easier if you need to change tech in the future.</li><li>The real reason for depending on <em>abstractions </em>is that high-level concepts tend to change less frequently than implementation details. If you define a coherent set of behavior in an interface, it&apos;s unlikely that it will change as often as the classes that implement it. It&apos;s about them being far more <strong>stable </strong>software constructs.</li><li>It&apos;s <strong>impossible </strong>to have a system with only stable components. Stability is tied to requirements, and if they change your system will need to change. The only thing you can do is to ensure that the components that embody that instability are not directly referenced and are hidden behind clean interfaces/abstractions.</li><li>Talking about &quot;concretions&quot;, you can measure the <em>dirtyness </em>of a component by checking how dependent it is on implementation details. The <strong>main</strong> method will very often be the dirtiest part of your system. This is ok, if any part of your system is allowed to binge on dependencies and tight coupling, that&apos;s the main method.</li><li>Another benefit of not tying your system to concrete implementations/technologies is that it becomes easier to totally remove or change them down the road. Hiding them behind an interface lets you <strong>delegate </strong>those choices to <em>future you. </em>The counterpart of you that lives in the future has much more knowledge about the system needs and is better equipped to make those choices, so make it easy for your buddy to implement those decisions.</li></ul><p>Those are the ones I can remember, they all embody the exact same idea: Build your system in a way that protects your business logic from knowing implementation details.</p><p>Nowadays we have a huge array of amazing languages that make this extremely easy, so go ahead and use those features to build great software!</p><p>Also, if you are interested in design principles for achieving this, you can take a look at a series of articles I wrote on the S.O.L.I.D principles, I hope you&apos;ll find them useful.</p><p>Thanks for reading.</p>]]></content:encoded></item><item><title><![CDATA[On Shape and Behavior]]></title><description><![CDATA[An article about the two most important ways a software engineer can provide value to a project: The architecture of the software and the functions it performs.]]></description><link>https://www.brainstobytes.com/on-shape-and-behavior/</link><guid isPermaLink="false">6087d829ebbab0003bd18606</guid><category><![CDATA[Software Engineering & Design]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Thu, 29 Apr 2021 16:35:48 GMT</pubDate><content:encoded><![CDATA[<p>I recently started re-reading Bob Martin&apos;s Clean Architecture and found two other ideas I wanted to share. One of them (the topic of this article) is the dual nature of the way software developers provide value through code.</p><p>When you implement (or modify) a feature in your system you are creating value by altering or expanding its <strong>behavior</strong>. Most business managers (or even developers) believe that this is all there is to software development: Find people who know how to make machines behave in a specific way and pay them to do it. Create requirements and they will implement them.</p><p>There is, of course, another important quality of the software we often overlook: The architecture/design/shape of the system. If you&apos;ve had the chance to work on projects with architectures of varying quality, you know the impact it can have in the future of development.</p><h3 id="so-why-is-architecture-so-important">So, why is architecture so important?</h3><p></p><p>Martin makes the case of the architecture of the system is more important than its behavior. His argument is that a system with good architecture, even if lacking in features, can be easily changed to accommodate any requirement. The opposite is not true: A system with bad architecture, even if doing all the required things for today&apos;s needs, is doomed to obsolescence.</p><p>And yes, software is <em>soft</em>, so you are right in thinking that <em>all </em>software can (technically) be changed. The problem happens when the costs of doing the changes exceed the benefits, leaving the system <em>practically immutable. </em>The reality is that most systems, sooner or later, become very difficult to change at least in part: Some classes or config values become so entangled that change becomes impractical.</p><p>A good architecture makes <em>the difficulty of implementing a change proportional to its scope, not the type of change. </em>An architecture that develops a strong preference over a specific type (or types) of change, and makes some others very difficult, will eventually make continuous development more difficult, that&apos;s why good architectures are usually <em>shape/type-of-change </em>agnostic and generalize well.</p><h3 id="so-if-its-such-a-bad-thing-why-does-it-happen">So, if it&apos;s such a bad thing, why does it happen?</h3><p></p><p>I think it&apos;s a combination of factors:</p><ul><li>Most business managers don&apos;t have the technical expertise to assess the importance of good architecture. From their point of view, the only thing that matters is that the system does what it needs to do, no matter how it&apos;s written.</li><li>Market pressure pushes developers into a train of thought that goes something like this: <em>Ok, now we need to put these features into production before the competition. Later, when all this is done, we will have time to refactor everything into better shape. </em>Spoiler alert: it never happens. Once the system hits production the demands and pressure only increase. The team is now responsible for ongoing development, maintenance, and operations!</li></ul><p>What happens next is something that you might have experienced, no matter if you are from business or tech: The changes become harder and harder to implement and the project slows down to a crawl. The costs of running the project increase year by year and recruiting more people doesn&apos;t seem to fix it.</p><p>This is frustrating for everyone. Management feels like they are just providing a stream of changes with similar scope, but each time it takes longer to get them into production. Developers feel as if they are handed a stream of increasingly complex pieces to fit into a monster that becomes more complex and difficult to tame.</p><h3 id="ok-get-it-but-what-can-i-do">Ok, get it, but what can I do?</h3><p></p><p>There is no easy answer, but in most successful teams, it goes something like this:</p><p>You need to remind yourself that <strong>you </strong>are the main stakeholder when it comes to the architecture and design of the system. Business managers are not equipped to evaluate the importance of the architecture, that&apos;s why they hired you! Your job is not only to make the system <em>behave </em>in the right way, but it&apos;s also to watch over the long-term quality of the system.</p><p>Also, people are willing to listen when the stakes are high. A system where the architecture comes last will eventually reach a point where changes are impossibly expensive. Talk to the other stakeholders and create the habit of balancing the development of new features, bug-fixing, and refactoring.</p><p>The reality is that at the end of the day, nobody but the development team <strong>really </strong>cares about the architecture of the system. So pick your fights, push back when needed and communicate the importance (and consequences) of keeping the system healthy.</p><p>I wish you luck.</p>]]></content:encoded></item><item><title><![CDATA[Domain-Driven Design]]></title><description><![CDATA[This article explores the basics of designing software systems based on strong conceptual models.]]></description><link>https://www.brainstobytes.com/domain-driven-design/</link><guid isPermaLink="false">60087e8c6704900039ecedeb</guid><category><![CDATA[Software Engineering & Design]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Wed, 20 Jan 2021 19:06:41 GMT</pubDate><content:encoded><![CDATA[<p><em>This article is a summary of what I consider to be the most important concepts of the book Domain-Driven Design, by Eric Evans. I tried to condense the most important ideas in a single article for anyone interested in the topic. I attempted to pack in as much information as possible, but it was not an easy task: The book is a very condensed work with lots of practical examples. I urge you to read the complete book if you want to really get what this is all about. With that out of the way, let&apos;s take a look at the domain-driven design, in a single article</em></p>
<p>I think it&apos;s a good idea to start by defining what we mean by <strong>domain</strong>.</p>
<p>We build software because we want to solve a problem related to a specific area or activity. This activity/area is the <strong>domain</strong> of our software and can be something concrete as amigurumi dolls or as abstract as accounting.</p>
<p>Developers usually don&apos;t build software in isolation. Most applications require the input of <em>domain experts</em> who bring to the table valuable <em>domain knowledge</em> that developers often lack. Not every piece of knowledge about the domain is relevant, though. A team must put aside all the non-relevant details of the domain and focus on the most important concepts to build a <strong>model</strong> to serve as the base for our development.</p>
<p>This model will provide the team with useful abstractions needed to build software that can solve the right problems and adapt to future changes in requirements. As you might imagine, there is an infinite amount of different models for every single domain. It all depends on which details from the domain you ignore and which ones you add. The following 3 ideas will guide you in the selection of a good model:</p>
<ol>
<li>The model is distilled knowledge: It has only the details that are relevant to solving the problem at hand.</li>
<li>The model forms the basis for the language (spoken and written) used by the team.</li>
<li>The model and the implementation should shape each other through the course of the project.</li>
</ol>
<p>Let&apos;s discuss all 3 ideas in a bit more detail.</p>
<h4 id="1crunching-knowledge">1.Crunching knowledge</h4>
<p>A good model helps you create objects that are more than just glorified data structures that share a name with the concept they represent. They need to have meaningful behavior and relations to other objects in the model.</p>
<p>You have to be selective: Add to the model important concepts and discard unimportant ones. This is hard to achieve in practice, so you will probably need to try many different iterations of the model before you find out what <strong>important</strong> means in your particular context.</p>
<p>At the beginning of any project, team members lack the knowledge needed to create a good model. That&apos;s fine, as the project goes on the team&apos;s knowledge base improves, and refining the model becomes easier. After several iterations business rules are better defined, ambiguities are resolved and the quality of the objects improves.</p>
<p>Creating a good model is hard, but engaging in continuous refinement will do the trick.</p>
<h4 id="2-the-model-as-the-foundation-of-the-teams-language">2. The model as the foundation of the team&apos;s  language</h4>
<p>Humans have amazing language skills, so we might as well use them to aid us in the design process. In most projects, the language gap between developers and domain experts can become a problem. Talking in terms of database tables or data structures might mean very little to domain experts, whereas developers might find the professional jargon of the domain confusing.</p>
<p>Creating a language based on the model provides the team with a tool to discuss the project with enough precision to create a technical implementation.</p>
<p>The team should make the effort in creating a language based on the model and using it as often as possible as an integral part of the development process.</p>
<h4 id="3binding-model-and-implementation">3.Binding model and implementation</h4>
<p>A model is much more than just a useful tool for aiding the initial stages of design. Models are the foundation of the design for the software we build.</p>
<p>The software entities that we create in our design should be representations of our model, but this is usually not easy. A model produced by careful analysis might be correct and still be difficult to implement.</p>
<p>A model that doesn&apos;t map easily into an implementation must be refined through several iterations. If done correctly we will create a model that captures the problem at hand and lends itself to and easy implementation.</p>
<p>Usually, the best strategy is to start with a limited design that reflects the model in a literal way with an obvious mapping. After that, we proceed to refine the model iteratively, making it easier and easier to implement without losing the essential details.</p>
<p><img src="https://www.brainstobytes.com/content/images/2021/01/RefinementCycle.png" alt="RefinementCycle" loading="lazy"></p>
<p>Models aren&apos;t just fancy constructs, they encode the most important knowledge about the domain. The model you produce as part of the design and development process is valuable in itself, you can make the argument that the software you make is valuable <em>because</em> it implements the model.</p>
<p>Just remember that every time you change the code (design), you perform an implicit change in the model. This is the reason all the developers in the team should be encouraged to take part in the modeling process.</p>
<p>Good. Now that we discussed the basic team aspects of domain-driven design, let&apos;s talk about the building blocks we can use to represent elements of a model.</p>
<h3 id="building-blocks-and-general-structure">Building blocks and general structure.</h3>
<p>Layered architectures are quite useful for building software based on strong models. If you haven&apos;t heard this term before don&apos;t worry, the idea is quite simple: The software is organized into conceptual layers with well-defined responsibilities. For keeping integrity, the communication between layers is constrained: A layer can only communicate (call methods and hold references to) the layer immediately under it.</p>
<p>DDD can be implemented in a scheme with 4 layers:</p>
<p><img src="https://www.brainstobytes.com/content/images/2021/01/LayeredArchitecture.png" alt="LayeredArchitecture" loading="lazy"></p>
<p>From all these layers, the one we focus on the most when doing domain-driven design is the domain layer. All the distilled knowledge will go into creating the objects that populate this layer.</p>
<p>Let&apos;s discuss a couple of ideas that will help you create a rich, expressive and clean domain layer.</p>
<h4 id="keep-your-associations-simple">Keep your associations simple</h4>
<p>An association between two objects is a dependency. Holding a reference to an object, calling one of its methods, or having any knowledge about it creates a dependency on it.</p>
<p>Associations aren&apos;t a bad thing, in fact, without them most software would be practically impossible to build. The problem starts when the number of associations becomes unmanageably large. The secret is to keep things simple and get rid of as many unnecessary associations as possible. In practice, I&apos;ve found that the biggest amount of value is gained from two things:</p>
<ul>
<li>Get rid of bidirectional associations (both objects depend on each other).</li>
<li>Minimize the number of associations for every object in your system (duh).</li>
</ul>
<p>It&apos;s not easy to do, but there are well-known techniques to solve this problem. Most books on design patterns or software construction will give you tools for managing object dependencies(You can check my favorites in the reading list, but almost any book on the topic is good enough).</p>
<p>If you don&apos;t know how to do this yet go and learn it, it will make your life much, much easier.</p>
<h4 id="understand-object-identity">Understand object identity</h4>
<p>In the book (DDD), Evans makes a distinction between objects based on their <em>identity</em>.</p>
<p>An object whose identity is defined entirely by its attributes is called a <strong>value object</strong>. Two value objects with the same attribute are essentially the same thing, and the system will treat them as such. Suppose you are building software for a car factory. Due to specific project considerations, you end up creating a type for the different <strong>Brands</strong> of car the factory makes. The <em>Brand</em> object is defined entirely by the attributes that make it, it&apos;s what sets a brand apart from all the rest.</p>
<p>If two <em>brands</em> were to exist at the same time the system would not be able to tell the difference. Because of this, cars don&apos;t need to have their own copies of a specific<em>Brand</em>, cars of the same make can all reference a single, immutable instance of their particular model in memory. Value objects tend to be either private members of other objects, one-off data containers to pass as arguments to functions, or immutable objects with multiple incoming references.</p>
<p>The other type of object is the <strong>entity</strong>. Entities transcend the contents of their attributes: Two entities can have the exact same attributes, but they still need to be treated as separate objects. Imagine an application that for one reason or another has an object <em>Person</em> with name, nationality, and favorite ice-cream flavor. Despite having the same attributes, the objects for me and my dad (Both Juan Luis Orozco, both Cost Rican, and both with mint as favorite ice cream flavor) should be treated as separate entities from creation to destruction. In practice, most objects with rich behaviors and relations in your application will be entities.</p>
<p>It&apos;s important to understand that the same real-world concept can take any of the two forms, this is dictated by the specifics of the problem you are trying to solve. Evans explains this with the following example:</p>
<p><em>Imagine you are building an application for managing the seats in a stadium for a ticketing system. If the application needs to take into consideration the specific seat customers book, the <em>Seat</em> objects will be entities: We care which seat we are going to get. If the application ignores their positions and numbers and customers can seat wherever they please, the <em>Seat</em> objects are going to be modeled as value objects because there is no important difference from the point of view of the ticket.</em></p>
<h4 id="other-useful-constructs">Other useful constructs</h4>
<h5 id="services">Services</h5>
<p>Services let you model procedures in a OO way. They are objects that perform procedural tasks that don&apos;t belong to any other domain object and can find a place in any of the architectural layers we discussed before. Designing a good service is not a simple task, but most great services have some characteristics in common:</p>
<ul>
<li>The operations they perform relate to a domain concept that is not a natural responsibility of your entities or value objects.</li>
<li>The interface is defined in terms of other elements of the domain model (both arguments and how the methods fit in the model).</li>
<li>The operation they perform can have side effects, but the object that implements the service must be stateless.</li>
</ul>
<h5 id="modules">Modules</h5>
<p>Modules are a way of packaging together closely-related parts of your system. Ideally, you want to put together concepts that can be understood and reasoned independently of other parts of the model. Modules encompass a cohesive set of concepts that tell a story when put together and help you achieve high-cohesion and low coupling.</p>
<p>As you might suspect by now, finding the boundaries and contents of each module is not an easy task. Gaining the amount of knowledge and understanding necessary for creating a good set of module boundaries will probably take you several iterations, and that&apos;s fine.</p>
<h5 id="aggregates">Aggregates</h5>
<p>Aggregates are a way of transforming a collection of objects with complicated relationships into a consistent construct that is easy to use and reason about. The following image shows a class diagram with a client object (Customer) interacting with two aggregates.</p>
<p><img src="https://www.brainstobytes.com/content/images/2021/01/Aggregates.png" alt="Aggregates" loading="lazy"></p>
<p>These are some important considerations when working with aggregates:</p>
<ul>
<li>An aggregate can contain any number of objects, but the root is always an Entity.</li>
<li>Nothing outside of the aggregate boundaries can hold a reference to anything inside of it. This means that all the functionality of the aggregate is served through the root (like a Facade). In the previous example, this means that the Customer can&apos;t hold permanent references to Wheel, Position, and/or Tire.</li>
<li>Roots can, in some circumstances, return a reference for an object inside of the boundary, but they can only be used by the client in a transient way.</li>
<li>Objects inside of the aggregate can hold references to the roots of other aggregates.</li>
<li>The root object is responsible for ensuring all the invariants of the aggregate.</li>
<li>Deleting the root of an aggregate deletes all the objects in it.</li>
</ul>
<h5 id="factories">Factories</h5>
<p>Factories are useful patterns for making the creation of objects easier. In this context, they become especially useful because they can create complete aggregates and ensure they start in a valid state where all invariants are satisfied.</p>
<p>And yes, I am talking about the good ol&apos; Abstract Factory and Factory Method patterns. So remember to delegate the creation of aggregates to the factories of your system.</p>
<h5 id="persistence">Persistence</h5>
<p>Almost every application needs to have some form of persistent storage for the data of the system. There are two main ways of giving your system these capabilities:</p>
<ul>
<li>Make use of the ActiveRecord pattern and give objects the responsibility of managing their own CRUD capabilities.</li>
<li>Create a special family of objects whose only responsibility is to retrieve and store objects in a DB or any other form of storage. An object that does this is called a Repository and can be implemented in many ways depending on the project&apos;s needs.</li>
</ul>
<h3 id="the-role-of-refactoring-in-ddd">The role of refactoring in DDD</h3>
<p>Refactoring is usually how we call the act of performing incremental changes to the codebase to improve the structure and quality of your code. This is not what we mean by refactoring in this context.</p>
<p>In DDD, refactoring is about distilling the model into a better representation of the problem at hand. Changes performed to the model (and as a result, the implementation) are usually fueled by increased insight into the problems we are trying to solve. At the beginning of the project, this might be a difficult task, but after enough refactorings, we find ourselves at a point where modest time investments provide huge value increases in functionality.</p>
<p>Pay attention to the insights you gain as the project advances, and don&apos;t ignore an opportunity to improve a model. Really, don&apos;t be afraid to &apos;break it&apos;, models are malleable, and the potential benefits are huge!</p>
<p>Models are usually improved by <strong>taking an implicit concept and making it explicit</strong>. And yes, unfortunately, there is no shortcut to achieving this.</p>
<p>You will need to use the project&apos;s model language, talk to experts in the field, read books and refactor until you find the most important concepts. It&apos;s hard work, but all these things will help you in creating consistent, flexible, and explicit models that will make implementing new features fast and easy.</p>
<h3 id="supple-design">Supple design</h3>
<p>According to Evans, a supple design is <em>A design that puts the power inherent in a deep model into the hands of a client developer to make clear, flexible expressions that give expected results robustly. Equally important, it leverages that same deep model to make the design itself easy for the implementer to mold and reshape to accommodate new insight.</em></p>
<p>Sounds nice, doesn&apos;t it? The goal of the book (and this article) is to give you tools for achieving a supple design in any project you tackle. We have already discussed most of the domain-driven design fundamentals you will use to achieve this goal. These are other helpful ideas that will aid you in this task:</p>
<ul>
<li>Create intention-revealing interfaces. The names of the functions and their parameters must speak an intent.</li>
<li>Use assertions for preventing unexpected behavior and also for expressing intent.</li>
<li>Decompose design elements (interfaces, classes, aggregates) into cohesive units with clearly defined boundaries.</li>
<li>When possible, use a declarative style of architecture that can express functionality as behavior combinations.</li>
<li>Refresh your knowledge of design patterns and analysis patterns. If you don&apos;t know where to start grab a copy of Analysis patterns by Martin Fowler.</li>
</ul>
<h3 id="maintaining-model-integrity">Maintaining model integrity</h3>
<p>If a system is large enough, smaller models will start to sprout in different parts of the system. This is one of the reasons keeping model integrity becomes harder as systems grow. Let&apos;s take a look at some of the things you can do to protect your model from corruption if you find yourself working on a big enough project:</p>
<h5 id="bounded-context">Bounded context</h5>
<p>When code from different modules is combined, bugs eventually arise. You need to identify the context (parts of the system) where the model applies and protect it at all costs from creeping inconsistencies. Your team should watch out for inconsistencies inside of those boundaries at all times and correct them immediately.</p>
<h5 id="context-map">Context map</h5>
<p>When multiple models in the system emerge, you need to find a way to figure how concepts in the two models relate. For this, create a context map that explicitly specifies these relationships.</p>
<h5 id="shared-kernel">Shared kernel</h5>
<p>Sometimes, two or more models will have a part in common that you can&apos;t separate. For this, designate the shared pieces as the <em>shared kernel</em> of your models. Teams can&apos;t make changes to the shared kernel without consulting each other to keep consistency.</p>
<h5 id="anticorruption-layer">Anticorruption layer</h5>
<p>An anticorruption layer provides clients with functionality in terms of their own models. This way, you can keep the two parts of the system isolated and consistent. If you need to access functionality from the other parts of the system you can use the interfaces they already have in place.</p>
<h3 id="distillation">Distillation</h3>
<p>These are some ideas you can use to split your model into different sub-sections:</p>
<p><strong>Core model</strong>: The core model is the most important part of the system, the essence of the problem your software solves, separate it from the supporting models.<br>
<strong>Separate generic concepts</strong>: Grab the parts of your system that represent generic problem features and put them into their own model.<br>
<strong>Separate mechanisms</strong>: Grab the procedural/mechanical parts of your system and hide them behind an interface. If you need to provide your objects with some graph-like behavior, code it into its own construct and provide the feature through a clean interface.</p>
<h3 id="final-considerations">Final considerations</h3>
<p>These are some other ideas for enriching the domain-driven design process:</p>
<ul>
<li>The design process must absorb feedback. Continuous communication and collaboration from every team member are essential for coming up with a successful model.</li>
<li>Decisions must reach the entire team. DDD is not an &apos;ivory tower architect* process and requires that team members contribute and have a say in design matters.</li>
<li>Make a place in your design for the dynamic nature of the process. It will change, several times, before you reach a satisfactory solution.</li>
<li>Good design requires both minimalism and humility.</li>
</ul>
<h3 id="thanks-for-reading">Thanks for reading</h3>
<p>I am a believer in the importance of investing in our own professional growth as software developers and technology professionals. Lots of important ideas and processes have been documented in the last decades, but somehow we find ourselves learning the same lessons over and over again.</p>
<p>Lots of great books and documents have been written on the topic of making great software, and Domain-Driven Design is, in my opinion, one of the most important ones. I spend a lot of time trying to write an article with all the important stuff, but it&apos;s almost impossible to do it justices with so little space.</p>
<p>I really recommend the book. Despite being almost two decades old, I find the contents to be both refreshing and enlightening for the modern development world. Perhaps because of the importance that software has nowadays, I&apos;d dare to say that it&apos;s never been as relevant as it is today.</p>
<p>So, if you have the time go and take a look, it&apos;s really worth it.</p>
<p>Thanks for reading!</p>
<h2 id="what-to-do-next">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li>This article is based on Domain-Driven Design, by Eric Evans. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[BrainsToBytes will be on hiatus until 2021]]></title><description><![CDATA[<p>Hello there,</p><p>I recently started working on a couple of side projects that require a bit more attention and time than I expected, so I decided to put the article schedule on hold for a while so that I can focus on finishing them.</p><p>I expect to be back with</p>]]></description><link>https://www.brainstobytes.com/brainstobytes-will-be-on-hiatus-until-2021/</link><guid isPermaLink="false">5fa13b4977c9a700393a0e53</guid><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 03 Nov 2020 11:21:44 GMT</pubDate><content:encoded><![CDATA[<p>Hello there,</p><p>I recently started working on a couple of side projects that require a bit more attention and time than I expected, so I decided to put the article schedule on hold for a while so that I can focus on finishing them.</p><p>I expect to be back with the usual posting frequency at the beginning of 2021, there are a couple of cool series in the works, but I&apos;d rather spend the time refining them than posting something I don&apos;t feel totally happy with. I might still post a couple of shorter articles every now and then, but for the time being BTB will be on hiatus.</p><p>Thank you for reading!</p>]]></content:encoded></item><item><title><![CDATA[Hands-on Pandas(11): The apply function]]></title><description><![CDATA[This article explains the basic usage of Panda's apply function and serves as conclusion for the Hands-on Pandas series.]]></description><link>https://www.brainstobytes.com/hands-on-pandas-11-the-apply-function/</link><guid isPermaLink="false">5f941a3077c9a700393a0e42</guid><category><![CDATA[Machine Learning & Data]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 27 Oct 2020 07:00:00 GMT</pubDate><media:content url="https://www.brainstobytes.com/content/images/2020/10/1200px-Pandas_logo.svg.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://www.brainstobytes.com/content/images/2020/10/1200px-Pandas_logo.svg.png" alt="Hands-on Pandas(11): The apply function"><p>We have already covered most of the fundamentals of working with data using the Pandas library. There is one more topic I&apos;d like to discuss before concluding the series: The Apply function.</p>
<p>In the previous article, we learned how to create subgroups of data using the <strong>groupby</strong> function. This is quite useful when you want to gain a better understanding of certain subsets of data or perform group aggregations. Today we will add another resource to your toolbox that will let you use those groups for much more.</p>
<p>Apply lets you perform more complex computations on the groups you create, it works like this: The function you provide to apply is called on each of the groups, and the results are concatenated into a single final data structure.</p>
<p><img src="https://www.brainstobytes.com/content/images/2020/10/ApplyAtWork.png" alt="Hands-on Pandas(11): The apply function" loading="lazy"></p>
<p>Again, this is much easier to understand with practical examples, so let&apos;s get started!</p>
<h2 id="basicapplicationsofapply">Basic applications of apply</h2>
<p>We will use the same table with Pokemon data we used in the last article.</p>
<p>First, let&apos;s import pandas and examine the contents of our DataFrame.</p>
<pre><code class="language-python">import pandas as pd

pdata = pd.read_csv(&apos;./sample_data/poke_colors.csv&apos;)
pdata
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Caterpie</td>
      <td>Green</td>
      <td>True</td>
      <td>45</td>
      <td>30</td>
      <td>35</td>
      <td>20</td>
      <td>20</td>
      <td>45</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Metapod</td>
      <td>Green</td>
      <td>True</td>
      <td>50</td>
      <td>20</td>
      <td>55</td>
      <td>25</td>
      <td>25</td>
      <td>30</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Scyther</td>
      <td>Green</td>
      <td>False</td>
      <td>70</td>
      <td>110</td>
      <td>80</td>
      <td>55</td>
      <td>80</td>
      <td>105</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Bulbasaur</td>
      <td>Green</td>
      <td>True</td>
      <td>45</td>
      <td>49</td>
      <td>49</td>
      <td>65</td>
      <td>65</td>
      <td>45</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Dratini</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
      <td>64</td>
      <td>45</td>
      <td>50</td>
      <td>50</td>
      <td>50</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Squirtle</td>
      <td>Blue</td>
      <td>True</td>
      <td>44</td>
      <td>48</td>
      <td>65</td>
      <td>50</td>
      <td>64</td>
      <td>43</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Poliwag</td>
      <td>Blue</td>
      <td>True</td>
      <td>40</td>
      <td>50</td>
      <td>40</td>
      <td>40</td>
      <td>40</td>
      <td>90</td>
    </tr>
    <tr>
      <th>7</th>
      <td>Poliwhirl</td>
      <td>Blue</td>
      <td>True</td>
      <td>65</td>
      <td>65</td>
      <td>65</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
    <tr>
      <th>8</th>
      <td>Charmander</td>
      <td>Red</td>
      <td>True</td>
      <td>39</td>
      <td>52</td>
      <td>43</td>
      <td>60</td>
      <td>50</td>
      <td>65</td>
    </tr>
    <tr>
      <th>9</th>
      <td>Magmar</td>
      <td>Red</td>
      <td>False</td>
      <td>65</td>
      <td>95</td>
      <td>57</td>
      <td>100</td>
      <td>85</td>
      <td>93</td>
    </tr>
    <tr>
      <th>10</th>
      <td>Paras</td>
      <td>Red</td>
      <td>True</td>
      <td>35</td>
      <td>70</td>
      <td>55</td>
      <td>45</td>
      <td>55</td>
      <td>25</td>
    </tr>
    <tr>
      <th>11</th>
      <td>Parasect</td>
      <td>Red</td>
      <td>False</td>
      <td>60</td>
      <td>95</td>
      <td>80</td>
      <td>60</td>
      <td>80</td>
      <td>30</td>
    </tr>
    <tr>
      <th>12</th>
      <td>Pikachu</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
      <td>55</td>
      <td>40</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
    <tr>
      <th>13</th>
      <td>Abra</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
      <td>20</td>
      <td>15</td>
      <td>105</td>
      <td>55</td>
      <td>90</td>
    </tr>
    <tr>
      <th>14</th>
      <td>Psyduck</td>
      <td>Yellow</td>
      <td>True</td>
      <td>50</td>
      <td>52</td>
      <td>48</td>
      <td>65</td>
      <td>50</td>
      <td>55</td>
    </tr>
    <tr>
      <th>15</th>
      <td>Kadabra</td>
      <td>Yellow</td>
      <td>True</td>
      <td>40</td>
      <td>35</td>
      <td>30</td>
      <td>120</td>
      <td>70</td>
      <td>10</td>
    </tr>
  </tbody>
</table>
</div>
<p>Apply&apos;s most important argument is a function. This function will be run on every group of data and the results will be concatenated in a final data structure. We will create a simple function that returns the two Pokemon with the highest attack value, something like this:</p>
<pre><code class="language-python"># Two pokes with the highest attack

def highest_attack(data_frame):
    # Remember how [] works, this selects the last two (highest) Attack entries after sorting
    return data_frame.sort_values(by=&apos;Attack&apos;)[-2:]

# Let&apos;s test it on the complete dataframe
highest_attack(pdata)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>11</th>
      <td>Parasect</td>
      <td>Red</td>
      <td>False</td>
      <td>60</td>
      <td>95</td>
      <td>80</td>
      <td>60</td>
      <td>80</td>
      <td>30</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Scyther</td>
      <td>Green</td>
      <td>False</td>
      <td>70</td>
      <td>110</td>
      <td>80</td>
      <td>55</td>
      <td>80</td>
      <td>105</td>
    </tr>
  </tbody>
</table>
</div>
<p>Now let&apos;s see how to use apply to do something a bit more interesting. We want to find the two pokemon with the highest attack value on a by-color basis. For doing this, we will group them by Color and then pass <em>highest_attack</em> to apply, something like this:</p>
<pre><code class="language-python"># Now, let&apos;s find which are the two pokemon with the highest attack on each color group:
pdata.groupby(&apos;Color&apos;).apply(highest_attack)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th></th>
      <th>Name</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="2" valign="top">Blue</th>
      <th>4</th>
      <td>Dratini</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
      <td>64</td>
      <td>45</td>
      <td>50</td>
      <td>50</td>
      <td>50</td>
    </tr>
    <tr>
      <th>7</th>
      <td>Poliwhirl</td>
      <td>Blue</td>
      <td>True</td>
      <td>65</td>
      <td>65</td>
      <td>65</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
    <tr>
      <th rowspan="2" valign="top">Green</th>
      <th>3</th>
      <td>Bulbasaur</td>
      <td>Green</td>
      <td>True</td>
      <td>45</td>
      <td>49</td>
      <td>49</td>
      <td>65</td>
      <td>65</td>
      <td>45</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Scyther</td>
      <td>Green</td>
      <td>False</td>
      <td>70</td>
      <td>110</td>
      <td>80</td>
      <td>55</td>
      <td>80</td>
      <td>105</td>
    </tr>
    <tr>
      <th rowspan="2" valign="top">Red</th>
      <th>9</th>
      <td>Magmar</td>
      <td>Red</td>
      <td>False</td>
      <td>65</td>
      <td>95</td>
      <td>57</td>
      <td>100</td>
      <td>85</td>
      <td>93</td>
    </tr>
    <tr>
      <th>11</th>
      <td>Parasect</td>
      <td>Red</td>
      <td>False</td>
      <td>60</td>
      <td>95</td>
      <td>80</td>
      <td>60</td>
      <td>80</td>
      <td>30</td>
    </tr>
    <tr>
      <th rowspan="2" valign="top">Yellow</th>
      <th>14</th>
      <td>Psyduck</td>
      <td>Yellow</td>
      <td>True</td>
      <td>50</td>
      <td>52</td>
      <td>48</td>
      <td>65</td>
      <td>50</td>
      <td>55</td>
    </tr>
    <tr>
      <th>12</th>
      <td>Pikachu</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
      <td>55</td>
      <td>40</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
  </tbody>
</table>
</div>
<p>Notice how the final table is the result of concatenating together the results of running <em>highest_attack</em> on every group!</p>
<h2 id="functionswithextraarguments">Functions with extra arguments</h2>
<p>The functions you pass to the apply method can receive additional arguments. Let&apos;s create another version of our function, this time called <strong>highest_attribute</strong>, that lets you specify the attribute to take into consideration and the n highest pokemon you want to select from each group:</p>
<pre><code class="language-python"># We set the default attribute as HP and the default n to 2
def highest_attribute(data_frame, attribute=&apos;HP&apos;, n=2):
    return data_frame.sort_values(by=attribute)[-n:]

pdata.groupby(&apos;Color&apos;).apply(highest_attribute, &apos;Defense&apos;, 3)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th></th>
      <th>Name</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="3" valign="top">Blue</th>
      <th>4</th>
      <td>Dratini</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
      <td>64</td>
      <td>45</td>
      <td>50</td>
      <td>50</td>
      <td>50</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Squirtle</td>
      <td>Blue</td>
      <td>True</td>
      <td>44</td>
      <td>48</td>
      <td>65</td>
      <td>50</td>
      <td>64</td>
      <td>43</td>
    </tr>
    <tr>
      <th>7</th>
      <td>Poliwhirl</td>
      <td>Blue</td>
      <td>True</td>
      <td>65</td>
      <td>65</td>
      <td>65</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
    <tr>
      <th rowspan="3" valign="top">Green</th>
      <th>3</th>
      <td>Bulbasaur</td>
      <td>Green</td>
      <td>True</td>
      <td>45</td>
      <td>49</td>
      <td>49</td>
      <td>65</td>
      <td>65</td>
      <td>45</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Metapod</td>
      <td>Green</td>
      <td>True</td>
      <td>50</td>
      <td>20</td>
      <td>55</td>
      <td>25</td>
      <td>25</td>
      <td>30</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Scyther</td>
      <td>Green</td>
      <td>False</td>
      <td>70</td>
      <td>110</td>
      <td>80</td>
      <td>55</td>
      <td>80</td>
      <td>105</td>
    </tr>
    <tr>
      <th rowspan="3" valign="top">Red</th>
      <th>10</th>
      <td>Paras</td>
      <td>Red</td>
      <td>True</td>
      <td>35</td>
      <td>70</td>
      <td>55</td>
      <td>45</td>
      <td>55</td>
      <td>25</td>
    </tr>
    <tr>
      <th>9</th>
      <td>Magmar</td>
      <td>Red</td>
      <td>False</td>
      <td>65</td>
      <td>95</td>
      <td>57</td>
      <td>100</td>
      <td>85</td>
      <td>93</td>
    </tr>
    <tr>
      <th>11</th>
      <td>Parasect</td>
      <td>Red</td>
      <td>False</td>
      <td>60</td>
      <td>95</td>
      <td>80</td>
      <td>60</td>
      <td>80</td>
      <td>30</td>
    </tr>
    <tr>
      <th rowspan="3" valign="top">Yellow</th>
      <th>15</th>
      <td>Kadabra</td>
      <td>Yellow</td>
      <td>True</td>
      <td>40</td>
      <td>35</td>
      <td>30</td>
      <td>120</td>
      <td>70</td>
      <td>10</td>
    </tr>
    <tr>
      <th>12</th>
      <td>Pikachu</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
      <td>55</td>
      <td>40</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
    <tr>
      <th>14</th>
      <td>Psyduck</td>
      <td>Yellow</td>
      <td>True</td>
      <td>50</td>
      <td>52</td>
      <td>48</td>
      <td>65</td>
      <td>50</td>
      <td>55</td>
    </tr>
  </tbody>
</table>
</div>
<p>Notice how the additional parameters are passed to the <strong>apply</strong> function, not to <strong>sort_values</strong> itself. Internally, apply makes sure that the right parameters are passed to whatever function it&apos;s applying.</p>
<h2 id="usinglambdasasanargumentforapply">Using lambdas as an argument for apply</h2>
<p>As a final note, sometimes you won&apos;t want to write a complete function definition if what you want to accomplish is very simple. In this case, you can pass a lambda function. In our next example we will use this approach to select from each group the pokemon whose name appears first in alphabetical ordering in each group:</p>
<pre><code class="language-python">pdata.groupby(&apos;Color&apos;).apply(lambda df: df.sort_values(&apos;Name&apos;).head(1) )
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th></th>
      <th>Name</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <th>4</th>
      <td>Dratini</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
      <td>64</td>
      <td>45</td>
      <td>50</td>
      <td>50</td>
      <td>50</td>
    </tr>
    <tr>
      <th>Green</th>
      <th>3</th>
      <td>Bulbasaur</td>
      <td>Green</td>
      <td>True</td>
      <td>45</td>
      <td>49</td>
      <td>49</td>
      <td>65</td>
      <td>65</td>
      <td>45</td>
    </tr>
    <tr>
      <th>Red</th>
      <th>8</th>
      <td>Charmander</td>
      <td>Red</td>
      <td>True</td>
      <td>39</td>
      <td>52</td>
      <td>43</td>
      <td>60</td>
      <td>50</td>
      <td>65</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <th>13</th>
      <td>Abra</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
      <td>20</td>
      <td>15</td>
      <td>105</td>
      <td>55</td>
      <td>90</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="practicemakesperfect">Practice makes perfect</h2>
<p>Apply is an incredibly flexible function that, if used in creative ways, lets you solve a huge variety of problems in data manipulation and transformation. This article exposed you to the basic concepts of the function, but make sure to study it further and experiment with real datasets.</p>
<p>As a closing remark, I&apos;d like to share a quotation from the book Python For Data Analysis (2nd), in which this series is largely based on:</p>
<blockquote>
<blockquote>
<p>Beyond these basic usage mechanics, getting the most out of apply<br>
may require some creativity. What occurs inside the function<br>
passed is up to you; it only needs to return a pandas object or a<br>
scalar value. The rest of this chapter will mainly consist of examples<br>
showing you how to solve various problems using groupby</p>
</blockquote>
</blockquote>
<h2 id="thatsallthepandasicansharefornow">That&apos;s all the Pandas I can share, for now</h2>
<p>With this article, we conclude our Hands-on Pandas series. It&apos;s been a lot of fun to write, and I really hope you learned one or two interesting things along the way.</p>
<p>Pandas, like every other software tool or skill, requires a good amount of practice before it becomes truly useful. Don&apos;t worry if you don&apos;t immediately know how to tackle a dataset or which function to call, with experience and continued exposure it will become second nature.</p>
<p>If you need help, remember that Pandas has some of the best docs around and a huge, helpful community that will guide you into finding a solution. I wish you a happy and productive learning process!</p>
<p>Thank you for reading!</p>
<h2 id="whattodonext">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li><a href="https://github.com/don-juancito/BrainsToBytes_CodeSamples/tree/master/?ref=brainstobytes.com">You can find the source code for this series in this repo</a>.</li>
<li>This article is based on Python for Data Analysis. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments, or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hands-on Pandas(10): Group Operations using groupby]]></title><description><![CDATA[This articles talk about how to use the groupby method and how to perform common aggregations on  Pandas GroupBy objects.]]></description><link>https://www.brainstobytes.com/hands-on-pandas-10-group-operations-using-groupby/</link><guid isPermaLink="false">5f2ace28be644300456d043a</guid><category><![CDATA[Machine Learning & Data]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 20 Oct 2020 07:00:00 GMT</pubDate><media:content url="https://www.brainstobytes.com/content/images/2020/08/1200px-Pandas_logo.svg.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://www.brainstobytes.com/content/images/2020/08/1200px-Pandas_logo.svg.png" alt="Hands-on Pandas(10): Group Operations using groupby"><p>Sometimes you need to perform operations on subsets of data. Your rows might have attributes in common or somehow form logical groups based on other properties. Common operations like finding the average, maximum, count, or standard deviation of values from groups of data is a really common task, and Pandas makes this really easy to accomplish.</p>
<p>In this article, we will learn how to use the <code>groupby</code> function and study some of the built-in aggregations you can run on groups. This will give you another valuable tool for data analysis, and I hope it&apos;ll help you accomplish your tasks in a much simpler way.</p>
<p>Great, let&apos;s get started!</p>
<h2 id="loadingoursampledata">Loading our sample data</h2>
<p>We will use data from a CSV file I created with info about 16 Pokemon. It contains attributes like the Name, Color (Green, Blue, Red, Yellow), and other stats like HP, Attack, Defense, and Speed.</p>
<p>We are interested in calculating some common aggregations over groups of Pokemon with different colors.</p>
<pre><code class="language-python">import pandas as pd

pdata = pd.read_csv(&apos;./sample_data/poke_colors.csv&apos;)
pdata
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Caterpie</td>
      <td>Green</td>
      <td>True</td>
      <td>45</td>
      <td>30</td>
      <td>35</td>
      <td>20</td>
      <td>20</td>
      <td>45</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Metapod</td>
      <td>Green</td>
      <td>True</td>
      <td>50</td>
      <td>20</td>
      <td>55</td>
      <td>25</td>
      <td>25</td>
      <td>30</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Scyther</td>
      <td>Green</td>
      <td>False</td>
      <td>70</td>
      <td>110</td>
      <td>80</td>
      <td>55</td>
      <td>80</td>
      <td>105</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Bulbasaur</td>
      <td>Green</td>
      <td>True</td>
      <td>45</td>
      <td>49</td>
      <td>49</td>
      <td>65</td>
      <td>65</td>
      <td>45</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Dratini</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
      <td>64</td>
      <td>45</td>
      <td>50</td>
      <td>50</td>
      <td>50</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Squirtle</td>
      <td>Blue</td>
      <td>True</td>
      <td>44</td>
      <td>48</td>
      <td>65</td>
      <td>50</td>
      <td>64</td>
      <td>43</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Poliwag</td>
      <td>Blue</td>
      <td>True</td>
      <td>40</td>
      <td>50</td>
      <td>40</td>
      <td>40</td>
      <td>40</td>
      <td>90</td>
    </tr>
    <tr>
      <th>7</th>
      <td>Poliwhirl</td>
      <td>Blue</td>
      <td>True</td>
      <td>65</td>
      <td>65</td>
      <td>65</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
    <tr>
      <th>8</th>
      <td>Charmander</td>
      <td>Red</td>
      <td>True</td>
      <td>39</td>
      <td>52</td>
      <td>43</td>
      <td>60</td>
      <td>50</td>
      <td>65</td>
    </tr>
    <tr>
      <th>9</th>
      <td>Magmar</td>
      <td>Red</td>
      <td>False</td>
      <td>65</td>
      <td>95</td>
      <td>57</td>
      <td>100</td>
      <td>85</td>
      <td>93</td>
    </tr>
    <tr>
      <th>10</th>
      <td>Paras</td>
      <td>Red</td>
      <td>True</td>
      <td>35</td>
      <td>70</td>
      <td>55</td>
      <td>45</td>
      <td>55</td>
      <td>25</td>
    </tr>
    <tr>
      <th>11</th>
      <td>Parasect</td>
      <td>Red</td>
      <td>False</td>
      <td>60</td>
      <td>95</td>
      <td>80</td>
      <td>60</td>
      <td>80</td>
      <td>30</td>
    </tr>
    <tr>
      <th>12</th>
      <td>Pikachu</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
      <td>55</td>
      <td>40</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
    <tr>
      <th>13</th>
      <td>Abra</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
      <td>20</td>
      <td>15</td>
      <td>105</td>
      <td>55</td>
      <td>90</td>
    </tr>
    <tr>
      <th>14</th>
      <td>Psyduck</td>
      <td>Yellow</td>
      <td>True</td>
      <td>50</td>
      <td>52</td>
      <td>48</td>
      <td>65</td>
      <td>50</td>
      <td>55</td>
    </tr>
    <tr>
      <th>15</th>
      <td>Kadabra</td>
      <td>Yellow</td>
      <td>True</td>
      <td>40</td>
      <td>35</td>
      <td>30</td>
      <td>120</td>
      <td>70</td>
      <td>10</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="groupby">GroupBy</h2>
<p>Pandas function <code>groupby</code> is used to create GroupBy objects. These objects can perform lots of useful built-in aggregations with just a single function call. <code>groupby</code> receives as argument a list of keys that decide how the grouping is performed. In our first example we will group the Pokemon by color:</p>
<pre><code class="language-python">pg = pdata.groupby(&apos;Color&apos;)
pg
</code></pre>
<pre><code>&lt;pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff848e80f28&gt;
</code></pre>
<p>As you might have seen from the message on top, we got an object of type GroupBy. If you want to get a bit more information about it, use the <code>size</code> method:</p>
<pre><code class="language-python">pg.size()
</code></pre>
<pre><code>Color
Blue      4
Green     4
Red       4
Yellow    4
dtype: int64
</code></pre>
<p>Another way of getting some basic information about the group is using the <code>groups</code> attribute (it&apos;s a dictionary):</p>
<pre><code class="language-python">pg.groups
</code></pre>
<pre><code>{&apos;Blue&apos;: Int64Index([4, 5, 6, 7], dtype=&apos;int64&apos;),
 &apos;Green&apos;: Int64Index([0, 1, 2, 3], dtype=&apos;int64&apos;),
 &apos;Red&apos;: Int64Index([8, 9, 10, 11], dtype=&apos;int64&apos;),
 &apos;Yellow&apos;: Int64Index([12, 13, 14, 15], dtype=&apos;int64&apos;)}
</code></pre>
<p>Great!</p>
<p>Now that you know how to create a basic group, let&apos;s check two important properties:</p>
<pre><code class="language-python"># The first one is that you can iterate over GroupBy objects.

for color, group in pg:
    print(color)
    print(group)
    print(&apos;\n\n&apos;)
</code></pre>
<pre><code>Blue
        Name Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
4    Dratini  Blue     True  41      64       45     50     50     50
5   Squirtle  Blue     True  44      48       65     50     64     43
6    Poliwag  Blue     True  40      50       40     40     40     90
7  Poliwhirl  Blue     True  65      65       65     50     50     90



Green
        Name  Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
0   Caterpie  Green     True  45      30       35     20     20     45
1    Metapod  Green     True  50      20       55     25     25     30
2    Scyther  Green    False  70     110       80     55     80    105
3  Bulbasaur  Green     True  45      49       49     65     65     45



Red
          Name Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
8   Charmander   Red     True  39      52       43     60     50     65
9       Magmar   Red    False  65      95       57    100     85     93
10       Paras   Red     True  35      70       55     45     55     25
11    Parasect   Red    False  60      95       80     60     80     30



Yellow
       Name   Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
12  Pikachu  Yellow     True  35      55       40     50     50     90
13     Abra  Yellow     True  25      20       15    105     55     90
14  Psyduck  Yellow     True  50      52       48     65     50     55
15  Kadabra  Yellow     True  40      35       30    120     70     10
</code></pre>
<pre><code class="language-python"># The second is that you can access the subgroups by providing the right key to the get_group method
pg.get_group(&apos;Blue&apos;)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>4</th>
      <td>Dratini</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
      <td>64</td>
      <td>45</td>
      <td>50</td>
      <td>50</td>
      <td>50</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Squirtle</td>
      <td>Blue</td>
      <td>True</td>
      <td>44</td>
      <td>48</td>
      <td>65</td>
      <td>50</td>
      <td>64</td>
      <td>43</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Poliwag</td>
      <td>Blue</td>
      <td>True</td>
      <td>40</td>
      <td>50</td>
      <td>40</td>
      <td>40</td>
      <td>40</td>
      <td>90</td>
    </tr>
    <tr>
      <th>7</th>
      <td>Poliwhirl</td>
      <td>Blue</td>
      <td>True</td>
      <td>65</td>
      <td>65</td>
      <td>65</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
  </tbody>
</table>
</div>
<p>Groups are interesting because they let you calculate aggregations on user-defined subsections of your data. Let&apos;s calculate the mean of every stat for the color groups:</p>
<pre><code class="language-python">#Evolves is boolean, so True will be treated as a 1 and False as 0 when calculating the mean
pg.mean()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>1.00</td>
      <td>47.50</td>
      <td>56.75</td>
      <td>53.75</td>
      <td>47.50</td>
      <td>51.00</td>
      <td>68.25</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>0.75</td>
      <td>52.50</td>
      <td>52.25</td>
      <td>54.75</td>
      <td>41.25</td>
      <td>47.50</td>
      <td>56.25</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>0.50</td>
      <td>49.75</td>
      <td>78.00</td>
      <td>58.75</td>
      <td>66.25</td>
      <td>67.50</td>
      <td>53.25</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>1.00</td>
      <td>37.50</td>
      <td>40.50</td>
      <td>33.25</td>
      <td>85.00</td>
      <td>56.25</td>
      <td>61.25</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="groupingjustasubsetofcolumns">Grouping just a subset of columns</h2>
<p>Sometimes what you want is to create groups with just one or two columns of interest. If this is the case, you can select them (and the column you want to group by) on the dataframe and then just call <code>groupby</code>:</p>
<pre><code class="language-python"># Let&apos;s create groups with just the HP values and then calculate the mean
pdata[[&apos;HP&apos;, &apos;Color&apos;]].groupby(&apos;Color&apos;).mean()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>HP</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>47.50</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>52.50</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>49.75</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>37.50</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Let&apos;s do the same for Attack, Defense and Speed
pdata[[&apos;Color&apos;, &apos;Attack&apos;, &apos;Defense&apos;, &apos;Speed&apos;]].groupby(&apos;Color&apos;).mean()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Attack</th>
      <th>Defense</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>56.75</td>
      <td>53.75</td>
      <td>68.25</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>52.25</td>
      <td>54.75</td>
      <td>56.25</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>78.00</td>
      <td>58.75</td>
      <td>53.25</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>40.50</td>
      <td>33.25</td>
      <td>61.25</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="groupsfrommultiplekeys">Groups from multiple keys</h2>
<p>You can create groups using more than one column as key. If you pass a list of keys to the <code>groupby</code> method, you will create a hierarchical grouping. Let&apos;s group by both Color and Evolves attributes:</p>
<pre><code class="language-python">opg = pdata.groupby([&apos;Color&apos;, &apos;Evolves&apos;])
opg.size()
</code></pre>
<pre><code>Color   Evolves
Blue    True       4
Green   False      1
        True       3
Red     False      2
        True       2
Yellow  True       4
dtype: int64
</code></pre>
<pre><code class="language-python">opg.groups
</code></pre>
<pre><code>{(&apos;Blue&apos;, True): Int64Index([4, 5, 6, 7], dtype=&apos;int64&apos;),
 (&apos;Green&apos;, False): Int64Index([2], dtype=&apos;int64&apos;),
 (&apos;Green&apos;, True): Int64Index([0, 1, 3], dtype=&apos;int64&apos;),
 (&apos;Red&apos;, False): Int64Index([9, 11], dtype=&apos;int64&apos;),
 (&apos;Red&apos;, True): Int64Index([8, 10], dtype=&apos;int64&apos;),
 (&apos;Yellow&apos;, True): Int64Index([12, 13, 14, 15], dtype=&apos;int64&apos;)}
</code></pre>
<h2 id="commonaggregations">Common aggregations:</h2>
<p>Some built-in aggregations already come implemented in Pandas, and you can also define your own if you need it. In the following section, we will run the following functions to get an idea of Pandas&apos; default behavior. All these functions take into consideration only non-NA values:</p>
<ul>
<li><strong>count</strong>: Number of elements in each group.</li>
<li><strong>sum</strong>: Sum of values in each group.</li>
<li><strong>mean</strong>: Arithmetic mean of each group.</li>
<li><strong>max</strong>: Maximum value of each group</li>
<li><strong>min</strong>: Minimum value of each group</li>
<li><strong>std</strong>: Standard deviation of each group.</li>
<li><strong>var</strong>: Variance of each group.</li>
</ul>
<p>Also, it&apos;s important to know that you don&apos;t need to assign the group to a variable. Very often, you will just use it as an intermediate value to call an aggregation. So, instead of writing this:</p>
<pre><code class="language-python">pg = pdata.groupby(&apos;Color&apos;)
pg.mean()
</code></pre>
<p>You can write this:</p>
<pre><code class="language-python">pdata.groupby(&apos;Color&apos;).mean()
</code></pre>
<pre><code class="language-python"># Let&apos;s drop the Name and Evolves properties to create a new dataframe
spdata = pdata.drop([&apos;Name&apos;, &apos;Evolves&apos;], axis=1)
spdata
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Color</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Green</td>
      <td>45</td>
      <td>30</td>
      <td>35</td>
      <td>20</td>
      <td>20</td>
      <td>45</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Green</td>
      <td>50</td>
      <td>20</td>
      <td>55</td>
      <td>25</td>
      <td>25</td>
      <td>30</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Green</td>
      <td>70</td>
      <td>110</td>
      <td>80</td>
      <td>55</td>
      <td>80</td>
      <td>105</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Green</td>
      <td>45</td>
      <td>49</td>
      <td>49</td>
      <td>65</td>
      <td>65</td>
      <td>45</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Blue</td>
      <td>41</td>
      <td>64</td>
      <td>45</td>
      <td>50</td>
      <td>50</td>
      <td>50</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Blue</td>
      <td>44</td>
      <td>48</td>
      <td>65</td>
      <td>50</td>
      <td>64</td>
      <td>43</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Blue</td>
      <td>40</td>
      <td>50</td>
      <td>40</td>
      <td>40</td>
      <td>40</td>
      <td>90</td>
    </tr>
    <tr>
      <th>7</th>
      <td>Blue</td>
      <td>65</td>
      <td>65</td>
      <td>65</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
    <tr>
      <th>8</th>
      <td>Red</td>
      <td>39</td>
      <td>52</td>
      <td>43</td>
      <td>60</td>
      <td>50</td>
      <td>65</td>
    </tr>
    <tr>
      <th>9</th>
      <td>Red</td>
      <td>65</td>
      <td>95</td>
      <td>57</td>
      <td>100</td>
      <td>85</td>
      <td>93</td>
    </tr>
    <tr>
      <th>10</th>
      <td>Red</td>
      <td>35</td>
      <td>70</td>
      <td>55</td>
      <td>45</td>
      <td>55</td>
      <td>25</td>
    </tr>
    <tr>
      <th>11</th>
      <td>Red</td>
      <td>60</td>
      <td>95</td>
      <td>80</td>
      <td>60</td>
      <td>80</td>
      <td>30</td>
    </tr>
    <tr>
      <th>12</th>
      <td>Yellow</td>
      <td>35</td>
      <td>55</td>
      <td>40</td>
      <td>50</td>
      <td>50</td>
      <td>90</td>
    </tr>
    <tr>
      <th>13</th>
      <td>Yellow</td>
      <td>25</td>
      <td>20</td>
      <td>15</td>
      <td>105</td>
      <td>55</td>
      <td>90</td>
    </tr>
    <tr>
      <th>14</th>
      <td>Yellow</td>
      <td>50</td>
      <td>52</td>
      <td>48</td>
      <td>65</td>
      <td>50</td>
      <td>55</td>
    </tr>
    <tr>
      <th>15</th>
      <td>Yellow</td>
      <td>40</td>
      <td>35</td>
      <td>30</td>
      <td>120</td>
      <td>70</td>
      <td>10</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Count the elements in each group
spdata.groupby(&apos;Color&apos;).count()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
      <td>4</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Calculate the sum in each group
spdata.groupby(&apos;Color&apos;).sum()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>190</td>
      <td>227</td>
      <td>215</td>
      <td>190</td>
      <td>204</td>
      <td>273</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>210</td>
      <td>209</td>
      <td>219</td>
      <td>165</td>
      <td>190</td>
      <td>225</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>199</td>
      <td>312</td>
      <td>235</td>
      <td>265</td>
      <td>270</td>
      <td>213</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>150</td>
      <td>162</td>
      <td>133</td>
      <td>340</td>
      <td>225</td>
      <td>245</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Calculate the mean in each group
spdata.groupby(&apos;Color&apos;).mean()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>47.50</td>
      <td>56.75</td>
      <td>53.75</td>
      <td>47.50</td>
      <td>51.00</td>
      <td>68.25</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>52.50</td>
      <td>52.25</td>
      <td>54.75</td>
      <td>41.25</td>
      <td>47.50</td>
      <td>56.25</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>49.75</td>
      <td>78.00</td>
      <td>58.75</td>
      <td>66.25</td>
      <td>67.50</td>
      <td>53.25</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>37.50</td>
      <td>40.50</td>
      <td>33.25</td>
      <td>85.00</td>
      <td>56.25</td>
      <td>61.25</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Find the maximum of each stat for every group
spdata.groupby(&apos;Color&apos;).max()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>65</td>
      <td>65</td>
      <td>65</td>
      <td>50</td>
      <td>64</td>
      <td>90</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>70</td>
      <td>110</td>
      <td>80</td>
      <td>65</td>
      <td>80</td>
      <td>105</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>65</td>
      <td>95</td>
      <td>80</td>
      <td>100</td>
      <td>85</td>
      <td>93</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>50</td>
      <td>55</td>
      <td>48</td>
      <td>120</td>
      <td>70</td>
      <td>90</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Find the minimum of each stat for every group
spdata.groupby(&apos;Color&apos;).min()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>40</td>
      <td>48</td>
      <td>40</td>
      <td>40</td>
      <td>40</td>
      <td>43</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>45</td>
      <td>20</td>
      <td>35</td>
      <td>20</td>
      <td>20</td>
      <td>30</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>35</td>
      <td>52</td>
      <td>43</td>
      <td>45</td>
      <td>50</td>
      <td>25</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>25</td>
      <td>20</td>
      <td>15</td>
      <td>50</td>
      <td>50</td>
      <td>10</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Calculate the standard deviation of each stat for every group
spdata.groupby(&apos;Color&apos;).std()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>11.789826</td>
      <td>8.995369</td>
      <td>13.149778</td>
      <td>5.000000</td>
      <td>9.865766</td>
      <td>25.276801</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>11.902381</td>
      <td>40.335055</td>
      <td>18.803812</td>
      <td>22.126530</td>
      <td>29.580399</td>
      <td>33.260337</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>14.952703</td>
      <td>20.960280</td>
      <td>15.456929</td>
      <td>23.584953</td>
      <td>17.559423</td>
      <td>31.920474</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>10.408330</td>
      <td>16.258331</td>
      <td>14.221463</td>
      <td>32.914029</td>
      <td>9.464847</td>
      <td>37.941841</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Calculate the variance of each stat for every subgroup
spdata.groupby(&apos;Color&apos;).var()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>139.000000</td>
      <td>80.916667</td>
      <td>172.916667</td>
      <td>25.000000</td>
      <td>97.333333</td>
      <td>638.916667</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>141.666667</td>
      <td>1626.916667</td>
      <td>353.583333</td>
      <td>489.583333</td>
      <td>875.000000</td>
      <td>1106.250000</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>223.583333</td>
      <td>439.333333</td>
      <td>238.916667</td>
      <td>556.250000</td>
      <td>308.333333</td>
      <td>1018.916667</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>108.333333</td>
      <td>264.333333</td>
      <td>202.250000</td>
      <td>1083.333333</td>
      <td>89.583333</td>
      <td>1439.583333</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="customdefinedaggregations">Custom-defined aggregations</h2>
<p>Pandas has a method called <code>agg</code> that lets you define and run your own aggregations over groups. In the next example, we will define a function that calculates the sum of the squares of the stats in every group.</p>
<pre><code class="language-python">def sum_of_squares(arr):
    sos = 0
    for element in arr:
        sos += element**2
    return sos

spdata.groupby(&apos;Color&apos;).agg(sum_of_squares)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>HP</th>
      <th>Attack</th>
      <th>Defense</th>
      <th>SpAtk</th>
      <th>SpDef</th>
      <th>Speed</th>
    </tr>
    <tr>
      <th>Color</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Blue</th>
      <td>9442</td>
      <td>13125</td>
      <td>12075</td>
      <td>9100</td>
      <td>10696</td>
      <td>20549</td>
    </tr>
    <tr>
      <th>Green</th>
      <td>11450</td>
      <td>15801</td>
      <td>13051</td>
      <td>8275</td>
      <td>11650</td>
      <td>15975</td>
    </tr>
    <tr>
      <th>Red</th>
      <td>10571</td>
      <td>25654</td>
      <td>14523</td>
      <td>19225</td>
      <td>19150</td>
      <td>14399</td>
    </tr>
    <tr>
      <th>Yellow</th>
      <td>5950</td>
      <td>7354</td>
      <td>5029</td>
      <td>32150</td>
      <td>12925</td>
      <td>19325</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="aggregationsarepowerful">Aggregations are powerful!</h2>
<p>Grouping data for analysis using Pandas is efficient, elegant, and powerful. Traditional relational databases owe part of their great popularity to their capacity to perform this sort of functionality, and now you can get much of the same work done using Python.</p>
<p>This last function (<code>agg</code>) might be one of the most useful capabilities Pandas provides. Custom-defined aggregations are incredibly powerful. They let you calculate any crazy function you can come up with, in a repeatable and consistent way.</p>
<p>This feature is so powerful that we will dedicate the next article in expanding the topic: We will talk about the powerful <code>apply</code> method.</p>
<p>Thank you for reading!</p>
<h2 id="whattodonext">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li><a href="https://github.com/don-juancito/BrainsToBytes_CodeSamples/tree/master/?ref=brainstobytes.com">You can find the source code for this series in this repo</a>.</li>
<li>This article is based on Python for Data Analysis. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hands-on Pandas(9): Merging Dataframes]]></title><description><![CDATA[This article teaches you how to perform merges/joins using pandas dataframes.]]></description><link>https://www.brainstobytes.com/hands-on-pandas-9-merging-dataframes/</link><guid isPermaLink="false">5f21882abe644300456d042c</guid><category><![CDATA[Machine Learning & Data]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 13 Oct 2020 07:00:00 GMT</pubDate><media:content url="https://www.brainstobytes.com/content/images/2020/07/1200px-Pandas_logo.svg-4.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://www.brainstobytes.com/content/images/2020/07/1200px-Pandas_logo.svg-4.png" alt="Hands-on Pandas(9): Merging Dataframes"><p>Merge/join operations in Pandas let you gather information from many tables into a single dataframe for further processing or analysis. This is another important skill that you will probably use a lot when working with data.</p>
<p>If you have some experience with relational databases you can recognize the analogous behavior with table joins. In this article, we will demo some of the most important behavior offered by Pandas&apos; <code>merge</code> function. It will probably be more than enough to keep you going before you need to consult more documentation.</p>
<p>Great, let&apos;s get started!</p>
<h2 id="loadingthedatafortheexamples">Loading the data for the examples</h2>
<p>For the demos, we will use data from two different files. One of them contains some non-numeric attributes of a few Pokemon, the other contains base stats like HP, Attack, and Speed.</p>
<pre><code class="language-python">import pandas as pd

attribs = pd.read_csv(&apos;./sample_data/poke_attributes.csv&apos;)
attribs
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python">stats = pd.read_csv(&apos;./sample_data/poke_stats.csv&apos;)
stats
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Ditto</td>
      <td>48</td>
      <td>48</td>
      <td>48</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Dratini</td>
      <td>41</td>
      <td>64</td>
      <td>50</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Pikachu</td>
      <td>35</td>
      <td>55</td>
      <td>90</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Caterpie</td>
      <td>45</td>
      <td>30</td>
      <td>45</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Vulpix</td>
      <td>38</td>
      <td>41</td>
      <td>65</td>
    </tr>
  </tbody>
</table>
</div>
<p>Notice that Ditto, Dratini, and Pikachu are on both tables, but Abra and Ekans appear only on the first one while Caterpie and Vulpix appear only on the second one.</p>
<p>Now let&apos;s go back to the programming bits: Pandas&apos; <strong>merge</strong> function links the rows of two dataframes using one or more keys. In the following examples, we will use the <strong>Name</strong> column as key to show different types of merge/join.</p>
<h2 id="innerjoins">Inner Joins</h2>
<p>An inner join is performed on the intersection of the keys of the two dataframes. In this case, it will find the Names that are in both dataframes (Ditto, Dratini, and Pikachu) and create a new dataframe with all the columns on both original dataframes.</p>
<p>We will call the merge function specifying two additional arguments:</p>
<ul>
<li><strong>how</strong>: Tells merge which type of join to perform.</li>
<li><strong>on</strong>: Tells merge which columns to use as key.</li>
</ul>
<pre><code class="language-python">innerjoin = pd.merge(attribs, stats, on=&apos;Name&apos;, how=&apos;inner&apos;)
innerjoin
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
      <td>48</td>
      <td>48</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
      <td>64</td>
      <td>50</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
      <td>55</td>
      <td>90</td>
    </tr>
  </tbody>
</table>
</div>
<p>If you want to perform an inner join you can skip the <strong>how</strong> attribute: The default behavior of the merge function is an inner join:</p>
<pre><code class="language-python">innerjoin = pd.merge(attribs, stats, on=&apos;Name&apos;)
innerjoin
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
      <td>48</td>
      <td>48</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
      <td>64</td>
      <td>50</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
      <td>55</td>
      <td>90</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="outerjoins">Outer joins</h2>
<p>Outer joins are performed on the union of the keys of the two dataframes. In this case, it will use every single name in the original dataframes and fill the missing fields with NaN.</p>
<pre><code class="language-python">outerjoin = pd.merge(attribs, stats, on=&apos;Name&apos;, how=&apos;outer&apos;)
outerjoin
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48.0</td>
      <td>48.0</td>
      <td>48.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41.0</td>
      <td>64.0</td>
      <td>50.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35.0</td>
      <td>55.0</td>
      <td>90.0</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Caterpie</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>45.0</td>
      <td>30.0</td>
      <td>45.0</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Vulpix</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>38.0</td>
      <td>41.0</td>
      <td>65.0</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="leftjoins">Left joins</h2>
<p>Left joins take every single element on the left dataframes and fill-in with the keys in common on the right dataframe. This might be a bit easier to understand with an example:</p>
<pre><code class="language-python">leftjoin = pd.merge(attribs, stats, on=&apos;Name&apos;, how=&apos;left&apos;)
leftjoin
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48.0</td>
      <td>48.0</td>
      <td>48.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41.0</td>
      <td>64.0</td>
      <td>50.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35.0</td>
      <td>55.0</td>
      <td>90.0</td>
    </tr>
  </tbody>
</table>
</div>
<p>Notice how Abra and Ekans don&apos;t have values in HP, Attack, and Speed. This happens because the right table does not contain values for these particular Pokemon.</p>
<h2 id="rightjoins">Right joins</h2>
<p>Right joins take every single element on the right dataframes and fill-in with the keys in common on the left dataframe. This might be a bit easier to understand with an example:</p>
<pre><code class="language-python">rightjoin = pd.merge(attribs, stats, on=&apos;Name&apos;, how=&apos;right&apos;)
rightjoin
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>Attack</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
      <td>48</td>
      <td>48</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
      <td>64</td>
      <td>50</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
      <td>55</td>
      <td>90</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Caterpie</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>45</td>
      <td>30</td>
      <td>45</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Vulpix</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>38</td>
      <td>41</td>
      <td>65</td>
    </tr>
  </tbody>
</table>
</div>
<p>Notice how Caterpie and Vulpix don&apos;t have values in Type, Color, and Evolves. This happens because the left table does not contain values for these particular Pokemon.</p>
<h2 id="otherconsiderations">Other considerations</h2>
<h4 id="joinswithkeycolumnsofdifferentnames">Joins with key columns of different names</h4>
<p>If the names of the columns are different, you can specify them with left_on and right_on</p>
<pre><code class="language-python">pd.merge(leftdataframe, rightdataframe, left_on=&apos;left_key&apos;, right_on=&apos;right_key&apos;)
</code></pre>
<h4 id="joiningonindexes">Joining on indexes</h4>
<p>Sometimes you want to use the index of the dataframes as key column to perform the join. For this to work, you just need to pass two additional parameters set to true: <code>left_index=True, right_index=True</code></p>
<h4 id="rememberingwhichjointypeyouneed">Remembering which join type you need</h4>
<p>If you are having trouble recalling which type of join is which, take a look at the following image:</p>
<p><img src="https://www.brainstobytes.com/content/images/2020/10/joins.png" alt="Hands-on Pandas(9): Merging Dataframes" loading="lazy"></p>
<h2 id="multisourcedata">Multi-source data</h2>
<p>In the previous article, we started talking about the reality of working with real data: You&apos;d be really lucky to find a clean dataset in a single repository. In reality, you will probably need to select and aggregate data from lots of different sources.</p>
<p>Merge is a relatively simple function at a basic level, but it&apos;s an incredibly useful and rich tool. You just learned how to use the basics, but it&apos;s also a good idea to read the documentation to get an idea of the full capabilities of this function.</p>
<p>In the next article, we will learn how to perform group operations in Pandas data structures.</p>
<p>Thank you for reading!</p>
<h2 id="whattodonext">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li><a href="https://github.com/don-juancito/BrainsToBytes_CodeSamples/tree/master/?ref=brainstobytes.com">You can find the source code for this series in this repo</a>.</li>
<li>This article is based on Python for Data Analysis. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hands-on Pandas(8): Cleaning Data]]></title><description><![CDATA[This article explains how to use Pandas' basic functions for data cleaning.]]></description><link>https://www.brainstobytes.com/hands-on-pandas-8-cleaning-data/</link><guid isPermaLink="false">5f16d21b77355700397743f5</guid><category><![CDATA[Machine Learning & Data]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 06 Oct 2020 07:00:00 GMT</pubDate><media:content url="https://www.brainstobytes.com/content/images/2020/07/1200px-Pandas_logo.svg-3.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://www.brainstobytes.com/content/images/2020/07/1200px-Pandas_logo.svg-3.png" alt="Hands-on Pandas(8): Cleaning Data"><p>In an ideal world, all the data you need is available in the right format and with complete content.</p>
<p>In the real world, you will probably need to scrape data from lots of different and incomplete sources. That&apos;s why it&apos;s important to learn how to clean your data before analyzing it or feeding it into a ML algorithm.</p>
<p>Data cleaning might not the most glamorous part of your work, but it&apos;s a crucial part of the development of data-based products. Not only it&apos;s important for the whole pipeline, but it&apos;s also one of the most time-consuming tasks in a project (Some estimate it to consume around 80% of a project&apos;s time).</p>
<p>In this article, we will learn some basic data cleaning techniques that will let you handle the most common situations.</p>
<p>Great, let&apos;s get started!</p>
<h2 id="dealingwithmissingvalues">Dealing with missing values</h2>
<p>Dealing with missing values is a basic (and extremely useful) technique. The data you&apos;ll have access to will probably miss the attributes of some entries, and it&apos;s better to spend some time explicitly handling each case.</p>
<p>Imagine you have a csv file with the following content:</p>
<pre><code>Name,Type,Color,Evolves,HP
Abra,Psychic,,True,
Pikachu,Electric,,True,35
Ekans,,Purple,,35
Dratini,,Blue,,41
Ditto,Normal,Pink,False,48
</code></pre>
<p>As you can see, it&apos;s missing entries for several rows. If you load it using <code>read_csv</code>, the resulting dataframe will look like this:</p>
<pre><code class="language-python">import pandas as pd
frame = pd.read_csv(&apos;./sample_data/pokes_missing.csv&apos;)
frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>NaN</td>
      <td>True</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>NaN</td>
      <td>True</td>
      <td>35.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>NaN</td>
      <td>Purple</td>
      <td>NaN</td>
      <td>35.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>NaN</td>
      <td>Blue</td>
      <td>NaN</td>
      <td>41.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48.0</td>
    </tr>
  </tbody>
</table>
</div>
<p>Pandas fills every entry without value with the default NaN <strong>sentinel value</strong>. In many situations, you just want to remove the rows with missing values and leave the ones that have exclusively non-NaN fields, for this, you can use the <code>dropna</code> function:</p>
<pre><code class="language-python"># dropna does not affect the original dataframe, it returns a new one.
frame.dropna()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48.0</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python">#If you want to perform the dropna in place, specify the inplace attribute
frame.dropna(inplace=True)
frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48.0</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Let&apos;s reload the data:
frame = pd.read_csv(&apos;./sample_data/pokes_missing.csv&apos;)
</code></pre>
<p>Dropna can be used differently: You can tell the function to drop only the rows that are composed of NaN values exclusively. This way you can do away with entries that don&apos;t provide any information. For doing this, provide the optional parameter <code>* how=&apos;all&apos;</code> when calling dropna.</p>
<p>Dropping all rows with NaN values might not be the best option. Very often, the best option is to fill-in the missing entries with values you provide or with values calculated from the dataframe itself. Pandas provides a function called <code>fillna</code> that lets you handle missing values:</p>
<pre><code class="language-python"># Let&apos;s fill every missing entry with the value 0
frame.fillna(10)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>10</td>
      <td>True</td>
      <td>10.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>10</td>
      <td>True</td>
      <td>35.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>10</td>
      <td>Purple</td>
      <td>10</td>
      <td>35.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>10</td>
      <td>Blue</td>
      <td>10</td>
      <td>41.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48.0</td>
    </tr>
  </tbody>
</table>
</div>
<p>Do you notice anything wrong with the dataframe? Well, for starters, the value 10 might work reasonably well for HP, but it doesn&apos;t make any sense for Type, Color or Evolves. <code>fillna</code> lets you specify how to fill NaN values for every column if you pass a dictionary instead of a single value:</p>
<pre><code class="language-python">frame.fillna({&apos;Type&apos;: &apos;Unknown&apos;,
              &apos;Color&apos;: &apos;Yellow&apos;,
              &apos;Evolves&apos;: True,
              &apos;HP&apos;: 30})
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>30.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>Unknown</td>
      <td>Purple</td>
      <td>True</td>
      <td>35.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Unknown</td>
      <td>Blue</td>
      <td>True</td>
      <td>41.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48.0</td>
    </tr>
  </tbody>
</table>
</div>
<p>Now the dataframe makes a bit more sense! As with dropna, the change performed by fillna doesn&apos;t happen on the original dataframe, it just creates a new dataframe with new values. If you want to change the original dataframe provide the inplace argument:</p>
<pre><code class="language-python">frame.fillna({&apos;Type&apos;: &apos;Unknown&apos;,
              &apos;Color&apos;: &apos;Yellow&apos;,
              &apos;Evolves&apos;: True,
              &apos;HP&apos;: 30},
             inplace=True)
frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>30.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>Unknown</td>
      <td>Purple</td>
      <td>True</td>
      <td>35.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Unknown</td>
      <td>Blue</td>
      <td>True</td>
      <td>41.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48.0</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="dealingwithduplicates">Dealing with duplicates</h2>
<p>Sometimes your data will contain duplicate rows or values with the same fields. Take a look at this dataframe:</p>
<pre><code class="language-python">frame = pd.read_csv(&apos;./sample_data/pokes_duplicates.csv&apos;)
frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
  </tbody>
</table>
</div>
<p>The frame has two Dratinis and two Abras, so let&apos;s get rid of them using drop_duplicates:</p>
<pre><code class="language-python">frame.drop_duplicates()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<p>Sometimes you want to remove duplicates based on the value of an attribute. In this case, you can provide an extra argument (or list of arguments if you want to use multiple attributes) to specify the column to take into consideration. For example, the following call ensures we have only rows with unique colors:</p>
<pre><code class="language-python"># Abra and Pikachu are both yellow, so it&apos;s time to go for Pikachu
frame.drop_duplicates(&apos;Color&apos;)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="mappingsandothertransformations">Mappings and other transformations</h2>
<p>In this section, we will learn about some other techniques for modifying your data. The first technique is to use the map function to alter a column in a dataframe:</p>
<pre><code class="language-python">frame = frame = pd.read_csv(&apos;./sample_data/pokes.csv&apos;)
frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Let&apos;s transform the type into an all-uppercase string
frame[&apos;Type&apos;] = frame[&apos;Type&apos;].map(lambda x: x.upper())
frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>PSYCHIC</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>ELECTRIC</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>POISON</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>DRAGON</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>NORMAL</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<p>The next thing we can do is to use the replace function to change values. In the following example we will replace the Evolves&apos; column boolean values for &apos;Yes&apos; and &apos;No&apos; strings:</p>
<pre><code class="language-python">frame[&apos;Evolves&apos;] = frame[&apos;Evolves&apos;].replace([True, False], [&apos;Yes&apos;,&apos;No&apos;])
frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>PSYCHIC</td>
      <td>Yellow</td>
      <td>Yes</td>
      <td>25</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>ELECTRIC</td>
      <td>Yellow</td>
      <td>Yes</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>POISON</td>
      <td>Purple</td>
      <td>Yes</td>
      <td>35</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>DRAGON</td>
      <td>Blue</td>
      <td>Yes</td>
      <td>41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>NORMAL</td>
      <td>Pink</td>
      <td>No</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<p>We can also use conditional selection to get just the data we need. Suppose we are interested only on the elements with HP under 40:</p>
<pre><code class="language-python">frame[frame[&apos;HP&apos;] &lt; 40]
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>PSYCHIC</td>
      <td>Yellow</td>
      <td>Yes</td>
      <td>25</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>ELECTRIC</td>
      <td>Yellow</td>
      <td>Yes</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>POISON</td>
      <td>Purple</td>
      <td>Yes</td>
      <td>35</td>
    </tr>
  </tbody>
</table>
</div>
<p>The last thing we will deal with is transforming categorical data into one-hot encoded columns. <a href="https://www.brainstobytes.com/one-hot-encoding-with-pokemon/">If you want a more in-depth explanation of one-hot encoding you can read this article</a>:</p>
<pre><code class="language-python"># get_dummies transforms categorical data into a one-hot encoded dataframe
# we will concatenate the original dataframe and the one-hot encoded type columns

pd.concat([frame, pd.get_dummies(frame[&apos;Type&apos;])], axis=1)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
      <th>DRAGON</th>
      <th>ELECTRIC</th>
      <th>NORMAL</th>
      <th>POISON</th>
      <th>PSYCHIC</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>PSYCHIC</td>
      <td>Yellow</td>
      <td>Yes</td>
      <td>25</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>ELECTRIC</td>
      <td>Yellow</td>
      <td>Yes</td>
      <td>35</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>POISON</td>
      <td>Purple</td>
      <td>Yes</td>
      <td>35</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>DRAGON</td>
      <td>Blue</td>
      <td>Yes</td>
      <td>41</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>NORMAL</td>
      <td>Pink</td>
      <td>No</td>
      <td>48</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="keepitclean">Keep it clean</h2>
<p>Getting perfect data at the beginning of the project is extremely unlikely. In reality, you will probably need to put together a dataset from many sources, and in the process, there might be some malformed, erroneous, or missing data.</p>
<p>That&apos;s ok as long as you know how to clean your data and get it in proper shape for your analysis/algorithms. It is often said that you will probably spend 80% of your time cleaning and wrangling data, so it&apos;s a good idea to learn one or two tricks.</p>
<p>In the next article, we will learn how to put together data from different sources into a single collection.</p>
<p>Thank you for reading!</p>
<h2 id="whattodonext">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li><a href="https://github.com/don-juancito/BrainsToBytes_CodeSamples/tree/master/?ref=brainstobytes.com">You can find the source code for this series in this repo</a>.</li>
<li>This article is based on Python for Data Analysis. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hands-on Pandas(7): Loading data from files]]></title><description><![CDATA[This article teaches the basics of data loading from files using Pandas' built-in functions.]]></description><link>https://www.brainstobytes.com/hands-on-pandas-7-loading-data-from-files/</link><guid isPermaLink="false">5f0f084fb37b18004544c710</guid><category><![CDATA[Machine Learning & Data]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 29 Sep 2020 07:00:00 GMT</pubDate><media:content url="https://www.brainstobytes.com/content/images/2020/07/1200px-Pandas_logo.svg-2.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://www.brainstobytes.com/content/images/2020/07/1200px-Pandas_logo.svg-2.png" alt="Hands-on Pandas(7): Loading data from files"><p>Data analysis usually starts by loading data into the structures of your library/tools of choice. Almost always this data will either come from a database, the web, or a collection of files.</p>
<p>The files that contain your data can come in many different formats: Comma-separated values in a text file, JSON files, excel files, or files with values separated by custom characters. In this article, we will learn how to read data from some of the common file formats using Pandas&apos; built-in functions.</p>
<p>Great, let&apos;s get started!</p>
<h2 id="readingdatafromcsvfiles">Reading data from CSV files</h2>
<p>CSV files are an incredibly common file format, and many tutorials and small file repos use them as default data format. Imagine you have a file with this contents at <em>./sample_data/pokes.csv</em>:</p>
<pre><code>Name,Type,Color,Evolves,HP
Abra,Psychic,Yellow,True,25
Pikachu,Electric,Yellow,True,35
Ekans,Poison,Purple,True,35
Dratini,Dragon,Blue,True,41
Ditto,Normal,Pink,False,48
</code></pre>
<p>You can easily put this data in a dataframe using <em>read_csv</em>:</p>
<pre><code class="language-python">import pandas as pd

poke_data = pd.read_csv(&apos;./sample_data/pokes.csv&apos;)
poke_data
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<p>Note that <em>read_csv</em> automatically assigns the first line as column index and adds its own default row index. You can specify which column to use as index with the <em>index_col</em> attribute:</p>
<pre><code class="language-python">poke_data = pd.read_csv(&apos;./sample_data/pokes.csv&apos;, index_col=&apos;Name&apos;)
poke_data
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Abra</th>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>Pikachu</th>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>Ekans</th>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>Dratini</th>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>Ditto</th>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<p>Sometimes files don&apos;t have column names on the first row. In this case, you can provide an additional parameter to read_csv to assign column names:</p>
<pre><code class="language-python">poke_data = pd.read_csv(&apos;./sample_data/pokes_no_header.csv&apos;,
                        names=[&apos;Name&apos;, &apos;Type&apos;, &apos;Color&apos;, &apos;Evolves&apos;, &apos;HP&apos;],
                        index_col=&apos;Name&apos;)
poke_data
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Abra</th>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>Pikachu</th>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>Ekans</th>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>Dratini</th>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>Ditto</th>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="readingdatafromtables">Reading data from tables</h2>
<p>Pandas has a function called <code>read_table</code> that lets you load files with data in tabular form. This function receives a parameter <em>sep</em> that specifies the separator character in the table. Imagine you have a file with the following contents:</p>
<pre><code>Name|Type|Color|Evolves|HP
Abra|Psychic|Yellow|True|25
Pikachu|Electric|Yellow|True|35
Ekans|Poison|Purple|True|35
Dratini|Dragon|Blue|True|41
Ditto|Normal|Pink|False|48
</code></pre>
<p>This table uses the &apos;|&apos; character as separator. We can load the data into pandas with the following call:</p>
<pre><code class="language-python">poke_data = pd.read_table(&apos;./sample_data/pokes_table&apos;, sep=&apos;|&apos;)
poke_data
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># You can emulate the behavior of read_csv by passing sep=&apos;,&apos; as attribute:
poke_data = pd.read_table(&apos;./sample_data/pokes.csv&apos;, sep=&apos;,&apos;)
poke_data
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<p>pd.read_table is very useful if you have a file with entries that are separated by a varying number of spaces or tabs. By passing the regular expression sep=&apos;\s+&apos; you will be able to load such files. Let&apos;s load a file with the following contents:</p>
<pre><code>Name  Type Color   Evolves      HP
Abra Psychic           Yellow  True  25
Pikachu  Electric  Yellow  True  35
Ekans   Poison       Purple  True  35
Dratini                 Dragon              Blue        True              41
Ditto    Normal  Pink  False  48
</code></pre>
<pre><code class="language-python">poke_data = pd.read_table(&apos;./sample_data/pokes_varspace&apos;, sep=&apos;\s+&apos;)
poke_data
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Name</th>
      <th>Type</th>
      <th>Color</th>
      <th>Evolves</th>
      <th>HP</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Abra</td>
      <td>Psychic</td>
      <td>Yellow</td>
      <td>True</td>
      <td>25</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Pikachu</td>
      <td>Electric</td>
      <td>Yellow</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Ekans</td>
      <td>Poison</td>
      <td>Purple</td>
      <td>True</td>
      <td>35</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Dratini</td>
      <td>Dragon</td>
      <td>Blue</td>
      <td>True</td>
      <td>41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Ditto</td>
      <td>Normal</td>
      <td>Pink</td>
      <td>False</td>
      <td>48</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="readingotherformats">Reading other formats</h2>
<p>The two other common formats you will probably find in practice are Excel-like files and JSON files. Pandas has the functions read_excel and read_json to aid you when working with those files. You can read the official documentation here:</p>
<ul>
<li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html?ref=brainstobytes.com">read_excel</a></li>
<li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html?ref=brainstobytes.com">read_json</a></li>
</ul>
<h2 id="filesarenoteverythingbutitsagreatstart">Files are not everything, but it&apos;s a great start</h2>
<p>While databases are probably the most common form of data storage you will deal with, knowing how to read data from files is a fundamental skill when working with Pandas. Lots of online repositories and tutorials offer huge libraries of data in several different file formats.</p>
<p>With the material covered in this article, and with access to Pandas&apos; excellent documentation you will have no problem working with files.</p>
<p>Now that we can get the data into dataframes, the next step is learning how to clean it. In the next article, we will learn some techniques for preparing data for processing.</p>
<p>Thank you for reading!</p>
<h2 id="whattodonext">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li><a href="https://github.com/don-juancito/BrainsToBytes_CodeSamples/tree/master/?ref=brainstobytes.com">You can find the source code for this series in this repo</a>.</li>
<li>This article is based on Python for Data Analysis. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hands-on Pandas(6): Descriptive Statistics]]></title><description><![CDATA[This article explains how to calculate basic descriptive statistics using Pandas.]]></description><link>https://www.brainstobytes.com/hands-on-pandas-6-descriptive-statistics/</link><guid isPermaLink="false">5f072f6f0d524f00394a657e</guid><category><![CDATA[Machine Learning & Data]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 22 Sep 2020 07:00:00 GMT</pubDate><media:content url="https://www.brainstobytes.com/content/images/2020/07/1200px-Pandas_logo.svg-1.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://www.brainstobytes.com/content/images/2020/07/1200px-Pandas_logo.svg-1.png" alt="Hands-on Pandas(6): Descriptive Statistics"><p>Pandas provides many options for calculating descriptive statistics and other reduction operations with just a simple function call. You might want to calculate these values as part of a ML/Data Analysis pipeline, or just because you want to get a better understanding of the data you are dealing with.</p>
<p>Most of these operations are similar to NumPy reductions, as they compute and return a single value. In some cases, it returns a structure with equal-or-fewer dimensions than the original.</p>
<p>In this article, we will explore some of the most used functions and see some examples. Great, let&apos;s get started!</p>
<pre><code class="language-python">import pandas as pd
import numpy as np

frame = pd.DataFrame(np.random.rand(4,5),
                     index=[&apos;A&apos;, &apos;B&apos;, &apos;C&apos;, &apos;D&apos;],
                     columns=[&apos;One&apos;, &apos;Two&apos;, &apos;Three&apos;, &apos;Four&apos;, &apos;Five&apos;])
frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>One</th>
      <th>Two</th>
      <th>Three</th>
      <th>Four</th>
      <th>Five</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>A</th>
      <td>0.973939</td>
      <td>0.427195</td>
      <td>0.790004</td>
      <td>0.027722</td>
      <td>0.686339</td>
    </tr>
    <tr>
      <th>B</th>
      <td>0.190250</td>
      <td>0.891813</td>
      <td>0.238110</td>
      <td>0.636394</td>
      <td>0.104428</td>
    </tr>
    <tr>
      <th>C</th>
      <td>0.951482</td>
      <td>0.207945</td>
      <td>0.081066</td>
      <td>0.815889</td>
      <td>0.785882</td>
    </tr>
    <tr>
      <th>D</th>
      <td>0.699541</td>
      <td>0.154921</td>
      <td>0.752932</td>
      <td>0.066052</td>
      <td>0.825628</td>
    </tr>
  </tbody>
</table>
</div>
<p>The first thing we will learn is how to perform sums. The <code>sum</code> function performs sums along the rows axis by default (returns the sum of the values of every column). You can pass <code>axis=&apos;columns&apos;</code> as an additional parameter to perform the sum along the columns axis:</p>
<pre><code class="language-python">frame.sum()
</code></pre>
<pre><code>One      2.815213
Two      1.681874
Three    1.862111
Four     1.546057
Five     2.402276
dtype: float64
</code></pre>
<pre><code class="language-python">frame.sum(axis=&apos;columns&apos;)
</code></pre>
<pre><code>A    2.905198
B    2.060995
C    2.842264
D    2.499073
dtype: float64
</code></pre>
<p>Pandas also lets you calculate the minimum and maximum values in a dataframe&apos;s columns or rows, for this, it provides the functions <code>min</code> and <code>max</code>. Like before, you can specify the axis:</p>
<pre><code class="language-python">frame.max()
</code></pre>
<pre><code>One      0.973939
Two      0.891813
Three    0.790004
Four     0.815889
Five     0.825628
dtype: float64
</code></pre>
<pre><code class="language-python">frame.min(axis=&apos;columns&apos;)
</code></pre>
<pre><code>A    0.027722
B    0.104428
C    0.081066
D    0.066052
dtype: float64
</code></pre>
<p>If instead, you are interested in the <em>indexes</em> where the minimum and maximum values are, just use <code>idxmax</code> and <code>idxmin</code>:</p>
<pre><code class="language-python">frame.idxmax() # All columns have their maximum values at row D
</code></pre>
<pre><code>One      A
Two      B
Three    A
Four     C
Five     D
dtype: object
</code></pre>
<pre><code class="language-python">frame.idxmin(axis=&apos;columns&apos;) # All rows have their minimum at column One
</code></pre>
<pre><code>A     Four
B     Five
C    Three
D     Four
dtype: object
</code></pre>
<p>Pandas also has functions for calculating (among many others) the mean, median, standard deviation and variance:</p>
<pre><code class="language-python">frame.mean()
</code></pre>
<pre><code>One      0.703803
Two      0.420468
Three    0.465528
Four     0.386514
Five     0.600569
dtype: float64
</code></pre>
<pre><code class="language-python">frame.median(axis=&apos;columns&apos;)
</code></pre>
<pre><code>A    0.686339
B    0.238110
C    0.785882
D    0.699541
dtype: float64
</code></pre>
<pre><code class="language-python">frame.var()
</code></pre>
<pre><code>One      0.132691
Two      0.112631
Three    0.129139
Four     0.159410
Five     0.112835
dtype: float64
</code></pre>
<pre><code class="language-python">frame.var(axis=&apos;columns&apos;)
</code></pre>
<pre><code>A    0.134738
B    0.113646
C    0.155681
D    0.129304
dtype: float64
</code></pre>
<pre><code class="language-python">frame.std()
</code></pre>
<pre><code>One      0.364268
Two      0.335605
Three    0.359358
Four     0.399262
Five     0.335909
dtype: float64
</code></pre>
<pre><code class="language-python">frame.std(axis=&apos;columns&apos;)
</code></pre>
<pre><code>A    0.367067
B    0.337114
C    0.394564
D    0.359588
dtype: float64
</code></pre>
<p>Pandas also has an incredibly useful function called <code>describe</code>. It will calculate a battery of standard reductions and show you the summary:</p>
<pre><code class="language-python">frame.describe()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>One</th>
      <th>Two</th>
      <th>Three</th>
      <th>Four</th>
      <th>Five</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>count</th>
      <td>4.000000</td>
      <td>4.000000</td>
      <td>4.000000</td>
      <td>4.000000</td>
      <td>4.000000</td>
    </tr>
    <tr>
      <th>mean</th>
      <td>0.703803</td>
      <td>0.420468</td>
      <td>0.465528</td>
      <td>0.386514</td>
      <td>0.600569</td>
    </tr>
    <tr>
      <th>std</th>
      <td>0.364268</td>
      <td>0.335605</td>
      <td>0.359358</td>
      <td>0.399262</td>
      <td>0.335909</td>
    </tr>
    <tr>
      <th>min</th>
      <td>0.190250</td>
      <td>0.154921</td>
      <td>0.081066</td>
      <td>0.027722</td>
      <td>0.104428</td>
    </tr>
    <tr>
      <th>25%</th>
      <td>0.572218</td>
      <td>0.194689</td>
      <td>0.198849</td>
      <td>0.056470</td>
      <td>0.540861</td>
    </tr>
    <tr>
      <th>50%</th>
      <td>0.825512</td>
      <td>0.317570</td>
      <td>0.495521</td>
      <td>0.351223</td>
      <td>0.736110</td>
    </tr>
    <tr>
      <th>75%</th>
      <td>0.957096</td>
      <td>0.543349</td>
      <td>0.762200</td>
      <td>0.681268</td>
      <td>0.795818</td>
    </tr>
    <tr>
      <th>max</th>
      <td>0.973939</td>
      <td>0.891813</td>
      <td>0.790004</td>
      <td>0.815889</td>
      <td>0.825628</td>
    </tr>
  </tbody>
</table>
</div>
<p>The last thing we will learn about is correlation. You can use the <code>corr</code> method to calculate the correlation between two columns (or rows) of a dataframe. This is something you will probably do often if you are into data exploration/analysis:</p>
<pre><code class="language-python"># Calculate the correlation between the columns One and Three
frame[&apos;One&apos;].corr(frame[&apos;Five&apos;])

</code></pre>
<pre><code>0.879646855332041
</code></pre>
<p>You can provide an additional parameter <code>method</code> to specify the correlation method used, the options are:</p>
<ul>
<li>pearson : Standard correlation coefficient</li>
<li>kendall : Kendall Tau correlation coefficient</li>
<li>spearman : Spearman rank correlation</li>
</ul>
<pre><code class="language-python">frame[&apos;One&apos;].corr(frame[&apos;Five&apos;], method=&apos;spearman&apos;)
</code></pre>
<pre><code>0.19999999999999998
</code></pre>
<p>Alternatively, you can calculate the correlation matrix of the dataframe by just calling the <code>corr</code> method:</p>
<pre><code class="language-python">frame.corr()
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>One</th>
      <th>Two</th>
      <th>Three</th>
      <th>Four</th>
      <th>Five</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>One</th>
      <td>1.000000</td>
      <td>-0.795497</td>
      <td>0.275002</td>
      <td>-0.269384</td>
      <td>0.879647</td>
    </tr>
    <tr>
      <th>Two</th>
      <td>-0.795497</td>
      <td>1.000000</td>
      <td>-0.275345</td>
      <td>0.271683</td>
      <td>-0.982924</td>
    </tr>
    <tr>
      <th>Three</th>
      <td>0.275002</td>
      <td>-0.275345</td>
      <td>1.000000</td>
      <td>-0.999982</td>
      <td>0.370300</td>
    </tr>
    <tr>
      <th>Four</th>
      <td>-0.269384</td>
      <td>0.271683</td>
      <td>-0.999982</td>
      <td>1.000000</td>
      <td>-0.366111</td>
    </tr>
    <tr>
      <th>Five</th>
      <td>0.879647</td>
      <td>-0.982924</td>
      <td>0.370300</td>
      <td>-0.366111</td>
      <td>1.000000</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="understandingyourdatausuallystartswithacalltodescribeorcorr">Understanding your data usually starts with a call to describe or corr</h2>
<p>Calculating a few values from your data can grant you a better understanding of the phenomenon that generated it.</p>
<p>One of the first things you will do when selecting features for a ML algorithm is plotting the result of the correlation matrix. This will give you an idea of which features have a better shot at predicting labels if you intend to train a supervised model.</p>
<p>This, again, is just an example of the many applications of statistical analysis, and these are just some basic functions to aid you in the process.</p>
<p>Now that we learned the basics, we need to talk about pulling the data into Pandas. In the next article, we will learn how to create dataframes from common file formats.</p>
<p>Thank you for reading!</p>
<h2 id="whattodonext">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li><a href="https://github.com/don-juancito/BrainsToBytes_CodeSamples/tree/master/?ref=brainstobytes.com">You can find the source code for this series in this repo</a>.</li>
<li>This article is based on Python for Data Analysis. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hands-on Pandas(5): Mapping, apply and applymap]]></title><description><![CDATA[In this article you will learn how to apply functions to dataframes using apply and applymap.]]></description><link>https://www.brainstobytes.com/hands-on-pandas-5-mapping-apply-and-applymap/</link><guid isPermaLink="false">5efc8b835dd2bc0039626992</guid><category><![CDATA[Machine Learning & Data]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 15 Sep 2020 07:00:00 GMT</pubDate><media:content url="https://www.brainstobytes.com/content/images/2020/07/1200px-Pandas_logo.svg.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://www.brainstobytes.com/content/images/2020/07/1200px-Pandas_logo.svg.png" alt="Hands-on Pandas(5): Mapping, apply and applymap"><p>In this article, we will learn about mapping and the <em>apply</em> and <em>applymap</em> functions.</p>
<p>This technique will help you manipulate your data in very convenient ways, and is another important addition to your toolbox.</p>
<p>As always, we will explore the topic with examples that will help you understand what&apos;s going on.</p>
<p>Great, let&apos;s get started!</p>
<h2 id="mapping">Mapping</h2>
<p>Mapping means applying a function that transforms the elements of a domain into the elements of another domain. In this case, the entries, rows, and columns in a series or dataframe. Pandas lets you apply functions at element, row, and column level to create new series and dataframes.</p>
<p>Pandas is also compatible with many of the operations defined in NumPy. This lets you apply functions in a very convenient and performant fashion. Let&apos;s see some examples:</p>
<pre><code class="language-python">import numpy as np
import pandas as pd

frame = pd.DataFrame(np.random.randn(4,5),
                     columns=list(&apos;abcde&apos;),
                     index=[&apos;one&apos;, &apos;two&apos;, &apos;three&apos;, &apos;four&apos;])
frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
      <th>e</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>one</th>
      <td>3.007277</td>
      <td>0.388730</td>
      <td>0.113406</td>
      <td>2.119481</td>
      <td>-0.975847</td>
    </tr>
    <tr>
      <th>two</th>
      <td>0.636278</td>
      <td>0.206911</td>
      <td>1.778134</td>
      <td>-1.663180</td>
      <td>-1.211043</td>
    </tr>
    <tr>
      <th>three</th>
      <td>0.946199</td>
      <td>-0.397836</td>
      <td>-0.127306</td>
      <td>-0.588036</td>
      <td>1.026060</td>
    </tr>
    <tr>
      <th>four</th>
      <td>-0.315198</td>
      <td>-0.496803</td>
      <td>-0.918301</td>
      <td>0.389656</td>
      <td>-1.515556</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># You can apply NumPy functions directly on dataframes.
# You can, for example, calculate the absolute value of every entry
np.abs(frame)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
      <th>e</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>one</th>
      <td>3.007277</td>
      <td>0.388730</td>
      <td>0.113406</td>
      <td>2.119481</td>
      <td>0.975847</td>
    </tr>
    <tr>
      <th>two</th>
      <td>0.636278</td>
      <td>0.206911</td>
      <td>1.778134</td>
      <td>1.663180</td>
      <td>1.211043</td>
    </tr>
    <tr>
      <th>three</th>
      <td>0.946199</td>
      <td>0.397836</td>
      <td>0.127306</td>
      <td>0.588036</td>
      <td>1.026060</td>
    </tr>
    <tr>
      <th>four</th>
      <td>0.315198</td>
      <td>0.496803</td>
      <td>0.918301</td>
      <td>0.389656</td>
      <td>1.515556</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># You can also calculate the 3rd power of every entry
np.power(frame, 3)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
      <th>e</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>one</th>
      <td>27.196948</td>
      <td>0.058741</td>
      <td>0.001459</td>
      <td>9.521129</td>
      <td>-0.929277</td>
    </tr>
    <tr>
      <th>two</th>
      <td>0.257597</td>
      <td>0.008858</td>
      <td>5.622036</td>
      <td>-4.600633</td>
      <td>-1.776145</td>
    </tr>
    <tr>
      <th>three</th>
      <td>0.847125</td>
      <td>-0.062967</td>
      <td>-0.002063</td>
      <td>-0.203335</td>
      <td>1.080236</td>
    </tr>
    <tr>
      <th>four</th>
      <td>-0.031315</td>
      <td>-0.122617</td>
      <td>-0.774382</td>
      <td>0.059162</td>
      <td>-3.481093</td>
    </tr>
  </tbody>
</table>
</div>
<p>You can apply many of NumPy&apos;s ufuncs to Pandas data structures, in most situations they provide a result with the same dimensions of the original structure.</p>
<p>Another important (and quite common) operation creates a new structure after applying an operation to every row or column in the original dataframe. Let&apos;s see how to create a new structure whose entries are the result of summing every column/row of our frame:</p>
<pre><code class="language-python"># Panda&apos;s apply runs a function along an axis. 
# The default behavior is to run it using the rows axis (apply the operation on every column)

# Let&apos;s produce a Series where each entry is the sum of the values in every column:

ser = frame.apply(np.sum)
ser
</code></pre>
<pre><code>a    4.274556
b   -0.298998
c    0.845934
d    0.257921
e   -2.676385
dtype: float64
</code></pre>
<pre><code class="language-python"># If you want to perform the operation using columns as an axis (the operation will be applied on a per-row basis)
# You can pass the optional argument axis

ser = frame.apply(np.sum, axis=&apos;columns&apos;)
ser
</code></pre>
<pre><code>one      4.653047
two     -0.252900
three    0.859082
four    -2.856201
dtype: float64
</code></pre>
<p>Again, you can use most NumPy ufuncs as an argument for the apply function, but it doesn&apos;t end there: You can define your own functions and use them with <em>applymap</em>. The following example applies a function that adds 2 to every entry:</p>
<pre><code class="language-python">def sum_two(entry):
    return entry + 2

frame.applymap(sum_two)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
      <th>e</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>one</th>
      <td>5.007277</td>
      <td>2.388730</td>
      <td>2.113406</td>
      <td>4.119481</td>
      <td>1.024153</td>
    </tr>
    <tr>
      <th>two</th>
      <td>2.636278</td>
      <td>2.206911</td>
      <td>3.778134</td>
      <td>0.336820</td>
      <td>0.788957</td>
    </tr>
    <tr>
      <th>three</th>
      <td>2.946199</td>
      <td>1.602164</td>
      <td>1.872694</td>
      <td>1.411964</td>
      <td>3.026060</td>
    </tr>
    <tr>
      <th>four</th>
      <td>1.684802</td>
      <td>1.503197</td>
      <td>1.081699</td>
      <td>2.389656</td>
      <td>0.484444</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># You can do this using lambdas, it&apos;s usually easier to read:

sum_three = lambda x: x+3

frame.apply(sum_three)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
      <th>e</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>one</th>
      <td>6.007277</td>
      <td>3.388730</td>
      <td>3.113406</td>
      <td>5.119481</td>
      <td>2.024153</td>
    </tr>
    <tr>
      <th>two</th>
      <td>3.636278</td>
      <td>3.206911</td>
      <td>4.778134</td>
      <td>1.336820</td>
      <td>1.788957</td>
    </tr>
    <tr>
      <th>three</th>
      <td>3.946199</td>
      <td>2.602164</td>
      <td>2.872694</td>
      <td>2.411964</td>
      <td>4.026060</td>
    </tr>
    <tr>
      <th>four</th>
      <td>2.684802</td>
      <td>2.503197</td>
      <td>2.081699</td>
      <td>3.389656</td>
      <td>1.484444</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="simpleconceptendlessapplications">Simple concept, endless applications</h2>
<p>Performing mappings lets you do almost anything you need with your data. Anything, from statistical aggregations to advanced machine learning tools are built upon this foundation.</p>
<p>As you may have noticed, the concept is very simple, but knowing how to apply NumPy functions to Pandas data structures will help you on a daily basis. This is even more obvious when you start to explore the potential of applying your own functions!</p>
<p>In the next article, we will learn about data summarization and descriptive statistics.</p>
<p>Thank you for reading!</p>
<h2 id="whattodonext">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li><a href="https://github.com/don-juancito/BrainsToBytes_CodeSamples/tree/master/?ref=brainstobytes.com">You can find the source code for this series in this repo</a>.</li>
<li>This article is based on Python for Data Analysis. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hands-on Pandas(4): Arithmetics with DataFrames and Series]]></title><description><![CDATA[This article explains the basics of performing arithmetic operations on pandas series and dataframes.]]></description><link>https://www.brainstobytes.com/hands-on-pandas-3-arithmetics-with-dataframes-and-series/</link><guid isPermaLink="false">5edf6c485c144f00390e9168</guid><category><![CDATA[Machine Learning & Data]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 08 Sep 2020 07:00:00 GMT</pubDate><media:content url="https://www.brainstobytes.com/content/images/2020/06/1200px-Pandas_logo.svg-2.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://www.brainstobytes.com/content/images/2020/06/1200px-Pandas_logo.svg-2.png" alt="Hands-on Pandas(4): Arithmetics with DataFrames and Series"><p>Arithmetic operations are some of the most fundamental (and important) things you can do with series and dataframes. In this article, we will learn how to perform basic operations using both series and dataframes.</p>
<p>We are interested in the following scenarios:</p>
<ul>
<li>Operations between series with the same index.</li>
<li>Operations between dataframes with the same index.</li>
<li>Operations between dataframe/series with the same index.</li>
<li>Operations between series with different indexes.</li>
<li>Operations between dataframes with different indexes.</li>
<li>Operations between dataframe/series with different indexes.</li>
</ul>
<p>Good, let&apos;s get started!</p>
<h2 id="sameindexobviousbehavior">Same index, obvious behavior</h2>
<p>If two (or more) series/dataframes share the same index (both row and column index in the case of dataframes), operations follow the obvious element-wise behavior you would expect if you&apos;ve used NumPy in the past:</p>
<pre><code class="language-python">import pandas as pd
ser_1 = pd.Series([1,2,3,4], index=[&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;])
ser_2 = pd.Series([10,20,30,40], index=[&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;])

print(ser_1)
print(ser_2)
</code></pre>
<pre><code>a    1
b    2
c    3
d    4
dtype: int64
a    10
b    20
c    30
d    40
dtype: int64
</code></pre>
<pre><code class="language-python"># Addition of two series with the same index
ser_1 + ser_2
</code></pre>
<pre><code>a    11
b    22
c    33
d    44
dtype: int64
</code></pre>
<pre><code class="language-python"># Subtraction of two series with the same index
ser_2 - ser_1
</code></pre>
<pre><code>a     9
b    18
c    27
d    36
dtype: int64
</code></pre>
<pre><code class="language-python"># Multiplication of two series with the same index
ser_1 * ser_2
</code></pre>
<pre><code>a     10
b     40
c     90
d    160
dtype: int64
</code></pre>
<pre><code class="language-python"># Division of two series with the same index
ser_2 / ser_1
</code></pre>
<pre><code>a    10.0
b    10.0
c    10.0
d    10.0
dtype: float64
</code></pre>
<p>The same behavior is shown when you apply operations on two dataframes that share both the row and column index:</p>
<pre><code class="language-python">import numpy as np
df_1 = pd.DataFrame(np.arange(1,17).reshape(4,4),
                    index= [&apos;Fi&apos;, &apos;Se&apos;, &apos;Th&apos;, &apos;Fo&apos;],
                    columns = [&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;])

df_2 = pd.DataFrame(np.arange(1,17).reshape(4,4) * 10,
                    index= [&apos;Fi&apos;, &apos;Se&apos;, &apos;Th&apos;, &apos;Fo&apos;],
                    columns = [&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;])
</code></pre>
<pre><code class="language-python">df_1
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Fi</th>
      <td>1</td>
      <td>2</td>
      <td>3</td>
      <td>4</td>
    </tr>
    <tr>
      <th>Se</th>
      <td>5</td>
      <td>6</td>
      <td>7</td>
      <td>8</td>
    </tr>
    <tr>
      <th>Th</th>
      <td>9</td>
      <td>10</td>
      <td>11</td>
      <td>12</td>
    </tr>
    <tr>
      <th>Fo</th>
      <td>13</td>
      <td>14</td>
      <td>15</td>
      <td>16</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python">df_2
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Fi</th>
      <td>10</td>
      <td>20</td>
      <td>30</td>
      <td>40</td>
    </tr>
    <tr>
      <th>Se</th>
      <td>50</td>
      <td>60</td>
      <td>70</td>
      <td>80</td>
    </tr>
    <tr>
      <th>Th</th>
      <td>90</td>
      <td>100</td>
      <td>110</td>
      <td>120</td>
    </tr>
    <tr>
      <th>Fo</th>
      <td>130</td>
      <td>140</td>
      <td>150</td>
      <td>160</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Addition of two dataframes with the same index
df_1 + df_2
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Fi</th>
      <td>11</td>
      <td>22</td>
      <td>33</td>
      <td>44</td>
    </tr>
    <tr>
      <th>Se</th>
      <td>55</td>
      <td>66</td>
      <td>77</td>
      <td>88</td>
    </tr>
    <tr>
      <th>Th</th>
      <td>99</td>
      <td>110</td>
      <td>121</td>
      <td>132</td>
    </tr>
    <tr>
      <th>Fo</th>
      <td>143</td>
      <td>154</td>
      <td>165</td>
      <td>176</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Multiplication of two dataframes with the same index
df_1 * df_2
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Fi</th>
      <td>10</td>
      <td>40</td>
      <td>90</td>
      <td>160</td>
    </tr>
    <tr>
      <th>Se</th>
      <td>250</td>
      <td>360</td>
      <td>490</td>
      <td>640</td>
    </tr>
    <tr>
      <th>Th</th>
      <td>810</td>
      <td>1000</td>
      <td>1210</td>
      <td>1440</td>
    </tr>
    <tr>
      <th>Fo</th>
      <td>1690</td>
      <td>1960</td>
      <td>2250</td>
      <td>2560</td>
    </tr>
  </tbody>
</table>
</div>
<p>It&apos;s also possible to perform operations between dataframes and series that share an index. The default behavior is to align the index of the series with the column index of the dataframe and perform the operations between each row and the series.</p>
<pre><code class="language-python"># Sum a series and a dataframe
ser_1 + df_1
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Fi</th>
      <td>2</td>
      <td>4</td>
      <td>6</td>
      <td>8</td>
    </tr>
    <tr>
      <th>Se</th>
      <td>6</td>
      <td>8</td>
      <td>10</td>
      <td>12</td>
    </tr>
    <tr>
      <th>Th</th>
      <td>10</td>
      <td>12</td>
      <td>14</td>
      <td>16</td>
    </tr>
    <tr>
      <th>Fo</th>
      <td>14</td>
      <td>16</td>
      <td>18</td>
      <td>20</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="differentindexouterjoins">Different index, outer joins</h2>
<p>If you perform operations between series/dataframes with different index, the result will be a new data structure whose index is the union of the original indexes. If you have worked with databases before this is similar to an outer join using the indexes of the original series/dataframes. This is much easier to see with an example:</p>
<pre><code class="language-python">ser_1 = pd.Series([1,1,1,1,1], index=[&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;, &apos;e&apos;])
ser_2 = pd.Series([5,5,5,5,5], index=[&apos;c&apos;, &apos;d&apos;, &apos;e&apos;, &apos;f&apos;, &apos;g&apos;])

print(ser_1)
print(ser_2)
</code></pre>
<pre><code>a    1
b    1
c    1
d    1
e    1
dtype: int64
c    5
d    5
e    5
f    5
g    5
dtype: int64
</code></pre>
<p>If the operation is performed on series with different indexes, the result will contain the result of the operation on all entries whose index is contained in the union of the original indexes. Elements outside of the union will be filled with NaN.</p>
<p>In this case, the union is <code>[&apos;c&apos;, &apos;d&apos;, &apos;e&apos;]</code>.</p>
<pre><code class="language-python">ser_1 + ser_2
</code></pre>
<pre><code>a    NaN
b    NaN
c    6.0
d    6.0
e    6.0
f    NaN
g    NaN
dtype: float64
</code></pre>
<pre><code class="language-python">ser_1 * ser_2
</code></pre>
<pre><code>a    NaN
b    NaN
c    5.0
d    5.0
e    5.0
f    NaN
g    NaN
dtype: float64
</code></pre>
<p>Dataframes have the same behavior, but the unions are performed on both the row and column index.</p>
<pre><code class="language-python">import numpy as np

# In this case, the union are the elements [a,b,c] in the columns and [Fi,Fo,Th] in the rows

df_1 = pd.DataFrame(np.arange(1,17).reshape(4,4),
                    index= [&apos;Fi&apos;, &apos;Ma&apos;, &apos;Th&apos;, &apos;Fo&apos;],
                    columns = [&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;])

df_2 = pd.DataFrame(np.arange(1,17).reshape(4,4) * 10,
                    index= [&apos;Fi&apos;, &apos;Se&apos;, &apos;Th&apos;, &apos;Fo&apos;],
                    columns = [&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;e&apos;])

df_1 + df_2
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
      <th>e</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Fi</th>
      <td>11.0</td>
      <td>22.0</td>
      <td>33.0</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Fo</th>
      <td>143.0</td>
      <td>154.0</td>
      <td>165.0</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Ma</th>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Se</th>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Th</th>
      <td>99.0</td>
      <td>110.0</td>
      <td>121.0</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>
</div>
<p>In the case of operations between dataframes and series with different indexes, a union will be performed between the column index of the dataframe and the index of the series:</p>
<pre><code class="language-python">df_1 + ser_2
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
      <th>e</th>
      <th>f</th>
      <th>g</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Fi</th>
      <td>NaN</td>
      <td>NaN</td>
      <td>8.0</td>
      <td>9.0</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Ma</th>
      <td>NaN</td>
      <td>NaN</td>
      <td>12.0</td>
      <td>13.0</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Th</th>
      <td>NaN</td>
      <td>NaN</td>
      <td>16.0</td>
      <td>17.0</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>Fo</th>
      <td>NaN</td>
      <td>NaN</td>
      <td>20.0</td>
      <td>21.0</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="fillinginmissingvalues">Filling in missing values</h2>
<p>Instead of using the normal arithmetic operators, you can use a set of built-in Pandas functions that accept an argument to fill-in missing values:</p>
<ul>
<li>add/radd</li>
<li>sub/rsub</li>
<li>div/rdiv</li>
<li>mul/rmul</li>
<li>pow/rpow</li>
</ul>
<p>Let&apos;s revisit series addition and use 0 as placeholder value:</p>
<pre><code class="language-python">ser_1.add(ser_2, fill_value=1)
</code></pre>
<pre><code>a    2.0
b    2.0
c    6.0
d    6.0
e    6.0
f    6.0
g    6.0
dtype: float64
</code></pre>
<p>If an entry is not in the overlap of the two series, the sum operation will be performed against a placeholder value of 0. For example, for indexes a/b, both are 1+0, and for f/g it is 5+0. The same behavior applies to dataframes.</p>
<h2 id="nowyouknowmaths">Now you know maths</h2>
<p>The toughest thing about working with arithmetic operations using pandas data structures is understanding how it works when indexes are not the same. As long as you remember that it behaves like an outer join, everything will be clear and easy.</p>
<p>In the next article, we will talk about mapping and function application, our first advance-y Pandas topics!</p>
<p>Thanks for reading!</p>
<h2 id="whattodonext">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li><a href="https://github.com/don-juancito/BrainsToBytes_CodeSamples/tree/master/?ref=brainstobytes.com">You can find the source code for this series in this repo</a>.</li>
<li>This article is based on Python for Data Analysis. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hands-on Pandas(3): Reindexing and Deletion]]></title><description><![CDATA[In this article we will learn how to alter indexes and remove elements from both series and dataframes.]]></description><link>https://www.brainstobytes.com/hands-on-pandas-3-reindexing-and-deletion/</link><guid isPermaLink="false">5ed8de4735eabc0039053c8c</guid><category><![CDATA[Machine Learning & Data]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 01 Sep 2020 07:00:00 GMT</pubDate><media:content url="https://www.brainstobytes.com/content/images/2020/06/1200px-Pandas_logo.svg-1.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://www.brainstobytes.com/content/images/2020/06/1200px-Pandas_logo.svg-1.png" alt="Hands-on Pandas(3): Reindexing and Deletion"><p>Today we will deal with two techniques we need to cover before moving to more advanced Pandas topics: Reindexing and element deletion.</p>
<p>It will be a bit shorter than the first two articles in the series, but that doesn&apos;t mean it&apos;s not important. Both techniques are very useful, and you will probably use them in your day-to-day work if you become a Pandas practitioner.</p>
<p>Good, let&apos;s get started!</p>
<h3 id="reindexing">Reindexing</h3>
<p>Reindexing is a fancy word for <em>creating a new dataframe/series with an altered index</em>.</p>
<pre><code class="language-python">import pandas as pd

ser = pd.Series([2,1,3,4,7,6,5], index=[&apos;b&apos;, &apos;a&apos;, &apos;c&apos;, &apos;d&apos;, &apos;g&apos;, &apos;f&apos;, &apos;e&apos;])
print(ser)
</code></pre>
<pre><code>b    2
a    1
c    3
d    4
g    7
f    6
e    5
dtype: int64
</code></pre>
<p>The <em>reindex</em> function receives a list of index elements and creates a new dataframe (or series) in which the rows/elements follow the order specified in that list.</p>
<p>For example, we can create a new series where the numbers are ordered in ascending order by providing the following input for reindex:</p>
<pre><code class="language-python">ordered_ser = ser.reindex([&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;, &apos;e&apos;, &apos;f&apos;, &apos;g&apos;])
print(ordered_ser)
</code></pre>
<pre><code>a    1
b    2
c    3
d    4
e    5
f    6
g    7
dtype: int64
</code></pre>
<p>You don&apos;t need to pass every element in the original index, you can provide a list with only the elements you need:</p>
<pre><code class="language-python"># This will create a new dataframe with the last four elements, in descending order
ordered_ser = ser.reindex([&apos;g&apos;, &apos;f&apos;, &apos;e&apos;, &apos;d&apos;])
print(ordered_ser)
</code></pre>
<pre><code>g    7
f    6
e    5
d    4
dtype: int64
</code></pre>
<p>Sometimes you want to reindex the series/dataframe to expand the range of elements. In this case, you will probably find that some of the elements are set to NaN:</p>
<pre><code class="language-python">ser = pd.Series([&apos;azul&apos;, &apos;rojo&apos;, &apos;verde&apos;], index=[0,4,8])
ser.reindex(range(12))
</code></pre>
<pre><code>0      azul
1       NaN
2       NaN
3       NaN
4      rojo
5       NaN
6       NaN
7       NaN
8     verde
9       NaN
10      NaN
11      NaN
dtype: object
</code></pre>
<pre><code class="language-python"># In this case, you can specify a fill method to dictate what will happen to the empty entries
# ffill, for example, performs a forward fill

ser.reindex(range(12), method=&apos;ffill&apos;)
</code></pre>
<pre><code>0      azul
1      azul
2      azul
3      azul
4      rojo
5      rojo
6      rojo
7      rojo
8     verde
9     verde
10    verde
11    verde
dtype: object
</code></pre>
<p>Frames behave pretty much the same way, but they also let you reindex by column. Let&apos;s take a look at a final reindexing example using a dataframe:</p>
<pre><code class="language-python">import numpy as np

frame = pd.DataFrame(np.arange(16).reshape(4,4),
                     index = [&apos;First&apos;, &apos;Second&apos;, &apos;Third&apos;, &apos;Fourth&apos;],
                     columns = [&apos;Alpha&apos;, &apos;Beta&apos;, &apos;Gamma&apos;, &apos;Delta&apos;])

frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Alpha</th>
      <th>Beta</th>
      <th>Gamma</th>
      <th>Delta</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>First</th>
      <td>0</td>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <th>Second</th>
      <td>4</td>
      <td>5</td>
      <td>6</td>
      <td>7</td>
    </tr>
    <tr>
      <th>Third</th>
      <td>8</td>
      <td>9</td>
      <td>10</td>
      <td>11</td>
    </tr>
    <tr>
      <th>Fourth</th>
      <td>12</td>
      <td>13</td>
      <td>14</td>
      <td>15</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># We can reindex using the row index
frame.reindex([&apos;Fourth&apos;, &apos;Second&apos;])
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Alpha</th>
      <th>Beta</th>
      <th>Gamma</th>
      <th>Delta</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Fourth</th>
      <td>12</td>
      <td>13</td>
      <td>14</td>
      <td>15</td>
    </tr>
    <tr>
      <th>Second</th>
      <td>4</td>
      <td>5</td>
      <td>6</td>
      <td>7</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Or, reindex using the columns
frame.reindex(columns=[&apos;Alpha&apos;, &apos;Gamma&apos;])
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Alpha</th>
      <th>Gamma</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>First</th>
      <td>0</td>
      <td>2</td>
    </tr>
    <tr>
      <th>Second</th>
      <td>4</td>
      <td>6</td>
    </tr>
    <tr>
      <th>Third</th>
      <td>8</td>
      <td>10</td>
    </tr>
    <tr>
      <th>Fourth</th>
      <td>12</td>
      <td>14</td>
    </tr>
  </tbody>
</table>
</div>
<h3 id="deletingelements">Deleting elements</h3>
<p>Now we will learn how to remove elements from both series and dataframes. This is usually achieved using the <em>drop</em> method.</p>
<p>Note that calls to drop don&apos;t alter the original series/dataframe. Instead, they return a new one without the specified elements. If for some reason you need to alter the original series/dataframe, you can pass <code>inplace=True</code> as an argument.</p>
<pre><code class="language-python">ser = pd.Series([1,2,3,4], index=[&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;])
print(ser)
</code></pre>
<pre><code>a    1
b    2
c    3
d    4
dtype: int64
</code></pre>
<pre><code class="language-python"># You can pass to drop the index value of the element you want to delete
ser.drop(&apos;b&apos;)
</code></pre>
<pre><code>a    1
c    3
d    4
dtype: int64
</code></pre>
<pre><code class="language-python"># You can also pass a list of index values
ser.drop([&apos;a&apos;, &apos;c&apos;])
</code></pre>
<pre><code>b    2
d    4
dtype: int64
</code></pre>
<p>Dataframes let you drop elements using both the row index and the column index.</p>
<pre><code class="language-python">frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Alpha</th>
      <th>Beta</th>
      <th>Gamma</th>
      <th>Delta</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>First</th>
      <td>0</td>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <th>Second</th>
      <td>4</td>
      <td>5</td>
      <td>6</td>
      <td>7</td>
    </tr>
    <tr>
      <th>Third</th>
      <td>8</td>
      <td>9</td>
      <td>10</td>
      <td>11</td>
    </tr>
    <tr>
      <th>Fourth</th>
      <td>12</td>
      <td>13</td>
      <td>14</td>
      <td>15</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Let&apos;s drop the second and fourth rows
frame.drop([&apos;Second&apos;, &apos;Fourth&apos;])
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Alpha</th>
      <th>Beta</th>
      <th>Gamma</th>
      <th>Delta</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>First</th>
      <td>0</td>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <th>Third</th>
      <td>8</td>
      <td>9</td>
      <td>10</td>
      <td>11</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># If you add an additional argument set to axis=&apos;columns&apos; (or axis=1) you will drop using the column index
# Let&apos;s get rid of the Alpha and Beta columns
frame.drop([&apos;Alpha&apos;, &apos;Beta&apos;], axis=&apos;columns&apos;)
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Gamma</th>
      <th>Delta</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>First</th>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <th>Second</th>
      <td>6</td>
      <td>7</td>
    </tr>
    <tr>
      <th>Third</th>
      <td>10</td>
      <td>11</td>
    </tr>
    <tr>
      <th>Fourth</th>
      <td>14</td>
      <td>15</td>
    </tr>
  </tbody>
</table>
</div>
<h2 id="datawranglingbasics">Data-wrangling basics</h2>
<p>When exploring data, you will need to alter indexes and delete rows with elements you don&apos;t need. As with all previous articles, I&apos;d like to encourage you to practice these techniques on your own until you feel comfortable with them.</p>
<p>In the next article, we will learn how to perform arithmetic operations with dataframes and series.</p>
<p>Thank you for reading!</p>
<h2 id="whattodonext">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li><a href="https://github.com/don-juancito/BrainsToBytes_CodeSamples/tree/master/?ref=brainstobytes.com">You can find the source code for this series in this repo</a>.</li>
<li>This article is based on Python for Data Analysis. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Hands-on Pandas(2): Selection, Filtering, loc and iloc]]></title><description><![CDATA[This article teaches the basics of selection and filtering using, among others, tools like the loc and iloc methods.]]></description><link>https://www.brainstobytes.com/hands-on-pandas-2-selection-filtering-loc-and-iloc/</link><guid isPermaLink="false">5ed7a757adbbe400394521f1</guid><category><![CDATA[Machine Learning & Data]]></category><dc:creator><![CDATA[Juan Orozco Villalobos]]></dc:creator><pubDate>Tue, 25 Aug 2020 07:00:00 GMT</pubDate><media:content url="https://www.brainstobytes.com/content/images/2020/06/1200px-Pandas_logo.svg.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://www.brainstobytes.com/content/images/2020/06/1200px-Pandas_logo.svg.png" alt="Hands-on Pandas(2): Selection, Filtering, loc and iloc"><p>In the last article, we learned about the two basic pandas data structures: Series and DataFrames. We also built a couple of them on our own and learned the basics of indexing and selection.</p>
<p>Today we will learn a bit more about selecting and filtering elements from Pandas data structures. This might seem like an incredibly basic topic, but it&apos;s very useful. That&apos;s why it&apos;s important to understand it well before tackling more advanced topics.</p>
<p>Knowing how to wrangle data is one of the most important skills for anyone working on data science and machine learning, and the foundation of those skills is data selection and filtering.</p>
<p>Good, let&apos;s get started!</p>
<h3 id="playingwithseries">Playing with Series</h3>
<p>Selecting elements from a Series object is pretty straightforward, the next are examples of different ways of selecting elements from a small 8-element series</p>
<pre><code class="language-python">import pandas as pd
import numpy as np 

ser = pd.Series(np.arange(8), index=[&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;, &apos;e&apos;, &apos;f&apos;, &apos;g&apos;, &apos;h&apos;])
print(ser)
</code></pre>
<pre><code>a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
dtype: int64
</code></pre>
<pre><code class="language-python"># You can select elements from a series using its index
ser[&apos;d&apos;]
</code></pre>
<pre><code>3
</code></pre>
<pre><code class="language-python"># You can also pass a list of index elements if you need to retrieve more than one element
ser[[&apos;a&apos;, &apos;d&apos;, &apos;g&apos;]]
</code></pre>
<pre><code>a    0
d    3
g    6
dtype: int64
</code></pre>
<p>Pandas is so cool that it even supports selection with index-based-slices! There is an important distinction between this and regular slices: <strong>The last element of the slice is included</strong>.</p>
<pre><code class="language-python"># Select all elements from b to g (both edges included)
ser[&apos;b&apos;:&apos;g&apos;]
</code></pre>
<pre><code>b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64
</code></pre>
<p>The fact that you are not using the default index does not mean that position-based selection is not permitted. You can still select elements from a Series using integers.</p>
<pre><code class="language-python"># Select the third (index 2, remember? 0-indexed) from our series
ser[2]
</code></pre>
<pre><code>2
</code></pre>
<pre><code class="language-python"># Now, select the elements at indexes 2, 3 and 6
ser[[2,3,6]]
</code></pre>
<pre><code>c    2
d    3
g    6
dtype: int64
</code></pre>
<pre><code class="language-python"># And finally, slice selection is still supported (but in this case, the last element is excluded as usual)
ser[2:8]
</code></pre>
<pre><code>c    2
d    3
e    4
f    5
g    6
h    7
dtype: int64
</code></pre>
<h2 id="playingwithdataframes">Playing with DataFrames</h2>
<p>Because of an extra dimension, selecting elements from DataFrames is richer than from Series. We will start with the most basic scenario: Selecting whole columns.</p>
<pre><code class="language-python">pokedata = {&apos;Name&apos;: [&apos;Abra&apos;, &apos;Koffing&apos;, &apos;Milcery&apos;, &apos;Pikachu&apos;, &apos;Shellder&apos;, &apos;Vulpix&apos;],
            &apos;Type&apos;: [&apos;Psychic&apos;, &apos;Poison&apos;, &apos;Fairy&apos;, &apos;Electric&apos;, &apos;Water&apos;, &apos;Fire&apos;],
            &apos;HP&apos;: [25, 40, 45, 35, 30, 38],
            &apos;Speed&apos;: [90, 35, 34, 90, 40, 65],
            &apos;Color&apos;: [&apos;Yellow&apos;, &apos;Purple&apos;, &apos;White&apos;, &apos;Yellow&apos;, &apos;Purple&apos;, &apos;Red&apos;],
            &apos;FirstGen&apos;: [True, True, False, True, True, True]}

# We will use the Name column as index
pframe = pd.DataFrame(pokedata).set_index(&apos;Name&apos;)
pframe
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Type</th>
      <th>HP</th>
      <th>Speed</th>
      <th>Color</th>
      <th>FirstGen</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Abra</th>
      <td>Psychic</td>
      <td>25</td>
      <td>90</td>
      <td>Yellow</td>
      <td>True</td>
    </tr>
    <tr>
      <th>Koffing</th>
      <td>Poison</td>
      <td>40</td>
      <td>35</td>
      <td>Purple</td>
      <td>True</td>
    </tr>
    <tr>
      <th>Milcery</th>
      <td>Fairy</td>
      <td>45</td>
      <td>34</td>
      <td>White</td>
      <td>False</td>
    </tr>
    <tr>
      <th>Pikachu</th>
      <td>Electric</td>
      <td>35</td>
      <td>90</td>
      <td>Yellow</td>
      <td>True</td>
    </tr>
    <tr>
      <th>Shellder</th>
      <td>Water</td>
      <td>30</td>
      <td>40</td>
      <td>Purple</td>
      <td>True</td>
    </tr>
    <tr>
      <th>Vulpix</th>
      <td>Fire</td>
      <td>38</td>
      <td>65</td>
      <td>Red</td>
      <td>True</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># You can select a column from the frame by passing the name between brackets
pframe[&apos;Type&apos;]
</code></pre>
<pre><code>Name
Abra         Psychic
Koffing       Poison
Milcery        Fairy
Pikachu     Electric
Shellder       Water
Vulpix          Fire
Name: Type, dtype: object
</code></pre>
<pre><code class="language-python"># If you pass a list of column names you will retrieve them in that order
pframe[[&apos;FirstGen&apos;, &apos;HP&apos;, &apos;Color&apos;]]
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>FirstGen</th>
      <th>HP</th>
      <th>Color</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Abra</th>
      <td>True</td>
      <td>25</td>
      <td>Yellow</td>
    </tr>
    <tr>
      <th>Koffing</th>
      <td>True</td>
      <td>40</td>
      <td>Purple</td>
    </tr>
    <tr>
      <th>Milcery</th>
      <td>False</td>
      <td>45</td>
      <td>White</td>
    </tr>
    <tr>
      <th>Pikachu</th>
      <td>True</td>
      <td>35</td>
      <td>Yellow</td>
    </tr>
    <tr>
      <th>Shellder</th>
      <td>True</td>
      <td>30</td>
      <td>Purple</td>
    </tr>
    <tr>
      <th>Vulpix</th>
      <td>True</td>
      <td>38</td>
      <td>Red</td>
    </tr>
  </tbody>
</table>
</div>
<p>Square brackets also support selection based on content. Let&apos;s select rows that satisfy specific criteria to see how it works.</p>
<pre><code class="language-python"># Select all Pokemon with speed lower than 50
pframe[pframe[&apos;Speed&apos;] &lt; 50]
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Type</th>
      <th>HP</th>
      <th>Speed</th>
      <th>Color</th>
      <th>FirstGen</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Koffing</th>
      <td>Poison</td>
      <td>40</td>
      <td>35</td>
      <td>Purple</td>
      <td>True</td>
    </tr>
    <tr>
      <th>Milcery</th>
      <td>Fairy</td>
      <td>45</td>
      <td>34</td>
      <td>White</td>
      <td>False</td>
    </tr>
    <tr>
      <th>Shellder</th>
      <td>Water</td>
      <td>30</td>
      <td>40</td>
      <td>Purple</td>
      <td>True</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Select all yelloe Pokemon
pframe[pframe[&apos;Color&apos;] == &apos;Yellow&apos;]
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Type</th>
      <th>HP</th>
      <th>Speed</th>
      <th>Color</th>
      <th>FirstGen</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Abra</th>
      <td>Psychic</td>
      <td>25</td>
      <td>90</td>
      <td>Yellow</td>
      <td>True</td>
    </tr>
    <tr>
      <th>Pikachu</th>
      <td>Electric</td>
      <td>35</td>
      <td>90</td>
      <td>Yellow</td>
      <td>True</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Select all first generation Pokemon with HP greater than 37
pframe[(pframe[&apos;FirstGen&apos;] == True) &amp; (pframe[&apos;HP&apos;] &gt; 37)]
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Type</th>
      <th>HP</th>
      <th>Speed</th>
      <th>Color</th>
      <th>FirstGen</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Koffing</th>
      <td>Poison</td>
      <td>40</td>
      <td>35</td>
      <td>Purple</td>
      <td>True</td>
    </tr>
    <tr>
      <th>Vulpix</th>
      <td>Fire</td>
      <td>38</td>
      <td>65</td>
      <td>Red</td>
      <td>True</td>
    </tr>
  </tbody>
</table>
</div>
<p>You can go as specific as you want with this form of filtering. Selecting subsets of rows is a very useful skill, so play a bit selecting based on your own conditions.</p>
<p>Good, I think we are good when it comes to selecting based on column tags, now let&apos;s select specific rows based on the index. For this, Pandas offers you two very valuable functions: <em>loc</em> and <em>iloc</em>.</p>
<p>loc lets you select based on axis labels, whereas iloc lets you select based on integers that represent the position of the row. Again it&apos;s easier to understand with examples:</p>
<pre><code class="language-python"># Select the row with index Shellder
pframe.loc[&apos;Shellder&apos;]
</code></pre>
<pre><code>Type         Water
HP              30
Speed           40
Color       Purple
FirstGen      True
Name: Shellder, dtype: object
</code></pre>
<pre><code class="language-python"># You can pass a list of index values and get the rows in the specified order
pframe.loc[[&apos;Shellder&apos;, &apos;Abra&apos;, &apos;Pikachu&apos;]]
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Type</th>
      <th>HP</th>
      <th>Speed</th>
      <th>Color</th>
      <th>FirstGen</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Shellder</th>
      <td>Water</td>
      <td>30</td>
      <td>40</td>
      <td>Purple</td>
      <td>True</td>
    </tr>
    <tr>
      <th>Abra</th>
      <td>Psychic</td>
      <td>25</td>
      <td>90</td>
      <td>Yellow</td>
      <td>True</td>
    </tr>
    <tr>
      <th>Pikachu</th>
      <td>Electric</td>
      <td>35</td>
      <td>90</td>
      <td>Yellow</td>
      <td>True</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># It&apos;s also possible to get only a subset of columns using loc
# Let&apos;s get data for Shellder, but only the Type and Color
pframe.loc[&apos;Shellder&apos;, [&apos;Type&apos;, &apos;Color&apos;]]
</code></pre>
<pre><code>Type      Water
Color    Purple
Name: Shellder, dtype: object
</code></pre>
<pre><code class="language-python"># If instead, you need to select elements based on order, you can use iloc
# For example, the following line selects the third row (index 2, because 0-indexed)
pframe.iloc[2]
</code></pre>
<pre><code>Type        Fairy
HP             45
Speed          34
Color       White
FirstGen    False
Name: Milcery, dtype: object
</code></pre>
<pre><code class="language-python"># Just like loc, you can pass a list of indexes and it will return a dataframe with rows in that order
pframe.iloc[[2,4,0]]
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Type</th>
      <th>HP</th>
      <th>Speed</th>
      <th>Color</th>
      <th>FirstGen</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Milcery</th>
      <td>Fairy</td>
      <td>45</td>
      <td>34</td>
      <td>White</td>
      <td>False</td>
    </tr>
    <tr>
      <th>Shellder</th>
      <td>Water</td>
      <td>30</td>
      <td>40</td>
      <td>Purple</td>
      <td>True</td>
    </tr>
    <tr>
      <th>Abra</th>
      <td>Psychic</td>
      <td>25</td>
      <td>90</td>
      <td>Yellow</td>
      <td>True</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Remember that little trick for selecting just a subset of columns? It also works for iloc
# This selects the third row, and only the Type (column at position 0) and HP (column at position 1)
pframe.iloc[2, [0, 1]]
</code></pre>
<pre><code>Type    Fairy
HP         45
Name: Milcery, dtype: object
</code></pre>
<h3 id="awordonnumericindexes">A word on numeric indexes</h3>
<p>loc and iloc are pretty straightforward, but it&apos;s important to understand the difference between them. This is especially true when dealing with numeric indexes. A dataframe with numeric indexes that are not in order, starting at 0 and without interruption will behave weird unless you remember how those function differ. Take the following dataframe as example:</p>
<pre><code class="language-python">frame = pd.DataFrame(np.arange(36).reshape(6,6), 
                     columns = [&apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;, &apos;e&apos;, &apos;f&apos; ],
                     index = [5, 3, 1, 4, 2, 0])
frame
</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
<pre><code>.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</code></pre>
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
      <th>c</th>
      <th>d</th>
      <th>e</th>
      <th>f</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>5</th>
      <td>0</td>
      <td>1</td>
      <td>2</td>
      <td>3</td>
      <td>4</td>
      <td>5</td>
    </tr>
    <tr>
      <th>3</th>
      <td>6</td>
      <td>7</td>
      <td>8</td>
      <td>9</td>
      <td>10</td>
      <td>11</td>
    </tr>
    <tr>
      <th>1</th>
      <td>12</td>
      <td>13</td>
      <td>14</td>
      <td>15</td>
      <td>16</td>
      <td>17</td>
    </tr>
    <tr>
      <th>4</th>
      <td>18</td>
      <td>19</td>
      <td>20</td>
      <td>21</td>
      <td>22</td>
      <td>23</td>
    </tr>
    <tr>
      <th>2</th>
      <td>24</td>
      <td>25</td>
      <td>26</td>
      <td>27</td>
      <td>28</td>
      <td>29</td>
    </tr>
    <tr>
      <th>0</th>
      <td>30</td>
      <td>31</td>
      <td>32</td>
      <td>33</td>
      <td>34</td>
      <td>35</td>
    </tr>
  </tbody>
</table>
</div>
<pre><code class="language-python"># Now, let&apos;s check what loc[2] and iloc[2] return
frame.loc[2]
</code></pre>
<pre><code>a    24
b    25
c    26
d    27
e    28
f    29
Name: 2, dtype: int64
</code></pre>
<pre><code class="language-python">frame.iloc[2]
</code></pre>
<pre><code>a    12
b    13
c    14
d    15
e    16
f    17
Name: 1, dtype: int64
</code></pre>
<p>Can you see they return different rows? This happens because <code>loc[2]</code> looks for a row with an index with a <strong>value</strong> of two, in this case, the penultimate row. On the other hand, <code>iloc[2]</code> just looks for the third row, the one with <em>positional index 2, starting from 0</em>. If you remember this, you will have no problem dealing with dataframes with numeric indexes!</p>
<h2 id="selectionisarichtopic">Selection is a rich topic</h2>
<p>One of the great things about Pandas is how easy it makes selecting only the data you need. As you may already know, almost every advanced application rests on this foundation, and know you know how to use it!</p>
<p>Now that we can select data and understand how indexes work, we can deal with two interesting topics: Reindexing and deletion of entries. The next article will talk about these topics, so make sure to come back to check it.</p>
<p>Thanks for reading!</p>
<h2 id="whattodonext">What to do next</h2>
<ul>
<li>Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.</li>
<li><a href="https://github.com/don-juancito/BrainsToBytes_CodeSamples/tree/master/?ref=brainstobytes.com">You can find the source code for this series in this repo</a>.</li>
<li>This article is based on Python for Data Analysis. These and other very helpful books can be found in the <a href="https://www.brainstobytes.com/recommended-books/">recommended reading list</a>.</li>
<li>Send me an email with questions, comments or suggestions (it&apos;s in the <a href="https://www.brainstobytes.com/about">About Me page</a>)</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item></channel></rss>