Pandas

Einfache Einführung: DataFrames und Series

Während DataFrames sich durch die Verarbeitung tabellarischer Daten auszeichnen, bieten Series eine vielseitige Möglichkeit, mit eindimensionalen beschrifteten Daten zu arbeiten. Sie bilden die Grundlage für den Aufbau von DataFrames und dienen als Bausteine für verschiedene Datenmanipulationsaufgaben.

Die Pandas-Series verstehen

Im Kern ist die Series einfach eine Datenfolge. Stellen Sie sich eine Serie als eine einzelne Spalte in einer Tabelle vor. Sie enthält eine S ammlung von Elementen (Datenpunkten) mit entsprechenden Beschriftungen (Indizes). Diese Indizes können eindeutig, aber auch nicht eindeutig sein. Während eindeutige Indizes eine effiziente Abfrage nach Indexwert ermöglichen, können nicht eindeutige Indizes zusätzliche Überlegungen bei der Bearbeitung der Serie erfordern. Eine Serie kann aus Datentypen wie Listen, NumPy-Arrays, Wörterbüchern usw. erstellt werden. Im Folgenden erstellen wir eine Serie aus einer Liste von Namen: Standardmäßig werden die Datenpunkte in der Serie ab Null indiziert. Der Serienkonstruktor verfügt jedoch über einen Indexparameter, mit dem wir einen benutzerdefinierten Index an die Datenpunkte übergeben können. Im folgenden Code haben wir alphabetische Zeichen als Index übergeben. Damit dies funktioniert, muss die Länge des benutzerdefinierten Index mit der Länge der Datenpunkte übereinstimmen. Wenn eine Nichtübereinstimmung vorliegt (wenn die Datenpunkte den Index zahlenmäßig übertreffen oder umgekehrt), generiert der Code einen ValueError. Der Datentyp der obigen Serie ist „Objekt“. Dies wird aus den Daten abgeleitet, die vom Datentyp „String“ sind. Wenn die Daten vom Datentyp „Integer“ sind, wäre der abgeleitete Datentyp „int64“. Siehe unten: Wir können jedoch auch den Parameter „dtype“ verwenden, um den Datentyp der Serie festzulegen. Im folgenden Code haben wir die Serie auf einen Float-Datentyp festgelegt. Während eine Serie normalerweise einen einzigen Datentyp enthält, ist der Datentyp ein „Objekt“-Datentyp, wenn die Daten beliebig sind. Dies ist der einzige Datentyp, der alle verschiedenen Datentypen unterstützt. Im folgenden Beispiel haben wir beliebige Daten in Form von Integern und Strings. Sie können sehen, dass der Datentyp ein Objekt ist. Bitte beachten Sie, dass eine umfangreiche Mischung von Datentypen zu Leistungsproblemen und unerwartetem Verhalten führen kann. Es wird immer empfohlen, eine Serie zu haben, die einen einzigen Datentyp enthält. 50 Tage Datenanalyse mit Python: Das ultimative Buch mit Herausforderungen für Anfänger Laut Forbes ist Datenanalyse eine der Fähigkeiten, die man im Jahr 2025 mit hohem Einkommen erlernen kann. Es gibt keinen besseren Weg, um sich in der Datenanalyse zu qualifizieren, als sich die Hände schmutzig zu machen und einige Herausforderungen anzugehen. Beginnen Sie Ihre 50-tägige Reise noch heute. Sie finden das Buch auch auf Leanpub. Verwenden Sie den folgenden Link: Leanpub-Link Weitere Ressourcen: Python-Grundlagen meistern: Der ultimative Leitfaden für Anfänger Python-Tipps und Tricks: Eine Sammlung von 100 grundlegenden und fortgeschrittenen Tipps und Tricks. Erstellen einer Reihe aus einem Wörterbuch Wir können auch eine Reihe aus einem Wörterbuch erstellen. Standardmäßig sind die Schlüssel des Wörterbuchs der Index und die Werte die Datenpunkte. Siehe unten: Wenn Sie eine Reihe mit bestimmten Schlüssel-Wert-Paaren erstellen möchten oder eine andere Reihenfolge benötigen, können Sie einen benutzerdefinierten Index erstellen, der den Schlüsseln der Schlüssel-Wert-Paare entspricht, die Sie in Ihre Reihe aufnehmen möchten. Siehe das Beispiel unten: Sie können in der Ausgabe sehen, dass der Schlüssel „Geschlecht“ und sein Wert „Männlich“ nicht in der Serie sind. Denken Sie daran, dass der benutzerdefinierte Index mit den Schlüsseln übereinstimmen muss, die Sie in die Serie aufnehmen möchten, damit dies funktioniert. Indizierung und Aufteilung der Pandas-Serie Die Serie unterstützt sowohl die bezeichnungsbasierte als auch die positionelle Indizierung. Sie können auf Elemente über ihre Indexbezeichnung oder ihre ganzzahlige Position zugreifen. Im folgenden Beispiel verwenden wir die Indexbezeichnung „a“, um auf 20.0 aus der Serie zuzugreifen. Darüber hinaus unterstützt die Pandas-Serie Aufteilungsvorgänge, sodass Sie problemlos Teilmengen von Daten extrahieren können. Nehmen wir an, wir möchten eine Teilmenge mit den Zahlen 20 bis 40 aus der obigen Serie extrahieren. So können wir die Teilmenge mithilfe der Bezeichnungsaufteilung extrahieren: Verwenden von Filtern zum Extrahieren einer Teilmenge Wir können bestimmte Teilmengen von Daten basierend auf Bedingungen extrahieren. Wir können Boolesche Ausdrücke oder Vergleichsoperatoren nutzen, um Filterbedingungen zu definieren. Nehmen wir an, wir möchten eine Teilmenge von Zahlen größer als 10 aus der Serie extrahieren. So können wir dies mithilfe von Filtern tun: In diesem Code verwendet my_series [my_series > 10] dieses B While DataFrames excel at handling tabular data, Series offer a versatile way to work with one-dimensional labeled data. They provide the foundation for building DataFrames and serve as the building blocks for various data manipulation tasks. In this article, we’re putting pandas Series in the spotlight. You’ll learn how to create Series, perform key operations, and leverage their unique strengths for efficient data manipulation. By the end, you’ll see why mastering Series is an indispensable skill for tackling data analysis challenges with confidence. Understanding the Pandas Series At its core, the Series is simply a sequence of data. Think of a Series as a single column in a spreadsheet. It holds a collection of elements (data points) with corresponding labels (indexes). These indexes can be unique, but they can also be non-unique. While unique indexes allow for efficient retrieval by index value, non-unique indexes might require additional considerations when manipulating the Series. A Series can be created from data types like lists, NumPy arrays, dictionaries, etc. Below, we create a Series from a list of names: By default, the data points in the Series will be indexed from zero. However, the Series constructor has an index parameter that we can use to pass a custom index to the data points. In the code below, we have passed alphabetic characters as the index. For this to work, the length of the custom index must match the length of the data points. If there is a mismatch (if the data points outnumber the index or vice versa), then the code will generate a ValueError. The data type of the above Series, is "object." This is inferred from the data, which is of the string data type. If the data is of the integer data type, the inferred data type would be int64. See below: However, we can also use the dtype parameter to set the data type of the Series. In the code below, we have set the Series to a float data type. While a Series usually holds a single data type, if the data is arbitrary, then the data type will be an 'object' data type. This is the only data type that accommodates all different types of data. In the example below, we have arbitrary data in the form of integers and strings. You can see that the data type is an object. Please note that extensive mixing of data types can lead to performance issues and unexpected behavior. It is always recommended to have a Series that holds a single type of data. 50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners According to Forbes, data analysis is one of the high-income skills to learn in 2025. There is no better way to become proficient at data analysis than by getting your hands dirty and tackling some challenges. Start your 50-day journey today. You can also find the book on Leanpub. Use the link below: Leanpub link Other Resources: Master Python Fundamentals: The Ultimate Guide for Beginners Python Tips and Tricks: A Collection of 100 Basic & Intermediate Tips & Tricks. Creating a Series from a Dictionary We can also create a Series from a dictionary. By default, the keys of the dictionary will be the index, and the values will be the data points. See below: If you want to create a Series with specific key-value pairs or require a different order, you can create a custom index that matches the keys of the key-value pairs you want to include in your Series. See the example below: You can see in the output that the key "gender" and its value "Male" are not in the Series. Remember, for this to work, the custom index should match the keys you want to include in the Series. Indexing and Slicing pandas Series The Series supports both label-based and positional indexing. You can access elements by their index label or integer position. In the example below, we use the index label "a" to access 20.0 from the Series. Additionally, pandas Series supports slicing operations, allowing you to extract subsets of data easily. Let's say we want to extract a subset with numbers 20 to 40 from the above Series. Here is how we can use label slicing to extract the subset: Using Filtering to Extract a Subset We can extract specific subsets of data based on conditions. We can leverage Boolean expressions or comparison operators to define filtering conditions. Let's say we want to extract a subset of numbers greater than 10 from the Series. Here is how we can do it using filtering: In this code, my_series [my_series > 10] uses this Boolean mask to filter the original Series. Only the elements for which the corresponding value in the Boolean mask is True are included in the subset. In this case, only values that are greater than 10 are included.

Kurzeinführung

  • Object creation
  • Viewing data
  • Selection
  • Getting
  • Selection by label
  • Selection by position
  • Boolean indexing
  • Setting
  • Missing data
  • Operations
  • Merge
  • Grouping
  • Reshaping
  • Time series
  • Categoricals
  • Plotting
  • Getting data in/out
  • The name Pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. This is a short introduction to pandas , geared mainly for new users. You can see more complex recipes in the Cookbook.

    Customarily, we import as follows:

    # Hinweis
    import numpy as np
    import pandas as pd

    Object creation

    See the Data Structure Intro section.

    Creating a Series by passing a list of values, letting pandas create a default integer index:

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    s = pd.Series([1, 3, 5, np.nan, 6, 8])
    print(s)

    Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    print(dates)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df)

    Creating a DataFrame by passing a dict of objects that can be converted to series-like.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    df2 = pd.DataFrame(
            {
                "A": 1.0,
                "B": pd.Timestamp("20210301"),
                "C": pd.Series(1, index=list(range(4)), dtype="float32"),
                "D": np.array([3] * 4, dtype="int32"),
                "E": pd.Categorical(["test", "train", "test", "train"]),
                "F": "foo",
            }
        )
    print(df2)
    print()
    print(df2.dtypes)

    The columns of the resulting DataFrame have different dtypes.

    If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

    # Hinweis
    df2.<TAB>    # noqa: E225, E999
    df2.A                  df2.bool
    df2.abs                df2.boxplot
    df2.add                df2.C
    df2.add_prefix         df2.clip
    df2.add_suffix         df2.columns
    df2.align              df2.copy
    df2.all                df2.count
    df2.any                df2.combine
    df2.append             df2.D
    df2.apply              df2.describe
    df2.applymap           df2.diff
    df2.B                  df2.duplicated
    

    As you can see, the columns A, B, C, and D are automatically tab completed. E and F are there as well; the rest of the attributes have been truncated for brevity.

    Viewing data

    See the Basics section.

    Here is how to view the top and bottom rows, the index, columns of the frame:

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df.head(),"\n")
    print(df.tail(3),"\n")
    print(df.index,"\n")
    print(df.columns,"\n")

    DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

    For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df.to_numpy())

    For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    df2 = pd.DataFrame(
            {
                "A": 1.0,
                "B": pd.Timestamp("20210301"),
                "C": pd.Series(1, index=list(range(4)), dtype="float32"),
                "D": np.array([3] * 4, dtype="int32"),
                "E": pd.Categorical(["test", "train", "test", "train"]),
                "F": "foo",
            }
        )
    
    print(df2.to_numpy())

    Note

    DataFrame.to_numpy() does not include the index or column labels in the output.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df.describe(),"\n")                           # description
    print(df.T,"\n")                                    # Transposing the data
    print(df.sort_index(axis=1, ascending=False),"\n")  # sorting ba an axis
    print(df.sort_values(by="B"))                       # sorting by values

    Selection

    While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

    See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.

    Getting

    Selecting a single column, which yields a Series, equivalent to df.A:

    import numpy as np
    import pandas as pd
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df.["A"])

    Selecting via [], which slices the rows.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df[0:3],"\n")
    print(df["20210302":"20210304"])

    Selection by label

    See more in Selection by Label.

    For getting a cross section using a label:

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df.loc[dates[0]],"\n")
    print(df.loc[:, ["A", "B"]],"\n")                      # multi-axis label
    print(df.loc["20130102":"20130104", ["A", "B"]],"\n")  # label slicing, both endpoints are included
    print(df.loc["20130102", ["A", "B"]],"\n")             # Reduction in the dimensions of the returned object
    print(df.loc[dates[0], "A"], "\n")                     # getting a scalar value
    print(df.at[dates[0], "A"])                            # getting fast access to a scalar (equivalent to the prior method)

    Selection by position

    See more in Selection by Position.

    Select via the position of the passed integers:

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df.iloc[3], "\n")
    print(df.iloc[3:5, 0:2],"\n")                      # By integer slices, acting similar to numpy/Python
    print(df.iloc[[1, 2, 4], [0, 2]], "\n")            # By lists of integer position locations, similar to the NumPy/Python style
    print(df.iloc[1:3, :], "\n")                       # slicing rows explicitly
    print(df.iloc[:, 1:3], "\n")                       # slicing columns explicitly
    print(df.iloc[1, 1], "\n")                         # getting a value explicitly
    print(df.iat[1, 1])                                # For getting fast access to a scalar (equivalent to the prior method)

    Boolean indexing

    Using a single column’s values to select data.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df[df["A"] > 0], "\n") 
    print(df[df > 0], "\n")         # Selecting values from a DataFrame where a boolean condition is met
    
    df2 = df.copy()                 # Using the isin() method for filtering
    df2["E"] = ["one", "one", "two", "three", "four", "three"]
    print(df2, "\n")
    print(df2[df2["E"].isin(["two", "four"])])

    Setting

    Setting a new column automatically aligns the data by the indexes.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20210301", periods=6))
    print(s1, "\n")
    df["F"] = s1
    df.at[dates[0], "A"] = 0                  # Setting values by label
    df.iat[0, 1] = 0                          # Setting values by position
    df.loc[:, "D"] = np.array([5] * len(df))  # Setting by assigning with a NumPy array
    print(df, "\n")                           # The result of the prior setting operations
    
    df2 = df.copy()
    df2[df2 > 0] = -df2                       # A 'where' operation with setting
    print(df2)                                # The result

    Missing data

    pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.

    Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
    df1.loc[dates[0] : dates[1], "E"] = 1
    print(df1, "\n")
    
    print(df1.dropna(how="any"), "\n")         # To drop any rows that have missing data.
    
    print(df1.fillna(value=5), "\n")           # Filling missing data.
    
    print(pd.isna(df1))                        # To get the boolean mask where values are nan.

    Operations

    See the Basic section on Binary Ops.

    Stats

    Operations in general exclude missing data.

    Performing a descriptive statistic:

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df.mean(), "\n")
    print(df.mean(1))            # Same operation on the other axis

    Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
    print(s, "\n")
    print(df.sub(s, axis="index"))

    Apply

    Applying functions to the data:

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    print(df.apply(np.cumsum), "\n")
    print(df.apply(lambda x: x.max() - x.min()))

    Histogramming

    See more at Histogramming and Discretization.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    dates = pd.date_range("20210301", periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
    s = pd.Series(np.random.randint(0, 7, size=10))
    print(s, "\n")
    print(s.value_counts())

    String Methods

    Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
    print(s.str.lower())

    Merge

    Concat

    pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

    See the Merging section.

    Concatenating pandas objects together with concat():

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    df = pd.DataFrame(np.random.randn(10, 4))
    print(df, "\n")
    
    pieces = [df[:3], df[3:7], df[7:]]    # break it into pieces
    for i in pieces:
        print(i, "\n")
    print(pd.concat(pieces))

    Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it. See Appending to dataframe for more.

    Join

    SQL style merges. See the Database style joining section.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
    right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
    
    print(left, "\n")
    print(right, "\n")
    print(pd.merge(left, right, on="key"), "\n")
    
    left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
    right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
    
    print(left, "\n")
    print(right, "\n")
    print(pd.merge(left, right, on="key"))

    Grouping

    By “group by” we are referring to a process involving one or more of the following steps:

    • Splitting the data into groups based on some criteria

    • Applying a function to each group independently

    • Combining the results into a data structure

    See the Grouping section.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    df = pd.DataFrame(
            {
                "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
                "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
                "C": np.random.randn(8),
                "D": np.random.randn(8),
            }
        )
    print(df, "\n")
    print("Grouping and then applying the sum() function to the resulting groups:")
    print(df.groupby("A").sum(), "\n")
    print("Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function:")
    print(df.groupby(["A", "B"]).sum())

    Reshaping

    See the sections on Hierarchical Indexing and Reshaping.

    Stack

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    tuples = list(
            zip(
                *[
                    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
                    ["one", "two", "one", "two", "one", "two", "one", "two"],
                ]
            )
        )
    index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
    df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
    df2 = df[:4]
    print(df2, "\n")
    stacked = df2.stack()    # The stack() method “compresses” a level in the DataFrame’s columns
    print(stacked, "\n")
    
    print("""With a “stacked” DataFrame or Series (having a MultiIndex as the index), 
    the inverse operation of stack() is unstack(), which by default unstacks the last level:""")
    print(stacked.unstack(), "\n")
    print(stacked.unstack(1), "\n")
    print(stacked.unstack(0), "\n")

    Pivot tables

    See the section on Pivot Tables.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    df = pd.DataFrame(
            {
                "A": ["one", "one", "two", "three"] * 3,
                "B": ["A", "B", "C"] * 4,
                "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
                "D": np.random.randn(12),
                "E": np.random.randn(12),
            }
        )
        
    print(df, "\n")
    print("We can produce pivot tables from this data very easily:")
    print(pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"]))

    Time series

    pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    rng = pd.date_range("1/1/2021", periods=100, freq="S")
    ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
    print(ts.resample("5Min").sum(), "\n")
    
    rng = pd.date_range("3/6/2021 00:00", periods=5, freq="D")   # Time zone representation:
    ts = pd.Series(np.random.randn(len(rng)), rng)
    print(ts, "\n")
    ts_utc = ts.tz_localize("UTC")
    print(ts_utc, "\n")
    print(ts_utc.tz_convert("US/Eastern"), "\n")                 # Converting to another time zone:
    
    rng = pd.date_range("1/1/2021", periods=5, freq="M")         # Converting between time span representations:
    ts = pd.Series(np.random.randn(len(rng)), index=rng)
    print(ts, "\n")
    ps = ts.to_period()
    print(ps, "\n")
    print(ps.to_timestamp())

    Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    prng = pd.period_range("2020Q1", "2030Q4", freq="Q-NOV")
    ts = pd.Series(np.random.randn(len(prng)), prng)
    ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9
    print(ts.head())

    Categoricals

    pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    df = pd.DataFrame(
            {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
        )
       
    df["grade"] = df["raw_grade"].astype("category")   # Convert the raw grades to a categorical data type.
    print(df["grade"], "\n")
    
    df["grade"].cat.categories = ["very good", "good", "very bad"] # Rename the categories to more meaningful names (assigning to Series.cat.categories() is in place!).
    
    # Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default).
    df["grade"] = df["grade"].cat.set_categories( ["very bad", "bad", "medium", "good", "very good"] )
    print(df["grade"], "\n")
    
    print(df.sort_values(by="grade"), "\n")            # Sorting is per order in the categories, not lexical order.
    print(df.groupby("grade").size())                  # Grouping by a categorical column also shows empty categories.

    Plotting

    See the Plotting docs.

    We use the standard convention for referencing the matplotlib API:

    # Hinweis
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    #import pandas._libs.tslib
    
    plt.close("all")
    ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
    ts = ts.cumsum()
    ts.plot()

    Jupyterlab

    Created image:

    series_plot_basic.png

    On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:

    # Hinweis 
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    #import pandas._libs.tslib
    
    plt.close("all")
    
    df = pd.DataFrame(np.random.randn(1000, 4), columns=list("ABCD"))
    df = df.cumsum()
    plt.figure()
    df.plot();
    plt.legend(loc='best');

    Jupyterlab

    Created image

    frame_plot_basic.png

    Getting data in/out

    CSV

    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    df = pd.DataFrame(np.random.randn(1000, 4), columns=list("ABCD"))  # 1000 ZEilen à 5 Spalten
    df = df.cumsum()     
    df.to_csv("foo.csv")                                               # als csv-Liste schreiben
    
    df2 = pd.read_csv("foo.csv")                                       # wieder einlesen
    print(df2)                                                         # zur Kontrolle ausgeben

    Excel

    Reading and writing to MS Excel.

    Writing to an excel file.

    # pip3 install openpyxl, xlrd
    import numpy as np
    import pandas as pd
    #import pandas._libs.tslib
    
    df = pd.DataFrame(np.random.randn(1000, 4), columns=list("ABCD"))  # 1000 Zeilen à 5 Spalten
    df = df.cumsum()     
    df.to_excel("foo.xlsx", sheet_name="Sheet1", engine='openpyxl')
    
    print("Reading data back from an excel file. Needs a special module openpxyl! ")
    df2=pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"], engine='openpyxl')
    print(df2)                                                         # zur Kontrolle ausgeben

    There are several ways that you can use to replace values in a pandas Series object. One of those ways is by using the where() method. In the code below, we create a Series object called "student_series" with the names of students. We are going to use the where() method with the isin() method to replace "Kelly" with "John" and "Lunda" with "Ruta."

    In this code, the isin() method checks each element in students_series against the list "["Kelly", "Lunda"]" and returns a boolean Series with True where the condition is met and False if the condition is not met. The "~" operator negates the boolean values obtained by the isin() method. This means it flips True values to False and False values to True (it works similar to the "not" operator). It basically labels elements that are not equal to "Kelly" or "Lunda" as True. The where() method applies this condition to the students_series object. It replaces the elements where the condition is False with the elements in the new_names Series object, and it keeps the elements where the condition is True. Are you Ready to be challenged?

    Dropping Elements

    When working with a Python Series object, you can easily drop elements using the drop() method. Let's say we want to drop "Rose" and "Sam" from the students_series object. We are going to use the indices for the elements "Rose" and "Sam" to drop these names. We will pass the indices to the drop() method. See the code below:

    In this code, we have used the element indices labels to drop "Rose" and "Sam" from the Series object. We have created a new Series object called "modified_series" with the remaining elements.

    Concatenating Multiple Series

    Pandas has a concat() function that makes combining multiple Series objects easy. In the code below, we create a new Series object, and then we will use the concat() function to combine this new Series object with the students_series object from above.

    You can see in the output that the two Series have been combined. Setting the ignore_index parameter to True, ensures that the new Series object resets the index and ignores the indices of the two objects that we are concatenating.

    Filling Missing Values

    When you are working with data, quite often you are going to deal with missing values. Filling missing values with a specific value is a common operation when dealing with data that has missing information. The pandas library provides the fillna() method to accomplish this. Here's how you can use it on a pandas Series object:

    In this code, we create a Series object that has a missing value for "Lu." Using the fillna() method, we fill the missing value with the mean of the non-missing values in the Series using fillna(students_age_series.mean(), inplace=True). Setting inplace to True means that we are modifying the original Series in place. You can replace NaN values with any desired value or with the result of some computation, like we have done here. 5. String Manipulation When working with string data, you will have to perform various text cleaning or transformation tasks. Various string methods like str.upper(), str.lower(), and str.strip() can be used on the pandas Series object. In the code below, we use the str.upper() method to convert the names in the Series, which are index labels, to uppercase letters. You can see that the the index labels have been converted into uppercase letters. This is just one of many string methods that you can use Series objects. Conclusion These are just a few methods that you can use on pandas Series object. Mastering these methods will come in handy when manipulating data for analysis. Practice makes perfect, so don't shy away from creating your own Series objects and trying out these methods and more. If you want to be challenged, attempt Day 9 and Day 16 challenges from the book "50 Days of Data Analysis with Python: The Ultimate Challenges Book for Beginners." Please like, share, and subscribe to this newsletter if you are not yet a subscriber.

    Examples

    Hauptstädte der Erde aus dem Netz holen:

    import pandas as pd
    import requests
    
    url = "https://de.wikipedia.org/wiki/Liste_der_Hauptst%C3%A4dte_der_Erde"
    
    html = requests.get(url).content
    df_list = pd.read_html(html)#   df-> dataframe (Pandas Typ)
    print("Es gibt",len(df_list),"Tabellen")
    
    df = df_list[0]#   nur die erste Tabelle benutzen
    Liste = df.values.tolist()#  dataframe in Liste wandeln
    
    print(Liste[0]) # Test, ob Überscgrift existiert
    
    z = 0
    for zeile in Liste:
        if str(zeile[2]) == '' or str(zeile[2]) == "nan":
            zeile[2] = "0"
        einwohner = str(zeile[2]).replace(".","")
    #    print(einwohner,type(einwohner))
        if einwohner.isdigit():
           if int(einwohner) > 10000000:
              print(zeile)
              z += 1
    print(z,"Städte")

    zurück: 12