torcharrow/tutorial/tutorial.py at main · pytorch/torcharrow · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
#!/usr/bin/env python
# coding: utf-8
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

# # TorchArrow in 10 minutes
#
# TorchArrow is a torch.Tensor-like Python DataFrame library for data preprocessing in deep learning. It supports multiple execution runtimes and Arrow as a common memory format.
#
# (Remark. In case the following looks familiar, it is with gratitude that portions of this tutorial were borrowed and adapted from the 10 Minutes to Pandas (and CuDF) tutorial.)
#
#

# The TorchArrow library consists of 3 parts:
#
#   * *DTypes* define *Schema*, *Fields*, primitive and composite *Types*.
#   * *Columns* defines sequences of strongly typed data with vectorized operations.
#   * *Dataframes*  are sequences of named and typed columns of same length with relational operations.
#
# Let's get started...

# In[1]:


# ## Constructing data: Columns
#
# ### From Pandas to TorchArrow
# To start let's create a Panda series and a TorchArrow column and compare them:

# In[2]:


import pandas as pd
import torcharrow as ta
import torcharrow.dtypes as dt


pd.Series([1, 2, None, 4])


# In Pandas each Series has an index, here depicted as the first column. Note also that the inferred type is float and not int, since in Pandas None implicitly  promotes an int list to a float series.

# TorchArrow has a much more precise type system:

# In[3]:


s = ta.column([1, 2, None, 4])
s


# TorchArrow creates CPU column by default, which is supported by [Velox](https://github.com/facebookincubator/velox) backend.

# In[4]:


s.device


# TorchArrow infers that the type is `Int64(nullable=True)`. Of course, we can always get lots of more information from a column: the length, count, null_count determine the total number, the number of non-null, and the number of nulls, respectively.
#
#
#

# In[5]:


len(s), s._count(), s.null_count


# TorchArrow infers Python float as float32 (instead of float64). This follows PyTorch and other deep learning libraries.

# In[6]:


ss = ta.column([2.718, 3.14, 42.42])
ss


# TorchArrow supports (almost all of Arrow types), including arbitrarily nested structs, maps, lists, and fixed size lists. Here is a non-nullable column of a list of non-nullable strings of arbitrary length.

# In[7]:


sf = ta.column([["hello", "world"], ["how", "are", "you"]], dtype=dt.List(dt.string))
sf.dtype


# And here is a column of average climate data, one map per continent, with city as key and yearly average min and max temperature:
#

# In[8]:


mf = ta.column(
    [
        {"helsinki": [-1.3, 21.5], "moscow": [-4.0, 24.3]},
        {"algiers": [11.2, 25.2], "kinshasa": [22.2, 26.8]},
    ]
)
mf


# ### Append and concat

# Columns are immutable. Use `append` to create a new column with a list of values appended.

# In[9]:


sf = sf.append([["I", "am", "fine", "and", "you"]])
sf


# Use `concat` to combine a list of columns.

# In[10]:


# TODO: Fix this!
# sf = sf.concat([ta.column("I", "am", "fine", "too")]


# ## Constructing data: Dataframes
#
# A Dataframe is just a set of named and strongly typed columns of equal length:

# In[11]:


df = ta.dataframe(
    {"a": list(range(7)), "b": list(reversed(range(7))), "c": list(range(7))}
)
df


# To access a dataframes columns write:

# In[12]:


df.columns


# Dataframes are also immutable, except you can always add a new column or overwrite an existing column.
#
# When a new column is added, it is appended to the set of existing columns at the end.

# In[13]:


df["d"] = ta.column(list(range(99, 99 + 7)))
df


# You can also overwrite an existing column.

# In[14]:


df["c"] = df["b"] * 2
df


# Similar to Column, we can also use `append` to create a new DataFrame with a listed of tuples appended.

# In[15]:


df = df.append([(7, 77, 777, 7777), (8, 88, 888, 8888)])
df


# Dataframes can be nested. Here is a Dataframe having sub-dataframes.
#

# In[16]:


df_inner = ta.dataframe({"b1": [11, 22, 33], "b2": [111, 222, 333]})
df_outer = ta.dataframe({"a": [1, 2, 3], "b": df_inner})
df_outer


# ## Interop
#
# Coming soon!

# ## Viewing (sorted) data
#
# Take the the top n rows

# In[17]:


df.head(2)


# Or return the last n rows

# In[18]:


df.tail(1)


# or sort the values before hand.

# In[19]:


df.sort(by=["c", "b"]).head(2)


# Sorting can be controlled not only by which columns to sort on, but also whether to sort ascending or descending, and how to deal with nulls, are they listed first or last.
#
# ## Selection using Indices
#
# Torcharrow supports two indices:
#  - Integer indices select rows
#  - String indices select columns
#
# So projecting a single column of a dataframe is simply

# In[20]:


df["a"]


# Selecting a single row uses an integer index. (In TorchArrow everything is zero-based.)

# In[21]:


df[1]


# Selecting a slice keeps the type alive. Here we slice rows:
#

# In[22]:


df[2:6:2]


# TorchArrow follows the normal Python semantics for slices: that is a slice interval is closed on the left and open on the right.

# ## Selection by Condition
#
# Selection of a column or dataframe *c* by a condition takes a boolean column *b* of the same length as *c*. If the *i*th row in *b* is true, *c*'s *i*th row is included in the result otherwise it is dropped. Below expression selects the first row, since it is true, and drops all remaining rows, since they are false.
#
#

# In[24]:


df[[True] + [False] * (len(df) - 1)]


# Conditional expressions over vectors return boolean vectors. Conditionals are thus the usual way to write filters.

# In[25]:


b = df["a"] > 4
df[b]


# Torcharrow supports all the usual predicates, like ==,!=,<,<=,>,>= as well as _in_. The later is denoted by `isin`
#

# In[26]:


df[df["a"].isin([5])]


# ## Missing data
#  Missing data can be filled in via the `fill_null` method

# In[27]:


t = s.fill_null(999)
t


# Alternatively data that has null data can be dropped:

# In[28]:


s.drop_null()


# ## Operators
# Columns and dataframes support all of Python's usual binary operators, like  ==,!=,<,<=,>,>= for equality  and comparison,  +,-,*,**,/,// for performing arithmetic and &,|,~ for conjunction, disjunction and negation.
#
# The semantics of each operator is given by lifting their scalar operation to vectors and dataframes. So given for instance a scalar comparison operator, in TorchArrow a scalar can be compared to each item in a column, two columns can be compared pointwise, a column can be compared to each column of a dataframe, and two dataframes can be compared by comparing each of their respective columns.
#
# Here are some example expressions:

# In[29]:


u = ta.column(list(range(5)))
v = -u
w = v + 1
v * w


# In[30]:


uv = ta.dataframe({"a": u, "b": v})
uu = ta.dataframe({"a": u, "b": u})
(uv == uu)


# ## Null strictness
#
# The default behavior of torcharrow operators and functions is that *if any argument is null then the result is null*. For instance:

# In[31]:


u = ta.column([1, None, 3])
v = ta.column([11, None, None])
u + v


# If null strictness does not work for your code you could call first `fill_null` to provide a value that is used instead of null.

# ## Numerical columns and descriptive statistics
# Numerical columns also support lifted operations, for `abs`, `ceil`, `floor`, `round`. Even more excited might be to use their aggregation operators like `count`, `sum`, `prod`, `min`, `max`, or descriptive statistics like `std`, `mean`, `median`, and `mode`. Here is an example ensemble:
#

# In[32]:


(t.min(), t.max(), t.sum(), t.mean())


# The `describe` method puts this nicely together:

# In[33]:


t.describe()


# Sum, prod, min and max are also available as accumulating operators called `cumsum`, `cumprod`, etc.
#
# Boolean vectors are very similar to numerical vector. They offer the aggregation operators `any` and `all`.

# ## String, list and map methods
# Torcharrow provides all of Python's string, list and map processing methods, just lifted to work over columns. Like in Pandas they are all accessible via the `str`, `list` and `map` property, respectively.
#
# ### Strings
# Let's convert a Column of strings to upper case.
#

# In[34]:


s = ta.column(["what a wonderful world!", "really?"])
s.str.upper()


# We can also split each string to a list of strings with the given delimiter.
#

# In[35]:


ss = s.str.split(pat=" ")
ss


# ### Lists
#
# To operate on a list column use the usual pure list operations, like `len(gth)`, `slice`, `index` and `count`, etc. But there are a couple of additional operations.
#
# For instance to invert the result of a string split operation a list of string column also offers a join operation.
#

# In[36]:


ss.list.join(sep="-")


# In addition lists provide `filter`, `map`, `flatmap` and `reduce` operators, which we will discuss as in more details in functional tools.
#
# ### Maps
#
# Column of type map provide the usual map operations like `len(gth)`, `[]`, `keys` and `values`. Keys and values both return a list column. Key and value columns can be reassembled by calling `mapsto`.

# In[37]:


mf.maps.keys()


# ## Relational tools: Where, select, groupby, join, etc.
#
# TorchArrow also plans to support all relational operators on DataFrame. The following sections discuss what exists today.
#
# ### Where
# The simplest operator is `df.where(p)` which is just another way of writing `df[p]`. (Note: TorchArrow's `where`  is not Pandas' `where`, the latter is a vectorized if-then-else which we call in Torcharrow `ite`.)

# In[38]:


xf = ta.dataframe({"A": ["a", "b", "a", "b"], "B": [1, 2, 3, 4], "C": [10, 11, 12, 13]})

xf.where(xf["B"] > 2)


# Note that in `xf.where` the predicate `xf['B']>2` refers to self, i.e. `xf`. To access self in an expression TorchArrow introduces the special name `me`. That is, we can also write:
#

# In[39]:


from torcharrow import me


xf.where(me["B"] > 2)


# ### Select
#
# Select is SQL's standard way to define a new set of columns. We use *positional args to keep columns and kwargs to give new bindings. Here is a typical example that keeps all of xf's columns but adds column 'D').
#

# In[40]:


xf.select(*xf.columns, D=me["B"] + me["C"])


# The short form of `*xf.columns` is '\*', so `xf.select('*', D=me['B']+me['C'])` does the same.

# ### Grouping, Join and Tranpose
#
# Coming soon!
#

# ## Functional tools:  map, filter, reduce
#
# Column and dataframe pipelines support map/reduce style programming as well. We first explore column oriented operations.
#
# ###  Map and its variations
#
# `map` maps values of a column according to input correspondence. The input correspondence can be given as a mapping or as a (user-defined-) function (UDF). If the mapping is a dict, then non mapped values become null.
#
#
#

# In[41]:


ta.column([1, 2, None, 4]).map({1: 111})


# If the mapping is a defaultdict, all values will be mapped as described by the default dict.

# In[42]:


from collections import defaultdict

ta.column([1, 2, None, 4]).map(defaultdict(lambda: -1, {1: 111}))


# **Handling null.** If the mapping is a function, then it will be applied on all values (including null), unless na_action is `'ignore'`, in which case, null values are passed through.

# In[43]:


def add_ten(num):
    return num + 10


ta.column([1, 2, None, 4]).map(add_ten, na_action="ignore")


# Note that `.map(add_ten, na_action=None)` would fail with a type error since `add_ten` is not defined for `None`/null. So if we wanted to pass null to `add_ten` we would have to prepare for it, maybe like so:

# In[44]:


def add_ten_or_0(num):
    return 0 if num is None else num + 10


ta.column([1, 2, None, 4]).map(add_ten_or_0, na_action=None)


# **Mapping to different types.** If `map` returns a column type that is different from the input column type, then `map` has to specify the returned column type.

# In[45]:


ta.column([1, 2, 3, 4]).map(str, dtype=dt.string)


# Instead of specifying `dtype` argument, you can also rely on type annotations (both Python annotations and `dtypes` are supported):

# In[46]:


from typing import Optional


def str_only_even(x) -> Optional[str]:
    if x % 2 == 0:
        return str(x)
    return None


ta.column([1, 2, 3, 4]).map(str_only_even)  # dt.string(nullable=True) is inferred


# **Map over Dataframes** Of course, `map` works over Dataframes, too. In this case the callable gets the whole row as a tuple.

# In[47]:


def add_unary(tup):
    return tup[0] + tup[1]


ta.dataframe({"a": [1, 2, 3], "b": [1, 2, 3]}).map(add_unary, dtype=dt.int64)


# **Multi-parameter functions**. So far all our functions were unary functions. But `map` can be used for n-ary functions, too: simply specify the set of `columns` you want to pass to the nary function.
#

# In[48]:


def add_binary(a, b):
    return a + b


ta.dataframe({"a": [1, 2, 3], "b": ["a", "b", "c"], "c": [1, 2, 3]}).map(
    add_binary, columns=["a", "c"], dtype=dt.int64
)


# **Multi-return functions.** Functions that return more than one column can be specified by returning a dataframe  (aka as struct column); providing the  return type is mandatory.

# In[49]:


ta.dataframe({"a": [17, 29, 30], "b": [3, 5, 11]}).map(
    divmod,
    columns=["a", "b"],
    dtype=dt.Struct([dt.Field("quotient", dt.int64), dt.Field("remainder", dt.int64)]),
)


# **Functions with state**. Functions sometimes need additional precomputed state. We capture the state in a (data)class and use a method as a delegate:
#

# In[50]:


def fib(n):
    if n == 0:
        return 0
    elif n == 1 or n == 2:
        return 1
    else:
        return fib(n - 1) + fib(n - 2)


from dataclasses import dataclass


@dataclass
class State:
    state: int

    def __post_init__(self):
        self.state = fib(self.state)

    def add_fib(self, x):
        return self.state + x


m = State(10)
ta.column([1, 2, 3]).map(m.add_fib)


# TorchArrow requires that only global functions or methods on class instances can be used as user defined functions. Lambdas, which can can capture arbitrary state and are not inspectable, are not supported.

# ### Filter
#
# `filter` takes a predicate and returns all those rows for which the predicate holds:

# In[51]:


ta.column([1, 2, 3, 4]).filter(lambda x: x % 2 == 1)


# Instead of the predicate you can pass an iterable of boolean of the same length as the column:

# In[52]:


ta.column([1, 2, 3, 4]).filter([True, False, True, False])


# If the predicate is an n-ary function, use the  `columns` argument as we have seen for `map`.

# ### Flatmap
#
# `flatmap` combines `map` with `filter`. Each callable can return a list of elements. If that list is empty, flatmap filters, if the returned list is a singleton, flatmap acts like map, if it returns several elements it 'explodes' the input. Here is an example:

# In[53]:


def selfish(words):
    return [words, words] if len(words) >= 1 and words[0] == "I" else []


sf.flatmap(selfish)


# `flatmap` has all the flexibility of `map`, i.e it can take the `ignore`, `dtype` and `column` arguments.

# ### Reduce
# `reduce` is just like Python's `reduce`. Here we compute the product of a column.

# In[54]:


import operator


ta.column([1, 2, 3, 4]).reduce(operator.mul)


# ## Batch Transform
#

# Batch `transform` is similar to `map`, except the functions takes batch input/output (represented by a Python list, PyTorch tensors, etc).

# In[55]:


from typing import List


def multiple_ten(val: List[int]) -> List[int]:
    return [x * 10 for x in val]


ta.column([1, 2, 3, 4]).transform(multiple_ten, format="python")


# In[56]:


"End of tutorial"