Convert Pandas dataframe to NumPy array

Question

How do I convert a Pandas dataframe into a NumPy array?

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {
        'A': [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1],
        'B': [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan],
        'C': [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan],
    },
    index=[1, 2, 3, 4, 5, 6, 7],
).rename_axis('ID')

That gives this DataFrame:

      A    B    C
ID                                 
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

I would like to convert this to a NumPy array, like so:

array([[ nan,  0.2,  nan],
       [ nan,  nan,  0.5],
       [ nan,  0.2,  0.5],
       [ 0.1,  0.2,  nan],
       [ 0.1,  0.2,  0.5],
       [ 0.1,  nan,  0.5],
       [ 0.1,  nan,  nan]])

Also, is it possible to preserve the dtypes, like this?

array([[ 1, nan,  0.2,  nan],
       [ 2, nan,  nan,  0.5],
       [ 3, nan,  0.2,  0.5],
       [ 4, 0.1,  0.2,  nan],
       [ 5, 0.1,  0.2,  0.5],
       [ 6, 0.1,  nan,  0.5],
       [ 7, 0.1,  nan,  nan]],
     dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

Why do you need this ? Aren't dataframes based on numpy arrays anyways ? You should be able to use a dataframe where you need an a numpy array. That's why you can use dataframes with scikit-learn where the functions ask for numpy arrays. — chrisfs, Commented Apr 22, 2018 at 17:56
Here are a couple of possibly relevant links about dtypes & recarrays (aka record arrays or structured arrays): (1) stackoverflow.com/questions/9949427/… (2) stackoverflow.com/questions/52579601/… — JohnE, Commented Oct 11, 2018 at 4:49
NOTE: Having to convert Pandas DataFrame to an array (or list) like this can be indicative of other issues. I strongly recommend ensuring that a DataFrame is the appropriate data structure for your particular use case, and that Pandas does not include any way of performing the operations you're interested in. — AMC, Commented Jan 7, 2020 at 19:57

wjandrea · Accepted Answer · 2022-12-20 22:00:54Z

Use `df.to_numpy()`

It's better than df.values, here's why.^*

It's time to deprecate your usage of values and as_matrix().

pandas v0.24.0 introduced two new methods for obtaining NumPy arrays from pandas objects:

to_numpy(), which is defined on Index, Series, and DataFrame objects, and
array, which is defined on Index and Series objects only.

If you visit the v0.24 docs for .values, you will see a big red warning that says:

Warning: We recommend using DataFrame.to_numpy() instead.

See this section of the v0.24.0 release notes, and this answer for more information.

_{* - to_numpy() is my recommended method for any production code that needs to run reliably for many versions into the future. However if you're just making a scratchpad in jupyter or the terminal, using .values to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.}

Towards Better Consistency: `to_numpy()`

In the spirit of better consistency throughout the API, a new method to_numpy has been introduced to extract the underlying NumPy array from DataFrames.

# Setup
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, 
                  index=['a', 'b', 'c'])

# Convert the entire DataFrame
df.to_numpy()
# array([[1, 4, 7],
#        [2, 5, 8],
#        [3, 6, 9]])

# Convert specific columns
df[['A', 'C']].to_numpy()
# array([[1, 7],
#        [2, 8],
#        [3, 9]])

As mentioned above, this method is also defined on Index and Series objects (see here).

df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)

df['A'].to_numpy()
#  array([1, 2, 3])

By default, a view is returned, so any modifications made will affect the original.

v = df.to_numpy()
v[0, 0] = -1
 
df
   A  B  C
a -1  4  7
b  2  5  8
c  3  6  9

If you need a copy instead, use to_numpy(copy=True).

pandas >= 1.0 update for ExtensionTypes

If you're using pandas 1.x, chances are you'll be dealing with extension types a lot more. You'll have to be a little more careful that these extension types are correctly converted.

a = pd.array([1, 2, None], dtype="Int64")                                  
a                                                                          

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64 

# Wrong
a.to_numpy()                                                               
# array([1, 2, <NA>], dtype=object)  # yuck, objects

# Correct
a.to_numpy(dtype='float', na_value=np.nan)                                 
# array([ 1.,  2., nan])

# Also correct
a.to_numpy(dtype='int', na_value=-1)
# array([ 1,  2, -1])

This is called out in the docs.

If you need the `dtypes` in the result...

As shown in another answer, DataFrame.to_records is a good way to do this.

df.to_records()
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
#           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])

This cannot be done with to_numpy, unfortunately. However, as an alternative, you can use np.rec.fromrecords:

v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
#           dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])

Performance wise, it's nearly the same (actually, using rec.fromrecords is a bit faster).

df2 = pd.concat([df] * 10000)

%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())

12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Rationale for Adding a New Method

to_numpy() (in addition to array) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[...] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. [...]

to_numpy aims to improve the consistency of the API, which is a major step in the right direction. .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.

Critique of Other Solutions

DataFrame.values has inconsistent behaviour, as already noted.

DataFrame.get_values() was quietly removed in v1.0 and was previously deprecated in v0.25. Before that, it was simply a wrapper around DataFrame.values, so everything said above applies.

DataFrame.as_matrix() was removed in v1.0 and was previously deprecated in v0.23. Do NOT use!

I don't understand how it is possible to read page after page after page of people screaming at the top of their lungs to switch from as_matrix to another solution, in this case, to_numpy without explaining how to recover the column selecting functionality of as_matrix! I am sure there are other ways to select columns, but as_matrix was at least one of them! — Jérémie, Commented Jul 31, 2019 at 23:50
@Jérémie besides the obvious df[[col1, col2']].to_numpy()? Not sure why you think wanting to advertise an updated alternative to a deprecated function warrants a downvote on the answer. — cs95, Commented Aug 1, 2019 at 0:31
what If some of the columns are of list type. How can I create a flat bumpy array out of this? — Moniba, Commented Aug 26, 2019 at 4:37
@Moniba you may want to explode the list items into separate columns/rows as per your requirement first. — cs95, Commented Aug 26, 2019 at 4:48
Unless I'm wrong, getting more than one column in the same call gets all the data merged into one big array. Am I missing something? — Andrea Moro, Commented Aug 29, 2019 at 13:58

Abdulrahman Bres · Accepted Answer · 2024-02-06 11:27:27Z

463

To convert a pandas dataframe (df) to a numpy ndarray, use this code:

df.values

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

If a specific column is needed:

df['column'].values

edited Feb 6, 2024 at 11:27

Abdulrahman Bres

2,9311 gold badge23 silver badges41 bronze badges

answered May 5, 2016 at 5:29

User456898

5,7246 gold badges23 silver badges38 bronze badges

22

which is not the recommanded method anymore!
– Marine Galantin
Commented Jan 25, 2021 at 18:27
perhaps mention this is for versions less than v0.24?
– joel
Commented Aug 18, 2023 at 11:49

Add a comment |

Jean-François Corbett · Accepted Answer · 2019-03-13 14:09:04Z

132

Note: The .as_matrix() method used in this answer is deprecated. Pandas 0.23.4 warns:

Method .as_matrix will be removed in a future version. Use .values instead.

Pandas has something built in...

numpy_matrix = df.as_matrix()

gives

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

edited Mar 13, 2019 at 14:09

Jean-François Corbett

38.6k30 gold badges143 silver badges191 bronze badges

answered Jul 17, 2014 at 1:13

ZJS

4,0512 gold badges17 silver badges23 bronze badges

31

This does not give a structured array, all columns are of dtype object.
– sebix
Commented Oct 9, 2014 at 11:24
9

This is now deprecated. From v0.24 onwards, please use to_numpy instead (not .values either). More here.
– cs95
Commented Feb 5, 2019 at 5:47

Add a comment |

prl900 · Accepted Answer · 2014-03-26 07:35:16Z

I would just chain the DataFrame.reset_index() and DataFrame.values functions to get the Numpy representation of the dataframe, including the index:

In [8]: df
Out[8]: 
          A         B         C
0 -0.982726  0.150726  0.691625
1  0.617297 -0.471879  0.505547
2  0.417123 -1.356803 -1.013499
3 -0.166363 -0.957758  1.178659
4 -0.164103  0.074516 -0.674325
5 -0.340169 -0.293698  1.231791
6 -1.062825  0.556273  1.508058
7  0.959610  0.247539  0.091333

[8 rows x 3 columns]

In [9]: df.reset_index().values
Out[9]:
array([[ 0.        , -0.98272574,  0.150726  ,  0.69162512],
       [ 1.        ,  0.61729734, -0.47187926,  0.50554728],
       [ 2.        ,  0.4171228 , -1.35680324, -1.01349922],
       [ 3.        , -0.16636303, -0.95775849,  1.17865945],
       [ 4.        , -0.16410334,  0.0745164 , -0.67432474],
       [ 5.        , -0.34016865, -0.29369841,  1.23179064],
       [ 6.        , -1.06282542,  0.55627285,  1.50805754],
       [ 7.        ,  0.95961001,  0.24753911,  0.09133339]])

To get the dtypes we'd need to transform this ndarray into a structured array using view:

In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])
Out[10]:
array([( 0, -0.98272574,  0.150726  ,  0.69162512),
       ( 1,  0.61729734, -0.47187926,  0.50554728),
       ( 2,  0.4171228 , -1.35680324, -1.01349922),
       ( 3, -0.16636303, -0.95775849,  1.17865945),
       ( 4, -0.16410334,  0.0745164 , -0.67432474),
       ( 5, -0.34016865, -0.29369841,  1.23179064),
       ( 6, -1.06282542,  0.55627285,  1.50805754),
       ( 7,  0.95961001,  0.24753911,  0.09133339),
       dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

the only thing missing in this answer is how to construct the dtype from the data frame so that you can write a generic function — Joseph Garvin, Commented Feb 13, 2017 at 17:07

meteore · Accepted Answer · 2012-11-02 10:16:00Z

You can use the to_records method, but have to play around a bit with the dtypes if they are not what you want from the get go. In my case, having copied your DF from a string, the index type is string (represented by an object dtype in pandas):

In [102]: df
Out[102]: 
label    A    B    C
ID                  
1      NaN  0.2  NaN
2      NaN  NaN  0.5
3      NaN  0.2  0.5
4      0.1  0.2  NaN
5      0.1  0.2  0.5
6      0.1  NaN  0.5
7      0.1  NaN  NaN

In [103]: df.index.dtype
Out[103]: dtype('object')
In [104]: df.to_records()
Out[104]: 
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [106]: df.to_records().dtype
Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Converting the recarray dtype does not work for me, but one can do this in Pandas already:

In [109]: df.index = df.index.astype('i8')
In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Out[111]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Note that Pandas does not set the name of the index properly (to ID) in the exported record array (a bug?), so we profit from the type conversion to also correct for that.

At the moment Pandas has only 8-byte integers, i8, and floats, f8 (see this issue).

To get the sought-after structured array (which has better performance than a recarray) you just pass the recarray to the np.array constructor. — meteore, Commented Nov 2, 2012 at 10:19
We just put in a fix for setting the name of the index shown above. — Chang She, Commented Nov 2, 2012 at 22:23

Jamie Doyle · Accepted Answer · 2018-04-23 11:51:16Z

32

+25

It seems like df.to_records() will work for you. The exact feature you're looking for was requested and to_records pointed to as an alternative.

I tried this out locally using your example, and that call yields something very similar to the output you were looking for:

rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)],
      dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')])

Note that this is a recarray rather than an array. You could move the result in to regular numpy array by calling its constructor as np.array(df.to_records()).

answered Apr 23, 2018 at 11:51

Jamie Doyle

4784 silver badges4 bronze badges

6

Wait, what does this answer add compared to the other answer by @meteore which mentioned to_records() over 5 years earlier?
– JohnE
Commented Oct 11, 2018 at 4:55

Add a comment |

Dadu Khan · Accepted Answer · 2019-05-28 15:46:22Z

18

Try this:

a = numpy.asarray(df)

answered May 28, 2019 at 15:46

Dadu Khan

3792 silver badges17 bronze badges

1

Hi! Please add some explanation to your answer. Right now, it is currently being marked as low quality by review due to length and content and is at risk of being deleted by the system. Thanks!
– d_kennetz
Commented May 28, 2019 at 17:31
1

basically convert the input to an array (as the name suggests). So along with the context of the question, this answer is valid. check docs.scipy.org/doc/numpy/reference/generated/…
– Lautaro Parada Opazo
Commented Sep 4, 2019 at 2:58
2

Thanks, I think it's kind of self-explanatory.
– Dadu Khan
Commented Sep 27, 2019 at 15:17

Add a comment |

Phil · Accepted Answer · 2017-06-23 14:28:23Z

Here is my approach to making a structure array from a pandas DataFrame.

Create the data frame

import pandas as pd
import numpy as np
import six

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

Define function to make a numpy structure array (not a record array) from a pandas DataFrame.

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    v = df.values
    cols = df.columns

    if six.PY2:  # python 2 needs .encode() but 3 does not
        types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
    else:
        types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z

Use reset_index to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.

sa = df_to_sarray(df.reset_index())
sa

array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
       (7L, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

EDIT: Updated df_to_sarray to avoid error calling .encode() with python 3. Thanks to Joseph Garvin and halcyon for their comment and solution.

doesn't work for me, error: TypeError: data type not understood — Joseph Garvin, Commented Feb 13, 2017 at 17:55
Thanks for your comment and to halcyon for the correction. I updated my answer so I hope it works for you now. — Phil, Commented Jun 23, 2017 at 14:30

cs95 · Accepted Answer · 2019-02-14 21:47:14Z

A Simpler Way for Example DataFrame:

df

         gbm       nnet        reg
0  12.097439  12.047437  12.100953
1  12.109811  12.070209  12.095288
2  11.720734  11.622139  11.740523
3  11.824557  11.926414  11.926527
4  11.800868  11.727730  11.729737
5  12.490984  12.502440  12.530894

USE:

np.array(df.to_records().view(type=np.matrix))

GET:

array([[(0, 12.097439  , 12.047437, 12.10095324),
        (1, 12.10981081, 12.070209, 12.09528824),
        (2, 11.72073428, 11.622139, 11.74052253),
        (3, 11.82455653, 11.926414, 11.92652727),
        (4, 11.80086775, 11.72773 , 11.72973699),
        (5, 12.49098389, 12.50244 , 12.53089367)]],
dtype=(numpy.record, [('index', '<i8'), ('gbm', '<f8'), ('nnet', '<f4'),
       ('reg', '<f8')]))

Priyanshu Chauhan · Accepted Answer · 2017-12-29 10:02:05Z

7

Two ways to convert the data-frame to its Numpy-array representation.

mah_np_array = df.as_matrix(columns=None)
mah_np_array = df.values

Doc: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html

answered Dec 29, 2017 at 10:02

Priyanshu Chauhan

5,5455 gold badges37 silver badges35 bronze badges

Add a comment |

Arsam · Accepted Answer · 2019-05-30 04:05:44Z

7

I went through the answers above. The "as_matrix()" method works but its obsolete now. For me, What worked was ".to_numpy()".

This returns a multidimensional array. I'll prefer using this method if you're reading data from excel sheet and you need to access data from any index. Hope this helps :)

answered May 30, 2019 at 4:05

Arsam

3764 silver badges4 bronze badges

What do you mean by and you need to access data from any index? Depending on the nature of your data, a Pandas DataFrame may not even be the right choice in the first place.
– AMC
Commented Jan 7, 2020 at 19:47

Add a comment |

cs95 · Accepted Answer · 2019-03-14 18:30:43Z

Just had a similar problem when exporting from dataframe to arcgis table and stumbled on a solution from usgs (https://my.usgs.gov/confluence/display/cdi/pandas.DataFrame+to+ArcGIS+Table). In short your problem has a similar solution:

df

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

np_data = np.array(np.rec.fromrecords(df.values))
np_names = df.dtypes.index.tolist()
np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names])

np_data

array([( nan,  0.2,  nan), ( nan,  nan,  0.5), ( nan,  0.2,  0.5),
       ( 0.1,  0.2,  nan), ( 0.1,  0.2,  0.5), ( 0.1,  nan,  0.5),
       ( 0.1,  nan,  nan)], 
      dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')]))

user1460675 · Accepted Answer · 2019-11-21 05:13:25Z

5

A simple way to convert dataframe to numpy array:

import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df_to_array = df.to_numpy()
array([[1, 3],
   [2, 4]])

Use of to_numpy is encouraged to preserve consistency.

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html

answered Nov 21, 2019 at 5:13

user1460675

4274 silver badges11 bronze badges

what is the difference between solution provided by Arsam and yours ...
– qaiser
Commented Nov 21, 2019 at 5:47
Just tried to make it more completely and usable with a code example, which is what I personally prefer.
– user1460675
Commented Nov 21, 2019 at 6:27
What is the difference between this answer and the second most upvoted answer here?
– cs95
Commented Jun 16, 2020 at 21:40

Add a comment |

Hermes Morales · Accepted Answer · 2019-12-30 19:46:00Z

5

Try this:

np.array(df) 

array([['ID', nan, nan, nan],
   ['1', nan, 0.2, nan],
   ['2', nan, nan, 0.5],
   ['3', nan, 0.2, 0.5],
   ['4', 0.1, 0.2, nan],
   ['5', 0.1, 0.2, 0.5],
   ['6', 0.1, nan, 0.5],
   ['7', 0.1, nan, nan]], dtype=object)

Some more information at: [https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html] Valid for numpy 1.16.5 and pandas 0.25.2.

answered Dec 30, 2019 at 19:46

Hermes Morales

6477 silver badges17 bronze badges

Add a comment |

James L · Accepted Answer · 2016-03-13 16:03:54Z

Further to meteore's answer, I found the code

df.index = df.index.astype('i8')

doesn't work for me. So I put my code here for the convenience of others stuck with this issue.

city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8')
# the field 'city_en' is a string, when converted to Numpy array, it will be an object
city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records()
descr=city_cluster_arr.dtype.descr
# change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe)
descr[1]=(descr[1][0], "S20")
newArr=city_cluster_arr.astype(np.dtype(descr))

cottontail · Accepted Answer · 2023-08-31 21:41:44Z

As mentioned in cs95's answer, to_numpy() will consistently convert a pandas dataframe into a numpy array. On the other hand, because .values (as suggested in 1, 2, 3, 4, 5) returns the underlying data of a dataframe, if that is not a numpy array, it will not return a numpy array.

For example, if a column is of extension dtype such as the nullable integer dtype (Int64), then .values will return a pandas IntegerArray object, not a numpy ndarray, which may not be what is desired. However, to_numpy() can only return a numpy array.

df = pd.DataFrame({'A': [10, 20, 30]}, dtype='Int64')

type(df['A'].values)     # <class 'pandas.core.arrays.integer.IntegerArray'>

type(df['A'].to_numpy()) # <class 'numpy.ndarray'>

Lorenz Walthert · Accepted Answer · 2024-11-01 14:56:40Z

0

Summarising cs95's answer, you want to_numpy(na_value=np.nan):

>>> import numpy as np
>>> import pandas as pd

>>> index = [1, 2, 3, 4, 5, 6, 7]
>>> a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
>>> pd.DataFrame(a).to_numpy(na_value=np.nan)
array([[nan],
       [nan],
       [nan],
       [0.1],
       [0.1],
       [0.1],
       [0.1]])

answered Nov 1, 2024 at 14:56

Lorenz Walthert

4,6481 gold badge21 silver badges26 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Convert Pandas dataframe to NumPy array

17 Answers 17

Use `df.to_numpy()`

Warning: We recommend using `DataFrame.to_numpy()` instead.

Towards Better Consistency: `to_numpy()`

pandas >= 1.0 update for ExtensionTypes

If you need the `dtypes` in the result...

Rationale for Adding a New Method

Critique of Other Solutions

Linked

Hot Network Questions

Collectives™ on Stack Overflow

17 Answers 17

Use df.to_numpy()

Warning: We recommend using DataFrame.to_numpy() instead.

Towards Better Consistency: to_numpy()

pandas >= 1.0 update for ExtensionTypes

If you need the dtypes in the result...

Rationale for Adding a New Method

Critique of Other Solutions

Linked

Related

Use `df.to_numpy()`

Warning: We recommend using `DataFrame.to_numpy()` instead.

Towards Better Consistency: `to_numpy()`

If you need the `dtypes` in the result...