Adding Mypy type signatures to NumpyPosted by Agustín Bartó 1 year ago Comments
Continuing our coverage of MyPy (check parts #1, #2, and #3 of our “A Day With MyPy” series), this time we wanted to show you how we applied what we learned so far, by creating a type stub to a package that we use on a daily basis: NumPy.
Our goals regarding this experiment were:
- To make a tangible contribution to NumPy, a project on which we rely.
- To test the usage of MyPy in our production pipelines.
All the code for the NumPy stub is available on GitHub.
When you want to add type annotations to code you don’t own, one solution is to write type stubs which are files with a description of the public interface of the modules with no implementations. Given that MyPy allows mixing dynamic and static typing, we decided to write the declarations for the most popular parts of numpy.
At the core of numpy are the ``ndarray``s, which are multi-dimensional arrays that hold fixed-size items. Given that it’s the most popular part of the library, and that the rest of numpy is built on it, we decided to start by adding types to its interface.
This is when we encountered our first obstacle: Most of numpy is written in C. With a regular package written in Python, we would have walked through the code and the we would have written signatures that match the code, adding the type information. This wasn’t possible with numpy. In some cases we used introspection, but we relied mostly on the reference documentation.
The second problem we faced was numpy’s inherent flexibility. Take this example, for instance:
In : import numpy as np In : np.array('a string') Out: array('a string', dtype='<U8') In : np.array(2) Out: array(2) In : np.array([1,2,3]) Out: array([1, 2, 3]) In : np.array((1,2,"3")) Out: array(['1', '2', '3'], dtype='<U21')
array function is used to create array objects, and as you can see, no matter what you use as parameter for the
object argument, it does its best to to convert it to homogeneous values that can go into an
ndarray in return. This is great for users, but it is a source of headaches if you want to add type annotations.
Luckily, our type signature for
ndarray allows us to be explicit about the type of items stored in the arrays:
class ndarray(_ArrayLike[_S], Generic[_S]):...
so we can do things like:
my_array = np.array([1,2,3]) # type: np.ndarray[int]
You’re probably wondering about the
_ArrayLike[_S] class, as it doesn’t exist on the
numpy namespace. We wrote this fictional class to describe the array interface that is common between arrays and scalars.
Little gotcha regarding type expressions
While testing the stub we found something that might affect other type stubs for structures that work as containers. Take a look at this example:
import numpy as np def do_something(array: np.ndarray[bool]): return array.all() some_array = np.array([True, False]) # type: np.ndarray[bool] if do_something(some_array): print('done something')
It all seems fine, and mypy doesn’t complain about, but if you try to run it, you’ll get the following error:
$ python test_numpy.py Traceback (most recent call last): File "test_numpy.py", line 3, in <module> def do_something(array: np.ndarray[bool]): TypeError: 'type' object is not subscriptable
Which makes total sense because
ndarray is not a descendant of
Generic. This is why we have classes like
Dict in the
typing module, so the type declaration doesn’t clash with the actual classes. There’s an easy work around, surrounding the type declaration in quotes:
import numpy as np def do_something(array: 'np.ndarray[bool]'): return array.all() some_array = np.array([True, False]) # type: np.ndarray[bool] if do_something(some_array): print('done something')
This way the type expression is evaluated as a string and no errors are generated. Notice that there was no problem with the second declaration as it was in a comment, and those aren’t evaluated.
Although this is a valid work-around, we will most likely introduce a class named
NDarray (to follow the pattern established by the
typing module) that can be used safely in type declarations.
Problems, problems everywhere
We tried our best to provide meaningful type declarations for mypy’s type inference engine, but the dynamic nature numpy made it difficult sometimes. Take this signature for example:
def all(self, axis: AxesType=None, out: '_ArrayLike[_U]'=None, keepdims: bool=None) -> Union['_ArrayLike[_U]', '_ArrayLike[bool]']: ...
According to the
ndarray.all documentation, it returns True when all array elements along a given axis evaluate to True. It actually returns a
numpy.bool_ scalar, hence the _ArrayLike[bool] signature. However, if the
out parameter is passed, the type of the return value would be the same as
The proper way to declare all would have been something like:
@overload def all(self, axis: AxesType=None, keepdims: bool=None) -> '_ArrayLike[bool]': ... @overload def all(self, axis: AxesType=None, keepdims: bool=None, *, out: '_ArrayLike[_U]') -> '_ArrayLike[_U]': ...
But due to a mypy bug we had to go with the former declaration. Once the bug has been dealt with, we’ll improve the declarations to help mypy type inference engine.
We also encountered problems within numpy itself.
In : import numpy as np In : nda = np.random.rand(4,5) < 0.5 In : ndb = np.arange(5) In : nda.all(axis=0, out=ndb) Out: array([0, 1, 0, 0, 0]) In : nda.all(0, ndb) (traceback not shown) TypeError: data type not understood
According to the argument specification of ndarray.all, there shouldn’t be any problems with the last sentence. In the implementation, the positional arguments are not exactly the same as in the docs.
With these problems in mind, we tried the stub against some of our own code. Here’s a snippet that shows what we found:
$ mypy --strict-optional --check-untyped-defs lr.py lr.py:2: error: No library stub file for module 'scipy' lr.py:2: note: (Stub files are from https://github.com/python/typeshed) lr.py:5: error: No library stub file for module 'sklearn' lr.py:7: error: No library stub file for module 'sklearn.utils.fixes' lr.py:8: error: No library stub file for module 'sklearn.utils.extmath' lr.py:9: error: No library stub file for module 'sklearn.datasets' lr.py:10: error: No library stub file for module 'sklearn.linear_model' lr.py: note: In member "fit" of class "LR": lr.py:17: error: "module" has no attribute "unique" lr.py: note: In member "decision_function" of class "LR": lr.py:33: error: "module" has no attribute "dot" lr.py: note: In member "predict" of class "LR": lr.py:39: error: "module" has no attribute "int" lr.py: note: In member "predict_proba" of class "LR": lr.py:46: error: "module" has no attribute "dot" lr.py: note: In member "likelihood" of class "LR": lr.py:59: error: "module" has no attribute "dot" lr.py:65: error: "module" has no attribute "sum" lr.py:65: error: "module" has no attribute "dot" lr.py:71: error: "module" has no attribute "dot"
Besides the missing stubs for scipy and sklearn (we might tackle those in the future), most of the problems came from the fact that the developer used the array operation functions (like dot or sum) defined on the
numpy namespace instead of the methods defined on the
ndarray class. Here’s an example of this:
def decision_function(self, X_test): scores = np.dot(X_test, self.weights[:-1].T) + self.weights[-1] return scores.ravel() if len(scores.shape) > 1 and scores.shape == 1 else scores
Here, the developer used
np.dot instead of
X_test.dot. We found that this happens quite often (at least in our code), so we’re going to add type declarations for the most common functions defined in the top-level
During one of our meetings we reviewed our findings and decided that we could improve the stub with a little bit of user input. So if you think you’re up to it, please take a look at the code and give us your feedback. Even if you think we did everything wrong, that’ll a great help for us, as we aim to provide a meaningful contribution to the NumPy, MyPy and Python communities in general.