Python as the ultimate case of C ++. Part 2/2

To be continued. Beginning in Python as the Ultimate Case of C ++. Part 1/2 ".







Variables and Data Types



Now that we’ve finally figured out the math, let's decide what variables should mean in our language.







In C ++, a programmer has a choice: use automatic variables placed on the stack, or keep values ​​in the program data memory, placing only pointers to these values ​​on the stack. What if we choose only one of these options for Python?







Of course, we cannot always use only the values ​​of variables, since large data structures will not fit on the stack, or their constant movement on the stack will create performance problems. Therefore, we will use only pointers in Python. This will conceptually simplify the language.







So the expression







a = 3
      
      





will mean that we created an object “3” in the program data memory (the so-called “heap”) and made the name “a” a reference to it. And the expression







 b = a
      
      





in this case, it will mean that we forced the variable “b” to refer to the same object in memory that “a” refers to, in other words, we copied the pointer.







If everything is a pointer, then how many list types do we need to implement in our language? Of course, only one is a list of pointers! You can use it to store integers, strings, other lists, anything else - all these are pointers.







How many types of hash tables do we need to implement? (In Python, this type is called a "dictionary" - dict



.) One! Let it associate pointers to keys with pointers to values.







Thus, we do not need to implement in our language a huge part of the C ++ specification - templates, since we perform all operations on objects, and objects are always accessible by pointer. Of course, programs written in Python do not have to limit themselves to working with pointers: there are libraries like NumPy that help scientists work with arrays of data in memory, as they would in Fortran. But the basis of the language - expressions like “a = 3” - always work with pointers.







The concept of “everything is a pointer” also simplifies the composition of types to the limit. Want a list of dictionaries? Just create a list and put dictionaries there! No need to ask Python for permission, no need to declare additional types, everything works out of the box.







But what if we want to use compound objects as keys? The key in the dictionary must have an immutable value, otherwise how to look up the values ​​on it? Lists are subject to change, therefore they cannot be used in this capacity. For such situations, Python has a data type that, like a list, is a sequence of objects, but, unlike a list, this sequence does not change. This type is called a tuple or tuple



(pronounced “tuple” or “tuple”).







Tuples in Python solve a longstanding scripting language problem. If you are not impressed with this feature, then you probably have never tried to use scripting languages ​​for serious work with data, in which you can use only strings or only primitive types as a key in hash tables.







Another possibility that tuples give us is to return several values ​​from a function without having to declare additional data types for this, as you have to do in C and C ++. Moreover, to make it easier to use this feature, the assignment operator was endowed with the ability to automatically unpack tuples into separate variables.







 def get_address(): ... return host, port host, port = get_address()
      
      





Unpacking has several useful side effects, for example, the exchange of variable values ​​can be written as follows:







 x, y = y, x
      
      





Everything is a pointer, which means functions and data types can be used as data. If you are familiar with the book “Design Patterns” by the authors of “The Gang of Four,” you must remember what complex and confusing methods it offers in order to parameterize the choice of the type of object created by your program at run time. Indeed, in many programming languages ​​this is difficult to do! In Python, all these difficulties disappear, because we know that a function can return a data type, that both functions and data types are just links, and links can be stored, for example, in dictionaries. This simplifies the task to the limit.







David Wheeler said: "All programming problems are solved by creating an additional level of indirection." Using links in Python is the level of indirection that is traditionally used to solve many problems in many languages, including C ++. But if it is used explicitly there, and this complicates the programs, then in Python it is used implicitly, uniformly with respect to data of all types, and is user friendly.







But if everything is a link, then what are these links referring to? Languages ​​like C ++ have many types. Let's leave in Python only one data type - an object! Specialists in the field of type theory shake their heads disapprovingly, but I believe that one source data type, from which all other types are produced in the language, is a good idea that ensures the uniformity of the language and its ease of use.







For specific memory contents, various Python implementations (PyPy, Jython, or MicroPython) can manage memory in different ways. But in order to better understand how Python simplicity and uniformity is implemented, to form the right mental model, it is better to refer to the Python reference implementation in C called CPython, which we can download at python.org .







 struct { struct _typeobject *ob_type; /* followed by object's data */ }
      
      





What we will see in the CPython source code is a structure that consists of a pointer to information about the type of a given variable and a payload that defines the specific value of the variable.







How does type information work? Let's dig into the CPython source code again.







 struct _typeobject { /* ... */ getattrfunc tp_getattr; setattrfunc tp_setattr; /* ... */ newfunc tp_new; freefunc tp_free; /* ... */ binaryfunc nb_add; binaryfunc nb_subtract; /* ... */ richcmpfunc tp_richcompare; /* ... */ }
      
      





We see pointers to functions that ensure that all operations that are possible for this type are performed: addition, subtraction, comparison, access to attributes, indexing, slicing, etc. These operations know how to work with the payload that is located in memory below a pointer to type information, be it an integer, string, or object of a type created by the user.







This is radically different from C and C ++, in which type information is associated with names, not values ​​of variables. In Python, all names are associated with links. The value by reference, in turn, is of type. This is the essence of dynamic languages.







To realize all the features of the language, it is enough for us to define two operations on the links. One of the most obvious is copying. When we assign a value to a variable, a slot in a dictionary, or an attribute of an object, we copy the links. This is a simple, fast and completely safe operation: copying links does not change the contents of the object.







The second operation is a function or method call. As we showed above, a Python program can interact with memory only through methods implemented in built-in objects. Therefore, it cannot cause an error related to a memory access.







You may wonder: if all the variables contain references, then how can I protect the value of a variable from changes by passing it to the function as a parameter?







 n = 3 some_function(n) # Q: I just passed a pointer! # Could some_function() have changed “3”?
      
      





The answer is that simple types in Python are immutable: they simply do not implement the method that is responsible for changing their value. The immutable (immutable) int



, float



, tuple



or str



provide in languages ​​like "everything is a pointer" the same semantic effect that automatic variables provide in C.







Unified types and methods greatly simplify the use of generalized programming, or generics. The functions min()



, max()



, sum()



and the like are built-in, there is no need to import them. And they work with any data types in which comparison operations for min()



and max()



are implemented, additions for sum()



, etc.







Create Objects



We found out in general terms how objects should behave. Now we will determine how we will create them. This is a matter of language syntax. C ++ supports at least three ways to create an object:







  1. Automatic, by declaring a variable of this class:

     my_class c(arg);
          
          



  2. Using the new



    operator:

     my_class *c = new my_class(arg);
          
          



  3. Factory, by calling an arbitrary function that returns a pointer:

     my_class *c = my_factory(arg);
          
          





As you probably already guessed, having studied the way of thinking of the creators of Python on the above examples, now we must choose one of them.







From the same book, Gangs of Four, we learned that a factory is the most flexible and universal way to create objects. Therefore, only this method is implemented in Python.







In addition to universality, this method is good in that you do not need to overload the language with unnecessary syntax to ensure it: a function call is already implemented in our language, and a factory is nothing more than a function.







Another rule for creating objects in Python is this: any data type is its own factory. Of course, you can write any number of additional, custom factories (which will be ordinary functions or methods, of course), but the general rule will remain valid:







 # Let's make type objects # their own type's factories! c = MyClass() i = int('7') f = float(length) s = str(bytes)
      
      





All types are called objects, and all of them return values ​​of their type, determined by the arguments passed in the call.







Thus, using only the basic syntax of the language, any manipulations when creating objects, such as the “Arena” or “Adaptation” patterns, can be encapsulated, since another great idea borrowed from C ++ is that the type itself determines how it happens spawning its objects, how the new



operator works for it.







How about NULL?



Handling a null pointer adds complexity to the program, so we outlaw NULL. Python syntax makes it impossible to create a null pointer. Two elementary operations on pointers, which we spoke about earlier, are defined in such a way that any variable points to some object.







As a result of this, the user cannot use Python to create an error related to a memory access, such as a segmentation error or out of buffer limits. In other words, Python programs are not affected by the two most dangerous types of vulnerabilities that threaten the security of the Internet over the past 20 years.







You may ask: “If the structure of operations on objects is unchanged, as we saw earlier, then how will users create their own classes, with methods and attributes not listed in this structure?”







The magic lies in the fact that for custom classes Python has a very simple "preparation" with a small number of methods implemented. Here are the most important ones:







 struct _typeobject { getattrfunc tr_getattr; setattrfunc tr_setattr; /* ... */ newfunc tp_new; /* ... */ }
      
      





tp_new()



creates a hash table for the user class, the same as for the dict



type. tp_getattr()



extracts something from this hash table, and tp_setattr()



, on the contrary, puts something there. Thus, the ability of arbitrary classes to store any methods and attributes is provided not at the level of C language structures, but at a higher level - a hash table. (Of course, with the exception of some cases related to performance optimization.)







Access modifiers



What do we do with all those rules and concepts that are built around C ++ keywords private



and protected



? Python, being a scripting language, does not need them. We already have “protected” parts of the language — these are data of built-in types. Under no circumstances will Python allow a program, for example, to manipulate the bits of a floating point number! This level of encapsulation is enough to maintain the integrity of the language itself. We, the creators of Python, believe that language integrity is the only good excuse for hiding information. All other structures and user program data are considered public.







You can write an underscore ( _



) at the beginning of a class attribute name to warn a colleague: you should not rely on this attribute. But the rest of Python learned the lessons of the early 90s: then many believed that the main reason that we write bloated, unreadable and buggy programs is the lack of private variables. I think the next 20 years have convinced everyone in the programming industry: private variables are not the only, and far from the most effective remedy for bloated and buggy programs. Therefore, the creators of Python decided not to even worry about private variables, and, as you can see, they did not fail.







Memory management



What happens to our objects, numbers and strings at a lower level? How exactly do they fit in memory, how does CPython share them, when and under what conditions are they destroyed?







And in this case, we chose the most general, predictable and productive way of working with memory: from the side of the C-program, all our objects are shared pointers .







With this knowledge in mind, the data structures that we examined earlier in the “Variables and data types” section should be supplemented as follows:







 struct { Py_ssize_t ob_refcnt; struct { struct _typeobject *ob_type; /* followed by object's data */ } }
      
      





So, every object in Python (we mean the implementation of CPython, of course) has its own reference counter. As soon as it becomes zero, the object can be deleted.







The link counting mechanism does not rely on additional calculations or background processes - an object can be destroyed instantly. In addition, it provides high data locality: often, memory starts to be used again immediately after being freed. The just destroyed object was most likely used recently, which means it was in the processor cache. Therefore, the newly created object will remain in the cache. These two factors - simplicity and locality - make link counting a very efficient way to collect garbage.







(Due to the fact that objects in real programs often refer to each other, the reference counter in certain cases cannot drop to zero even when objects are no longer used in the program. Therefore, CPython also has a second garbage collection mechanism - the background one, based on on generations of objects. - approx. transl. )







Python developer errors



We tried to develop a language that is simple enough for beginners, but also attractive enough for professionals. At the same time, we were not able to avoid mistakes in understanding and using the tools that we ourselves created.







Python 2, due to the inertia of thinking associated with scripting languages, tried to convert string types, as a language with weak typing would do. If you try to combine a byte string with a string in Unicode, the interpreter implicitly converts the byte string to Unicode using the code table that is available on the system and presents the result in Unicode:







 >>> 'byte string ' + u'unicode string' u'byte string unicode string'
      
      





As a result, some websites worked just fine while their users were using English, but they produced cryptic errors when using characters from other alphabets.







This language design bug has been fixed in Python 3:







 >>> b'byte string ' + u'unicode string' TypeError: can't concat bytes to str
      
      





A similar error in Python 2 was related to the “naive” sorting of lists consisting of incomparable elements:







 >>> sorted(['b', 1, 'a', 2]) [1, 2, 'a', 'b']
      
      





Python 3 in this case makes it clear to the user that he is trying to do something not very meaningful:







 >>> sorted(['b', 1, 'a', 2]) TypeError: unorderable types: int() < str()
      
      





Abuses



Users now and then sometimes abuse the dynamic nature of the Python language, and then, in the 90s, when best practices were not yet widely known, this happened especially often:







 class Address(object): def __init__(self, host, port): self.host = host self.port = port
      
      





“But this is not optimal!” Some said, “What if the port does not differ from the default value?” Anyway, we spend a whole class attribute on its storage! ”And the result was something like







 class Address(object): def __init__(self, host, port=None): self.host = host if port is not None: # so terrible self.port = port
      
      





So, objects of the same type appear in the program, which, however, cannot be operated uniformly, since some of them have some attribute, while others do not! And we cannot touch this attribute without checking its presence in advance:







 # code was forced to use introspection # (terrible!) if hasattr(addr, 'port'): print(addr.port)
      
      





Currently, the abundance of hasattr()



, isinstance()



and other introspection is a sure sign of bad code, and it is considered best practice to make attributes always present in the object. This provides a simpler syntax when accessing it:







 # today's best practice: # every atribute always present if addr.port is not None: print(addr.port)
      
      





So, the early experiments with dynamically added and deleted attributes ended, and now we look at classes in Python in much the same way as in C ++.







Another bad habit of early Python was the use of functions in which an argument can have completely different types. For example, you might think that it might be too difficult for the user to create a list of column names each time, and you should allow him to pass them also as a single line, where the names of individual columns are separated by, say, a comma:







 class Dataframe(object): def __init__(self, columns): if isinstance(columns, str): columns = columns.split(',') self.columns = columns
      
      





But this approach can give rise to its problems. For example, what if a user accidentally gives us a row that is not intended to be used as a list of column names? Or if the column name should contain a comma?







Also, such a code is more difficult to maintain, debug, and especially test: in tests, only one of the two types supported by us may be checked, but the coverage will still be 100%, and we will not test the other type.







As a result, we came to the conclusion that Python allows the user to pass arguments of any type to functions, but most of them in most situations will use a function in the same way as they would in C: pass an argument of the same type to it.







The need to use eval()



in a program is considered an explicit architectural miscalculation. Most likely, you just did not figure out how to do the same in a normal way.But in some cases - for example, if you are writing a program such as a Jupyter notebook or an online sandbox to run and test custom code - the use eval()



is quite justified, and Python performs great in this type of task! Indeed, implementing something like this in C ++ would be much more difficult.







As we have shown above, introspection ( getattr()



, hasattr()



, isinstance()



) is not always a good way to perform common user tasks. But these features, however, are built into the language, and they simply sparkle in situations where our code should describe itself: logging, testing, static checking, debugging!







Era of consolidation



: , . 20 , C++ Python. , , . .







, shared_ptr



TensorFlow 2016 2018 .







TensorFlow − C++-, Python- ( C++ − TensorFlow, ).







image







TensorFlow, shared_ptr



, . , .







C++? . , ? , , C++ Python!








All Articles