str部分源码解析(Cpython python3)

源码文件

https://github.com/python/cpython/blob/main/Objects/unicodeobject.c

字符对象创建

Python中字符串(strs)对象最重要的创建方法为PyUnicode_DecodeUTF8Stateful,
最终都会调用到PyUnicode_DecodeUTF8Stateful:如下

1
2
a = 'hello'
b = str('world')

源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
static PyUnicodeObject *
_PyUnicode_New(Py_ssize_t length)
{
PyUnicodeObject *unicode;
size_t new_size;

/* Optimization for empty strings */
if (length == 0) {
return (PyUnicodeObject *)unicode_new_empty();
}

/* Ensure we won't overflow the size. */
if (length > ((PY_SSIZE_T_MAX / (Py_ssize_t)sizeof(Py_UNICODE)) - 1)) {
return (PyUnicodeObject *)PyErr_NoMemory();
}
if (length < 0) {
PyErr_SetString(PyExc_SystemError,
"Negative size passed to _PyUnicode_New");
return NULL;
}

unicode = PyObject_New(PyUnicodeObject, &PyUnicode_Type);
if (unicode == NULL)
return NULL;
new_size = sizeof(Py_UNICODE) * ((size_t)length + 1);

_PyUnicode_WSTR_LENGTH(unicode) = length;
_PyUnicode_HASH(unicode) = -1;
_PyUnicode_STATE(unicode).interned = 0;
_PyUnicode_STATE(unicode).kind = 0;
_PyUnicode_STATE(unicode).compact = 0;
_PyUnicode_STATE(unicode).ready = 0;
_PyUnicode_STATE(unicode).ascii = 0;
_PyUnicode_DATA_ANY(unicode) = NULL;
_PyUnicode_LENGTH(unicode) = 0;
_PyUnicode_UTF8(unicode) = NULL;
_PyUnicode_UTF8_LENGTH(unicode) = 0;

_PyUnicode_WSTR(unicode) = (Py_UNICODE*) PyObject_Malloc(new_size);
if (!_PyUnicode_WSTR(unicode)) {
Py_DECREF(unicode);
PyErr_NoMemory();
return NULL;
}

/* Initialize the first element to guard against cases where
* the caller fails before initializing str -- unicode_resize()
* reads str[0], and the Keep-Alive optimization can keep memory
* allocated for str alive across a call to unicode_dealloc(unicode).
* We don't want unicode_resize to read uninitialized memory in
* that case.
*/
_PyUnicode_WSTR(unicode)[0] = 0;
_PyUnicode_WSTR(unicode)[length] = 0;

assert(_PyUnicode_CheckConsistency((PyObject *)unicode, 0));
return unicode;
}

字符串三大特性

  • 1.空串缓存:空串(unicode_empty)为同一个地址第二次需要空串时,只是将计
    数加1,在_PyUnicodeWriter_Finish中实现空串缓存。

  • 2.字符缓冲池:字符(unicode_latin1)为同一个地址,
    第二次需要该字符时,只是将计数加1,在get_latin1_char中实现字符缓存。

  • 3.常量字符串池

    1
    2
    3
    a = 'hello'
    b = 'hello'
    a is b #True

    由上例可以看出Python对常量字符串做了缓存。
    缓存的关键性实现在PyUnicode_InternInPlace方法中。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    // unicodeobject.c
    static PyObject *interned = NULL;

    void
    PyUnicode_InternInPlace(PyObject **p)
    {
    PyObject *s = *p;
    PyObject *t;
    #ifdef Py_DEBUG
    assert(s != NULL);
    assert(_PyUnicode_CHECK(s));
    #else
    if (s == NULL || !PyUnicode_Check(s))
    return;
    #endif
    /* If it's a subclass, we don't really know what putting
    it in the interned dict might do. */
    if (!PyUnicode_CheckExact(s))
    return;
    if (PyUnicode_CHECK_INTERNED(s))
    return;
    if (interned == NULL) {
    interned = PyDict_New();
    if (interned == NULL) {
    PyErr_Clear(); /* Don't leave an exception */
    return;
    }
    }
    Py_ALLOW_RECURSION
    t = PyDict_SetDefault(interned, s, s);
    Py_END_ALLOW_RECURSION
    if (t == NULL) {
    PyErr_Clear();
    return;
    }
    if (t != s) {
    Py_INCREF(t);
    Py_SETREF(*p, t);
    return;
    }
    /* The two references in interned are not counted by refcnt.
    The deallocator will take care of this */
    Py_REFCNT(s) -= 2;
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
    }

神奇的intern机制

Python解释器中使用了 intern(字符串驻留)的技术来提高字符串效率,什么是intern机制?
就是同样的字符串对象仅仅会保存一份,放在一个字符串储蓄池

  • 1.如果有空格不使用intern机制
    1
    2
    3
    4
    5
    6
    7
    8
    9
    >>> s1="hello"
    >>> s2="hello"
    >>> s1 is s2
    True

    >>> s1="hell o"
    >>> s2="hell o"
    >>> s1 is s2
    False
  • 2.如果一个字符串长度超过20个字符,不启动intern机制
    1
    2
    3
    4
    5
    6
    7
    8
    9
    >>> s1 = "a" * 20
    >>> s2 = "a" * 20
    >>> s1 is s2
    True

    >>> s1 = "a" * 21
    >>> s2 = "a" * 21
    >>> s1 is s2
    False
分享到