str部分源码解析(Cpython python3)

2021-06-18

2021-06-17

本文字数: 765 阅读时长: 4 分钟

阅读次数次

源码文件

https://github.com/python/cpython/blob/main/Objects/unicodeobject.c

字符对象创建

Python中字符串（strs）对象最重要的创建方法为PyUnicode_DecodeUTF8Stateful，
最终都会调用到PyUnicode_DecodeUTF8Stateful：如下

1 2	a = 'hello' b = str('world')

源码

static PyUnicodeObject *
_PyUnicode_New(Py_ssize_t length)
{
    PyUnicodeObject *unicode;
    size_t new_size;

    /* Optimization for empty strings */
    if (length == 0) {
        return (PyUnicodeObject *)unicode_new_empty();
    }

    /* Ensure we won't overflow the size. */
    if (length > ((PY_SSIZE_T_MAX / (Py_ssize_t)sizeof(Py_UNICODE)) - 1)) {
        return (PyUnicodeObject *)PyErr_NoMemory();
    }
    if (length < 0) {
        PyErr_SetString(PyExc_SystemError,
                        "Negative size passed to _PyUnicode_New");
        return NULL;
    }

    unicode = PyObject_New(PyUnicodeObject, &PyUnicode_Type);
    if (unicode == NULL)
        return NULL;
    new_size = sizeof(Py_UNICODE) * ((size_t)length + 1);

    _PyUnicode_WSTR_LENGTH(unicode) = length;
    _PyUnicode_HASH(unicode) = -1;
    _PyUnicode_STATE(unicode).interned = 0;
    _PyUnicode_STATE(unicode).kind = 0;
    _PyUnicode_STATE(unicode).compact = 0;
    _PyUnicode_STATE(unicode).ready = 0;
    _PyUnicode_STATE(unicode).ascii = 0;
    _PyUnicode_DATA_ANY(unicode) = NULL;
    _PyUnicode_LENGTH(unicode) = 0;
    _PyUnicode_UTF8(unicode) = NULL;
    _PyUnicode_UTF8_LENGTH(unicode) = 0;

    _PyUnicode_WSTR(unicode) = (Py_UNICODE*) PyObject_Malloc(new_size);
    if (!_PyUnicode_WSTR(unicode)) {
        Py_DECREF(unicode);
        PyErr_NoMemory();
        return NULL;
    }

    /* Initialize the first element to guard against cases where
     * the caller fails before initializing str -- unicode_resize()
     * reads str[0], and the Keep-Alive optimization can keep memory
     * allocated for str alive across a call to unicode_dealloc(unicode).
     * We don't want unicode_resize to read uninitialized memory in
     * that case.
     */
    _PyUnicode_WSTR(unicode)[0] = 0;
    _PyUnicode_WSTR(unicode)[length] = 0;

    assert(_PyUnicode_CheckConsistency((PyObject *)unicode, 0));
    return unicode;
}

字符串三大特性

1.空串缓存：空串（unicode_empty）为同一个地址第二次需要空串时，只是将计
数加1，在_PyUnicodeWriter_Finish中实现空串缓存。
2.字符缓冲池：字符（unicode_latin1）为同一个地址，
第二次需要该字符时，只是将计数加1，在get_latin1_char中实现字符缓存。

3.常量字符串池

1
2
3

a = 'hello'
b = 'hello'
a is b  #True

由上例可以看出Python对常量字符串做了缓存。
缓存的关键性实现在PyUnicode_InternInPlace方法中。

// unicodeobject.c
static PyObject *interned = NULL;

void
PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;
    PyObject *t;
#ifdef Py_DEBUG
    assert(s != NULL);
    assert(_PyUnicode_CHECK(s));
#else
    if (s == NULL || !PyUnicode_Check(s))
        return;
#endif
    /* If it's a subclass, we don't really know what putting
       it in the interned dict might do. */
    if (!PyUnicode_CheckExact(s))
        return;
    if (PyUnicode_CHECK_INTERNED(s))
        return;
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don't leave an exception */
            return;
        }
    }
    Py_ALLOW_RECURSION
    t = PyDict_SetDefault(interned, s, s);
    Py_END_ALLOW_RECURSION
    if (t == NULL) {
        PyErr_Clear();
        return;
    }
    if (t != s) {
        Py_INCREF(t);
        Py_SETREF(*p, t);
        return;
    }
    /* The two references in interned are not counted by refcnt.
       The deallocator will take care of this */
    Py_REFCNT(s) -= 2;
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}

神奇的intern机制

Python解释器中使用了 intern（字符串驻留）的技术来提高字符串效率，什么是intern机制？
就是同样的字符串对象仅仅会保存一份，放在一个字符串储蓄池

1.如果有空格不使用intern机制

>>> s1="hello"
>>> s2="hello"
>>> s1 is s2
True

>>> s1="hell o"
>>> s2="hell o"
>>> s1 is s2
False

2.如果一个字符串长度超过20个字符，不启动intern机制

>>> s1 = "a" * 20
>>> s2 = "a" * 20
>>> s1 is s2
True

>>> s1 = "a" * 21
>>> s2 = "a" * 21
>>> s1 is s2
False