源码文件
https://github.com/python/cpython/blob/main/Objects/unicodeobject.c
字符对象创建
Python中字符串(strs)对象最重要的创建方法为PyUnicode_DecodeUTF8Stateful,
最终都会调用到PyUnicode_DecodeUTF8Stateful:如下
1 | a = 'hello' |
源码
1 | static PyUnicodeObject * |
字符串三大特性
1.空串缓存:空串(unicode_empty)为同一个地址第二次需要空串时,只是将计
数加1,在_PyUnicodeWriter_Finish中实现空串缓存。2.字符缓冲池:字符(unicode_latin1)为同一个地址,
第二次需要该字符时,只是将计数加1,在get_latin1_char中实现字符缓存。3.常量字符串池
1
2
3a = 'hello'
b = 'hello'
a is b #True由上例可以看出Python对常量字符串做了缓存。
缓存的关键性实现在PyUnicode_InternInPlace方法中。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45// unicodeobject.c
static PyObject *interned = NULL;
void
PyUnicode_InternInPlace(PyObject **p)
{
PyObject *s = *p;
PyObject *t;
#ifdef Py_DEBUG
assert(s != NULL);
assert(_PyUnicode_CHECK(s));
#else
if (s == NULL || !PyUnicode_Check(s))
return;
#endif
/* If it's a subclass, we don't really know what putting
it in the interned dict might do. */
if (!PyUnicode_CheckExact(s))
return;
if (PyUnicode_CHECK_INTERNED(s))
return;
if (interned == NULL) {
interned = PyDict_New();
if (interned == NULL) {
PyErr_Clear(); /* Don't leave an exception */
return;
}
}
Py_ALLOW_RECURSION
t = PyDict_SetDefault(interned, s, s);
Py_END_ALLOW_RECURSION
if (t == NULL) {
PyErr_Clear();
return;
}
if (t != s) {
Py_INCREF(t);
Py_SETREF(*p, t);
return;
}
/* The two references in interned are not counted by refcnt.
The deallocator will take care of this */
Py_REFCNT(s) -= 2;
_PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}
神奇的intern机制
Python解释器中使用了 intern(字符串驻留)的技术来提高字符串效率,什么是intern机制?
就是同样的字符串对象仅仅会保存一份,放在一个字符串储蓄池
- 1.如果有空格不使用intern机制
1
2
3
4
5
6
7
8
9>>> s1="hello"
>>> s2="hello"
>>> s1 is s2
True
>>> s1="hell o"
>>> s2="hell o"
>>> s1 is s2
False - 2.如果一个字符串长度超过20个字符,不启动intern机制
1
2
3
4
5
6
7
8
9>>> s1 = "a" * 20
>>> s2 = "a" * 20
>>> s1 is s2
True
>>> s1 = "a" * 21
>>> s2 = "a" * 21
>>> s1 is s2
False