1. Metadata-Version: 2.1
  2. Name: charset-normalizer
  3. Version: 2.0.10
  4. Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
  5. Home-page:
  6. Author: Ahmed TAHRI @Ousret
  7. Author-email:
  8. License: MIT
  9. Project-URL: Bug Reports,
  10. Project-URL: Documentation,
  11. Keywords: encoding,i18n,txt,text,charset,charset-detector,normalization,unicode,chardet
  12. Platform: UNKNOWN
  13. Classifier: License :: OSI Approved :: MIT License
  14. Classifier: Intended Audience :: Developers
  15. Classifier: Topic :: Software Development :: Libraries :: Python Modules
  16. Classifier: Operating System :: OS Independent
  17. Classifier: Programming Language :: Python
  18. Classifier: Programming Language :: Python :: 3
  19. Classifier: Programming Language :: Python :: 3.5
  20. Classifier: Programming Language :: Python :: 3.6
  21. Classifier: Programming Language :: Python :: 3.7
  22. Classifier: Programming Language :: Python :: 3.8
  23. Classifier: Programming Language :: Python :: 3.9
  24. Classifier: Programming Language :: Python :: 3.10
  25. Classifier: Topic :: Text Processing :: Linguistic
  26. Classifier: Topic :: Utilities
  27. Classifier: Programming Language :: Python :: Implementation :: PyPy
  28. Classifier: Typing :: Typed
  29. Requires-Python: >=3.5.0
  30. Description-Content-Type: text/markdown
  31. License-File: LICENSE
  32. Provides-Extra: unicode_backport
  33. Requires-Dist: unicodedata2 ; extra == 'unicode_backport'
  34. <h1 align="center">Charset Detection, for Everyone 👋 <a href=",encoding,chardet,developers"><img src=""/></a></h1>
  35. <p align="center">
  36. <sup>The Real First Universal Charset Detector</sup><br>
  37. <a href="">
  38. <img src="" />
  39. </a>
  40. <a href="">
  41. <img src="" />
  42. </a>
  43. <a href="">
  44. <img alt="Download Count Total" src="" />
  45. </a>
  46. </p>
  47. > A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`,
  48. > I'm trying to resolve the issue by taking a new approach.
  49. > All IANA character set names for which the Python core library provides codecs are supported.
  50. <p align="center">
  51. >>>>> <a href="" target="_blank">👉 Try Me Online Now, Then Adopt Me 👈 </a> <<<<<
  52. </p>
  53. This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.
  54. | Feature | [Chardet]( | Charset Normalizer | [cChardet]( |
  55. | ------------- | :-------------: | :------------------: | :------------------: |
  56. | `Fast` | ❌<br> | ✅<br> | ✅ <br> |
  57. | `Universal**` | ❌ | ✅ | ❌ |
  58. | `Reliable` **without** distinguishable standards | ❌ | ✅ | ✅ |
  59. | `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ |
  60. | `Free & Open` | ✅ | ✅ | ✅ |
  61. | `License` | LGPL-2.1 | MIT | MPL-1.1
  62. | `Native Python` | ✅ | ✅ | ❌ |
  63. | `Detect spoken language` | ❌ | ✅ | N/A |
  64. | `Supported Encoding` | 30 | :tada: [93]( | 40
  65. <p align="center">
  66. <img src="" alt="Reading Normalized Text" width="226"/><img src="" alt="Cat Reading Text" width="200"/>
  67. *\*\* : They are clearly using specific code for a specific encoding even if covering most of used one*<br>
  68. ## ⭐ Your support
  69. *Fork, test-it, star-it, submit your ideas! We do listen.*
  70. ## ⚡ Performance
  71. This package offer better performance than its counterpart Chardet. Here are some numbers.
  72. | Package | Accuracy | Mean per file (ms) | File per sec (est) |
  73. | ------------- | :-------------: | :------------------: | :------------------: |
  74. | [chardet]( | 92 % | 220 ms | 5 file/sec |
  75. | charset-normalizer | **98 %** | **40 ms** | 25 file/sec |
  76. | Package | 99th percentile | 95th percentile | 50th percentile |
  77. | ------------- | :-------------: | :------------------: | :------------------: |
  78. | [chardet]( | 1115 ms | 300 ms | 27 ms |
  79. | charset-normalizer | 460 ms | 240 ms | 18 ms |
  80. Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.
  81. > Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows.
  82. > And yes, these results might change at any time. The dataset can be updated to include more files.
  83. > The actual delays heavily depends on your CPU capabilities. The factors should remain the same.
  84. [cchardet]( is a non-native (cpp binding) and unmaintained faster alternative with
  85. a better accuracy than chardet but lower than this package. If speed is the most important factor, you should try it.
  86. ## ✨ Installation
  87. Using PyPi for latest stable
  88. ```sh
  89. pip install charset-normalizer -U
  90. ```
  91. If you want a more up-to-date `unicodedata` than the one available in your Python setup.
  92. ```sh
  93. pip install charset-normalizer[unicode_backport] -U
  94. ```
  95. ## 🚀 Basic Usage
  96. ### CLI
  97. This package comes with a CLI.
  98. ```
  99. usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
  100. file [file ...]
  101. The Real First Universal Charset Detector. Discover originating encoding used
  102. on text file. Normalize text to unicode.
  103. positional arguments:
  104. files File(s) to be analysed
  105. optional arguments:
  106. -h, --help show this help message and exit
  107. -v, --verbose Display complementary information about file if any.
  108. Stdout will contain logs about the detection process.
  109. -a, --with-alternative
  110. Output complementary possibilities if any. Top-level
  111. JSON WILL be a list.
  112. -n, --normalize Permit to normalize input file. If not set, program
  113. does not write anything.
  114. -m, --minimal Only output the charset detected to STDOUT. Disabling
  115. JSON output.
  116. -r, --replace Replace file when trying to normalize it instead of
  117. creating a new one.
  118. -f, --force Replace file without asking if you are sure, use this
  119. flag with caution.
  120. -t THRESHOLD, --threshold THRESHOLD
  121. Define a custom maximum amount of chaos allowed in
  122. decoded content. 0. <= chaos <= 1.
  123. --version Show version information and exit.
  124. ```
  125. ```bash
  126. normalizer ./data/
  127. ```
  128. :tada: Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.
  129. ```json
  130. {
  131. "path": "/home/default/projects/charset_normalizer/data/",
  132. "encoding": "cp1252",
  133. "encoding_aliases": [
  134. "1252",
  135. "windows_1252"
  136. ],
  137. "alternative_encodings": [
  138. "cp1254",
  139. "cp1256",
  140. "cp1258",
  141. "iso8859_14",
  142. "iso8859_15",
  143. "iso8859_16",
  144. "iso8859_3",
  145. "iso8859_9",
  146. "latin_1",
  147. "mbcs"
  148. ],
  149. "language": "French",
  150. "alphabets": [
  151. "Basic Latin",
  152. "Latin-1 Supplement"
  153. ],
  154. "has_sig_or_bom": false,
  155. "chaos": 0.149,
  156. "coherence": 97.152,
  157. "unicode_path": null,
  158. "is_preferred": true
  159. }
  160. ```
  161. ### Python
  162. *Just print out normalized text*
  163. ```python
  164. from charset_normalizer import from_path
  165. results = from_path('./')
  166. print(str(
  167. ```
  168. *Normalize any text file*
  169. ```python
  170. from charset_normalizer import normalize
  171. try:
  172. normalize('./') # should write to disk my_subtitle-***.srt
  173. except IOError as e:
  174. print('Sadly, we are unable to perform charset normalization.', str(e))
  175. ```
  176. *Upgrade your code without effort*
  177. ```python
  178. from charset_normalizer import detect
  179. ```
  180. The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible.
  181. See the docs for advanced usage : [](
  182. ## 😇 Why
  183. When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a
  184. reliable alternative using a completely different method. Also! I never back down on a good challenge!
  185. I **don't care** about the **originating charset** encoding, because **two different tables** can
  186. produce **two identical rendered string.**
  187. What I want is to get readable text, the best I can.
  188. In a way, **I'm brute forcing text decoding.** How cool is that ? 😎
  189. Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.
  190. ## 🍰 How
  191. - Discard all charset encoding table that could not fit the binary content.
  192. - Measure chaos, or the mess once opened (by chunks) with a corresponding charset encoding.
  193. - Extract matches with the lowest mess detected.
  194. - Additionally, we measure coherence / probe for a language.
  195. **Wait a minute**, what is chaos/mess and coherence according to **YOU ?**
  196. *Chaos :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
  197. **I established** some ground rules about **what is obvious** when **it seems like** a mess.
  198. I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to
  199. improve or rewrite it.
  200. *Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
  201. that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.
  202. ## ⚡ Known limitations
  203. - Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
  204. - Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.
  205. ## 👤 Contributing
  206. Contributions, issues and feature requests are very much welcome.<br />
  207. Feel free to check [issues page]( if you want to contribute.
  208. ## 📝 License
  209. Copyright © 2019 [Ahmed TAHRI @Ousret](<br />
  210. This project is [MIT]( licensed.
  211. Characters frequencies used in this project © 2012 [Denny Vrandečić](