123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267 |
- Metadata-Version: 2.1
- Name: charset-normalizer
- Version: 2.0.10
- Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
- Home-page: https://github.com/ousret/charset_normalizer
- Author: Ahmed TAHRI @Ousret
- Author-email: ahmed.tahri@cloudnursery.dev
- License: MIT
- Project-URL: Bug Reports, https://github.com/Ousret/charset_normalizer/issues
- Project-URL: Documentation, https://charset-normalizer.readthedocs.io/en/latest
- Keywords: encoding,i18n,txt,text,charset,charset-detector,normalization,unicode,chardet
- Platform: UNKNOWN
- Classifier: License :: OSI Approved :: MIT License
- Classifier: Intended Audience :: Developers
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
- Classifier: Operating System :: OS Independent
- Classifier: Programming Language :: Python
- Classifier: Programming Language :: Python :: 3
- Classifier: Programming Language :: Python :: 3.5
- Classifier: Programming Language :: Python :: 3.6
- Classifier: Programming Language :: Python :: 3.7
- Classifier: Programming Language :: Python :: 3.8
- Classifier: Programming Language :: Python :: 3.9
- Classifier: Programming Language :: Python :: 3.10
- Classifier: Topic :: Text Processing :: Linguistic
- Classifier: Topic :: Utilities
- Classifier: Programming Language :: Python :: Implementation :: PyPy
- Classifier: Typing :: Typed
- Requires-Python: >=3.5.0
- Description-Content-Type: text/markdown
- License-File: LICENSE
- Provides-Extra: unicode_backport
- Requires-Dist: unicodedata2 ; extra == 'unicode_backport'
- <h1 align="center">Charset Detection, for Everyone 👋 <a href="https://twitter.com/intent/tweet?text=The%20Real%20First%20Universal%20Charset%20%26%20Language%20Detector&url=https://www.github.com/Ousret/charset_normalizer&hashtags=python,encoding,chardet,developers"><img src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social"/></a></h1>
- <p align="center">
- <sup>The Real First Universal Charset Detector</sup><br>
- <a href="https://pypi.org/project/charset-normalizer">
- <img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" />
- </a>
- <a href="https://codecov.io/gh/Ousret/charset_normalizer">
- <img src="https://codecov.io/gh/Ousret/charset_normalizer/branch/master/graph/badge.svg" />
- </a>
- <a href="https://pepy.tech/project/charset-normalizer/">
- <img alt="Download Count Total" src="https://pepy.tech/badge/charset-normalizer/month" />
- </a>
- </p>
- > A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`,
- > I'm trying to resolve the issue by taking a new approach.
- > All IANA character set names for which the Python core library provides codecs are supported.
- <p align="center">
- >>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">👉 Try Me Online Now, Then Adopt Me 👈 </a> <<<<<
- </p>
- This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.
- | Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
- | ------------- | :-------------: | :------------------: | :------------------: |
- | `Fast` | ❌<br> | ✅<br> | ✅ <br> |
- | `Universal**` | ❌ | ✅ | ❌ |
- | `Reliable` **without** distinguishable standards | ❌ | ✅ | ✅ |
- | `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ |
- | `Free & Open` | ✅ | ✅ | ✅ |
- | `License` | LGPL-2.1 | MIT | MPL-1.1
- | `Native Python` | ✅ | ✅ | ❌ |
- | `Detect spoken language` | ❌ | ✅ | N/A |
- | `Supported Encoding` | 30 | :tada: [93](https://charset-normalizer.readthedocs.io/en/latest/support.html) | 40
- <p align="center">
- <img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/>
- *\*\* : They are clearly using specific code for a specific encoding even if covering most of used one*<br>
- ## ⭐ Your support
- *Fork, test-it, star-it, submit your ideas! We do listen.*
-
- ## ⚡ Performance
- This package offer better performance than its counterpart Chardet. Here are some numbers.
- | Package | Accuracy | Mean per file (ms) | File per sec (est) |
- | ------------- | :-------------: | :------------------: | :------------------: |
- | [chardet](https://github.com/chardet/chardet) | 92 % | 220 ms | 5 file/sec |
- | charset-normalizer | **98 %** | **40 ms** | 25 file/sec |
- | Package | 99th percentile | 95th percentile | 50th percentile |
- | ------------- | :-------------: | :------------------: | :------------------: |
- | [chardet](https://github.com/chardet/chardet) | 1115 ms | 300 ms | 27 ms |
- | charset-normalizer | 460 ms | 240 ms | 18 ms |
- Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.
- > Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows.
- > And yes, these results might change at any time. The dataset can be updated to include more files.
- > The actual delays heavily depends on your CPU capabilities. The factors should remain the same.
- [cchardet](https://github.com/PyYoshi/cChardet) is a non-native (cpp binding) and unmaintained faster alternative with
- a better accuracy than chardet but lower than this package. If speed is the most important factor, you should try it.
- ## ✨ Installation
- Using PyPi for latest stable
- ```sh
- pip install charset-normalizer -U
- ```
- If you want a more up-to-date `unicodedata` than the one available in your Python setup.
- ```sh
- pip install charset-normalizer[unicode_backport] -U
- ```
- ## 🚀 Basic Usage
- ### CLI
- This package comes with a CLI.
- ```
- usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
- file [file ...]
- The Real First Universal Charset Detector. Discover originating encoding used
- on text file. Normalize text to unicode.
- positional arguments:
- files File(s) to be analysed
- optional arguments:
- -h, --help show this help message and exit
- -v, --verbose Display complementary information about file if any.
- Stdout will contain logs about the detection process.
- -a, --with-alternative
- Output complementary possibilities if any. Top-level
- JSON WILL be a list.
- -n, --normalize Permit to normalize input file. If not set, program
- does not write anything.
- -m, --minimal Only output the charset detected to STDOUT. Disabling
- JSON output.
- -r, --replace Replace file when trying to normalize it instead of
- creating a new one.
- -f, --force Replace file without asking if you are sure, use this
- flag with caution.
- -t THRESHOLD, --threshold THRESHOLD
- Define a custom maximum amount of chaos allowed in
- decoded content. 0. <= chaos <= 1.
- --version Show version information and exit.
- ```
- ```bash
- normalizer ./data/sample.1.fr.srt
- ```
- :tada: Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.
- ```json
- {
- "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
- "encoding": "cp1252",
- "encoding_aliases": [
- "1252",
- "windows_1252"
- ],
- "alternative_encodings": [
- "cp1254",
- "cp1256",
- "cp1258",
- "iso8859_14",
- "iso8859_15",
- "iso8859_16",
- "iso8859_3",
- "iso8859_9",
- "latin_1",
- "mbcs"
- ],
- "language": "French",
- "alphabets": [
- "Basic Latin",
- "Latin-1 Supplement"
- ],
- "has_sig_or_bom": false,
- "chaos": 0.149,
- "coherence": 97.152,
- "unicode_path": null,
- "is_preferred": true
- }
- ```
- ### Python
- *Just print out normalized text*
- ```python
- from charset_normalizer import from_path
- results = from_path('./my_subtitle.srt')
- print(str(results.best()))
- ```
- *Normalize any text file*
- ```python
- from charset_normalizer import normalize
- try:
- normalize('./my_subtitle.srt') # should write to disk my_subtitle-***.srt
- except IOError as e:
- print('Sadly, we are unable to perform charset normalization.', str(e))
- ```
- *Upgrade your code without effort*
- ```python
- from charset_normalizer import detect
- ```
- The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible.
- See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/)
- ## 😇 Why
- When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a
- reliable alternative using a completely different method. Also! I never back down on a good challenge!
- I **don't care** about the **originating charset** encoding, because **two different tables** can
- produce **two identical rendered string.**
- What I want is to get readable text, the best I can.
- In a way, **I'm brute forcing text decoding.** How cool is that ? 😎
- Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.
- ## 🍰 How
- - Discard all charset encoding table that could not fit the binary content.
- - Measure chaos, or the mess once opened (by chunks) with a corresponding charset encoding.
- - Extract matches with the lowest mess detected.
- - Additionally, we measure coherence / probe for a language.
- **Wait a minute**, what is chaos/mess and coherence according to **YOU ?**
- *Chaos :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
- **I established** some ground rules about **what is obvious** when **it seems like** a mess.
- I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to
- improve or rewrite it.
- *Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
- that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.
- ## ⚡ Known limitations
- - Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
- - Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.
- ## 👤 Contributing
- Contributions, issues and feature requests are very much welcome.<br />
- Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute.
- ## 📝 License
- Copyright © 2019 [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br />
- This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed.
- Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/)
|