encoding - How to detect illegal UTF-8 byte sequences to replace them in java inputstream? -


The file in question is not under my control Most byte sequences are valid UTF-8s, it's ISO-8859-1 (or Other encoding) is not.

There are some illegal byte sequences in the file, they should be replaced with the replacement character.

This is not easy work, it seems that it requires some knowledge about the UTF-8 state machine.

Oracle has a cover that I want:

Is there anything available (commercially or as free software)?

Thanksgiving - Places

Solution:

  in the last BufferedInputStream = new BufferedInputStream (istream); Last charset decoder charset decoder = Standard Charsets.utf_8.nudecoder (); CharsetDecoder.onMalformedInput (CodingErrorAction.REPLACE); CharsetDecoder.onUnmappableCharacter (CodingErrorAction.REPLACE); Last Reader Inputdirector = New Interstitial Reader (in, CharsetDicoder); What class do you want this class to be user-defined on different types of errors (see &)  

Charset decoder writes a outputstream , which you can do effectively by using the pipe in a InputStream One filtered out InputStream .


Comments