Welcome to MSFN

Register now to gain access to all of our features. Once registered and logged in, you will be able to contribute to this site by submitting your own content or replying to existing content. You'll be able to customize your profile, receive reputation points as a reward for submitting content, while also communicating with other members via your own private inbox, plus much more! This message will be removed once you have signed in.


tomasz86

How to check text file encoding from command-line?

15 posts in this topic

Is there any simple way to check text file encoding from command-line?

I'm working on a script which adds lines to the I386\HIVE*.INF files in Windows 2000/XP source. In an English system those HIVE*.INF files are coded in ANSI but in a Korean Windows XP they're coded in UCS-2 Little Endian.

I don't really need to know the specific encoding. I'd just like to know whether the file is ANSI or not.

0

Share this post


Link to post
Share on other sites

Is there any simple way to check text file encoding from command-line?

I'm working on a script which adds lines to the I386\HIVE*.INF files in Windows 2000/XP source. In an English system those HIVE*.INF files are coded in ANSI but in a Korean Windows XP they're coded in UCS-2 Little Endian.

I don't really need to know the specific encoding. I'd just like to know whether the file is ANSI or not.

UCS-2 Little Endian sounds to me a lot like "Unicode". :whistle:

The simplest check you can make is looking for a hex 00 (if there is at least one, it's Unicode, conversely, if there are none, it's not Unicode - and it is very likely to be "plain ANSI text").

http://betterexplained.com/articles/unicode/

jaclaz

0

Share this post


Link to post
Share on other sites

Nice one, jaclaz!

0

Share this post


Link to post
Share on other sites

Thanks for you suggestions!

@allen2

It would be nice to be able to do it without any external application although I'll have a look at it if I can't manage to do it with just the Windows default tools.

@jaclaz

Can you view a text file in hex in the command-line? After googling the only "method" which I've managed to find is to use "debug.exe" but it actually displays "00"s in ANSI files too.

I wonder what you think about this. If you ECHO something to a text file coded in UCS-2 Little Endian from CMD (without the /U switch) the text will be completely broken. I'm thinking about ECHOing a specific string to those HIVE*.INF files and then just search for it with FINDSTR. If it can't find it then it will mean that the file is UCS-2 Little Endian.

Edited by tomasz86
0

Share this post


Link to post
Share on other sites

endian.zip - endian.exe (console-32 app), endian.bat (sample usage)

endian.exe source snippet based on header info detailed at http://betterexplained.com/articles/unicode/:


// read first three (or more) bytes from file into byte array s[], then:
return
(s[0]==255) && (s[1]==254)? 255 :
(s[0]==254) && (s[1]==255)? 254 :
(s[0]==0xEF) && (s[1]==0xBB) && (s[2]==0xBF)? 239 : 0;

Sample usage: endian.bat


@echo off
%0\..\endian %1

IF ERRORLEVEL 255 GOTO UCS2LE
IF ERRORLEVEL 254 GOTO UCS2BE
IF ERRORLEVEL 239 GOTO UTF8

echo ANSI
GOTO End

:UCS2LE
echo UCS-2 Little Endian
GOTO End

:UCS2BE
echo UCS-2 Big Endian
GOTO End

:UTF8
echo UTF-8

:End


If debug.exe is available, I think it can be used to achieve the same results. Debug can be scripted to open the file, analyze the first three bytes, and create a temp com file that sets an appropriate ERRORLEVEL If Debug doesn't set the ERRORLEVEL upon exit itself, the temp file can be eliminated by running the temp program within Debug.

0

Share this post


Link to post
Share on other sites

@jaclaz

Can you view a text file in hex in the command-line? After googling the only "method" which I've managed to find is to use "debug.exe" but it actually displays "00"s in ANSI files too.

I wonder what you think about this. If you ECHO something to a text file coded in UCS-2 Little Endian from CMD (without the /U switch) the text will be completely broken. I'm thinking about ECHOing a specific string to those HIVE*.INF files and then just search for it with FINDSTR. If it can't find it then it will mean that the file is UCS-2 Little Endian.

Well, you have gsar already used in your "standard" set of tools, haven't you?

I doubt - no offence intended :) - that an ANSI file contains hex 00, but you can (better :thumbup ) use gsar to find the initial FFFE as jumper suggested.

In any case, FOR/ F won't "like" Unicode, thus:

@ECHO OFF
::ISANSI.CMD - small example batch to check if a file is ANSI or UNICODE
FOR /F %%A in (%1) DO ECHO ANSI&GOTO :EOF
ECHO UNICODE

BUT this won't work the same:

@ECHO OFF
::NOANSI.CMD - small example batch that will always return ANSI
FOR /F %%A in ('TYPE %1') DO ECHO ANSI&GOTO :EOF
ECHO UNICODE

jaclaz

jaclaz

0

Share this post


Link to post
Share on other sites

Well, in this particular script the only "non-Windows" tool is gsar.exe...

But at the moment I actually managed to solve this specific problem with this very simple checking:

SETLOCAL ENABLEDELAYEDEXPANSION
FINDSTR/IL "[Version]" I386\hivedef.inf >NUL
IF !ERRORLEVEL! EQU 0 (
TYPE temp.txt>>I386\hivedef.inf
) ELSE (
CMD /U /C "TYPE temp.txt>>I386\hivedef.inf"
)

FINDSTR can't find "[Version]" if the file is encoded in UCS-2 Little Endian.

Edited by tomasz86
0

Share this post


Link to post
Share on other sites

The BOM (if present) might help to detect to detect the encoding of the file.

And this vbs will report the encoding of the file passed as argument:

Function encoding(fpn) 
set file=CreateObject("ADODB.Stream")
file.Type=1
file.Open
file.LoadFromFile fpn
data = file.Read
file.Close
a = hex(Ascb(Midb(data, 1, 1)))
b = hex(Ascb(Midb(data, 2, 1)))
c = hex(Ascb(Midb(data, 3, 1)))
d= hex(Ascb(Midb(data, 4, 1)))
encoding="unknow ascii"
If a = "EF" AND b = "BB" AND c = "BF" Then encoding = "UTF-8"
If (a = "FE" AND b = "FF" AND not c = "00" ) then encoding = "UTF-16 (BE)"
If (a = "FF" AND b = "FE") Then encoding = "UTF-16 (LE)"
If (a = "00" AND b = "00" AND c = "FE" AND d = "FF" ) then encoding = "UTF-32 (BE)"
If (a = "FF" AND b = "FE" AND c = "00" AND d = "00" ) then encoding = "UTF-32 (LE)"
If (a = "2B" AND b = "2F" AND c = "76" AND (d = "38" or d = "39" or d = "2B" or d = "2F" )) then encoding = "UTF-7"
If (a = "F7" AND b = "64" AND c = "4C") then encoding = "UTF-1"
If (a = "DD" AND b = "73" AND c = "66" AND d = "73") then encoding = "UTF-EBCDIC"
If (a = "0E" AND b = "FE" AND c = "FF") then encoding = "SCSU"
If (a = "FB" AND b = "EE" AND c = "28") then encoding = "BOCU-1"
If (a = "84" AND b = "31" AND c = "95" AND d = "33") then encoding = "GB-18030"
End Function
wscript.echo encoding(WScript.Arguments.Item(0))

But many files doesn't contain the BOM.

0

Share this post


Link to post
Share on other sites

It is clear how we all give a different meaning to the word "simple". :whistle:

FOR /F %%A in  (I386\hivedef.inf) DO TYPE temp.txt>>I386\hivedef.inf&GOTO :skip
CMD /U /C "TYPE temp.txt>>I386\hivedef.inf"
:skip

jaclaz

0

Share this post


Link to post
Share on other sites

As some systems files are encoded in ANSI, is it a necessity to append using the same encoding? What happens if you convert the file to ANSI and use / append as such?

0

Share this post


Link to post
Share on other sites

It is clear how we all give a different meaning to the word "simple". :whistle:

FOR /F %%A in  (I386\hivedef.inf) DO TYPE temp.txt>>I386\hivedef.inf&GOTO :skip
CMD /U /C "TYPE temp.txt>>I386\hivedef.inf"
:skip

Thank you :w00t: I meant simpler when compared to the other options where other tools must be used.

As some systems files are encoded in ANSI, is it a necessity to append using the same encoding? What happens if you convert the file to ANSI and use / append as such?

It should be possible in case of these HIVE*.INF files because they utilise only English and Korean characters but how about unicode files where a lot of different languages are used like this one?

Netrtle.7z

If you check the [strings.*] sections you'll see that there are a lot of them for several different languages. If I try to convert such a file to ANSI then many characters are lost.

0

Share this post


Link to post
Share on other sites

Completly off topic but the file netrtle.inf you attached contains a "r" on the beginning of the first line which shouldn't be there no mater the encoding and the inf file.

0

Share this post


Link to post
Share on other sites

Completly off topic but the file netrtle.inf you attached contains a "r" on the beginning of the first line which shouldn't be there no mater the encoding and the inf file.

It's just a typo of mine :blushing: You can safely ignore it.

0

Share this post


Link to post
Share on other sites

For those with pre-ME systems that don't support FINDSTR, here's a batch file that uses FIND to process one file from the command line or a whole directory of files:


@echo off
if not %1*==* goto TEST

for %%s in (*.inf) do call %0 %%s
goto EXIT

:TEST
find>nul /i "[Version]" %1
IF ERRORLEVEL 1 goto NONANSI

:ANSI
echo %1 is ANSI
goto EXIT

:NONANSI
echo %1 is non-ANSI

:EXIT

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.