How to check text file encoding from command-line?
#1
Posted 07 October 2012 - 07:37 AM
I'm working on a script which adds lines to the I386\HIVE*.INF files in Windows 2000/XP source. In an English system those HIVE*.INF files are coded in ANSI but in a Korean Windows XP they're coded in UCS-2 Little Endian.
I don't really need to know the specific encoding. I'd just like to know whether the file is ANSI or not.
#3
Posted 07 October 2012 - 09:38 AM
tomasz86, on 07 October 2012 - 07:37 AM, said:
I'm working on a script which adds lines to the I386\HIVE*.INF files in Windows 2000/XP source. In an English system those HIVE*.INF files are coded in ANSI but in a Korean Windows XP they're coded in UCS-2 Little Endian.
I don't really need to know the specific encoding. I'd just like to know whether the file is ANSI or not.
UCS-2 Little Endian sounds to me a lot like "Unicode".
The simplest check you can make is looking for a hex 00 (if there is at least one, it's Unicode, conversely, if there are none, it's not Unicode - and it is very likely to be "plain ANSI text").
http://betterexplain...ticles/unicode/
jaclaz
#5
Posted 07 October 2012 - 03:59 PM
@allen2
It would be nice to be able to do it without any external application although I'll have a look at it if I can't manage to do it with just the Windows default tools.
@jaclaz
Can you view a text file in hex in the command-line? After googling the only "method" which I've managed to find is to use "debug.exe" but it actually displays "00"s in ANSI files too.
I wonder what you think about this. If you ECHO something to a text file coded in UCS-2 Little Endian from CMD (without the /U switch) the text will be completely broken. I'm thinking about ECHOing a specific string to those HIVE*.INF files and then just search for it with FINDSTR. If it can't find it then it will mean that the file is UCS-2 Little Endian.
This post has been edited by tomasz86: 07 October 2012 - 04:00 PM
#6
Posted 08 October 2012 - 01:07 AM
endian.zip (988bytes)
Number of downloads: 11 - endian.exe (console-32 app), endian.bat (sample usage)
endian.exe source snippet based on header info detailed at http://betterexplain...ticles/unicode/:
// read first three (or more) bytes from file into byte array s[], then:
return
(s[0]==255) && (s[1]==254)? 255 :
(s[0]==254) && (s[1]==255)? 254 :
(s[0]==0xEF) && (s[1]==0xBB) && (s[2]==0xBF)? 239 : 0;
Sample usage: endian.bat
@echo off %0\..\endian %1 IF ERRORLEVEL 255 GOTO UCS2LE IF ERRORLEVEL 254 GOTO UCS2BE IF ERRORLEVEL 239 GOTO UTF8 echo ANSI GOTO End :UCS2LE echo UCS-2 Little Endian GOTO End :UCS2BE echo UCS-2 Big Endian GOTO End :UTF8 echo UTF-8 :End
If debug.exe is available, I think it can be used to achieve the same results. Debug can be scripted to open the file, analyze the first three bytes, and create a temp com file that sets an appropriate ERRORLEVEL If Debug doesn't set the ERRORLEVEL upon exit itself, the temp file can be eliminated by running the temp program within Debug.
#7
Posted 08 October 2012 - 02:18 AM
tomasz86, on 07 October 2012 - 03:59 PM, said:
Can you view a text file in hex in the command-line? After googling the only "method" which I've managed to find is to use "debug.exe" but it actually displays "00"s in ANSI files too.
I wonder what you think about this. If you ECHO something to a text file coded in UCS-2 Little Endian from CMD (without the /U switch) the text will be completely broken. I'm thinking about ECHOing a specific string to those HIVE*.INF files and then just search for it with FINDSTR. If it can't find it then it will mean that the file is UCS-2 Little Endian.
Well, you have gsar already used in your "standard" set of tools, haven't you?
I doubt - no offence intended
In any case, FOR/ F won't "like" Unicode, thus:
@ECHO OFF ::ISANSI.CMD - small example batch to check if a file is ANSI or UNICODE FOR /F %%A in (%1) DO ECHO ANSI&GOTO :EOF ECHO UNICODE
BUT this won't work the same:
@ECHO OFF
::NOANSI.CMD - small example batch that will always return ANSI
FOR /F %%A in ('TYPE %1') DO ECHO ANSI&GOTO :EOF
ECHO UNICODE
jaclaz
jaclaz
#8
Posted 08 October 2012 - 02:25 AM
But at the moment I actually managed to solve this specific problem with this very simple checking:
SETLOCAL ENABLEDELAYEDEXPANSION FINDSTR/IL "[Version]" I386\hivedef.inf >NUL IF !ERRORLEVEL! EQU 0 ( TYPE temp.txt>>I386\hivedef.inf ) ELSE ( CMD /U /C "TYPE temp.txt>>I386\hivedef.inf" )
FINDSTR can't find "[Version]" if the file is encoded in UCS-2 Little Endian.
This post has been edited by tomasz86: 08 October 2012 - 02:26 AM
#9
Posted 08 October 2012 - 02:32 AM
And this vbs will report the encoding of the file passed as argument:
Function encoding(fpn)
set file=CreateObject("ADODB.Stream")
file.Type=1
file.Open
file.LoadFromFile fpn
data = file.Read
file.Close
a = hex(Ascb(Midb(data, 1, 1)))
b = hex(Ascb(Midb(data, 2, 1)))
c = hex(Ascb(Midb(data, 3, 1)))
d= hex(Ascb(Midb(data, 4, 1)))
encoding="unknow ascii"
If a = "EF" AND b = "BB" AND c = "BF" Then encoding = "UTF-8"
If (a = "FE" AND b = "FF" AND not c = "00" ) then encoding = "UTF-16 (BE)"
If (a = "FF" AND b = "FE") Then encoding = "UTF-16 (LE)"
If (a = "00" AND b = "00" AND c = "FE" AND d = "FF" ) then encoding = "UTF-32 (BE)"
If (a = "FF" AND b = "FE" AND c = "00" AND d = "00" ) then encoding = "UTF-32 (LE)"
If (a = "2B" AND b = "2F" AND c = "76" AND (d = "38" or d = "39" or d = "2B" or d = "2F" )) then encoding = "UTF-7"
If (a = "F7" AND b = "64" AND c = "4C") then encoding = "UTF-1"
If (a = "DD" AND b = "73" AND c = "66" AND d = "73") then encoding = "UTF-EBCDIC"
If (a = "0E" AND b = "FE" AND c = "FF") then encoding = "SCSU"
If (a = "FB" AND b = "EE" AND c = "28") then encoding = "BOCU-1"
If (a = "84" AND b = "31" AND c = "95" AND d = "33") then encoding = "GB-18030"
End Function
wscript.echo encoding(WScript.Arguments.Item(0))
But many files doesn't contain the BOM.
#10
Posted 08 October 2012 - 03:26 AM
FOR /F %%A in (I386\hivedef.inf) DO TYPE temp.txt>>I386\hivedef.inf&GOTO :skip CMD /U /C "TYPE temp.txt>>I386\hivedef.inf" :skip
jaclaz
#11
Posted 08 October 2012 - 09:13 AM
#12
Posted 09 October 2012 - 12:58 AM
jaclaz, on 08 October 2012 - 03:26 AM, said:
FOR /F %%A in (I386\hivedef.inf) DO TYPE temp.txt>>I386\hivedef.inf&GOTO :skip CMD /U /C "TYPE temp.txt>>I386\hivedef.inf" :skip
Thank you
Yzöwl, on 08 October 2012 - 09:13 AM, said:
It should be possible in case of these HIVE*.INF files because they utilise only English and Korean characters but how about unicode files where a lot of different languages are used like this one?
Netrtle.7z (32.31K)
Number of downloads: 3
If you check the [Strings.*] sections you'll see that there are a lot of them for several different languages. If I try to convert such a file to ANSI then many characters are lost.
#13
Posted 09 October 2012 - 02:02 AM
#14
Posted 09 October 2012 - 02:41 AM
#15
Posted 09 October 2012 - 09:28 PM
@echo off if not %1*==* goto TEST for %%s in (*.inf) do call %0 %%s goto EXIT :TEST find>nul /i "[Version]" %1 IF ERRORLEVEL 1 goto NONANSI :ANSI echo %1 is ANSI goto EXIT :NONANSI echo %1 is non-ANSI :EXIT



Help

Back to top









